FeatureTerminatoR

Loading the packages

To load the package, you can use the below command:

library(FeatureTerminatoR)
library(caret)
library(dplyr)
library(ggplot2)
library(randomForest)

Recursive Feature Elimination

The trick to this is to use cross validation, or repeated cross validation, to eliminate n features from the model. This is achieved by fitting the model multiple times at each step, removing the weakest features, determining by either the coefficients in the model, or by the feature importance attributes in the model.

Within the package there is a number of different types you can utilise:

  • rfFuncs - this uses random forests method of assessing the mean decrease in accuracy over the features of interest i.e. the x (independent variables) and through the recursive nature of the algorithm looks at which IVs have the largest affect on the mean decrease in accuracy for the predicted y. The algorithm then purges the features with a low feature importance, those that have little effect on changing this accuracy metric.
  • nbFuncs - this uses the naive bayes algorithm to assess those features that have the greatest affect on the overall probability of the dependent variable. Utilising the affect for the priori and the posterior. Naive is due to assuming all the variables in the model are equally as important at the outset of the test.
  • treebagFuncs - explains how many times a variable occurs as decision node. The number of occurrence and the position of a given decision node in the tree give an indication of the importance of the respective predictor. The more often a variable occurs, and the closer a decision node is to the root node, the more important is the variable and the node, respectively.
  • lmFuncs - sum of squared errors from the regression line, with the important variables being defined to have deviation outside of the expect gaussian distribution.

See the underlying caretFuncs() documentation.

The model implements all these methods. I will utilise the random forest variable importance selection method, as this is quick to train on our test dataset.

Using the rfe_removeR function in FeatureTerminatoR

The following steps will take you through how to use this function.

Loading the test data

For the test data we will use the in built iris dataset.

df <- iris
print(head(df,10))
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1           5.1         3.5          1.4         0.2  setosa
#> 2           4.9         3.0          1.4         0.2  setosa
#> 3           4.7         3.2          1.3         0.2  setosa
#> 4           4.6         3.1          1.5         0.2  setosa
#> 5           5.0         3.6          1.4         0.2  setosa
#> 6           5.4         3.9          1.7         0.4  setosa
#> 7           4.6         3.4          1.4         0.3  setosa
#> 8           5.0         3.4          1.5         0.2  setosa
#> 9           4.4         2.9          1.4         0.2  setosa
#> 10          4.9         3.1          1.5         0.1  setosa

Fitting a RFE method to the data

Now is the time to use the workhouse function for the RFE (Recursive Feature Elimination) methods:

#Passing in the indexes as slices x values located in index 1:4 and y value in location 5
rfe_fit <- rfeTerminator(df, x_cols= 1:4, y_cols=5, alter_df = TRUE, eval_funcs = rfFuncs)
#> [INFO] Removing features as a result of recursive feature enginnering. Expose rfe_reduced_data from returned list using $ selectors.
#> [IVS SELECTED] Optimal variables are: Petal.Width
#> [IVS SELECTED] Optimal variables are: Petal.Length
#Passing by column name
rfe_fit_col_name <- rfeTerminator(df, x_cols=1:4, y_cols="Species", alter_df=TRUE)
#> [INFO] Removing features as a result of recursive feature enginnering. Expose rfe_reduced_data from returned list using $ selectors.
#> [IVS SELECTED] Optimal variables are: Petal.Width
#> [IVS SELECTED] Optimal variables are: Petal.Length
# A further example
ref_x_col_name <- rfeTerminator(df,
                                x_cols=c("Sepal.Length", "Sepal.Width",
                                        "Petal.Length", "Petal.Width"),
                                y_cols = "Species")
#> [INFO] Removing features as a result of recursive feature enginnering. Expose rfe_reduced_data from returned list using $ selectors.
#> [IVS SELECTED] Optimal variables are: Petal.Length
#> [IVS SELECTED] Optimal variables are: Petal.Width

This shows that it does not matter how you pass the data to the function, but the x column names need to be wrapped in a vector, as the further example highlights. Otherwise, you can simply pass the columns as a slice of the data frame.

Exploring the model output results

The model will select the best combination of values, with the sizes argument indicating the range of numeric features to retain. This defaults to an integer column slice between 1:10.

#Explore the optimal model results
print(rfe_fit$rfe_model_fit_results)
#> 
#> Recursive feature selection
#> 
#> Outer resampling method: Cross-Validated (10 fold) 
#> 
#> Resampling performance over subset size:
#> 
#>  Variables Accuracy Kappa AccuracySD KappaSD Selected
#>          1   0.9200  0.88    0.06126 0.09189         
#>          2   0.9667  0.95    0.03514 0.05270        *
#>          3   0.9533  0.93    0.04500 0.06749         
#>          4   0.9600  0.94    0.04661 0.06992         
#> 
#> The top 2 variables (out of 2):
#>    Petal.Width, Petal.Length
#View the optimum variables selected
print(rfe_fit$rfe_model_fit_results$optVariables)
#> [1] "Petal.Width"  "Petal.Length"

Outputting the original and reduced data

The following list type will retain the original data, with the alter_df argument indicating if the results should be outputted for manual evaluation of the backward elimination, or whether the data frame should be reduced. This could be the full data before a training / testing split, or on the training set, dependent on your ML pipeline strategy.

Viewing the original data

To view the original data:

#Explore the original data passed to the frame
print(head(rfe_fit$rfe_original_data))
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

Obtaining the data after rfe termination

Viewing the outputs post termination, you can observe that the features that have little bearing on the dependent (predicted variable) are terminated:

#Explore the data adapted with the less important features removed
print(head(rfe_fit$rfe_reduced_data))
#>   Petal.Width Petal.Length Species
#> 1         0.2          1.4  setosa
#> 2         0.2          1.4  setosa
#> 3         0.2          1.3  setosa
#> 4         0.2          1.5  setosa
#> 5         0.2          1.4  setosa
#> 6         0.4          1.7  setosa

The features that do not have a significant impact have been removed from your model and this would surely speed up the ML or predictive model prior to training it.

Next, we move on to another feature selection method, this time we are utilising a correlation method to remove potential affects of multicollinearity.

Removing High Correlated Features - multicol_terminatoR

The main reason you would want to do this is to avoid multicollinearity. This is an effect caused when there are high intercorrelations among two or more independent variables in linear models, this is not so much of a problem with non-linear models, such as trees, but can still cause high variance in the models, thus scaling of independent variables is always recommended.

Why bother about multicollinearity?

In general, multicollinearity can lead to wider confidence intervals that produce less reliable probabilities in terms of the effect of independent variables in a model. That is, the statistical inferences from a model with multicollinearity may not be dependable.

Key takeaways:

  • Multicollinearity is a statistical concept where independent variables in a model are correlated.
  • Multicollinearity among independent variables will result in less reliable statistical inferences.
  • It is better to use independent variables that are not correlated or repetitive when building multiple regression models that use two or more variables.

This is why you would want to remove highly correlated features.

Getting started with the high correlation removal

We already have our test data loaded in, and we will use the dataset from the previous example in this example.

#Fit a model on the results and define a confidence cut off limit
mc_term_fit <- FeatureTerminatoR::mutlicol_terminator(df, x_cols=1:4,
                                   y_cols="Species",
                                   alter_df=TRUE,
                                   cor_sig = 0.90)
#> [INFO] Removing features as a result of highly correlated value cut off.

Visualising the outputs

Exploring the outputs:

# Visualise the quantile distributions of where the correlations lie
mc_term_fit$corr_quant_chart

This shows that our cut off range starts at about the 85th percentile of the correlation distributions, at the top end. This would also work for strong negative associations. Here, we could probably be a little more strict in our 90% limit, but we will keep it at this for now, as we do not want to purge all the features.

Viewing the raw correlation and covariance matrices

This has been built into the tool for ease:

# View the correlation matrix
mc_term_fit$corr_matrix
#>              Sepal.Length Sepal.Width Petal.Length Petal.Width
#> Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
#> Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
#> Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
#> Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000
# View the covariance matrix
mc_term_fit$cov_matrix
#>              Sepal.Length Sepal.Width Petal.Length Petal.Width
#> Sepal.Length    0.6856935  -0.0424340    1.2743154   0.5162707
#> Sepal.Width    -0.0424340   0.1899794   -0.3296564  -0.1216394
#> Petal.Length    1.2743154  -0.3296564    3.1162779   1.2956094
#> Petal.Width     0.5162707  -0.1216394    1.2956094   0.5810063
# View the quantile range
mc_term_fit$corr_quantile #This excludes the diagonal correlations, as this would inflate the quantile distribution
#>         5%        10%        15%        20%        25%        30%        35% 
#> -0.4284401 -0.4222087 -0.3879359 -0.3661259 -0.3661259 -0.2915591 -0.1548532 
#>        40%        45%        50%        55%        60%        65%        70% 
#> -0.1175698 -0.1175698  0.3501857  0.8179411  0.8179411  0.8260130  0.8556100 
#>        75%        80%        85%        90%        95%       100% 
#>  0.8717538  0.8717538  0.9036429  0.9537543  0.9628654  0.9628654

There is some strong correlations between petal length and petal width, so these will be clipped by our choice of cut-off.

Viewing the reduced data

To get the outputs from the feature selection method, we use the following call to obtain the output tibble:

# Get the removed and reduced data
new_df_post_feature_removal <- mc_term_fit$feature_removed_df
glimpse(new_df_post_feature_removal)
#> Rows: 150
#> Columns: 4
#> $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
#> $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
#> $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
#> $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…

Here, the algorithm has removed a value based off the cut-off limit provided.

Still to be included

These algorithms will form the first version of the package, but still to be developed are:

  • Simulated Annealing methods - this is a probabilistic technique for approximating the global optimum of a given function. The name of the algorithm comes from annealing in metallurgy, a technique involving heating and controlled cooling of a material to increase the size of its crystals and reduce their defects.
  • Lasso Regression - a regularisation method that allows for the intercept to equal zero, meaning the variable is of very little, or no importance to the prediced yhat measure.