To load the package, you can use the below command:
The trick to this is to use cross validation, or repeated cross validation, to eliminate n features from the model. This is achieved by fitting the model multiple times at each step, removing the weakest features, determining by either the coefficients in the model, or by the feature importance attributes in the model.
Within the package there is a number of different types you can utilise:
See the underlying caretFuncs() documentation.
The model implements all these methods. I will utilise the random forest variable importance selection method, as this is quick to train on our test dataset.
The following steps will take you through how to use this function.
For the test data we will use the in built iris dataset.
df <- iris
print(head(df,10))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5.0 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
Now is the time to use the workhouse function for the RFE (Recursive Feature Elimination) methods:
#Passing in the indexes as slices x values located in index 1:4 and y value in location 5
rfe_fit <- rfeTerminator(df, x_cols= 1:4, y_cols=5, alter_df = TRUE, eval_funcs = rfFuncs)
#> [INFO] Removing features as a result of recursive feature enginnering. Expose rfe_reduced_data from returned list using $ selectors.
#> [IVS SELECTED] Optimal variables are: Petal.Width
#> [IVS SELECTED] Optimal variables are: Petal.Length
#Passing by column name
rfe_fit_col_name <- rfeTerminator(df, x_cols=1:4, y_cols="Species", alter_df=TRUE)
#> [INFO] Removing features as a result of recursive feature enginnering. Expose rfe_reduced_data from returned list using $ selectors.
#> [IVS SELECTED] Optimal variables are: Petal.Width
#> [IVS SELECTED] Optimal variables are: Petal.Length
# A further example
ref_x_col_name <- rfeTerminator(df,
x_cols=c("Sepal.Length", "Sepal.Width",
"Petal.Length", "Petal.Width"),
y_cols = "Species")
#> [INFO] Removing features as a result of recursive feature enginnering. Expose rfe_reduced_data from returned list using $ selectors.
#> [IVS SELECTED] Optimal variables are: Petal.Length
#> [IVS SELECTED] Optimal variables are: Petal.Width
This shows that it does not matter how you pass the data to the function, but the x column names need to be wrapped in a vector, as the further example highlights. Otherwise, you can simply pass the columns as a slice of the data frame.
The model will select the best combination of values, with the sizes argument indicating the range of numeric features to retain. This defaults to an integer column slice between 1:10.
#Explore the optimal model results
print(rfe_fit$rfe_model_fit_results)
#>
#> Recursive feature selection
#>
#> Outer resampling method: Cross-Validated (10 fold)
#>
#> Resampling performance over subset size:
#>
#> Variables Accuracy Kappa AccuracySD KappaSD Selected
#> 1 0.9200 0.88 0.06126 0.09189
#> 2 0.9667 0.95 0.03514 0.05270 *
#> 3 0.9533 0.93 0.04500 0.06749
#> 4 0.9600 0.94 0.04661 0.06992
#>
#> The top 2 variables (out of 2):
#> Petal.Width, Petal.Length
#View the optimum variables selected
print(rfe_fit$rfe_model_fit_results$optVariables)
#> [1] "Petal.Width" "Petal.Length"
The following list type will retain the original data, with the
alter_df
argument indicating if the results should be
outputted for manual evaluation of the backward elimination, or whether
the data frame should be reduced. This could be the full data before a
training / testing split, or on the training set, dependent on your ML
pipeline strategy.
To view the original data:
#Explore the original data passed to the frame
print(head(rfe_fit$rfe_original_data))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
Viewing the outputs post termination, you can observe that the features that have little bearing on the dependent (predicted variable) are terminated:
#Explore the data adapted with the less important features removed
print(head(rfe_fit$rfe_reduced_data))
#> Petal.Width Petal.Length Species
#> 1 0.2 1.4 setosa
#> 2 0.2 1.4 setosa
#> 3 0.2 1.3 setosa
#> 4 0.2 1.5 setosa
#> 5 0.2 1.4 setosa
#> 6 0.4 1.7 setosa
The features that do not have a significant impact have been removed from your model and this would surely speed up the ML or predictive model prior to training it.
Next, we move on to another feature selection method, this time we
are utilising a correlation method to remove potential affects of
multicollinearity
.
These algorithms will form the first version of the package, but still to be developed are: