Title: | Wrapper Algorithm for All Relevant Feature Selection |
---|---|
Description: | An all relevant feature selection wrapper algorithm. It finds relevant features by comparing original attributes' importance with importance achievable at random, estimated using their permuted copies (shadows). |
Authors: | Miron Bartosz Kursa [aut, cre] , Witold Remigiusz Rudnicki [aut] |
Maintainer: | Miron Bartosz Kursa <[email protected]> |
License: | GPL (>= 2) |
Version: | 8.0.0 |
Built: | 2024-12-19 06:26:20 UTC |
Source: | CRAN |
attStats
shows a summary of a Boruta run in an attribute-centred way.
It produces a data frame containing some importance stats as well as the number of hits that attribute scored and the decision it was given.
attStats(x)
attStats(x)
x |
an object of a class Boruta, from which attribute stats should be extracted. |
A data frame containing, for each attribute that was originally in information system, mean, median, maximal and minimal importance, number of hits normalised to number of importance source runs performed and the decision copied from finalDecision
.
When using a Boruta object generated by a TentativeRoughFix
, the resulting data frame will consist a rough-fixed decision.
x
has to be made with holdHistory
set to TRUE
for this code to run.
## Not run: library(mlbench); data(Sonar) #Takes some time, so be patient Boruta(Class~.,data=Sonar,doTrace=2)->Bor.son print(Bor.son) stats<-attStats(Bor.son) print(stats) plot(normHits~meanImp,col=stats$decision,data=stats) ## End(Not run)
## Not run: library(mlbench); data(Sonar) #Takes some time, so be patient Boruta(Class~.,data=Sonar,doTrace=2)->Bor.son print(Bor.son) stats<-attStats(Bor.son) print(stats) plot(normHits~meanImp,col=stats$decision,data=stats) ## End(Not run)
Boruta is an all relevant feature selection wrapper algorithm, capable of working with any classification method that output variable importance measure (VIM); by default, Boruta uses Random Forest. The method performs a top-down search for relevant features by comparing original attributes' importance with importance achievable at random, estimated using their permuted copies, and progressively eliminating irrelevant features to stabilise that test.
Boruta(x, ...) ## Default S3 method: Boruta( x, y, pValue = 0.01, mcAdj = TRUE, maxRuns = 100, doTrace = 0, holdHistory = TRUE, getImp = getImpRfZ, ... ) ## S3 method for class 'formula' Boruta(formula, data, ...)
Boruta(x, ...) ## Default S3 method: Boruta( x, y, pValue = 0.01, mcAdj = TRUE, maxRuns = 100, doTrace = 0, holdHistory = TRUE, getImp = getImpRfZ, ... ) ## S3 method for class 'formula' Boruta(formula, data, ...)
x |
data frame of predictors. |
... |
additional parameters passed to |
y |
response vector; factor for classification, numeric vector for regression, |
pValue |
confidence level. Default value should be used. |
mcAdj |
if set to |
maxRuns |
maximal number of importance source runs. You may increase it to resolve attributes left Tentative. |
doTrace |
verbosity level. 0 means no tracing, 1 means reporting decision about each attribute as soon as it is justified, 2 means the same as 1, plus reporting each importance source run, 3 means the same as 2, plus reporting of hits assigned to yet undecided attributes. |
holdHistory |
if set to |
getImp |
function used to obtain attribute importance.
The default is getImpRfZ, which runs random forest from the |
formula |
alternatively, formula describing model to be analysed. |
data |
in which to interpret formula. |
Boruta iteratively compares importances of attributes with importances of shadow attributes, created by shuffling original ones.
Attributes that have significantly worst importance than shadow ones are being consecutively dropped.
On the other hand, attributes that are significantly better than shadows are admitted to be Confirmed.
Shadows are re-created in each iteration.
Algorithm stops when only Confirmed attributes are left, or when it reaches maxRuns
importance source runs.
If the second scenario occurs, some attributes may be left without a decision.
They are claimed Tentative.
You may try to extend maxRuns
or lower pValue
to clarify them, but in some cases their importances do fluctuate too much for Boruta to converge.
Instead, you can use TentativeRoughFix
function, which will perform other, weaker test to make a final decision, or simply treat them as undecided in further analysis.
An object of class Boruta
, which is a list with the following components:
finalDecision |
a factor of three value: |
ImpHistory |
a data frame of importances of attributes gathered in each importance source run.
Beside predictors' importances, it contains maximal, mean and minimal importance of shadow attributes in each run.
Rejected attributes get |
timeTaken |
time taken by the computation. |
impSource |
string describing the source of importance, equal to a comment attribute of the |
call |
the original call of the |
Miron B. Kursa, Witold R. Rudnicki (2010). Feature Selection with the Boruta Package. Journal of Statistical Software, 36(11), p. 1-13. URL: doi:10.18637/jss.v036.i11
set.seed(777) #Boruta on the "small redundant XOR" problem; read ?srx for details data(srx) Boruta(Y~.,data=srx)->Boruta.srx #Results summary print(Boruta.srx) #Result plot plot(Boruta.srx) #Attribute statistics attStats(Boruta.srx) #Using alternative importance source, rFerns Boruta(Y~.,data=srx,getImp=getImpFerns)->Boruta.srx.ferns print(Boruta.srx.ferns) #Verbose Boruta(Y~.,data=srx,doTrace=2)->Boruta.srx ## Not run: #Boruta on the iris problem extended with artificial irrelevant features #Generate said features iris.extended<-data.frame(iris,apply(iris[,-5],2,sample)) names(iris.extended)[6:9]<-paste("Nonsense",1:4,sep="") #Run Boruta on this data Boruta(Species~.,data=iris.extended,doTrace=2)->Boruta.iris.extended #Nonsense attributes should be rejected print(Boruta.iris.extended) ## End(Not run) ## Not run: #Boruta on the HouseVotes84 data from mlbench library(mlbench); data(HouseVotes84) na.omit(HouseVotes84)->hvo #Takes some time, so be patient Boruta(Class~.,data=hvo,doTrace=2)->Bor.hvo print(Bor.hvo) plot(Bor.hvo) plotImpHistory(Bor.hvo) ## End(Not run) ## Not run: #Boruta on the Ozone data from mlbench library(mlbench); data(Ozone) library(randomForest) na.omit(Ozone)->ozo Boruta(V4~.,data=ozo,doTrace=2)->Bor.ozo cat('Random forest run on all attributes:\n') print(randomForest(V4~.,data=ozo)) cat('Random forest run only on confirmed attributes:\n') print(randomForest(ozo[,getSelectedAttributes(Bor.ozo)],ozo$V4)) ## End(Not run) ## Not run: #Boruta on the Sonar data from mlbench library(mlbench); data(Sonar) #Takes some time, so be patient Boruta(Class~.,data=Sonar,doTrace=2)->Bor.son print(Bor.son) #Shows important bands plot(Bor.son,sort=FALSE) ## End(Not run)
set.seed(777) #Boruta on the "small redundant XOR" problem; read ?srx for details data(srx) Boruta(Y~.,data=srx)->Boruta.srx #Results summary print(Boruta.srx) #Result plot plot(Boruta.srx) #Attribute statistics attStats(Boruta.srx) #Using alternative importance source, rFerns Boruta(Y~.,data=srx,getImp=getImpFerns)->Boruta.srx.ferns print(Boruta.srx.ferns) #Verbose Boruta(Y~.,data=srx,doTrace=2)->Boruta.srx ## Not run: #Boruta on the iris problem extended with artificial irrelevant features #Generate said features iris.extended<-data.frame(iris,apply(iris[,-5],2,sample)) names(iris.extended)[6:9]<-paste("Nonsense",1:4,sep="") #Run Boruta on this data Boruta(Species~.,data=iris.extended,doTrace=2)->Boruta.iris.extended #Nonsense attributes should be rejected print(Boruta.iris.extended) ## End(Not run) ## Not run: #Boruta on the HouseVotes84 data from mlbench library(mlbench); data(HouseVotes84) na.omit(HouseVotes84)->hvo #Takes some time, so be patient Boruta(Class~.,data=hvo,doTrace=2)->Bor.hvo print(Bor.hvo) plot(Bor.hvo) plotImpHistory(Bor.hvo) ## End(Not run) ## Not run: #Boruta on the Ozone data from mlbench library(mlbench); data(Ozone) library(randomForest) na.omit(Ozone)->ozo Boruta(V4~.,data=ozo,doTrace=2)->Bor.ozo cat('Random forest run on all attributes:\n') print(randomForest(V4~.,data=ozo)) cat('Random forest run only on confirmed attributes:\n') print(randomForest(ozo[,getSelectedAttributes(Bor.ozo)],ozo$V4)) ## End(Not run) ## Not run: #Boruta on the Sonar data from mlbench library(mlbench); data(Sonar) #Takes some time, so be patient Boruta(Class~.,data=Sonar,doTrace=2)->Bor.son print(Bor.son) #Shows important bands plot(Bor.son,sort=FALSE) ## End(Not run)
Applies downstream importance source on a given object strata and averages their outputs.
conditionalTransdapter(groups, adapter = getImpRfZ)
conditionalTransdapter(groups, adapter = getImpRfZ)
groups |
groups. |
adapter |
importance adapter to transform. |
transformed importance adapter which can be fed into getImp
argument of the Boruta
function.
Applies the decoherence transformation to the input, destroying all multivariate interactions. It will trash the Boruta result, only apply if you know what are you doing! Works only for categorical decision.
decohereTransdapter(adapter = getImpRfZ)
decohereTransdapter(adapter = getImpRfZ)
adapter |
importance adapter to transform. |
transformed importance adapter which can be fed into getImp
argument of the Boruta
function.
set.seed(777) # SRX data only contains multivariate interactions data(srx) # Decoherence transform removes them all, # leaving no confirmed features Boruta(Y~.,data=srx,getImp=decohereTransdapter())
set.seed(777) # SRX data only contains multivariate interactions data(srx) # Decoherence transform removes them all, # leaving no confirmed features Boruta(Y~.,data=srx,getImp=decohereTransdapter())
Functions which convert the Boruta selection into a formula, so that it could be passed further to other functions.
getConfirmedFormula(x) getNonRejectedFormula(x)
getConfirmedFormula(x) getNonRejectedFormula(x)
x |
an object of a class Boruta, made using a formula interface. |
Formula, corresponding to the Boruta results.
getConfirmedFormula
returns only Confirmed attributes, getNonRejectedFormula
also adds Tentative ones.
This operation is possible only when Boruta selection was invoked using a formula interface.
Those function is intended to be given to a getImp
argument of Boruta
function to be called by the Boruta algorithm as an importance source.
getImpExtraZ
generates default, normalized permutation importance, getImpExtraRaw
raw permutation importance, finally getImpExtraGini
generates Gini impurity importance.
getImpExtraZ(x, y, ntree = 500, num.trees = ntree, ...) getImpExtraGini(x, y, ntree = 500, num.trees = ntree, ...) getImpExtraRaw(x, y, ntree = 500, num.trees = ntree, ...)
getImpExtraZ(x, y, ntree = 500, num.trees = ntree, ...) getImpExtraGini(x, y, ntree = 500, num.trees = ntree, ...) getImpExtraRaw(x, y, ntree = 500, num.trees = ntree, ...)
x |
data frame of predictors including shadows. |
y |
response vector. |
ntree |
Number of trees in the forest; copied into |
num.trees |
Number of trees in the forest, as according to |
... |
parameters passed to the underlying |
This function is intended to be given to a getImp
argument of Boruta
function to be called by the Boruta algorithm as an importance source.
getImpFerns(x, y, ...)
getImpFerns(x, y, ...)
x |
data frame of predictors including shadows. |
y |
response vector. |
... |
parameters passed to the underlying |
Random Ferns importance calculation should be much faster than using Random Forest; however, one must first optimize the value of the depth
parameter and
it is quite likely that the number of ferns in the ensemble required for the importance to converge will be higher than the number of trees in case of Random Forest.
Those function is intended to be given to a getImp
argument of Boruta
function to be called by the Boruta algorithm as an importance source.
getImpLegacyRfZ
generates default, normalized permutation importance, getImpLegacyRfRaw
raw permutation importance, finally getImpLegacyRfGini
generates Gini index importance, all using randomForest
as a Random Forest algorithm implementation.
getImpLegacyRfZ(x, y, ...) getImpLegacyRfRaw(x, y, ...) getImpLegacyRfGini(x, y, ...)
getImpLegacyRfZ(x, y, ...) getImpLegacyRfRaw(x, y, ...) getImpLegacyRfGini(x, y, ...)
x |
data frame of predictors including shadows. |
y |
response vector. |
... |
parameters passed to the underlying |
The getImpLegacyRfZ
function was a default importance source in Boruta versions prior to 5.0; since then ranger
Random Forest implementation is used instead of randomForest
, for speed, memory conservation and an ability to utilise multithreading.
Both importance sources should generally lead to the same results, yet there are differences.
Most notably, ranger by default treats factor attributes as ordered (and works very slow if instructed otherwise with respect.unordered.factors=TRUE
); on the other hand it lifts 32 levels limit specific to randomForest
.
To this end, Boruta decision for factor attributes may be different.
Random Forest methods has two main parameters, number of attributes tried at each split and the number of trees in the forest; first one is called mtry
in both implementations, but the second ntree
in randomForest
and num.trees
in ranger
.
To this end, to maintain compatibility, getImpRf*
functions still accept ntree
parameter relaying it into num.trees
.
Still, both parameters take the same defaults in both implementations (square root of the number all all attributes and 500 respectively).
Moreover, ranger
brings some addition capabilities to Boruta, like analysis of survival problems or sticky variables which are always considered on splits.
Finally, the results for the same PRNG seed will be different.
set.seed(777) #Add some nonsense attributes to iris dataset by shuffling original attributes iris.extended<-data.frame(iris,apply(iris[,-5],2,sample)) names(iris.extended)[6:9]<-paste("Nonsense",1:4,sep="") #Run Boruta on this data Boruta(Species~.,getImp=getImpLegacyRfZ, data=iris.extended,doTrace=2)->Boruta.iris.extended #Nonsense attributes should be rejected print(Boruta.iris.extended)
set.seed(777) #Add some nonsense attributes to iris dataset by shuffling original attributes iris.extended<-data.frame(iris,apply(iris[,-5],2,sample)) names(iris.extended)[6:9]<-paste("Nonsense",1:4,sep="") #Run Boruta on this data Boruta(Species~.,getImp=getImpLegacyRfZ, data=iris.extended,doTrace=2)->Boruta.iris.extended #Nonsense attributes should be rejected print(Boruta.iris.extended)
Those function is intended to be given to a getImp
argument of Boruta
function to be called by the Boruta algorithm as an importance source.
getImpRfZ
generates default, normalized permutation importance, getImpRfRaw
raw permutation importance, finally getImpRfGini
generates Gini index importance.
getImpRfZ(x, y, ntree = 500, num.trees = ntree, ...) getImpRfGini(x, y, ntree = 500, num.trees = ntree, ...) getImpRfRaw(x, y, ntree = 500, num.trees = ntree, ...)
getImpRfZ(x, y, ntree = 500, num.trees = ntree, ...) getImpRfGini(x, y, ntree = 500, num.trees = ntree, ...) getImpRfRaw(x, y, ntree = 500, num.trees = ntree, ...)
x |
data frame of predictors including shadows. |
y |
response vector. |
ntree |
Number of trees in the forest; copied into |
num.trees |
Number of trees in the forest, as according to |
... |
parameters passed to the underlying |
Prior to Boruta 5.0, getImpLegacyRfZ
function was a default importance source in Boruta; see getImpLegacyRf for more details.
This function is intended to be given to a getImp
argument of Boruta
function to be called by the Boruta algorithm as an importance source.
This functionality is inspired by the Python package BoostARoota by Chase DeHan.
In practice, due to the eager way XgBoost works, this adapter changes Boruta into minimal optimal method, hence I strongly recommend against using this.
getImpXgboost(x, y, nrounds = 5, verbose = 0, ...)
getImpXgboost(x, y, nrounds = 5, verbose = 0, ...)
x |
data frame of predictors including shadows. |
y |
response vector. |
nrounds |
Number of rounds; passed to the underlying |
verbose |
Verbosity level of xgboost; either 0 (silent) or 1 (progress reports). Passed to the underlying |
... |
other parameters passed to the underlying |
Only dense matrix interface is supported; all predictions given to Boruta
call have to be numeric (not integer).
Categorical features should be split into indicator attributes.
https://github.com/chasedehan/BoostARoota
getSelectedAttributes
returns a vector of names of attributes selected during a Boruta run.
getSelectedAttributes(x, withTentative = FALSE)
getSelectedAttributes(x, withTentative = FALSE)
x |
an object of a class Boruta, from which relevant attributes names should be extracted. |
withTentative |
if set to |
A character vector with names of the relevant attributes.
## Not run: data(iris) #Takes some time, so be patient Boruta(Species~.,data=iris,doTrace=2)->Bor.iris print(Bor.iris) print(getSelectedAttributes(Bor.iris)) ## End(Not run)
## Not run: data(iris) #Takes some time, so be patient Boruta(Species~.,data=iris,doTrace=2)->Bor.iris print(Bor.iris) print(getSelectedAttributes(Bor.iris)) ## End(Not run)
Wraps the importance adapter to accept NAs in input.
imputeTransdapter(adapter = getImpRfZ)
imputeTransdapter(adapter = getImpRfZ)
adapter |
importance adapter to transform. |
transformed importance adapter which can be fed into getImp
argument of the Boruta
function.
An all-NA feature will be converted to all zeroes, which should be ok as a totally non-informative value with most methods, but it is not universally correct. Ideally, one should avoid having such features in input altogether.
## Not run: set.seed(777) data(srx) srx_na<-srx # Randomly punch 25 holes in the SRX data holes<-25 holes<-cbind( sample(nrow(srx),holes,replace=TRUE), sample(ncol(srx),holes,replace=TRUE) ) srx_na[holes]<-NA # Use impute transdapter to mitigate them with internal imputation Boruta(Y~.,data=srx_na,getImp=imputeTransdapter(getImpRfZ)) ## End(Not run)
## Not run: set.seed(777) data(srx) srx_na<-srx # Randomly punch 25 holes in the SRX data holes<-25 holes<-cbind( sample(nrow(srx),holes,replace=TRUE), sample(ncol(srx),holes,replace=TRUE) ) srx_na[holes]<-NA # Use impute transdapter to mitigate them with internal imputation Boruta(Y~.,data=srx_na,getImp=imputeTransdapter(getImpRfZ)) ## End(Not run)
Default plot method for Boruta objects, showing boxplots of attribute importances over run.
## S3 method for class 'Boruta' plot( x, colCode = c("green", "yellow", "red", "blue"), sort = TRUE, whichShadow = c(TRUE, TRUE, TRUE), col = NULL, xlab = "Attributes", ylab = "Importance", ... )
## S3 method for class 'Boruta' plot( x, colCode = c("green", "yellow", "red", "blue"), sort = TRUE, whichShadow = c(TRUE, TRUE, TRUE), col = NULL, xlab = "Attributes", ylab = "Importance", ... )
x |
an object of a class Boruta. |
colCode |
a vector containing colour codes for attribute decisions, respectively Confirmed, Tentative, Rejected and shadow. |
sort |
controls whether boxplots should be ordered, or left in original order. |
whichShadow |
a logical vector controlling which shadows should be drawn; switches respectively max shadow, mean shadow and min shadow. |
col |
standard |
xlab |
X axis label that will be passed to |
ylab |
Y axis label that will be passed to |
... |
additional graphical parameter that will be passed to |
Invisible copy of x
.
If col
is given and sort
is TRUE
, the col
will be permuted, so that its order corresponds to attribute order in ImpHistory
.
This function will throw an error when x
lacks importance history, i.e., was made with holdHistory
set to FALSE
.
## Not run: library(mlbench); data(HouseVotes84) na.omit(HouseVotes84)->hvo #Takes some time, so be patient Boruta(Class~.,data=hvo,doTrace=2)->Bor.hvo print(Bor.hvo) plot(Bor.hvo) ## End(Not run)
## Not run: library(mlbench); data(HouseVotes84) na.omit(HouseVotes84)->hvo #Takes some time, so be patient Boruta(Class~.,data=hvo,doTrace=2)->Bor.hvo print(Bor.hvo) plot(Bor.hvo) ## End(Not run)
Alternative plot method for Boruta objects, showing matplot of attribute importances over run.
plotImpHistory( x, colCode = c("green", "yellow", "red", "blue"), col = NULL, type = "l", lty = 1, pch = 0, xlab = "Classifier run", ylab = "Importance", ... )
plotImpHistory( x, colCode = c("green", "yellow", "red", "blue"), col = NULL, type = "l", lty = 1, pch = 0, xlab = "Classifier run", ylab = "Importance", ... )
x |
an object of a class Boruta. |
colCode |
a vector containing colour codes for attribute decisions, respectively Confirmed, Tentative, Rejected and shadow. |
col |
standard |
type |
Plot type that will be passed to |
lty |
Line type that will be passed to |
pch |
Point mark type that will be passed to |
xlab |
X axis label that will be passed to |
ylab |
Y axis label that will be passed to |
... |
additional graphical parameter that will be passed to |
Invisible copy of x
.
This function will throw an error when x
lacks importance history, i.e., was made with holdHistory
set to FALSE
.
## Not run: library(mlbench); data(Sonar) #Takes some time, so be patient Boruta(Class~.,data=Sonar,doTrace=2)->Bor.son print(Bor.son) plotImpHistory(Bor.son) ## End(Not run)
## Not run: library(mlbench); data(Sonar) #Takes some time, so be patient Boruta(Class~.,data=Sonar,doTrace=2)->Bor.son print(Bor.son) plotImpHistory(Bor.son) ## End(Not run)
Print method for the Boruta objects.
## S3 method for class 'Boruta' print(x, ...)
## S3 method for class 'Boruta' print(x, ...)
x |
an object of a class Boruta. |
... |
additional arguments passed to |
Invisible copy of x
.
A synthetic data set with 32 rows corresponding to all combinations of values of five logical features, A, B, N1, N2 and N3. The decision Y is equal to A xor B, hence N1–N3 are irrelevant attributes. The set also contains 3 additional features, A or B (AoB), A and B (AnB) and not A (nA), which provide a redundant, but still relevant way to reconstruct Y.
data(srx)
data(srx)
A data frame with 8 predictors, 4 relevant: A, B, AoB, AnB and nA, as well as 3 irrelevant N1, N2 and N3, and decision attribute Y.
This is set is an easy way to demonstrate the difference between all relevant feature selection methods, which should select all features except N1–N3, and minimal optimal ones, which will probably ignore most of them.
https://blog.mbq.me/relevance-and-redundancy/
In some circumstances (too short Boruta run, unfortunate mixing of shadow attributes, tricky dataset...), Boruta can leave some attributes Tentative.
TentativeRoughFix
performs a simplified, weaker test for judging such attributes.
TentativeRoughFix(x, averageOver = Inf)
TentativeRoughFix(x, averageOver = Inf)
x |
an object of a class Boruta. |
averageOver |
Either number of last importance source runs to average over or Inf for averaging over the whole Boruta run. |
Function claims as Confirmed those attributes that have median importance higher than the median importance of maximal shadow attribute, and the rest as Rejected. Depending of the user choice, medians for the test are count over last round, all rounds or N last importance source runs.
A Boruta class object with modified finalDecision
element.
Such object has few additional elements:
originalDecision |
Original |
averageOver |
Copy of |
This function should be used only when strict decision is highly desired, because this test is much weaker than Boruta and can lower the confidence of the final result.
x
has to be made with holdHistory
set to
TRUE
for this code to run.