library(gfboost)
#> Loading required package: mvtnorm
#> Loading required package: pcaPP
#> Loading required package: mboost
#> Loading required package: parallel
#> Loading required package: stabs
The gfboost package extends the mboost package
(Hothorn et al. (2017), Hofner et al. (2014), Hothorn et al. (2010), Bühlmann and Hothorn (2007), Hofner, Boccuto, and Göker (2015), Hothorn and Bühlmann (2006)) as it provides an
implementation of a so-called ‘’gradient-free Gradient Boosting’’
algorithm that allows for the application of Boosting to
non-differentiable and non-convex loss functions as well as a modified,
loss-based Stability Selection variant (Werner
(2019b), Werner and Ruckdeschel
(2019)). The motivation behind this type of Boosting algorithm is
the application of Boosting to ranking problems which suffer from
complicated, non-continuous loss functions.
We assume that our training set is given by a data matrix 𝒟train ∈ ℝn × (p + 1) where n is the number of observations and p is the number of predictors, i.e., our data matrix can be written as 𝒟train = (Xtrain, Ytrain) for the regressor matrix Xtrain ∈ ℝn × p and the response vector Ytrain ∈ ℝn.
The relation between the response and the regressors can be described by some map f : ℝp ↦ ℝ in the sense that y = f(x) + ϵ for some error term ϵ with mean zero and variance σ2 ∈ ]0, ∞[.
The goal is to approximate f by a sparse and stable model where ‘’stable’’ refers to the property that the set of selected predictors does only change insignificantly on similar data.
Our package uses the family objects from the package mboost to define new loss functions. We refer to mboost (Hothorn et al. (2017)) for more details. For our applications, we allow for non-differentiable loss functions like ranking losses, i.e., losses that do not provide a gradient, leaving the ngradient argument of the family objects empty. We provide some specific losses for ranking problems. As a simple example, we demonstrate the family Rank():
y<-c(-3, 10.3,-8, 12, 14,-0.5, 29,-1.1,-5.7, 119)
yhat<-c(0.02, 0.6, 0.1, 0.47, 0.82, 0.04, 0.77, 0.09, 0.01, 0.79)
Rank()@risk(y,yhat)
#> [1] 0.1777778
The Rank family implements the hard ranking loss (S. Clémençon, Lugosi, and Vayatis (2008)) which compares the ordering of the two inserted vectors. More precisely, it sums up all misrankings, i.e., all pairs (i, j) where Yi is higher/lower than Yj but where Ŷi is lower/higher than Ŷj. At the end, this sum is standardized such that a hard ranking loss of 0 indicates that there are no misrankings while a hard ranking loss of, say, 0.5, indicates that in the half of all such possible pairs a misranking occurs. We provide further families for the so-called localized and weak ranking loss (Stéphan Clémençon and Vayatis (2007)), including a normalized variant for the latter.
For more details about ranking problems, see S. Clémençon, Lugosi, and Vayatis (2008) or
Werner (2019a) as well as references
therein.
In this section, we describe our gradient-free Gradient Boosting
algorithm SingBoost which is based on L2−Boosting and allows
for loss functions that do not provide a gradient. We further show how
to get coefficient paths for the variables and describe an aggregation
procedure for SingBoost.
The main function of this package is the singboost function
which is based on glmboost (Hothorn et
al. (2017)). In this version of the package, we restricted
ourselves to linear regression base learners.
The usage of singboost is very similar to that of glmboost. The main difference is that SingBoost includes so-called ‘’singular iterations’’. In contrast to the gradient descent step executed in standard Boosting algorithms where the baselearner that fits the current negative gradient vector best, the singular iterations correspond to a secant step, i.e., the least-squares baselearner that fits the current residuals, evaluated in the (possibly complicated) loss function, best is taken. Therefore, singboost allows for loss functions that do not provide a (unique) gradient, so even a family object that has an empty ngradient argument can be inserted as the family argument of the singboost function.
The motivation behind these singular iterations comes from a measure-theoretical perspective in the sense that the selection frequencies can be regarded as a ‘’column measure’’ (cf. Werner and Ruckdeschel (2019)). Clearly, different loss functions lead to different column measures and potentially to different sets of selected variables, i.e., the columns that are relevant for our target loss function, maybe a ranking loss, and which are not selected by some L−Boosting for another loss function L, may be identified with a ‘’singular part’’ of column measures (cf. Werner and Ruckdeschel (2019)). We always use L2−Boosting (Bühlmann and Yu (2003), Bühlmann (2006)) as reference Boosting model. Nevertheless, we assume that L2−Boosting already selects relevant variables which would make a Boosting algorithm only containing singular iterations and therefore being computationally quite expensive unnecessary. Therefore, SingBoost alternates between singular iterations and standard L2−Boosting iterations where the frequency of the singular iterations can be set by the parameter M, i.e., every M−th iteration is a singular iteration. For the sake of a wide applicability, we do not exclude standard (i.e., differentiable) loss functions. Note that our implementation contains an additional LS argument indicating whether singular iterations are gradient- or secant-based, requiring to be set to TRUE if a non-differentiable loss function is used.
The further input arguments are the number of iterations miter (miter), the learning rate κ (kap) and the data set 𝒟 (D). It is important to note that the inserted data set must not contain an intercept column since this column will be added automatically internally in singboost. If one does not want an intercept coefficient, one has to center the data beforehand. Furthermore, the last column must contain the response variable, so singboost always treats the last column as the response column and all other columns as regressor columns. Since a formula argument cannot be inserted, one has to generate the model matrix first if basic functions, interaction terms or categorical variables are included. For the particular case of localized ranking losses, the argument best controls how large the proportion of the best instances is.
Consider the following simple example:
glmres<-glmboost(Sepal.Length~.,iris)
glmres
#>
#> Generalized Linear Models Fitted via Gradient Boosting
#>
#> Call:
#> glmboost.formula(formula = Sepal.Length ~ ., data = iris)
#>
#>
#> Squared Error (Regression)
#>
#> Loss function: (y - f)^2
#>
#>
#> Number of boosting iterations: mstop = 100
#> Step size: 0.1
#> Offset: 5.843333
#>
#> Coefficients:
#> (Intercept) Sepal.Width Petal.Length Speciesvirginica
#> -3.36560001 0.53639026 0.46090783 -0.01924627
#> attr(,"offset")
#> [1] 5.843333
attributes(varimp(glmres))$self
#> [1] 0.00 0.38 0.57 0.00 0.00 0.05
attributes(varimp(glmres))$var
#> [1] (Intercept) Sepal.Width Petal.Length Petal.Width
#> [5] Speciesversicolor Speciesvirginica
#> 6 Levels: (Intercept) < Petal.Width < ... < Petal.Length
firis<-as.formula(Sepal.Length~.)
Xiris<-model.matrix(firis,iris)
Diris<-data.frame(Xiris[,-1],iris$Sepal.Length)
colnames(Diris)[6]<-"Y"
coef(glmboost(Xiris,iris$Sepal.Length))
#> (Intercept) Sepal.Width Petal.Length Speciesvirginica
#> -3.36560001 0.53639026 0.46090783 -0.01924627
#> attr(,"offset")
#> [1] 5.843333
singboost(Diris)
#> $`Selected variables`
#> [1] "Petal.Length>=Sepal.Width>=Speciesvirginica"
#>
#> $Coefficients
#> [1] 2.47773332 0.53639026 0.46090783 0.00000000 0.00000000 -0.01924627
#>
#> $Freqs
#> [1] 0.00 0.38 0.57 0.00 0.00 0.05
#>
#> $VarCoef
#> Intercept Sepal.Width Petal.Length Speciesvirginica
#> 2.47773332 0.53639026 0.46090783 -0.01924627
singboost(Diris,LS=TRUE)
#> $`Selected variables`
#> [1] "Petal.Length>=Sepal.Width>=Speciesvirginica"
#>
#> $Coefficients
#> [1] 2.47773332 0.53639026 0.46090783 0.00000000 0.00000000 -0.01924627
#>
#> $Freqs
#> [1] 0.00 0.38 0.57 0.00 0.00 0.05
#>
#> $VarCoef
#> Intercept Sepal.Width Petal.Length Speciesvirginica
#> 2.47773332 0.53639026 0.46090783 -0.01924627
The application to the iris data set should demonstrate that L2−Boosting is a special instance of our SingBoost algorithm (Werner and Ruckdeschel (2019)). More precisely, using the squared loss function, our SingBoost algorithm results in exactly the same model and coefficients, provided that the hyperparameters are identical. We did not specify them explicitly in the example, but the number of iterations in singboost is miter = 100 per default and the learning rate is κ = 0.1, mimicking the default setting for these parameters for L2−Boosting in mboost (Hothorn et al. (2017)). Note that we did not explicitly insert a Boosting family. Per default, singboost uses Gaussian() which corresponds to standard L2−Boosting. The LS argument indicates that we indeed perform so-called singular iterations. Clearly, for the squared loss function, both results are the same since L2−Boosting with least-squares baselearners implicitly also takes the best baselearner in the sense that it fits the current residuals best, evaluated in the squared loss (and which is implemented however much more sophisticatedly in mboost by using correlation updates), see e.g. Zhao and Yu (2004), Zhao and Yu (2007). For clarity, we manually renamed the response variable Sepal.Length as Y in advance in the previous example (which is not necessary). singboost reports the selected variables, the coefficients and the selection frequencies, i.e., the relative number of iterations where a particular predictor has been selected. Note that singboost does not report the offset and the intercept separately.
singboost of course also allows for categorical variables, interaction terms etc.. See the following example:
glmres2<-glmboost(Sepal.Length~Petal.Length+Sepal.Width:Species,iris)
finter<-as.formula(Sepal.Length~Petal.Length+Sepal.Width:Species-1)
Xinter<-model.matrix(finter,iris)
Dinter<-data.frame(Xinter,iris$Sepal.Length)
singboost(Dinter)
#> $`Selected variables`
#> [1] "Petal.Length>=Sepal.Width.Speciessetosa>=Sepal.Width.Speciesvirginica"
#>
#> $Coefficients
#> [1] 4.00831289 0.45662497 0.08896750 0.00000000 0.01751541
#>
#> $Freqs
#> [1] 0.00 0.60 0.36 0.00 0.04
#>
#> $VarCoef
#> Intercept Petal.Length
#> 4.00831289 0.45662497
#> Sepal.Width.Speciessetosa Sepal.Width.Speciesvirginica
#> 0.08896750 0.01751541
coef(glmres2)
#> (Intercept) Petal.Length
#> -1.83502045 0.45662497
#> Sepal.Width:Speciessetosa Sepal.Width:Speciesvirginica
#> 0.08896750 0.01751541
#> attr(,"offset")
#> [1] 5.843333
The example above shows how to deal with interaction terms. In fact, instead of inserting a formula argument as input for singboost, build the model matrix (without the intercept column) and enter this matrix as input argument. The same procedure is valid for categorical variables or basis functions. Note that we did not yet implement a group structure, i.e., we always treat all columns separately and do not perform block-wise updates as suggested in Gertheiss et al. (2011).
We now demonstrate how to apply SingBoost for quantile regression and hard ranking regression:
glmquant<-glmboost(Sepal.Length~.,iris,family=QuantReg(tau=0.75))
coef(glmquant)
#> (Intercept) Sepal.Width Petal.Length
#> -3.1274211 0.5191849 0.4756341
#> attr(,"offset")
#> 50%
#> 5.8
attributes(varimp(glmquant))$self
#> [1] 0.24 0.27 0.49 0.00 0.00 0.00
singboost(Diris,singfamily=QuantReg(tau=0.75),LS=TRUE)
#> $`Selected variables`
#> [1] "Petal.Length>=Sepal.Width>=Speciesvirginica"
#>
#> $Coefficients
#> [1] 2.47836568 0.53612658 0.46095525 0.00000000 0.00000000 -0.01925953
#>
#> $Freqs
#> [1] 0.00 0.38 0.57 0.00 0.00 0.05
#>
#> $VarCoef
#> Intercept Sepal.Width Petal.Length Speciesvirginica
#> 2.47836568 0.53612658 0.46095525 -0.01925953
singboost(Diris,singfamily=QuantReg(tau=0.75),LS=TRUE,M=2)
#> $`Selected variables`
#> [1] "Petal.Length>=Sepal.Width>=Speciesvirginica"
#>
#> $Coefficients
#> [1] 2.44900980 0.54702733 0.45925162 0.00000000 0.00000000 -0.01196688
#>
#> $Freqs
#> [1] 0.00 0.42 0.55 0.00 0.00 0.03
#>
#> $VarCoef
#> Intercept Sepal.Width Petal.Length Speciesvirginica
#> 2.44900980 0.54702733 0.45925162 -0.01196688
singboost(Diris,singfamily=Rank(),LS=TRUE)
#> $`Selected variables`
#> [1] "Petal.Length>=Sepal.Width>=Speciesvirginica>=Speciesversicolor"
#>
#> $Coefficients
#> [1] 2.485989684 0.534163798 0.460436981 0.000000000 0.000344599
#> [6] -0.018630537
#>
#> $Freqs
#> [1] 0.00 0.38 0.56 0.00 0.01 0.05
#>
#> $VarCoef
#> Intercept Sepal.Width Petal.Length Speciesversicolor
#> 2.485989684 0.534163798 0.460436981 0.000344599
#> Speciesvirginica
#> -0.018630537
The last simulation has the argument M=2 which indicates that we alternate between singular and L2−Boosting iterations where the second simulation uses the default M=10.
More details concerning the implementation of the SingBoost algorithm
can be found in Werner (2019b).
If coefficient paths have to be drawn, use the path.singboost function followed by the singboost.plot function instead of the singboost function since only path.singboost also saves the intercept and coefficients paths. In the singboost.plot function, M and miter have to be inserted again:
CMB intends to aggregate SingBoost models that have been fitted on subsamples of the training data. The main motivation behind this idea is that SingBoost is expected to detect relevant variables for the respective (complicated) loss function that L2−Boosting would not select. From a measure-theoretical perspective, these columns may be identified with a ‘’singular part’’ (cf. Werner and Ruckdeschel (2019)). In order to propose the opportunity to study this singular part for specific loss functions, we provide the CMB procedure that aggregates SingBoost models in order to get a more ‘’stable’’ singular part. Note that the CMB procedure does not replace a Stability Selection. Furthermore, if one is only interested in getting a suitable stable model, we recommend to only use a Stability Selection without the CMB procedure by setting Bsing=1 and nsing=ncmb when using the CMB3S function (see below) for the sake of computational costs.
The main idea is to draw Bsing (Bsing) subsamples with nsing (nsing) rows from the training set and to compute the singboost model on each. Then, the models are aggregated where different aggregation procedures can be used, i.e., either a simple average (weights1) or a weighted average based on the reciprocal of the out-of-sample loss (weights2) computed on the complement of the current subsample.
Let us however provide an example for CMB:
set.seed(19931023)
cmb1<-CMB(Diris,nsing=100,Bsing=50,alpha=0.8,singfam=Rank(),
evalfam=Rank(),sing=TRUE,M=10,m_iter=100,
kap=0.1,LS=TRUE,wagg='weights1',robagg=FALSE,lower=0)
cmb1
#> $`Column measure`
#> [1] 0.000 1.000 1.000 0.325 0.525 0.800
#>
#> $`Selected variables`
#> [1] "Sepal.Width>=Petal.Length>=Speciesvirginica>=Speciesversicolor>=Petal.Width"
#>
#> $`Variable names`
#> [1] "Intercept" "Sepal.Width" "Petal.Length"
#> [4] "Petal.Width" "Speciesversicolor" "Speciesvirginica"
#> [7] "Y"
#>
#> $`Row measure`
#> [1] 0.725 0.650 0.600 0.650 0.750 0.650 0.775 0.700 0.600 0.500 0.725 0.725
#> [13] 0.675 0.650 0.675 0.700 0.725 0.600 0.575 0.600 0.625 0.750 0.650 0.675
#> [25] 0.650 0.650 0.750 0.625 0.625 0.650 0.675 0.700 0.700 0.725 0.600 0.600
#> [37] 0.550 0.675 0.625 0.725 0.700 0.675 0.625 0.600 0.650 0.650 0.650 0.675
#> [49] 0.825 0.525 0.525 0.750 0.625 0.600 0.625 0.750 0.600 0.700 0.700 0.625
#> [61] 0.575 0.675 0.625 0.475 0.775 0.600 0.725 0.700 0.675 0.550 0.575 0.700
#> [73] 0.600 0.700 0.775 0.625 0.700 0.700 0.525 0.700 0.800 0.700 0.650 0.625
#> [85] 0.650 0.825 0.675 0.725 0.675 0.675 0.800 0.775 0.675 0.650 0.700 0.725
#> [97] 0.400 0.700 0.775 0.775 0.750 0.625 0.700 0.650 0.650 0.675 0.725 0.725
#> [109] 0.700 0.700 0.625 0.650 0.675 0.600 0.650 0.650 0.525 0.650 0.650 0.650
#> [121] 0.625 0.700 0.600 0.625 0.575 0.675 0.750 0.700 0.700 0.575 0.700 0.650
#> [133] 0.725 0.725 0.700 0.725 0.775 0.700 0.725 0.650 0.725 0.700 0.750 0.525
#> [145] 0.650 0.625 0.725 0.650 0.600 0.725
In this example, we apply SingBoost with the hard ranking loss
(inserted through singfam=Rank()) to Bsing = 50
subsamples with nsing = 100
rows from the iris data set each. After computing the SingBoost
models, their out-of-sample performance is evaluated based on the
complement of the subsample used for training (and thus always contains
50 observations) using again the hard ranking loss (inserted through
evalfam=Rank()). The parameter alpha=0.8 indicates
that we only use the best 80% (i.e., the best 40) of the SingBoost
models for aggregation and wagg=‘weights1’ leads to a simple
average of the selection frequencies. Finally, the aggregated selection
frequencies (which form an empirical aggregated ‘’column measure’‘) are
reported as well as the selected variables. The last part of the output,
the empirical aggregated’‘row measure’‘, reports the’‘selection
frequencies’’ for the rows. This is not trivial since these frequencies
are not just the relative numbers of occurences of the respective rows
in the training sets due to the partitioning but also contain
information about the ‘’quality’’ of the instances since the relative
numbers of occurences in the training sets corresponding to the
best models are computed.
L2−Boosting is known to suffer from overfitting. This drawback is not limited to L2−Boosting, but also affects the Lasso, other Boosting models and therefore also SingBoost. Usually, one applies the Stability Selection from Meinshausen and Bühlmann (2010) (for Lasso and its variants) resp. from Hofner, Boccuto, and Göker (2015) (for Boosting models) by computing the models on subsamples of the training set and by selecting a set of stable variables according to some threshold. Motivated by the fact that our SingBoost essentially combines L2−Boosting and a secant step according to some other target loss function, we propose a Stability Selection that is based on the performance of the models on some validation set, evaluated in this target loss function.
The main issue concerning the standard Stability Selection is that two of three parameters (number of variables per model, PFER and threshold) have to be inserted in the function stabsel available in the package stabs (Hofner and Hothorn (2017)). However, since the recommendations for these parameters do not necessarily need to hold for SingBoost models, our idea is to define a grid of thresholds or a grid of cardinalities (number of variables in the final (stable) model). Then, we compute SingBoost models on B subsamples with ncmb (ncmb) rows from the training set and average the selection frequencies. Now, having a grid of thresholds (indicated by setting gridtype=‘pigrid’), we compute the stable model for each threshold by taking the variables whose aggregated selection frequencies are at least as high as the threshold and evaluate their performances on a validation set according to our target loss function. To this end, we either compute SingBoost or least squares coefficients on the reduced data set which only contains the regressor columns corresponding to the stable variables. The stable model with the best validation performance (and with it, implicitly the optimal threshold) is finally reported. Note that the main issue that we learned when working with thresholds is that once the signal to noise ratio is rather low, it frequently happens that no variable passes the threshold which leads to an empty model, see Werner (2019b). Since we never think of an empty model as a reasonable model, we propose to alternatively define a grid of numbers of variables in the final model such that for each number, say q (not to confuse with the q in Hofner’s Stability Selection where this variable indicates the number of variables selected in each Boosting model), the q variables with the highest aggregated selection frequencies enter the stable model (using this type of grid is indicated by setting gridtype=‘qgrid’). The selection of the best model and therefore, the optimal number of variables, is again based on the validation performance.
Note that although our Stability Selection requires the arguments nsing and Bsing, one can easily use it without the CMB aggregation procedure by specifying nsing = ncmb and Bsing = 1. This is important if one wants to compare our Stability Selection directly with the Stability Selection of Hofner, Boccuto, and Göker (2015).
On the other hand, the aggregation paradigms that we already mentioned in the CMB subsection can be directly applied when aggregating the SingBoost models for the Stability Selection. Per default, our CMB3S function averages the selection frequencies across the models. If a weighted aggregation or the aggregation of only the best models is desired, set ntrain = ncmb, B = 1, Bsing > 1 and nsing < ncmb. Effectively, Bsing subsamples with nsing observations each are drawn from the training set, but the aggregation schemes that we mentioned above are available through the arguments alpha and wagg.
Let us now consider the following illustrative example:
set.seed(19931023)
ind<-sample(1:150,120,replace=FALSE)
Dtrain<-Diris[ind,]
Dvalid<-Diris[-ind,]
set.seed(19931023)
cmb3s1<-CMB3S(Dtrain,nsing=80,Dvalid=Dvalid,ncmb=80,Bsing=1,B=100,alpha=0.5,singfam=Gaussian(),
evalfam=Gaussian(),sing=FALSE,M=10,m_iter=100,kap=0.1,LS=FALSE,wagg='weights1',gridtype='pigrid',
grid=seq(0.8,0.9,1),robagg=FALSE,lower=0,singcoef=TRUE,Mfinal=10)
cmb3s1$Fin
#> Intercept Sepal.Width Petal.Length
#> 2.2572372 0.5979883 0.4636778
cmb3s1$Stab
#> [1] 0.00 1.00 1.00 0.22 0.41 0.52
We ran the CMB3S function where CMB-3S is the acronym for ‘’Column Measure Boosting with SingBoost and Stability Selection’’. We divided the data set into a training set and a validation set. From the training set containing 120 instances, we draw B = 100 times a subsample containing ncmb = 80 instances. On each subsample, we compute the L2−Boosting model (since singfam=Gaussian()). Having aggregated the SingBoost selection frequencies and the overall CMB selection frequencies, the 30 instances that were not contained in the initial training set form the validation set for the Stability Selection. We used a grid of thresholds since gridtype=‘pigrid’ and used the thresholds 0.8, 0.9 and 1 (inserted as the grid argument), so for each threshold, the stable model is computed as well as its performance on the validation set. To this end, the required coefficients are computed by SingBoost (since singcoef=TRUE) where each 10-th iteration is a singular iteration (due to Mfinal=10) according to the squared loss (since singfam=Gaussian()). The validation loss is also the squared loss since evalfam=Gaussian(). The CMB3S function outputs the aggregated selection frequencies for each variable as well as the finally selected stable variables. Note that only Sepal.Width and Petal.Length are selected since only these two variables pass the optimal threshold.
Alternatively, one can run the same simulation with a grid of cardinalities by just changing the input arguments gridtype and grid:
set.seed(19931023)
ind<-sample(1:150,120,replace=FALSE)
Dtrain<-Diris[ind,]
Dvalid<-Diris[-ind,]
set.seed(19931023)
cmb3s2<-CMB3S(Dtrain,nsing=80,Dvalid=Dvalid,ncmb=80,Bsing=1,B=100,alpha=0.5,singfam=Gaussian(),
evalfam=Gaussian(),sing=FALSE,M=10,m_iter=100,kap=0.1,LS=FALSE,wagg='weights1',gridtype='qgrid',
grid=seq(1,2,3),robagg=FALSE,lower=0,singcoef=TRUE,Mfinal=10)
cmb3s2$Fin
#> Intercept Sepal.Width Petal.Length
#> 2.2572372 0.5979883 0.4636778
cmb3s2$Stab
#> [1] 0.00 1.00 1.00 0.22 0.41 0.52
One can see that the optimal number of final variables turned out to be 2, so the stable predictor set is the same as in the previous example.
Optionally, we also implemented a function CV.CMB3S which
takes into account that the initial data set is randomly divided into a
training and a validation set. Therefore, the CV.CMB-3S procedure
cross-validates the Stability Selection itself by using multiple
partitions of the initial data set. The main reason to use this function
is to compare the performance of our stable models against some
competitor models, so the initial data set is not only partitioned into
training and validation sets but also a test set is drawn such that
theese three sets form a partition of the initial data set. The
cross-validated Stability Selection reports the cross-validated loss,
evaluated in the target loss function (inserted through
targetfam). We strongly recommend to parallelize the
cross-validated Stability Selection due to huge comptuational
costs.