Title: | Gradient-Free Gradient Boosting |
---|---|
Description: | Implementation of routines of the author's PhD thesis on gradient-free Gradient Boosting (Werner, Tino (2020) "Gradient-Free Gradient Boosting", URL '<https://oops.uni-oldenburg.de/id/eprint/4290>'). |
Authors: | Tino Werner [aut, cre, cph] |
Maintainer: | Tino Werner <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.1.1 |
Built: | 2024-12-01 08:37:10 UTC |
Source: | CRAN |
Aggregates the selection frequencies of multiple SingBoost models. May be used with caution since there are not yet recommendations about good hyperparameters.
CMB( D, nsing, Bsing = 1, alpha = 1, singfam = Gaussian(), evalfam = Gaussian(), sing = FALSE, M = 10, m_iter = 100, kap = 0.1, LS = FALSE, best = 1, wagg, robagg = FALSE, lower = 0, ... )
CMB( D, nsing, Bsing = 1, alpha = 1, singfam = Gaussian(), evalfam = Gaussian(), sing = FALSE, M = 10, m_iter = 100, kap = 0.1, LS = FALSE, best = 1, wagg, robagg = FALSE, lower = 0, ... )
D |
Data matrix. Has to be an |
nsing |
Number of observations (rows) used for the SingBoost submodels. |
Bsing |
Number of subsamples based on which the SingBoost models are validated. Default is 1. Not to confuse with parameter |
alpha |
Optional real number in |
singfam |
A SingBoost family. The SingBoost models are trained based on the corresponding loss function. Default is |
evalfam |
A SingBoost family. The SingBoost models are validated according to the corresponding loss function. Default is |
sing |
If |
M |
An integer between 2 and |
m_iter |
Number of SingBoost iterations. Default is 100. |
kap |
Learning rate (step size). Must be a real number in |
LS |
If a |
best |
Needed in the case of localized ranking. The parameter |
wagg |
Type of row weight aggregation. |
robagg |
Optional. If setting |
lower |
Optional argument. Only reasonable when setting |
... |
Optional further arguments |
SingBoost is designed to detect variables that standard Boosting procedures may not but which may be relevant w.r.t. the target loss function. However, one may try to stabilize this ”singular part” of the column measure by aggregating several SingBoost models in the sense that they are evaluated on a validation set and that the selection frequencies are averaged, maybe in a weighted manner according to the validation losses. Warning: This procedure does not replace a Stability Selection!
Column measure |
Aggregated column measure as |
Selected variables |
Names of the variables with positive aggregated column measure. |
Variables names |
Names of all variables including the intercept. |
Row measure |
Aggregated row measure as |
Werner, T., Gradient-Free Gradient Boosting, PhD Thesis, Carl von Ossietzky University Oldenburg, 2020
firis<-as.formula(Sepal.Length~.) Xiris<-model.matrix(firis,iris) Diris<-data.frame(Xiris[,-1],iris$Sepal.Length) colnames(Diris)[6]<-"Y" set.seed(19931023) cmb1<-CMB(Diris,nsing=100,Bsing=50,alpha=0.8,singfam=Rank(), evalfam=Rank(),sing=TRUE,M=10,m_iter=100, kap=0.1,LS=TRUE,wagg='weights1',robagg=FALSE,lower=0) cmb1 set.seed(19931023) cmb2<-CMB(Diris,nsing=100,Bsing=50,alpha=0.8,singfam=Rank(), evalfam=Rank(),sing=TRUE,M=2,m_iter=100, kap=0.1,LS=TRUE,wagg='weights1',robagg=FALSE,lower=0) cmb2[[1]] set.seed(19931023) cmb3<-CMB(Diris,nsing=100,Bsing=50,alpha=0.8,singfam=Rank(), evalfam=Rank(),sing=TRUE,M=10,m_iter=100, kap=0.1,LS=TRUE,wagg='weights2',robagg=FALSE,lower=0) cmb3[[1]]
firis<-as.formula(Sepal.Length~.) Xiris<-model.matrix(firis,iris) Diris<-data.frame(Xiris[,-1],iris$Sepal.Length) colnames(Diris)[6]<-"Y" set.seed(19931023) cmb1<-CMB(Diris,nsing=100,Bsing=50,alpha=0.8,singfam=Rank(), evalfam=Rank(),sing=TRUE,M=10,m_iter=100, kap=0.1,LS=TRUE,wagg='weights1',robagg=FALSE,lower=0) cmb1 set.seed(19931023) cmb2<-CMB(Diris,nsing=100,Bsing=50,alpha=0.8,singfam=Rank(), evalfam=Rank(),sing=TRUE,M=2,m_iter=100, kap=0.1,LS=TRUE,wagg='weights1',robagg=FALSE,lower=0) cmb2[[1]] set.seed(19931023) cmb3<-CMB(Diris,nsing=100,Bsing=50,alpha=0.8,singfam=Rank(), evalfam=Rank(),sing=TRUE,M=10,m_iter=100, kap=0.1,LS=TRUE,wagg='weights2',robagg=FALSE,lower=0) cmb3[[1]]
Draws a Stability plot for CMB.
CMB.stabpath( D, nsing, Bsing = 1, alpha = 1, singfam = Gaussian(), evalfam = Gaussian(), sing = FALSE, Mseq, m_iter = 100, kap = 0.1, LS = FALSE, best = 1, wagg, robagg = FALSE, lower = 0, B, ncmb, ... )
CMB.stabpath( D, nsing, Bsing = 1, alpha = 1, singfam = Gaussian(), evalfam = Gaussian(), sing = FALSE, Mseq, m_iter = 100, kap = 0.1, LS = FALSE, best = 1, wagg, robagg = FALSE, lower = 0, B, ncmb, ... )
D |
Data matrix. Has to be an |
nsing |
Number of observations (rows) used for the SingBoost submodels. |
Bsing |
Number of subsamples based on which the SingBoost models are validated. Default is 1. Not to confuse with parameter |
alpha |
Optional real number in |
singfam |
A SingBoost family. The SingBoost models are trained based on the corresponding loss function. Default is |
evalfam |
A SingBoost family. The SingBoost models are validated according to the corresponding loss function. Default is |
sing |
If |
Mseq |
A vector of different values for |
m_iter |
Number of SingBoost iterations. Default is 100. |
kap |
Learning rate (step size). Must be a real number in |
LS |
If a |
best |
Needed in the case of localized ranking. The parameter |
wagg |
Type of row weight aggregation. |
robagg |
Optional. If setting |
lower |
Optional argument. Only reasonable when setting |
B |
Number of subsamples of size |
ncmb |
Number of samples used for |
... |
Optional further arguments |
relev |
List of relevant variables (represented as their column number). |
ind |
Vector of relevant variables (represented as their column number). |
Werner, T., Gradient-Free Gradient Boosting, PhD Thesis, Carl von Ossietzky University Oldenburg, 2020
Workhorse function for the Stability Selection variant where either a grid of thresholds or a grid of cardinalities is given so that the Boosting models are evaluated on a validation set according to all elements of the respective grid. The model which performs best is finally selected as stable model.
CMB.Stabsel( Dtrain, nsing, Bsing = 1, B = 100, alpha = 1, singfam = Gaussian(), evalfam = Gaussian(), sing = FALSE, M = 10, m_iter = 100, kap = 0.1, LS = FALSE, best = 1, wagg, gridtype, grid, Dvalid, ncmb, robagg = FALSE, lower = 0, singcoef = FALSE, Mfinal, ... )
CMB.Stabsel( Dtrain, nsing, Bsing = 1, B = 100, alpha = 1, singfam = Gaussian(), evalfam = Gaussian(), sing = FALSE, M = 10, m_iter = 100, kap = 0.1, LS = FALSE, best = 1, wagg, gridtype, grid, Dvalid, ncmb, robagg = FALSE, lower = 0, singcoef = FALSE, Mfinal, ... )
Dtrain |
Data matrix. Has to be an |
nsing |
Number of observations (rows) used for the SingBoost submodels. |
Bsing |
Number of subsamples based on which the SingBoost models are validated. Default is 1. Not to confuse with parameter |
B |
Number of subsamples based on which the CMB models are validated. Default is 100. Not to confuse with |
alpha |
Optional real number in |
singfam |
A SingBoost family. The SingBoost models are trained based on the corresponding loss function. Default is |
evalfam |
A SingBoost family. The SingBoost models are validated according to the corresponding loss function. Default is |
sing |
If |
M |
An integer between 2 and |
m_iter |
Number of SingBoost iterations. Default is 100. |
kap |
Learning rate (step size). Must be a real number in |
LS |
If a |
best |
Needed in the case of localized ranking. The parameter |
wagg |
Type of row weight aggregation. |
gridtype |
Choose between |
grid |
The grid for the thresholds (in |
Dvalid |
Validation data for selecting the optimal element of the grid and with it the best corresponding model. |
ncmb |
Number of samples used for |
robagg |
Optional. If setting |
lower |
Optional argument. Only reasonable when setting |
singcoef |
Default is |
Mfinal |
Optional. Necessary if |
... |
Optional further arguments |
The Stability Selection in the packages stabs
and mboost
requires to fix two of three parameters which are
the per-family error rate, the threshold and the number of variables which have to be selected in each model. Our
Stability Selection is based on another idea. We also train Boosting models on subsamples but we use a validation
step to determine the size of the optimal model. More precisely, if 'pigrid'
is used as gridtype
, the corresponding
stable models for each threshold are computed by selecting all variables whose aggregated selection frequency exceeds
the threshold. Then, these candidate stable models are validated according to the target loss function (inserted
through evalfam
) and the optimal one is finally selected. If 'qgrid'
is used as gridtype
, a vector of positive
integers has to be entered instead of a vector of thresholds. The candidate stable models then consist of the best
variables ordered by their aggregated selection frequencies, respectively. The validation step is the same.
colind.opt |
The column numbers of the variables that form the best stable model as a vector. |
coeff.opt |
The coefficients corresponding to the optimal stable model as a vector. |
aggnu |
Aggregated empirical column measure (i.e., selection frequencies) as a vector. |
aggzeta |
Aggregated empirical row measure (i.e., row weights) as a vector. |
Werner, T., Gradient-Free Gradient Boosting, PhD Thesis, Carl von Ossietzky University Oldenburg, 2020
T. Hothorn, P. Bühlmann, T. Kneib, M. Schmid, and B. Hofner. mboost: Model-Based Boosting, 2017
B. Hofner and T. Hothorn. stabs: Stability Selection with Error Control, 2017.
B. Hofner, L. Boccuto, and M. Göker. Controlling false discoveries in high-dimensional situations: Boosting with stability selection. BMC Bioinformatics, 16(1):144, 2015.
N. Meinshausen and P. Bühlmann. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417–473, 2010.
Executes CMB and the loss-based Stability Selection.
CMB3S( Dtrain, nsing, Bsing = 1, B = 100, alpha = 1, singfam = Gaussian(), evalfam = Gaussian(), sing = FALSE, M = 10, m_iter = 100, kap = 0.1, LS = FALSE, best = 1, wagg, gridtype, grid, Dvalid, ncmb, robagg = FALSE, lower = 0, singcoef = FALSE, Mfinal = 10, ... )
CMB3S( Dtrain, nsing, Bsing = 1, B = 100, alpha = 1, singfam = Gaussian(), evalfam = Gaussian(), sing = FALSE, M = 10, m_iter = 100, kap = 0.1, LS = FALSE, best = 1, wagg, gridtype, grid, Dvalid, ncmb, robagg = FALSE, lower = 0, singcoef = FALSE, Mfinal = 10, ... )
Dtrain |
Data matrix. Has to be an |
nsing |
Number of observations (rows) used for the SingBoost submodels. |
Bsing |
Number of subsamples based on which the SingBoost models are validated. Default is 1. Not to confuse with parameter |
B |
Number of subsamples based on which the CMB models are validated. Default is 100. Not to confuse with |
alpha |
Optional real number in |
singfam |
A SingBoost family. The SingBoost models are trained based on the corresponding loss function. Default is |
evalfam |
A SingBoost family. The SingBoost models are validated according to the corresponding loss function. Default is |
sing |
If |
M |
An integer between 2 and |
m_iter |
Number of SingBoost iterations. Default is 100. |
kap |
Learning rate (step size). Must be a real number in |
LS |
If a |
best |
Needed in the case of localized ranking. The parameter |
wagg |
Type of row weight aggregation. |
gridtype |
Choose between |
grid |
The grid for the thresholds (in |
Dvalid |
Validation data for selecting the optimal element of the grid and with it the best corresponding model. |
ncmb |
Number of samples used for |
robagg |
Optional. If setting |
lower |
Optional argument. Only reasonable when setting |
singcoef |
Default is |
Mfinal |
Optional. Necessary if |
... |
Optional further arguments |
See CMB
and CMB.Stabsel
.
Final coefficients |
The coefficients corresponding to the optimal stable model as a vector. |
Stable column measure |
Aggregated empirical column measure (i.e., selection frequencies) as a vector. |
Selected columns |
The column numbers of the variables that form the best stable model as a vector. |
Used row measure |
Aggregated empirical row measure (i.e., row weights) as a vector. |
Werner, T., Gradient-Free Gradient Boosting, PhD Thesis, Carl von Ossietzky University Oldenburg, 2020
T. Hothorn, P. Bühlmann, T. Kneib, M. Schmid, and B. Hofner. mboost: Model-Based Boosting, 2017
B. Hofner and T. Hothorn. stabs: Stability Selection with Error Control, 2017.
B. Hofner, L. Boccuto, and M. Göker. Controlling false discoveries in high-dimensional situations: Boosting with stability selection. BMC Bioinformatics, 16(1):144, 2015.
N. Meinshausen and P. Bühlmann. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417–473, 2010.
firis<-as.formula(Sepal.Length~.) Xiris<-model.matrix(firis,iris) Diris<-data.frame(Xiris[,-1],iris$Sepal.Length) colnames(Diris)[6]<-"Y" set.seed(19931023) ind<-sample(1:150,120,replace=FALSE) Dtrain<-Diris[ind,] Dvalid<-Diris[-ind,] set.seed(19931023) cmb3s<-CMB3S(Dtrain,nsing=120,Dvalid=Dvalid,ncmb=120,Bsing=1,B=1,alpha=1,singfam=Gaussian() ,evalfam=Gaussian(),sing=FALSE,M=10,m_iter=100,kap=0.1,LS=FALSE,wagg='weights1', gridtype='pigrid',grid=seq(0.8,0.9,1),robagg=FALSE,lower=0,singcoef=TRUE,Mfinal=10) cmb3s$Fin cmb3s$Stab cmb3s$Sel glmres4<-glmboost(Sepal.Length~.,iris[ind,]) coef(glmres4) set.seed(19931023) cmb3s1<-CMB3S(Dtrain,nsing=80,Dvalid=Dvalid,ncmb=100,Bsing=10,B=100,alpha=0.5,singfam=Gaussian(), evalfam=Gaussian(),sing=FALSE,M=10,m_iter=100,kap=0.1,LS=FALSE,wagg='weights1',gridtype='pigrid', grid=seq(0.8,0.9,1),robagg=FALSE,lower=0,singcoef=TRUE,Mfinal=10) cmb3s1$Fin cmb3s1$Stab ## This will may take around a minute set.seed(19931023) cmb3s2<-CMB3S(Dtrain,nsing=80,Dvalid=Dvalid,ncmb=100,Bsing=10,B=100,alpha=0.5,singfam=Rank(), evalfam=Rank(),sing=TRUE,M=10,m_iter=100,kap=0.1,LS=TRUE,wagg='weights2',gridtype='pigrid', grid=seq(0.8,0.9,1),robagg=FALSE,lower=0,singcoef=TRUE,Mfinal=10) cmb3s2$Fin cmb3s2$Stab set.seed(19931023) cmb3s3<-CMB3S(Dtrain,nsing=80,Dvalid=Dvalid,ncmb=100,Bsing=10,B=100,alpha=0.5,singfam=Huber(), evalfam=Huber(),sing=FALSE,M=10,m_iter=100,kap=0.1,LS=FALSE,wagg='weights2',gridtype='pigrid', grid=seq(0.8,0.9,1),robagg=FALSE,lower=0,singcoef=FALSE,Mfinal=10) cmb3s3$Fin cmb3s3$Stab
firis<-as.formula(Sepal.Length~.) Xiris<-model.matrix(firis,iris) Diris<-data.frame(Xiris[,-1],iris$Sepal.Length) colnames(Diris)[6]<-"Y" set.seed(19931023) ind<-sample(1:150,120,replace=FALSE) Dtrain<-Diris[ind,] Dvalid<-Diris[-ind,] set.seed(19931023) cmb3s<-CMB3S(Dtrain,nsing=120,Dvalid=Dvalid,ncmb=120,Bsing=1,B=1,alpha=1,singfam=Gaussian() ,evalfam=Gaussian(),sing=FALSE,M=10,m_iter=100,kap=0.1,LS=FALSE,wagg='weights1', gridtype='pigrid',grid=seq(0.8,0.9,1),robagg=FALSE,lower=0,singcoef=TRUE,Mfinal=10) cmb3s$Fin cmb3s$Stab cmb3s$Sel glmres4<-glmboost(Sepal.Length~.,iris[ind,]) coef(glmres4) set.seed(19931023) cmb3s1<-CMB3S(Dtrain,nsing=80,Dvalid=Dvalid,ncmb=100,Bsing=10,B=100,alpha=0.5,singfam=Gaussian(), evalfam=Gaussian(),sing=FALSE,M=10,m_iter=100,kap=0.1,LS=FALSE,wagg='weights1',gridtype='pigrid', grid=seq(0.8,0.9,1),robagg=FALSE,lower=0,singcoef=TRUE,Mfinal=10) cmb3s1$Fin cmb3s1$Stab ## This will may take around a minute set.seed(19931023) cmb3s2<-CMB3S(Dtrain,nsing=80,Dvalid=Dvalid,ncmb=100,Bsing=10,B=100,alpha=0.5,singfam=Rank(), evalfam=Rank(),sing=TRUE,M=10,m_iter=100,kap=0.1,LS=TRUE,wagg='weights2',gridtype='pigrid', grid=seq(0.8,0.9,1),robagg=FALSE,lower=0,singcoef=TRUE,Mfinal=10) cmb3s2$Fin cmb3s2$Stab set.seed(19931023) cmb3s3<-CMB3S(Dtrain,nsing=80,Dvalid=Dvalid,ncmb=100,Bsing=10,B=100,alpha=0.5,singfam=Huber(), evalfam=Huber(),sing=FALSE,M=10,m_iter=100,kap=0.1,LS=FALSE,wagg='weights2',gridtype='pigrid', grid=seq(0.8,0.9,1),robagg=FALSE,lower=0,singcoef=FALSE,Mfinal=10) cmb3s3$Fin cmb3s3$Stab
Cross-validates the whole loss-based Stability Selection by aggregating several stable models according to their performance on validation sets. Also computes a cross-validated test loss on a disjoint test set.
CV.CMB3S( D, nsing, Bsing = 1, B = 100, alpha = 1, singfam = Gaussian(), evalfam = Gaussian(), sing = FALSE, M = 10, m_iter = 100, kap = 0.1, LS = FALSE, best = 1, wagg, gridtype, grid, ncmb, CVind, targetfam = Gaussian(), print = TRUE, robagg = FALSE, lower = 0, singcoef = FALSE, Mfinal = 10, ... )
CV.CMB3S( D, nsing, Bsing = 1, B = 100, alpha = 1, singfam = Gaussian(), evalfam = Gaussian(), sing = FALSE, M = 10, m_iter = 100, kap = 0.1, LS = FALSE, best = 1, wagg, gridtype, grid, ncmb, CVind, targetfam = Gaussian(), print = TRUE, robagg = FALSE, lower = 0, singcoef = FALSE, Mfinal = 10, ... )
D |
Data matrix. Has to be an |
nsing |
Number of observations (rows) used for the SingBoost submodels. |
Bsing |
Number of subsamples based on which the SingBoost models are validated. Default is 1. Not to confuse with parameter |
B |
Number of subsamples based on which the CMB models are validated. Default is 100. Not to confuse with |
alpha |
Optional real number in |
singfam |
A SingBoost family. The SingBoost models are trained based on the corresponding loss function. Default is |
evalfam |
A SingBoost family. The SingBoost models are validated according to the corresponding loss function. Default is |
sing |
If |
M |
An integer between 2 and |
m_iter |
Number of SingBoost iterations. Default is 100. |
kap |
Learning rate (step size). Must be a real number in |
LS |
If a |
best |
Needed in the case of localized ranking. The parameter |
wagg |
Type of row weight aggregation. |
gridtype |
Choose between |
grid |
The grid for the thresholds (in |
ncmb |
Number of samples used for |
CVind |
A list where each element contains a vector on length |
targetfam |
Target loss. Should be the same family as |
print |
If set to |
robagg |
Optional. If setting |
lower |
Optional argument. Only reasonable when setting |
singcoef |
Default is |
Mfinal |
Optional. Necessary if |
... |
Optional further arguments |
In CMB3S
, a validation set is given based on which the optimal stable model is chosen. The CV.CMB3S
function adds an outer cross-validation step such that both the training and the validation data sets (and
optionally the test data sets) are chosen randomly by disjointly dividing the initial data set. The aggregated
stable models form an ”ultra-stable” model. It is strongly recommended to use this function is a parallelized
manner due to huge computation time.
Cross-validated loss |
A vector containing the cross-validated test losses. |
Ultra-stable column measure |
A vector containing the aggregated selection frequencies of the stable models. |
Werner, T., Gradient-Free Gradient Boosting, PhD Thesis, Carl von Ossietzky University Oldenburg, 2020
Auxiliary function for generating simple artificial data sets with normally distributed coefficients and regressors. Note that we only report this function for reproducibility of the simulations from the PhD thesis of the author.
genDataFromExamples( p, n, s = 1, xmean = 0, betamean = 0, betasd = 1, snr = 2, rho = 0 )
genDataFromExamples( p, n, s = 1, xmean = 0, betamean = 0, betasd = 1, snr = 2, rho = 0 )
p |
Number of variables (columns). |
n |
Number of observations (rows). |
s |
Sparsity. Real number between 0 and 1. |
xmean |
Mean of each of the normally distributed columns. Default is 0. |
betamean |
Mean of each of the normally distributed coefficients. Default is 0. |
betasd |
Standard deviation of the normally distributed coefficients. Default is 1. |
snr |
Signal to noise ratio. Real number greater than zero. Default is 2. |
rho |
Parameter for a Toeplitz covariance structure of the regressors. Real number between -1 and 1. Default is 0 which corresponds to uncorrelated columns. |
D |
Data matrix |
vars |
A list of the relevant variables. |
genDataFromExamples(10,25,0.3)
genDataFromExamples(10,25,0.3)
Gradient-free Gradient Boosting family for the localized ranking loss function including its fast computation.
LocRank(K)
LocRank(K)
K |
Indicates that we are interesting in the top |
The localized ranking loss combines the hard and the weak ranking loss, i.e., it penalizes misrankings at
the top of the list (the best instances according to the response value) and ”misclassification” in the sense
that instances belonging to the top of the list are ranked lower and vice versa. The localized ranking loss already
returns a normalized loss that can take values between 0 and 1.
LocRank
returns a family object
as in the package mboost
.
A Boosting family object
Werner, T., Gradient-Free Gradient Boosting, PhD Thesis, Carl von Ossietzky University Oldenburg, 2020, Equation (5.2.5)
T. Hothorn, P. Bühlmann, T. Kneib, M. Schmid, and B. Hofner. mboost: Model-Based Boosting, 2017
{y<-c(-3, 10.3,-8, 12, 14,-0.5, 29,-1.1,-5.7, 119) yhat<-c(0.02, 0.6, 0.1, 0.47, 0.82, 0.04, 0.77, 0.09, 0.01, 0.79) LocRank(4)@risk(y,yhat)} {y<-c(-3, 10.3,-8, 12, 14,-0.5, 29,-1.1,-5.7, 119) yhat<-c(0.02, 0.6, 0.1, 0.47, 0.82, 0.04, 0.77, 0.09, 0.01, 0.79) LocRank(5)@risk(y,yhat)}
{y<-c(-3, 10.3,-8, 12, 14,-0.5, 29,-1.1,-5.7, 119) yhat<-c(0.02, 0.6, 0.1, 0.47, 0.82, 0.04, 0.77, 0.09, 0.01, 0.79) LocRank(4)@risk(y,yhat)} {y<-c(-3, 10.3,-8, 12, 14,-0.5, 29,-1.1,-5.7, 119) yhat<-c(0.02, 0.6, 0.1, 0.47, 0.82, 0.04, 0.77, 0.09, 0.01, 0.79) LocRank(5)@risk(y,yhat)}
Runs SingBoost but saves the coefficients paths. If no coefficient path plot is needed, just use
singboost
.
path.singboost( D, M = 10, m_iter = 100, kap = 0.1, singfamily = Gaussian(), best = 1, LS = FALSE )
path.singboost( D, M = 10, m_iter = 100, kap = 0.1, singfamily = Gaussian(), best = 1, LS = FALSE )
D |
Data matrix. Has to be an |
M |
An integer between 2 and |
m_iter |
Number of SingBoost iterations. Default is 100. |
kap |
Learning rate (step size). Must be a real number in |
singfamily |
A Boosting family corresponding to the target loss function. See . |
best |
Needed in the case of localized ranking. The parameter |
LS |
If a |
Selected variables |
Names of the selected variables. |
Coefficients |
The selected coefficients as an |
Freqs |
Selection frequencies and a matrix for intercept and coefficient paths, respectively. |
Intercept path |
The intercept path as an |
Coefficient path |
The coefficient paths as a |
Simple auxiliary function for randomly generating the indices for training, validation and test data for cross validation.
random.CVind(n, ncmb, nval, CV)
random.CVind(n, ncmb, nval, CV)
n |
Number of observations (rows). |
ncmb |
Number of training samples for the SingBoost models in CMB. Must be an integer between 1 and |
nval |
Number of validation samples in the CMB aggregation procedure. Must be an integer between 1 and |
CV |
Number of cross validation steps. Must be a positive integer. |
The data set consists of $n$ observations. of them are used for the CMB aggregation procedure.
Note that within CMB itself, only a subset of these observations may be used for SingBoost training. The Stability
Selection is based on the validation set consisting of
observations. The cross-validated loss of the
final model is evaluated on the test data set with
observations. Clearly, all data sets need to
be disjoint.
CVind |
List of row indices for training, validation and test data for each cross validation loop. |
Gradient-free Gradient Boosting family for the hard ranking loss function including its fast computation.
Rank()
Rank()
The hard ranking loss is used to compare different orderings, usually the true ordering of instances of a data
set according to their responses with the predicted counterparts. The usage of the pcaPP
package avoids the
cumbersome computation that would require
comparisons. Rank
returns a family object
as in the package mboost
.
A Boosting family object
Werner, T., Gradient-Free Gradient Boosting, PhD Thesis, Carl von Ossietzky University Oldenburg, 2020, Equations (5.2.2) and (5.2.3)
T. Hothorn, P. Bühlmann, T. Kneib, M. Schmid, and B. Hofner. mboost: Model-Based Boosting, 2017
{y<-c(-3, 10.3,-8, 12, 14,-0.5, 29,-1.1,-5.7, 119) yhat<-c(0.02, 0.6, 0.1, 0.47, 0.82, 0.04, 0.77, 0.09, 0.01, 0.79) Rank()@risk(y,yhat)} {x<-1:6 z<-6:1 Rank()@risk(x,z)} {x<-1:6 z<-1:6 Rank()@risk(x,z)}
{y<-c(-3, 10.3,-8, 12, 14,-0.5, 29,-1.1,-5.7, 119) yhat<-c(0.02, 0.6, 0.1, 0.47, 0.82, 0.04, 0.77, 0.09, 0.01, 0.79) Rank()@risk(y,yhat)} {x<-1:6 z<-6:1 Rank()@risk(x,z)} {x<-1:6 z<-1:6 Rank()@risk(x,z)}
Validation step to combine different SingBoost models.
RejStep( D, nsing, Bsing = 1, ind, sing = FALSE, singfam = Gaussian(), evalfam = Gaussian(), M = 10, m_iter = 100, kap = 0.1, LS = FALSE, best = 1 )
RejStep( D, nsing, Bsing = 1, ind, sing = FALSE, singfam = Gaussian(), evalfam = Gaussian(), M = 10, m_iter = 100, kap = 0.1, LS = FALSE, best = 1 )
D |
Data matrix. Has to be an |
nsing |
Number of observations (rows) used for the SingBoost submodels. |
Bsing |
Number of subsamples based on which the SingBoost models are validated. Default is 1. Not to confuse with parameter |
ind |
Vector with indices for dividing the data set into training and validation data. |
sing |
If |
singfam |
A SingBoost family. The SingBoost models are trained based on the corresponding loss function. Default is |
evalfam |
A SingBoost family. The SingBoost models are validated according to the corresponding loss function. Default is |
M |
An integer between 2 and |
m_iter |
Number of SingBoost iterations. Default is 100. |
kap |
Learning rate (step size). Must be a real number in |
LS |
If a |
best |
Needed in the case of localized ranking. The parameter |
Divides the data set into a training and a validation set. The SingBoost models are computed on the training set and evaluated on the validation set based on the loss function corresponding to the selected Boosting family.
loss |
Vector of validation losses. |
occ |
Selection frequencies for each Boosting model. |
Werner, T., Gradient-Free Gradient Boosting, PhD Thesis, Carl von Ossietzky University Oldenburg, 2020
SingBoost is a Boosting method that can deal with complicated loss functions that do not allow for a gradient. SingBoost is based on L2-Boosting in its current implementation.
singboost( D, M = 10, m_iter = 100, kap = 0.1, singfamily = Gaussian(), best = 1, LS = FALSE )
singboost( D, M = 10, m_iter = 100, kap = 0.1, singfamily = Gaussian(), best = 1, LS = FALSE )
D |
Data matrix. Has to be an |
M |
An integer between 2 and |
m_iter |
Number of SingBoost iterations. Default is 100. |
kap |
Learning rate (step size). Must be a real number in |
singfamily |
A Boosting family corresponding to the target loss function. See . |
best |
Needed in the case of localized ranking. The parameter |
LS |
If a |
Gradient Boosting algorithms require convexity and differentiability of the underlying loss function.
SingBoost is a Boosting algorithm based on Boosting that allows for complicated loss functions that do not
need to satisfy these requirements. In fact, SingBoost alternates between standard
Boosting iterations and
singular iterations where essentially an empirical gradient step is executed in the sense that the baselearner
that performs best, evaluated in the complicated loss, is selected in the respective iteration. The implementation
is based on
glmboost
from the package mboost
and using the loss in the singular iterations returns exactly the
same coefficients as
Boosting.
Selected variables |
Names of the selected variables. |
Coefficients |
The selected coefficients as an |
Freqs |
Selection frequencies and a matrix for intercept and coefficient paths, respectively. |
VarCoef |
Vector of the non-zero coefficients. |
Werner, T., Gradient-Free Gradient Boosting, PhD Thesis, Carl von Ossietzky University Oldenburg, 2020
P. Bühlmann and B. Yu. Boosting with the l2 loss: Regression and Classification. Journal of the American Statistical Association, 98(462):324–339, 2003
T. Hothorn, P. Bühlmann, T. Kneib, M. Schmid, and B. Hofner. mboost: Model-Based Boosting, 2017
{glmres<-glmboost(Sepal.Length~.,iris) glmres attributes(varimp(glmres))$self attributes(varimp(glmres))$var firis<-as.formula(Sepal.Length~.) Xiris<-model.matrix(firis,iris) Diris<-data.frame(Xiris[,-1],iris$Sepal.Length) colnames(Diris)[6]<-"Y" coef(glmboost(Xiris,iris$Sepal.Length)) singboost(Diris) singboost(Diris,LS=TRUE)} {glmres2<-glmboost(Sepal.Length~Petal.Length+Sepal.Width:Species,iris) finter<-as.formula(Sepal.Length~Petal.Length+Sepal.Width:Species-1) Xinter<-model.matrix(finter,iris) Dinter<-data.frame(Xinter,iris$Sepal.Length) singboost(Dinter) coef(glmres2)} {glmres3<-glmboost(Xiris,iris$Sepal.Length,control=boost_control(mstop=250,nu=0.05)) coef(glmres3) attributes(varimp(glmres3))$self singboost(Diris,m_iter=250,kap=0.05) singboost(Diris,LS=TRUE,m_iter=250,kap=0.05)} {glmquant<-glmboost(Sepal.Length~.,iris,family=QuantReg(tau=0.75)) coef(glmquant) attributes(varimp(glmquant))$self singboost(Diris,singfamily=QuantReg(tau=0.75),LS=TRUE) singboost(Diris,singfamily=QuantReg(tau=0.75),LS=TRUE,M=2)} {singboost(Diris,singfamily=Rank(),LS=TRUE) singboost(Diris,singfamily=Rank(),LS=TRUE,M=2)}
{glmres<-glmboost(Sepal.Length~.,iris) glmres attributes(varimp(glmres))$self attributes(varimp(glmres))$var firis<-as.formula(Sepal.Length~.) Xiris<-model.matrix(firis,iris) Diris<-data.frame(Xiris[,-1],iris$Sepal.Length) colnames(Diris)[6]<-"Y" coef(glmboost(Xiris,iris$Sepal.Length)) singboost(Diris) singboost(Diris,LS=TRUE)} {glmres2<-glmboost(Sepal.Length~Petal.Length+Sepal.Width:Species,iris) finter<-as.formula(Sepal.Length~Petal.Length+Sepal.Width:Species-1) Xinter<-model.matrix(finter,iris) Dinter<-data.frame(Xinter,iris$Sepal.Length) singboost(Dinter) coef(glmres2)} {glmres3<-glmboost(Xiris,iris$Sepal.Length,control=boost_control(mstop=250,nu=0.05)) coef(glmres3) attributes(varimp(glmres3))$self singboost(Diris,m_iter=250,kap=0.05) singboost(Diris,LS=TRUE,m_iter=250,kap=0.05)} {glmquant<-glmboost(Sepal.Length~.,iris,family=QuantReg(tau=0.75)) coef(glmquant) attributes(varimp(glmquant))$self singboost(Diris,singfamily=QuantReg(tau=0.75),LS=TRUE) singboost(Diris,singfamily=QuantReg(tau=0.75),LS=TRUE,M=2)} {singboost(Diris,singfamily=Rank(),LS=TRUE) singboost(Diris,singfamily=Rank(),LS=TRUE,M=2)}
Plot function for the SingBoost coefficient paths
singboost.plot(mod, M, m_iter, subnames = FALSE)
singboost.plot(mod, M, m_iter, subnames = FALSE)
mod |
singboost object. |
M |
An integer between 2 and |
m_iter |
Number of SingBoost iterations. Default is 100. |
subnames |
Use it only if the variable names are of the form ”letter plus number”. Better just ignore it. |
Nothing. Plots SingBoost coefficient paths
{glmres<-glmboost(Sepal.Length~.,iris) glmres attributes(varimp(glmres))$self attributes(varimp(glmres))$var firis<-as.formula(Sepal.Length~.) Xiris<-model.matrix(firis,iris) Diris<-data.frame(Xiris[,-1],iris$Sepal.Length) plot(glmres) singpath<-path.singboost(Diris) singboost.plot(singpath,10,100,subnames=FALSE)}
{glmres<-glmboost(Sepal.Length~.,iris) glmres attributes(varimp(glmres))$self attributes(varimp(glmres))$var firis<-as.formula(Sepal.Length~.) Xiris<-model.matrix(firis,iris) Diris<-data.frame(Xiris[,-1],iris$Sepal.Length) plot(glmres) singpath<-path.singboost(Diris) singboost.plot(singpath,10,100,subnames=FALSE)}
Gradient-free Gradient Boosting family for the weak ranking loss function.
WeakRank(K)
WeakRank(K)
K |
Indicates that we are only interesting in the top |
The weak ranking loss may be regarded as a classification loss. The parameter K
defines the top of the list,
consisting of the best instances according to their response values. Then the weak ranking loss penalizes
”misclassification” in the sense that instances belonging to the top of the list are ranked lower and vice versa.
WeakRank
returns a family object as in the package mboost
.
A Boosting family object
Werner, T., Gradient-Free Gradient Boosting, PhD Thesis, Carl von Ossietzky University Oldenburg, 2020, Remark (5.2.1)
T. Hothorn, P. Bühlmann, T. Kneib, M. Schmid, and B. Hofner. mboost: Model-Based Boosting, 2017
{y<-c(-3, 10.3,-8, 12, 14,-0.5, 29,-1.1,-5.7, 119) yhat<-c(0.02, 0.6, 0.1, 0.47, 0.82, 0.04, 0.77, 0.09, 0.01, 0.79) WeakRank(4)@risk(y,yhat)} {y<-c(-3, 10.3,-8, 12, 14,-0.5, 29,-1.1,-5.7, 119) yhat<-c(0.02, 0.6, 0.1, 0.47, 0.82, 0.04, 0.77, 0.09, 0.01, 0.79) WeakRank(5)@risk(y,yhat)}
{y<-c(-3, 10.3,-8, 12, 14,-0.5, 29,-1.1,-5.7, 119) yhat<-c(0.02, 0.6, 0.1, 0.47, 0.82, 0.04, 0.77, 0.09, 0.01, 0.79) WeakRank(4)@risk(y,yhat)} {y<-c(-3, 10.3,-8, 12, 14,-0.5, 29,-1.1,-5.7, 119) yhat<-c(0.02, 0.6, 0.1, 0.47, 0.82, 0.04, 0.77, 0.09, 0.01, 0.79) WeakRank(5)@risk(y,yhat)}
Gradient-free Gradient Boosting family for the normalized weak ranking loss function.
WeakRankNorm(K)
WeakRankNorm(K)
K |
Indicates that we are only interesting in the top |
A more intuitive loss function than the weak ranking loss thanks to its normalization to a maximum value
of 1. For example, if a number of the top
instances has not been ranked at the top of the list, the
normalized weak ranking loss is
.
WeakRankNorm
returns a family object as in the package mboost
.
A Boosting family object
Werner, T., Gradient-Free Gradient Boosting, PhD Thesis, Carl von Ossietzky University Oldenburg, 2020, Remark (5.2.4)
T. Hothorn, P. Bühlmann, T. Kneib, M. Schmid, and B. Hofner. mboost: Model-Based Boosting, 2017