The sgs
R package fits sparse-group SLOPE (SGS) and
group SLOPE (gSLOPE) models. The package has implementations for linear
and logisitic regression, both of which are demonstrated here. The
package also uses strong screening rules to speed up computational time,
described in detail in F.
Feser, M. Evangelou (2024) “Strong Screening Rules for Group-based SLOPE
Models”. Screening rules are applied by default here. However, the
impact of screening is demonstrated in the Screening section at the
end.
Sparse-group SLOPE (SGS) is a penalised regression approach that performs bi-level selection with FDR control under orthogonal designs. SGS is described in detail in F. Feser, M. Evangelou (2023) “Sparse-group SLOPE: adaptive bi-level selection with FDR-control”.
For this example, a 400 × 500 input matrix is used with a simple grouping structure, sampled from a multivariate Gaussian distribution with no correlation.
We now fit an SGS model to the data using linear regression. The SGS model has many different hyperparameters which can be tuned/selected. Of particular importance is the λ parameter, which defines the level of sparsity in the model. First, we select this manually and then next use cross-validation to tune it. The other parameters we leave as their default values, although they can easily be changed.
model = fit_sgs(X = data$X, y = data$y, groups = groups, type="linear", lambda = 0.5, alpha=0.95, vFDR=0.1, gFDR=0.1, standardise = "l2", intercept = TRUE, verbose=FALSE, screen=TRUE)
Note: we have fit an intercept and applied ℓ2 standardisation. This is the recommended usage when applying SGS with linear regression. The lambda values can also be calculated automatically, starting at the null model and continuing as specified by and :
The package provides several useful outputs after fitting a model. The vector shows the fitted values (note the intercept). We can also recover the indices of the non-zero variables and groups, which are indexed from the first variable, not the intercept.
## [1] -1.6734720 -1.4893849 3.1678789 1.6350236 5.0369209 -1.4619740
## [7] 1.2779414 -0.9364065 4.9809334 -0.6548784 -1.4470782 -2.2343330
## [13] 1.6188817 -1.6665858 -0.7298015
## [1] 3.879978 5.295647 1.461974 1.277941 5.068190 1.588364 2.234333 2.435343
## [1] 97 99 100 133 136 170 217 231 234 260 263 334 391 393 394
## [1] 30 39 46 56 59 64 76 85
Defining a function that lets us calculate various metrics (including the FDR and sensitivity):
fdr_sensitivity = function(fitted_ids, true_ids,num_coef){
# calculates FDR, FPR, and sensitivity
num_true = length(intersect(fitted_ids,true_ids))
num_false = length(fitted_ids) - num_true
num_missed = length(true_ids) - num_true
num_true_negatives = num_coef - length(true_ids)
out=c()
out$fdr = num_false / (num_true + num_false)
if (is.nan(out$fdr)){out$fdr = 0}
out$sensitivity = num_true / length(true_ids)
if (length(true_ids) == 0){
out$sensitivity = 1
}
out$fpr = num_false / num_true_negatives
out$f1 = (2*num_true)/(2*num_true + num_false + num_missed)
if (is.nan(out$f1)){out$f1 = 1}
return(out)
}
Calculating relevant metrics give
## $fdr
## [1] 0
##
## $sensitivity
## [1] 0.5357143
##
## $fpr
## [1] 0
##
## $f1
## [1] 0.6976744
## $fdr
## [1] 0
##
## $sensitivity
## [1] 0.8
##
## $fpr
## [1] 0
##
## $f1
## [1] 0.8888889
The model is currently too sparse, as our choice of λ is too high. We can instead use cross-validation.
Cross-validation is used to fit SGS models along a λ path of length 20. The first value, λmax, is chosen to give the null model and the path is terminated at λmin = δλmax, where δ is some value between 0 and 1 (given by in the function). The 1se rule (as in the package) is used to choose the optimal model.
cv_model = fit_sgs_cv(X = data$X, y = data$y, groups=groups, type = "linear", path_length = 20, nfolds=10, alpha = 0.95, vFDR = 0.1, gFDR = 0.1, min_frac = 0.05, standardise="l2",intercept=TRUE,verbose=TRUE, screen = TRUE)
## [1] "Fold 1/10 done. Error: 3910.0939772"
## [1] "Fold 2/10 done. Error: 3376.48939972751"
## [1] "Fold 3/10 done. Error: 3062.30130094167"
## [1] "Fold 4/10 done. Error: 3858.4636925338"
## [1] "Fold 5/10 done. Error: 2513.53852887078"
## [1] "Fold 6/10 done. Error: 4400.18480300288"
## [1] "Fold 7/10 done. Error: 3353.92178900251"
## [1] "Fold 8/10 done. Error: 2936.67725189627"
## [1] "Fold 9/10 done. Error: 3214.80654464167"
## [1] "Fold 10/10 done. Error: 2644.28024293398"
The fitting verbose contains useful information, showing the error for each fold. Aside from the fitting verbose, we can see a more succinct summary by using the function
##
## regression type:
##
## lambda error num.nonzero
## [1,] 1.8885960 9523.7357 0
## [2,] 1.6131093 9207.3047 1
## [3,] 1.3778075 8483.2428 2
## [4,] 1.1768288 7660.2673 3
## [5,] 1.0051665 6853.3743 4
## [6,] 0.8585444 5743.6287 14
## [7,] 0.7333098 4449.9434 15
## [8,] 0.6263430 3427.6404 15
## [9,] 0.5349793 2666.8889 15
## [10,] 0.4569427 2096.7952 16
## [11,] 0.3902891 1624.7847 22
## [12,] 0.3333582 1237.4624 22
## [13,] 0.2847318 935.0866 22
## [14,] 0.2431984 714.0932 22
## [15,] 0.2077234 549.4968 24
## [16,] 0.1774231 423.4776 24
## [17,] 0.1515426 328.7270 26
## [18,] 0.1294373 254.9699 26
## [19,] 0.1105565 200.4616 27
## [20,] 0.0944298 160.1339 27
The best model is found to be the one at the end of the path:
## [1] 20
Checking the metrics again, we see how CV has generated a model with the correct amount of sparsity that gives FDR levels below the specified values.
fdr_sensitivity(fitted_ids = cv_model$fit$selected_var, true_ids = data$true_var_id, num_coef = 500)
## $fdr
## [1] 0.03703704
##
## $sensitivity
## [1] 0.9285714
##
## $fpr
## [1] 0.002118644
##
## $f1
## [1] 0.9454545
fdr_sensitivity(fitted_ids = cv_model$fit$selected_grp, true_ids = data$true_grp_id, num_coef = 100)
## $fdr
## [1] 0.09090909
##
## $sensitivity
## [1] 1
##
## $fpr
## [1] 0.01111111
##
## $f1
## [1] 0.952381
As mentioned, the package can also be used to fit SGS to a binary response. First, we generate some binary data. We can use the same input matrix, X, and true β as before. We split the data into train and test to test the models classification performance.
sigmoid = function(x) {
1 / (1 + exp(-x))
}
y = ifelse(sigmoid(data$X %*% data$true_beta + rnorm(400))>0.5,1,0)
train_y = y[1:350]
test_y = y[351:400]
train_X = data$X[1:350,]
test_X = data$X[351:400,]
We can again apply CV.
cv_model = fit_sgs_cv(X = train_X, y = train_y, groups=groups, type = "logistic", path_length = 20, nfolds=10, alpha = 0.95, vFDR = 0.1, gFDR = 0.1, min_frac = 0.05, standardise="l2",intercept=FALSE,verbose=TRUE, screen = TRUE)
## [1] "Fold 1/10 done. Error: 0.211428571428571"
## [1] "Fold 2/10 done. Error: 0.21"
## [1] "Fold 3/10 done. Error: 0.182857142857143"
## [1] "Fold 4/10 done. Error: 0.274285714285714"
## [1] "Fold 5/10 done. Error: 0.188571428571429"
## [1] "Fold 6/10 done. Error: 0.16"
## [1] "Fold 7/10 done. Error: 0.157142857142857"
## [1] "Fold 8/10 done. Error: 0.191428571428571"
## [1] "Fold 9/10 done. Error: 0.222857142857143"
## [1] "Fold 10/10 done. Error: 0.21"
and again, use the predict function
For logistic regression, the function returns both the predicted
class probabilities (response
) and the predicted class
(class
). We can use this to check the prediction accuracy,
given as 82%.
## [1] 0.4090457 0.6485261 0.1668541 0.1466649 0.3901439
## [1] 0 1 0 0 0
## [1] 0.82
Group SLOPE (gSLOPE) applies adaptive group penalisation to control
the group FDR under orthogonal designs. gSLOPE is described in detail in
Brzyski, D.,
Gossmann, A., Su, W., Bogdan, M. (2019). Group SLOPE – Adaptive
Selection of Groups of Predictors. gSLOPE is implemented in the
sgs
package with the same features as SGS. Here, we briefly
demonstrate how to fit a gSLOPE model.
Screening rules allow the input dimensionality to be reduced before fitting. The strong screening rules for gSLOPE and SGS are described in detail in Feser, F., Evangelou, M. (2024). Strong Screening Rules for Group-based SLOPE Models. Here, we demonstrate the effectiveness of screening rules by looking at the speed improvement they provide. For SGS:
screen_time = system.time(model_screen <- fit_sgs(X = data$X, y = data$y, groups = groups, type="linear", path_length = 100, alpha=0.95, vFDR=0.1, gFDR=0.1, standardise = "l2", intercept = TRUE, verbose=FALSE, screen=TRUE))
no_screen_time = system.time(model_no_screen <- fit_sgs(X = data$X, y = data$y, groups = groups, type="linear", path_length = 100, alpha=0.95, vFDR=0.1, gFDR=0.1, standardise = "l2", intercept = TRUE, verbose=FALSE, screen=FALSE))
screen_time
## user system elapsed
## 8.058 10.316 6.110
## user system elapsed
## 10.183 11.654 6.974
and for gSLOPE:
screen_time = system.time(model_screen <- fit_gslope(X = data$X, y = data$y, groups = groups, type="linear", path_length = 100, gFDR=0.1, standardise = "l2", intercept = TRUE, verbose=FALSE, screen=TRUE))
no_screen_time = system.time(model_no_screen <- fit_gslope(X = data$X, y = data$y, groups = groups, type="linear", path_length = 100, gFDR=0.1, standardise = "l2", intercept = TRUE, verbose=FALSE, screen=FALSE))
screen_time
## user system elapsed
## 4.993 7.446 3.367
## user system elapsed
## 13.064 18.411 8.190