The package provides a tool to select variables in a nonlinear multivariate model. More precisely, it consists in providing a variable selection tool from n observations satisfying the following nonparametric regression model: where f is an unknown real-valued function and where the εi’s are i.i.d centered random variables of variance σ2. The xi’s are observation points which belong to a compact set S of ℝp. We will also assume that f actually depends on only d variables instead of p, with d < p, which means that there exists a real-valued function f̃ such that f(x) = f̃(x̃), where x ∈ ℝp and x̃ ∈ ℝd. Variable selection consists in identifying the components of x̃. This variable selection approach is described in [1]. We refer the reader to this paper for further details and references.
We first propose to apply our method to n = 700 observations satisfying Model with f = f1 where p = 5, defined in [1]. These observations are obtained with a Gaussian noise of σ = 0.25. In the following, the d = 2 relevant variables to select are {3, 5} and the irrelevant ones to discard are {1, 2, 4}:
The observation set is loaded from files which are provided within the package, as follows:
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.3687684 0.16895845 0.7114856 0.1493075 0.2300115
## [2,] 0.7162858 0.47407370 0.2271114 0.8187909 0.3845692
## [3,] 0.5543277 0.63473174 0.9341467 0.4209710 0.1551578
## [4,] 0.2551628 0.55242762 0.8940447 0.8587429 0.6602330
## [5,] 0.1468073 0.21261063 0.8249912 0.7159358 0.6177809
## [6,] 0.3917696 0.01350068 0.6862343 0.8377919 0.6143807
## --- Loading the values of corresponding noisy values of the response variable --- ##
data('y_obs') ;
head(y_obs)
## [1] -0.09049367 -1.56817050 0.02365417 0.32580069 1.07158399 1.21354888
absorber
to select the relevant
variablesThe absorber
function
of the absorber
package is
applied by using the following arguments:
x
) where xi belongs to
[0, 1]p, 1 ≤ i ≤ n,y
),M
). The default value is 3 (quadratic B-splines).Additional arguments can also be provided in this function:
K
: Integer, number
K of evenly spaced knots to
use in the B-spline basis. The default value is 1.all.variables
: List of
characters or integers, labels of the variables. The default value is
NULL
.parallel
: Logical, if
set to TRUE
then a
parallelized version of the code is used. The default value is FALSE
.nbCore
: Numerical, it
represents the number of cores used for parallelization, if parallel is
set to TRUE
.The resulting outputs are the following:
lambdas
: sequence of
the used penalization parameters λ.selec.var
: list of
sequences of the selected variables, one sequence for each penalization
parameter.aic.var
: sequence of
variables selected using the AIC.First, we can print the sequence of penalization parameters λ used in our method:
## [1] 0.01563831 0.01492752 0.01424904 0.01360140 0.01298320 0.01239309
We can then print the corresponding sequences of selected variables for each penalization parameter:
## [[1]]
## NULL
##
## [[2]]
## [1] 3
##
## [[3]]
## [1] 3
##
## [[4]]
## [1] 3
##
## [[5]]
## [1] 3
##
## [[6]]
## [1] 3
and finally the variables selected with AIC:
## [1] 3 5
plot_selection
The plot_selection
function of the absorber
package produces a histogram of the variable selection percentage for
each variable on which f
depends. It also displays in red the results obtained with the AIC.
We can compare this visualization to the one indicating the relevant and the irrelevant variables in red and green, respectively, as in Figure 6 of [1]. To do so, we gather the results into a data.frame as follows:
nlam = length(res$lambdas)
occurrence = data.frame(table(unlist(res$selec.var))) ;
colnames(occurrence) = c("Covariable", "Percentage") ;
occurrence$Percentage =occurrence$Percentage*100/nlam ;
occurrence = occurrence[order(-occurrence$Percentage),,drop=FALSE] ;
occurrence$Covariable = factor(occurrence$Covariable,
levels = unique(occurrence$Covariable)) ;
occurrence$Category = as.factor(ifelse(occurrence$Covariable %in% true.dimensions,
'real features', 'fake features')) ;
str(occurrence) ;
## 'data.frame': 5 obs. of 3 variables:
## $ Covariable: Factor w/ 5 levels "3","5","4","2",..: 1 2 3 4 5
## $ Percentage: num 99 65 45 37 36
## $ Category : Factor w/ 2 levels "fake features",..: 2 2 1 1 1
We can then plot the results as a histogram of variable selection percentage:
color.order = c('firebrick', 'forestgreen')[which( c('fake features', 'real features')
%in% levels(occurrence$Category))]
plt_occ = ggplot(data = occurrence, aes(x = Covariable, y = Percentage, fill = Category)) +
geom_bar(stat = 'identity') +
scale_fill_manual(values = color.order) +
ylab('Percentage of selection') +
theme_bw() +
theme(legend.title = element_blank(),
axis.text.x = element_text(size = 16, face = 'bold'),
axis.text.y = element_text(size = 14),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 15),
legend.text = element_text(size = 14),
legend.position = 'bottom',
legend.key.size = unit(1, "cm"),
panel.grid.major = element_line(size = 0.6, linetype = 'solid',
colour = "darkgrey"),
panel.grid.minor = element_line(size = 0.2, linetype = 'solid',
colour = "darkgrey"))
print(plt_occ)
The results obtained with the AIC allows us to retrieve the correct relevant variables since it selects {3, 5} while discarding the irrelevant ones.
References
[1] Savino, M. E. and Lévy-Leduc, C. (2024) A novel variable selection method in nonlinear multivariate models using B-splines with an application to geoscience. ⟨hal-04434820⟩.