Title: | Confidence Interval Post-Selection of Variable |
---|---|
Description: | Calculates confidence intervals after variable selection using repeated data splits. The package offers methods to address the challenges of post-selection inference, ensuring more accurate confidence intervals in models involving variable selection. The two main functions are 'lmps', which records the different models selected across multiple data splits as well as the corresponding coefficient estimates, and 'cips', which takes the lmps object as input to select variables and perform inferences using two types of voting. |
Authors: | Boubacar DIALLO [aut, cre] |
Maintainer: | Boubacar DIALLO <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.1 |
Built: | 2024-12-22 06:21:02 UTC |
Source: | CRAN |
This package calculates post-selection confidence intervals for variables. It uses repeated data splitting with a voting mechanism and offers two methods for post-selection: Lasso and BIC. For Lasso, cross-validation is used to find the best lambda that fits the model. For BIC, since it's not possible to test all models, a backward or forward elimination method is applied. The selection is done on one part of the data, followed by calibration on the other part, and this process is repeated multiple times.
This package provides two main functions:
- lmps : This function provides the model selection matrices for the different data splits,
as well as the matrix of coefficient estimates for the selected models.
Its 'summary' method gives important information about the appropriate voting type to use with the CIps function.
- CIps : This function takes an 'lmps' object as a argument, along with other parameters that specify the type of vote
and the confidence level for the confidence intervals (calculated empirically).
Package: CIpostSelect
Version: 0.1.0
Date: 2024-09-26
License: MIT
Author: Boubacar DIALLO
Maintainer: Boubacar Diallo <[email protected]>
library(mlbench) data("BostonHousing") # Create lmps object model = lmps(medv ~ ., data = BostonHousing, method = "Lasso", N = 100) # Summary of lmps summary(model) # helps choose the appropriate vote type # Create CIps object cips = CIps(model, vote = "coef", alpha = 0.05, s.vote_coef = 0.5) # Results print(cips) # Summary plot plot(cips)
library(mlbench) data("BostonHousing") # Create lmps object model = lmps(medv ~ ., data = BostonHousing, method = "Lasso", N = 100) # Summary of lmps summary(model) # helps choose the appropriate vote type # Create CIps object cips = CIps(model, vote = "coef", alpha = 0.05, s.vote_coef = 0.5) # Results print(cips) # Summary plot plot(cips)
Creates an object of class CIps based on the provided parameters.
CIps(x, vote, alpha = 0.05, s.vote_coef = 0.5)
CIps(x, vote, alpha = 0.05, s.vote_coef = 0.5)
x |
An object of class lmps, which contains the selection and coefficient estimation matrices. |
vote |
The type of vote to perform: "model" for selection based on the most frequent model, or "coef" for variable selection (e.g., if a variable is selected more than 50 percent of the time). |
alpha |
Specifies the confidence level for the confidence intervals. |
s.vote_coef |
A parameter between 0 and 1 that, when using "coef" voting, indicates the frequency threshold for selecting a variable. |
After obtaining the lmps object, which provides the selection matrices (models and coefficients), this function allows us to compute confidence intervals that are calculated empirically based on the chosen voting method and the desired level of certainty. The confidence intervals are obtained through empirical calculation on each vector of estimates for the corresponding coefficients.
CIps also provides an intercept (test version) estimated as follows: in the case of a vote on models, it takes the average of the intercept vector for the rows where the most frequently selected model in the N splits is chosen. For the vote on coefficients, the idea is to select the coefficient that has been chosen the least number of times among those retained and then average the intercept only for the rows where this coefficient is selected.
An object of class CIps.
library(mlbench) data("BostonHousing") # lmps object model = lmps(medv~., data = BostonHousing, method = "Lasso", N = 50) # CIps object cips = CIps(model, vote = "coef", alpha = 0.05, s.vote_coef = 0.5) # lmps object model = lmps(medv~., data = BostonHousing, method = "Lasso", N = 50, cores = 2) # CIps object cips = CIps(model, vote = "coef", alpha = 0.05, s.vote_coef = 0.5)
library(mlbench) data("BostonHousing") # lmps object model = lmps(medv~., data = BostonHousing, method = "Lasso", N = 50) # CIps object cips = CIps(model, vote = "coef", alpha = 0.05, s.vote_coef = 0.5) # lmps object model = lmps(medv~., data = BostonHousing, method = "Lasso", N = 50, cores = 2) # CIps object cips = CIps(model, vote = "coef", alpha = 0.05, s.vote_coef = 0.5)
Function that handles storing our estimation and variable selection matrices during the different splits.
lmps( formula, data, method, N, p_split = 0.5, cores = NULL, direction = "backward", forced_var = NULL )
lmps( formula, data, method, N, p_split = 0.5, cores = NULL, direction = "backward", forced_var = NULL )
formula |
Regression model to use, specified as a formula. |
data |
Data set to be used for regression modeling. |
method |
Method for variable selection. Should be one of |
N |
Number of splits. |
p_split |
Probabilities associated with the splits. |
cores |
Number of cores for parallel processing. |
direction |
It can take two values: |
forced_var |
A character string specifying a predictor variable to be forced into selection. By default, it is NULL, allowing for no forced selection. If provided, this variable will be consistently selected during the N splits. |
We have data that we will split several times while shuffling it each time. Then, we will divide the data into two parts based on a specific probability for splitting. In the first half, we will perform model selection, followed by calibration on the second half. At the end of these steps, we will obtain matrices of dimensions N*p that represent the selected models and the estimated coefficients associated with these models.
An object of class lmps
library(mlbench) data("BostonHousing") # lmps object model = lmps(medv ~ ., data = BostonHousing, method = "Lasso", N = 50) # A parallelized example # lmps object model = lmps(medv ~ ., data = BostonHousing, method = "Lasso", N = 50, cores = 2)
library(mlbench) data("BostonHousing") # lmps object model = lmps(medv ~ ., data = BostonHousing, method = "Lasso", N = 50) # A parallelized example # lmps object model = lmps(medv ~ ., data = BostonHousing, method = "Lasso", N = 50, cores = 2)
It provides a ggplot graphic where the x-axis displays all the explanatory variables, with the confidence intervals of the selected variables shown in green and the coefficient estimates represented as red points.
## S3 method for class 'CIps' plot(x, ...)
## S3 method for class 'CIps' plot(x, ...)
x |
An object of class CIps. |
... |
Additional arguments to be passed to the plot function. |
No return value, called for its side effects, which is plotting a graph.
library(mlbench) data("BostonHousing") # lmps object model = lmps(medv~., data = BostonHousing, method = "Lasso", N = 50) # CIps object cips = CIps(model, vote = "coef", alpha = 0.05, s.vote_coef = 0.5) # plot plot(cips) # lmps object model = lmps(medv~., data = BostonHousing, method = "Lasso", N = 50, cores = 2) # CIps object cips = CIps(model, vote = "coef", alpha = 0.05, s.vote_coef = 0.5) # plot plot(cips)
library(mlbench) data("BostonHousing") # lmps object model = lmps(medv~., data = BostonHousing, method = "Lasso", N = 50) # CIps object cips = CIps(model, vote = "coef", alpha = 0.05, s.vote_coef = 0.5) # plot plot(cips) # lmps object model = lmps(medv~., data = BostonHousing, method = "Lasso", N = 50, cores = 2) # CIps object cips = CIps(model, vote = "coef", alpha = 0.05, s.vote_coef = 0.5) # plot plot(cips)
It provides information on the selected variables, the estimated confidence intervals, and the coefficients of these selected variables.
## S3 method for class 'CIps' print(x, ...)
## S3 method for class 'CIps' print(x, ...)
x |
An object of class CIps. |
... |
Additional arguments to be passed to the print function. |
No return value, called for its side effects, which is printing the object to the console.
library(mlbench) data("BostonHousing") # lmps object model = lmps(medv~., data = BostonHousing, method = "Lasso", N = 50) # CIps object cips = CIps(model, vote = "coef", alpha = 0.05, s.vote_coef = 0.5) # print print(cips) # lmps object model = lmps(medv~., data = BostonHousing, method = "Lasso", N = 50, cores = 2) # CIps object cips = CIps(model, vote = "coef", alpha = 0.05, s.vote_coef = 0.5) # print print(cips)
library(mlbench) data("BostonHousing") # lmps object model = lmps(medv~., data = BostonHousing, method = "Lasso", N = 50) # CIps object cips = CIps(model, vote = "coef", alpha = 0.05, s.vote_coef = 0.5) # print print(cips) # lmps object model = lmps(medv~., data = BostonHousing, method = "Lasso", N = 50, cores = 2) # CIps object cips = CIps(model, vote = "coef", alpha = 0.05, s.vote_coef = 0.5) # print print(cips)
Summary function for our lmps object
## S3 method for class 'lmps' summary(object, ...)
## S3 method for class 'lmps' summary(object, ...)
object |
Our lmps object |
... |
Other arguments ignored (for compatibility with generic) |
This function provides a summary of the data collected during the application of the lmps function. It summarizes how many times the most frequently selected model was chosen across our N divisions, as well as the selection frequency of variables in the different divisions. It can also provide the execution time of the lmps function, which may vary significantly depending on the chosen post-selection method and the dimensionality of our data.
A summary of our lmps object
library(mlbench) data("BostonHousing") # lmps object model = lmps(medv~., data = BostonHousing, method = "Lasso", N = 50) summary(model) # lmps object model = lmps(medv~., data = BostonHousing, method = "Lasso", N = 50, cores = 2) summary(model)
library(mlbench) data("BostonHousing") # lmps object model = lmps(medv~., data = BostonHousing, method = "Lasso", N = 50) summary(model) # lmps object model = lmps(medv~., data = BostonHousing, method = "Lasso", N = 50, cores = 2) summary(model)