Title: | Projection Predictive Feature Selection |
---|---|
Description: | Performs projection predictive feature selection for generalized linear models (Piironen, Paasiniemi, and Vehtari, 2020, <doi:10.1214/20-EJS1711>) with or without multilevel or additive terms (Catalina, Bürkner, and Vehtari, 2022, <https://proceedings.mlr.press/v151/catalina22a.html>), for some ordinal and nominal regression models (Weber, Glass, and Vehtari, 2023, <arXiv:2301.01660>), and for many other regression models (using the latent projection by Catalina, Bürkner, and Vehtari, 2021, <arXiv:2109.04702>, which can also be applied to most of the former models). The package is compatible with the 'rstanarm' and 'brms' packages, but other reference models can also be used. See the vignettes and the documentation for more information and examples. |
Authors: | Juho Piironen [aut], Markus Paasiniemi [aut], Alejandro Catalina [aut], Frank Weber [cre, aut], Aki Vehtari [aut], Jonah Gabry [ctb], Marco Colombo [ctb], Paul-Christian Bürkner [ctb], Hamada S. Badr [ctb], Brian Sullivan [ctb], Sölvi Rögnvaldsson [ctb], The LME4 Authors [cph] (see file 'LICENSE' for details), Yann McLatchie [ctb], Juho Timonen [ctb] |
Maintainer: | Frank Weber <[email protected]> |
License: | GPL-3 | file LICENSE |
Version: | 2.8.0 |
Built: | 2024-11-09 06:39:34 UTC |
Source: | CRAN |
The R package projpred performs the projection predictive variable (or
"feature") selection for various regression models. We recommend to read the
README
file (available with enhanced formatting
online) and the main vignette (topic = "projpred"
, but also available
online) before
continuing here.
Throughout the whole package documentation, we use the term "submodel" for
all kinds of candidate models onto which the reference model is projected.
For custom reference models, the candidate models don't need to be actual
submodels of the reference model, but in any case (even for custom
reference models), the candidate models are always actual submodels of the
full formula
used by the search procedure. In this regard, it is correct
to speak of submodels, even in case of a custom reference model.
The following model type abbreviations will be used at multiple places throughout the documentation: GLM (generalized linear model), GLMM (generalized linear multilevel—or "mixed"—model), GAM (generalized additive model), and GAMM (generalized additive multilevel—or "mixed"—model). Note that the term "generalized" includes the Gaussian family as well.
For the projection of the reference model onto a submodel, projpred
currently relies on the following functions as draw-wise divergence
minimizers (in other words, these are the workhorse functions employed by
projpred's internal default div_minimizer
functions, see
init_refmodel()
):
Submodel without multilevel or additive terms:
For the traditional (or latent) projection (or the augmented-data
projection in case of the binomial()
or brms::bernoulli()
family): An
internal C++ function which basically serves the same purpose as lm()
for the gaussian()
family and glm()
for all other families. The
returned object inherits from class subfit
. Possible tuning parameters
for this internal C++ function are: regul
(amount of ridge
regularization; default: 1e-4
), thresh_conv
(convergence threshold;
default: 1e-7
), qa_updates_max
(maximum number of quadratic
approximation updates; default: 100
, but fixed to 1
in case of the
Gaussian family with identity link), ls_iter_max
(maximum number of
line search iterations; default: 30
, but fixed to 1
in case of the
Gaussian family with identity link), normalize
(single logical value
indicating whether to scale the predictors internally with the returned
regression coefficient estimates being back-adjusted appropriately;
default: TRUE
), beta0_init
(single numeric value giving the starting
value for the intercept at centered predictors; default: 0
), and
beta_init
(numeric vector giving the starting values for the regression
coefficients; default: vector of 0
s).
For the augmented-data projection: MASS::polr()
(the returned object
inherits from class polr
) for the brms::cumulative()
family or
rstanarm::stan_polr()
fits, nnet::multinom()
(the returned object
inherits from class multinom
) for the brms::categorical()
family.
Submodel with multilevel but no additive terms:
For the traditional (or latent) projection (or the augmented-data
projection in case of the binomial()
or brms::bernoulli()
family):
lme4::lmer()
(the returned object inherits from class lmerMod
) for
the gaussian()
family, lme4::glmer()
(the returned object inherits
from class glmerMod
) for all other families.
For the augmented-data projection: ordinal::clmm()
(the returned
object inherits from class clmm
) for the brms::cumulative()
family,
mclogit::mblogit()
(the returned object inherits from class mmblogit
)
for the brms::categorical()
family.
Submodel without multilevel but additive terms: mgcv::gam()
(the returned
object inherits from class gam
).
Submodel with multilevel and additive terms: gamm4::gamm4()
(within
projpred, the returned object inherits from class gamm4
).
Setting global option projpred.extra_verbose
to TRUE
will print out which
submodel projpred is currently projecting onto as well as (if method = "forward"
and verbose = TRUE
in varsel()
or cv_varsel()
) which
submodel has been selected at those steps of the forward search for which a
percentage (of the maximum submodel size that the search is run up to) is
printed. In general, however, we cannot recommend setting this global option
to TRUE
for cv_varsel()
with validate_search = TRUE
(simply due to the
amount of information that will be printed, but also due to the progress bar
which will not work as intended anymore).
By default, projpred catches messages and warnings from the draw-wise
divergence minimizers and throws their unique collection after performing all
draw-wise divergence minimizations (i.e., draw-wise projections). This can be
deactivated by setting global option projpred.warn_prj_drawwise
to FALSE
.
Furthermore, by default, projpred checks the convergence of the
draw-wise divergence minimizers and throws a warning if any seem to have not
converged. This warning is thrown after the warning message from global
option projpred.warn_prj_drawwise
(see above) and can be deactivated by
setting global option projpred.check_conv
to FALSE
.
The projection of the reference model onto a submodel can be run in parallel
(across the projected draws). This is powered by the foreach package.
Thus, any parallel (or sequential) backend compatible with foreach can
be used, e.g., the backends from packages doParallel, doMPI, or
doFuture. Using the global option projpred.prll_prj_trigger
, the
number of projected draws below which no parallelization is applied (even if
a parallel backend is registered) can be modified. Such a "trigger" threshold
exists because of the computational overhead of a parallelization which makes
the projection parallelization only useful for a sufficiently large number of
projected draws. By default, the projection parallelization is turned off,
which can also be achieved by supplying Inf
(or NULL
) to option
projpred.prll_prj_trigger
. Note that we cannot recommend the projection
parallelization on Windows because in our experience, the parallelization
overhead is larger there, causing a parallel run to take longer than a
sequential run. Also note that the projection parallelization works well for
submodels which are GLMs (and hence also for the latent projection if the
submodel has no multilevel or additive predictor terms), but for all other
types of submodels, the fitted submodel objects are quite big, which—when
running in parallel—may lead to excessive memory usage which in turn may
crash the R session (on Unix systems, setting an appropriate memory limit via
unix::rlimit_as()
may avoid crashing the whole machine). Thus, we currently
cannot recommend parallelizing projections onto submodels which are GLMs (in
this context, the latent projection onto a submodel without multilevel and
without additive terms may be regarded as a projection onto a submodel which
is a GLM). However, for cv_varsel()
, there is also a CV parallelization
(i.e., a parallelization of projpred's cross-validation) which can be
activated via argument parallel
.
In case of multilevel models, projpred offers two global options for
"integrating out" group-level effects: projpred.mlvl_pred_new
and
projpred.mlvl_proj_ref_new
. When setting projpred.mlvl_pred_new
to TRUE
(default is FALSE
), then at
prediction time, projpred will treat group levels existing in the
training data as new group levels, implying that their group-level effects
are drawn randomly from a (multivariate) Gaussian distribution. This concerns
both, the reference model and the (i.e., any) submodel. Furthermore, setting
projpred.mlvl_pred_new
to TRUE
causes as.matrix.projection()
and
as_draws_matrix.projection()
to omit the projected group-level effects (for
the group levels from the original dataset). When setting
projpred.mlvl_proj_ref_new
to TRUE
(default is FALSE
), then at
projection time, the reference model's fitted values (that the submodels
fit to) will be computed by treating the group levels from the original
dataset as new group levels, implying that their group-level effects will
be drawn randomly from a (multivariate) Gaussian distribution (as long as the
reference model is a multilevel model, which—for custom reference
models—does not need to be the case). This also affects the latent response
values for a latent projection correspondingly. Setting
projpred.mlvl_pred_new
to TRUE
makes sense, e.g., when the prediction
task is such that any group level will be treated as a new one. Typically,
setting projpred.mlvl_proj_ref_new
to TRUE
only makes sense when
projpred.mlvl_pred_new
is already set to TRUE
. In that case, the default
of FALSE
for projpred.mlvl_proj_ref_new
ensures that at projection time,
the submodels fit to the best possible fitted values from the reference
model, and setting projpred.mlvl_proj_ref_new
to TRUE
would make sense if
the group-level effects should be integrated out completely.
By setting the global option projpred.run_gc
to TRUE
, projpred will
call gc()
at some places (e.g., after each size that the forward search
passes through) to free up some memory. These gc()
calls are not always
necessary to reduce the peak memory usage, but they add runtime (hence the
default of FALSE
for that global option).
Most examples are not executed when called via example()
. To execute them,
their code has to be copied and pasted manually to the console.
init_refmodel()
, get_refmodel()
For setting up an object
containing information about the reference model, the submodels, and how
the projection should be carried out. Explicit calls to init_refmodel()
and get_refmodel()
are only rarely needed.
varsel()
, cv_varsel()
For running the search part and the evaluation part for a projection predictive variable selection, possibly with cross-validation (CV).
summary.vsel()
, print.vsel()
, plot.vsel()
,
suggest_size.vsel()
, ranking()
, cv_proportions()
,
plot.cv_proportions()
, performances()
For post-processing the results
from varsel()
and cv_varsel()
.
project()
For projecting the reference model onto submodel(s). Typically, this follows the variable selection, but it can also be applied directly (without a variable selection).
as.matrix.projection()
and as_draws_matrix.projection()
For extracting projected parameter draws.
proj_linpred()
, proj_predict()
For making predictions from a submodel (after projecting the reference model onto it).
Maintainer: Frank Weber [email protected]
Authors:
Juho Piironen [email protected]
Markus Paasiniemi
Alejandro Catalina [email protected]
Aki Vehtari
Other contributors:
Jonah Gabry [contributor]
Marco Colombo [contributor]
Paul-Christian Bürkner [contributor]
Hamada S. Badr [contributor]
Brian Sullivan [contributor]
Sölvi Rögnvaldsson [contributor]
The LME4 Authors (see file 'LICENSE' for details) [copyright holder]
Yann McLatchie [contributor]
Juho Timonen [contributor]
Useful links:
Report bugs at https://github.com/stan-dev/projpred/issues/
draws_matrix
(see package
posterior)These are the posterior::as_draws()
and posterior::as_draws_matrix()
methods for projection
objects (returned by project()
, possibly as
elements of a list
). They extract the projected parameter draws and return
them as a draws_matrix
. In case of different (i.e., nonconstant) weights
for the projected draws, a draws_matrix
allows for a safer handling of
these weights (safer in contrast to the matrix returned by
as.matrix.projection()
), in particular by providing the natural input for
posterior::resample_draws()
(see section "Examples" below).
## S3 method for class 'projection' as_draws_matrix(x, ...) ## S3 method for class 'projection' as_draws(x, ...)
## S3 method for class 'projection' as_draws_matrix(x, ...) ## S3 method for class 'projection' as_draws(x, ...)
x |
An object of class |
... |
Arguments passed to |
In case of the augmented-data projection for a multilevel submodel
of a brms::categorical()
reference model, the multilevel parameters (and
therefore also their names) slightly differ from those in the brms
reference model fit (see section "Augmented-data projection" in
extend_family()
's documentation).
An
draws_matrix
(see
posterior::draws_matrix()
) of projected draws, with
denoting the number of projected draws and
the number of parameters. If the projected draws have nonconstant
weights,
posterior::weight_draws()
is applied internally.
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Projection onto an arbitrary combination of predictor terms (with a small # value for `nclusters`, but only for illustrative purposes; this is not # recommended in general): prj <- project(fit, predictor_terms = c("X1", "X3", "X5"), nclusters = 5, seed = 9182) # Applying the posterior::as_draws_matrix() generic to the output of # project() dispatches to the projpred::as_draws_matrix.projection() # method: prj_draws <- posterior::as_draws_matrix(prj) # Resample the projected draws according to their weights: set.seed(3456) prj_draws_resampled <- posterior::resample_draws(prj_draws, ndraws = 1000) # The values from the following two objects should be the same (in general, # this only holds approximately): print(proportions(table(rownames(prj_draws_resampled)))) print(weights(prj_draws)) # Treat the resampled draws like ordinary draws, e.g., summarize them: print(posterior::summarize_draws( prj_draws_resampled, "median", "mad", function(x) quantile(x, probs = c(0.025, 0.975)) )) # Or visualize them using the `bayesplot` package: if (requireNamespace("bayesplot", quietly = TRUE)) { print(bayesplot::mcmc_intervals(prj_draws_resampled)) }
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Projection onto an arbitrary combination of predictor terms (with a small # value for `nclusters`, but only for illustrative purposes; this is not # recommended in general): prj <- project(fit, predictor_terms = c("X1", "X3", "X5"), nclusters = 5, seed = 9182) # Applying the posterior::as_draws_matrix() generic to the output of # project() dispatches to the projpred::as_draws_matrix.projection() # method: prj_draws <- posterior::as_draws_matrix(prj) # Resample the projected draws according to their weights: set.seed(3456) prj_draws_resampled <- posterior::resample_draws(prj_draws, ndraws = 1000) # The values from the following two objects should be the same (in general, # this only holds approximately): print(proportions(table(rownames(prj_draws_resampled)))) print(weights(prj_draws)) # Treat the resampled draws like ordinary draws, e.g., summarize them: print(posterior::summarize_draws( prj_draws_resampled, "median", "mad", function(x) quantile(x, probs = c(0.025, 0.975)) )) # Or visualize them using the `bayesplot` package: if (requireNamespace("bayesplot", quietly = TRUE)) { print(bayesplot::mcmc_intervals(prj_draws_resampled)) }
This is the as.matrix()
method for projection
objects (returned by
project()
, possibly as elements of a list
). It extracts the projected
parameter draws and returns them as a matrix. In case of different (i.e.,
nonconstant) weights for the projected draws, see
as_draws_matrix.projection()
for a better solution.
## S3 method for class 'projection' as.matrix(x, nm_scheme = NULL, allow_nonconst_wdraws_prj = FALSE, ...)
## S3 method for class 'projection' as.matrix(x, nm_scheme = NULL, allow_nonconst_wdraws_prj = FALSE, ...)
x |
An object of class |
nm_scheme |
The naming scheme for the columns of the output matrix.
Either |
allow_nonconst_wdraws_prj |
A single logical value indicating whether to
allow projected draws with different (i.e., nonconstant) weights ( |
... |
Currently ignored. |
In case of the augmented-data projection for a multilevel submodel
of a brms::categorical()
reference model, the multilevel parameters (and
therefore also their names) slightly differ from those in the brms
reference model fit (see section "Augmented-data projection" in
extend_family()
's documentation).
An matrix of projected
draws, with
denoting the number of projected
draws and
the number of parameters. If
allow_nonconst_wdraws_prj
is set to TRUE
, the weights of the projected draws are stored in an
attribute wdraws_prj
. (If allow_nonconst_wdraws_prj
is FALSE
,
projected draws with nonconstant weights cause an error.)
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Projection onto an arbitrary combination of predictor terms (with a small # value for `ndraws`, but only for the sake of speed in this example; this # is not recommended in general): prj <- project(fit, predictor_terms = c("X1", "X3", "X5"), ndraws = 21, seed = 9182) # Applying the as.matrix() generic to the output of project() dispatches to # the projpred::as.matrix.projection() method: prj_mat <- as.matrix(prj) # Since the draws have all the same weight here, we can treat them like # ordinary MCMC draws, e.g., we can summarize them using the `posterior` # package: if (requireNamespace("posterior", quietly = TRUE)) { print(posterior::summarize_draws( posterior::as_draws_matrix(prj_mat), "median", "mad", function(x) quantile(x, probs = c(0.025, 0.975)) )) } # Or visualize them using the `bayesplot` package: if (requireNamespace("bayesplot", quietly = TRUE)) { print(bayesplot::mcmc_intervals(prj_mat)) }
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Projection onto an arbitrary combination of predictor terms (with a small # value for `ndraws`, but only for the sake of speed in this example; this # is not recommended in general): prj <- project(fit, predictor_terms = c("X1", "X3", "X5"), ndraws = 21, seed = 9182) # Applying the as.matrix() generic to the output of project() dispatches to # the projpred::as.matrix.projection() method: prj_mat <- as.matrix(prj) # Since the draws have all the same weight here, we can treat them like # ordinary MCMC draws, e.g., we can summarize them using the `posterior` # package: if (requireNamespace("posterior", quietly = TRUE)) { print(posterior::summarize_draws( posterior::as_draws_matrix(prj_mat), "median", "mad", function(x) quantile(x, probs = c(0.025, 0.975)) )) } # Or visualize them using the `bayesplot` package: if (requireNamespace("bayesplot", quietly = TRUE)) { print(bayesplot::mcmc_intervals(prj_mat)) }
This is the function which has to be supplied to extend_family()
's argument
augdat_ilink
in case of the augmented-data projection for the binomial()
family.
augdat_ilink_binom(eta_arr, link = "logit")
augdat_ilink_binom(eta_arr, link = "logit")
eta_arr |
An array as described in section "Augmented-data projection"
of |
link |
The same as argument |
An array as described in section "Augmented-data projection" of
extend_family()
's documentation.
This is the function which has to be supplied to extend_family()
's argument
augdat_link
in case of the augmented-data projection for the binomial()
family.
augdat_link_binom(prb_arr, link = "logit")
augdat_link_binom(prb_arr, link = "logit")
prb_arr |
An array as described in section "Augmented-data projection"
of |
link |
The same as argument |
An array as described in section "Augmented-data projection" of
extend_family()
's documentation.
Sometimes there can be terms in a formula that refer to a matrix instead of a single predictor. This function breaks up the matrix term into individual predictors to handle separately, as that is probably the intention of the user.
break_up_matrix_term(formula, data)
break_up_matrix_term(formula, data)
formula |
A |
data |
The original |
A list
containing the expanded formula
and the expanded
data.frame
.
This function aggregates parameter draws that have been clustered
into
clusters by averaging across the draws that
belong to the same cluster. This averaging can be done in a weighted fashion.
cl_agg( draws, cl = seq_len(nrow(draws)), wdraws = rep(1, nrow(draws)), eps_wdraws = 0 )
cl_agg( draws, cl = seq_len(nrow(draws)), wdraws = rep(1, nrow(draws)), eps_wdraws = 0 )
draws |
An |
cl |
A numeric vector of length |
wdraws |
A numeric vector of length |
eps_wdraws |
A positive numeric value (typically small) which will be
used to improve numerical stability: The weights of the draws within each
cluster are multiplied by |
An matrix of aggregated
parameter draws.
set.seed(323) S <- 100L P <- 3L draws <- matrix(rnorm(S * P), nrow = S, ncol = P) # Clustering example: S_cl <- 10L cl_draws <- sample.int(S_cl, size = S, replace = TRUE) draws_cl <- cl_agg(draws, cl = cl_draws) # Clustering example with nonconstant `wdraws`: w_draws <- rgamma(S, shape = 4) draws_cl <- cl_agg(draws, cl = cl_draws, wdraws = w_draws) # Thinning example (implying constant `wdraws`): S_th <- 50L idxs_thin <- round(seq(1, S, length.out = S_th)) th_draws <- rep(NA, S) th_draws[idxs_thin] <- seq_len(S_th) draws_th <- cl_agg(draws, cl = th_draws)
set.seed(323) S <- 100L P <- 3L draws <- matrix(rnorm(S * P), nrow = S, ncol = P) # Clustering example: S_cl <- 10L cl_draws <- sample.int(S_cl, size = S, replace = TRUE) draws_cl <- cl_agg(draws, cl = cl_draws) # Clustering example with nonconstant `wdraws`: w_draws <- rgamma(S, shape = 4) draws_cl <- cl_agg(draws, cl = cl_draws, wdraws = w_draws) # Thinning example (implying constant `wdraws`): S_th <- 50L idxs_thin <- round(seq(1, S, length.out = S_th)) th_draws <- rep(NA, S) th_draws[idxs_thin] <- seq_len(S_th) draws_th <- cl_agg(draws, cl = th_draws)
Calculates the ranking proportions from the fold-wise predictor rankings in
a cross-validation (CV) with fold-wise searches. For a given predictor
and a given submodel size
, the ranking proportion is the
proportion of CV folds which have predictor
at position
of
their predictor ranking. While these ranking proportions are helpful for
investigating variability in the predictor ranking, they can also be
cumulated across submodel sizes. The cumulated ranking proportions are more
helpful when it comes to model selection.
cv_proportions(object, ...) ## S3 method for class 'ranking' cv_proportions(object, cumulate = FALSE, ...) ## S3 method for class 'vsel' cv_proportions(object, ...)
cv_proportions(object, ...) ## S3 method for class 'ranking' cv_proportions(object, cumulate = FALSE, ...) ## S3 method for class 'vsel' cv_proportions(object, ...)
object |
For |
... |
For |
cumulate |
A single logical value indicating whether the ranking
proportions should be cumulated across increasing submodel sizes ( |
A numeric matrix containing the ranking proportions. This matrix has
nterms_max
rows and nterms_max
columns, with nterms_max
as specified
in the (possibly implicit) ranking()
call. The rows correspond to the
submodel sizes and the columns to the predictor terms (sorted according to
the full-data predictor ranking). If cumulate
is FALSE
, then the
returned matrix is of class cv_proportions
. If cumulate
is TRUE
, then
the returned matrix is of classes cv_proportions_cumul
and
cv_proportions
(in this order).
Note that if cumulate
is FALSE
, then the values in the returned matrix
only need to sum to 1 (column-wise and row-wise) if nterms_max
(see
above) is equal to the full model size. Likewise, if cumulate
is TRUE
,
then the value 1
only needs to occur in each column of the returned
matrix if nterms_max
is equal to the full model size.
The cv_proportions()
function is only applicable if the ranking
object
includes fold-wise predictor rankings (i.e., if it is based on a vsel
object created by cv_varsel()
with validate_search = TRUE
). If the
ranking
object contains only a full-data predictor ranking (i.e., if it
is based on a vsel
object created by varsel()
or by cv_varsel()
, but
the latter with validate_search = FALSE
), then an error is thrown because
in that case, there are no fold-wise predictor rankings from which to
calculate ranking proportions.
# For an example, see `?plot.cv_proportions`.
# For an example, see `?plot.cv_proportions`.
Run the search part and the evaluation part for a projection predictive
variable selection. The search part determines the predictor ranking (also
known as solution path), i.e., the best submodel for each submodel size
(number of predictor terms). The evaluation part determines the predictive
performance of the submodels along the predictor ranking. In contrast to
varsel()
, cv_varsel()
performs a cross-validation (CV) by running the
search part with the training data of each CV fold separately (an exception
is explained in section "Note" below) and by running the evaluation part on
the corresponding test set of each CV fold. A special method is
cv_varsel.vsel()
because it re-uses the search results from an earlier
cv_varsel()
(or varsel()
) run, as illustrated in the main vignette.
cv_varsel(object, ...) ## Default S3 method: cv_varsel(object, ...) ## S3 method for class 'vsel' cv_varsel( object, cv_method = object$cv_method %||% "LOO", nloo = object$nloo, K = object$K %||% if (!inherits(object, "datafit")) 5 else 10, cvfits = object$cvfits, validate_search = object$validate_search %||% TRUE, ... ) ## S3 method for class 'refmodel' cv_varsel( object, method = "forward", cv_method = if (!inherits(object, "datafit")) "LOO" else "kfold", ndraws = NULL, nclusters = 20, ndraws_pred = 400, nclusters_pred = NULL, refit_prj = !inherits(object, "datafit"), nterms_max = NULL, penalty = NULL, verbose = TRUE, nloo = object$nobs, K = if (!inherits(object, "datafit")) 5 else 10, cvfits = object$cvfits, search_control = NULL, lambda_min_ratio = 1e-05, nlambda = 150, thresh = 1e-06, validate_search = TRUE, seed = NA, search_terms = NULL, search_out = NULL, parallel = getOption("projpred.prll_cv", FALSE), ... )
cv_varsel(object, ...) ## Default S3 method: cv_varsel(object, ...) ## S3 method for class 'vsel' cv_varsel( object, cv_method = object$cv_method %||% "LOO", nloo = object$nloo, K = object$K %||% if (!inherits(object, "datafit")) 5 else 10, cvfits = object$cvfits, validate_search = object$validate_search %||% TRUE, ... ) ## S3 method for class 'refmodel' cv_varsel( object, method = "forward", cv_method = if (!inherits(object, "datafit")) "LOO" else "kfold", ndraws = NULL, nclusters = 20, ndraws_pred = 400, nclusters_pred = NULL, refit_prj = !inherits(object, "datafit"), nterms_max = NULL, penalty = NULL, verbose = TRUE, nloo = object$nobs, K = if (!inherits(object, "datafit")) 5 else 10, cvfits = object$cvfits, search_control = NULL, lambda_min_ratio = 1e-05, nlambda = 150, thresh = 1e-06, validate_search = TRUE, seed = NA, search_terms = NULL, search_out = NULL, parallel = getOption("projpred.prll_cv", FALSE), ... )
object |
An object of class |
... |
For |
cv_method |
The CV method, either |
nloo |
Caution: Still experimental. Only relevant if |
K |
Only relevant if |
cvfits |
Only relevant if |
validate_search |
A single logical value indicating whether to
cross-validate also the search part, i.e., whether to run the search
separately for each CV fold ( |
method |
The method for the search part. Possible options are
|
ndraws |
Number of posterior draws used in the search part. Ignored if
|
nclusters |
Number of clusters of posterior draws used in the search
part. Ignored in case of L1 search (because L1 search always uses a single
cluster). For the meaning of |
ndraws_pred |
Only relevant if |
nclusters_pred |
Only relevant if |
refit_prj |
For the evaluation part, should the submodels along the
predictor ranking be fitted again ( |
nterms_max |
Maximum submodel size (number of predictor terms) up to
which the search is continued. If |
penalty |
Only relevant for L1 search. A numeric vector determining the
relative penalties or costs for the predictors. A value of |
verbose |
A single logical value indicating whether to print out additional information during the computations. |
search_control |
A
|
lambda_min_ratio |
Deprecated (please use |
nlambda |
Deprecated (please use |
thresh |
Deprecated (please use |
seed |
Pseudorandom number generation (PRNG) seed by which the same
results can be obtained again if needed. Passed to argument |
search_terms |
Only relevant for forward search. A custom character
vector of predictor term blocks to consider for the search. Section
"Details" below describes more precisely what "predictor term block" means.
The intercept ( |
search_out |
Intended for internal use. |
parallel |
A single logical value indicating whether to run costly parts
of the CV in parallel ( |
Arguments ndraws
, nclusters
, nclusters_pred
, and ndraws_pred
are automatically truncated at the number of posterior draws in the
reference model (which is 1
for datafit
s). Using less draws or clusters
in ndraws
, nclusters
, nclusters_pred
, or ndraws_pred
than posterior
draws in the reference model may result in slightly inaccurate projection
performance. Increasing these arguments affects the computation time
linearly.
For argument method
, there are some restrictions: For a reference model
with multilevel or additive formula terms or a reference model set up for
the augmented-data projection, only the forward search is available.
Furthermore, argument search_terms
requires a forward search to take
effect.
L1 search is faster than forward search, but forward search may be more accurate. Furthermore, forward search may find a sparser model with comparable performance to that found by L1 search, but it may also start overfitting when more predictors are added.
An L1 search may select an interaction term before all involved lower-order interaction terms (including main-effect terms) have been selected. In projpred versions > 2.6.0, the resulting predictor ranking is automatically modified so that the lower-order interaction terms come before this interaction term, but if this is conceptually undesired, choose the forward search instead.
The elements of the search_terms
character vector don't need to be
individual predictor terms. Instead, they can be building blocks consisting
of several predictor terms connected by the +
symbol. To understand how
these building blocks work, it is important to know how projpred's
forward search works: It starts with an empty vector chosen
which will
later contain already selected predictor terms. Then, the search iterates
over model sizes (with
denoting the maximum submodel size, not counting the intercept). The
candidate models at model size
are constructed from those elements
from
search_terms
which yield model size when combined with the
chosen
predictor terms. Note that sometimes, there may be no candidate
models for model size . Also note that internally,
search_terms
is
expanded to include the intercept ("1"
), so the first step of the search
(model size 0) always consists of the intercept-only model as the only
candidate.
As a search_terms
example, consider a reference model with formula y ~ x1 + x2 + x3
. Then, to ensure that x1
is always included in the
candidate models, specify search_terms = c("x1", "x1 + x2", "x1 + x3", "x1 + x2 + x3")
(or, in a simpler way that leads to the same results,
search_terms = c("x1", "x1 + x2", "x1 + x3")
, for which helper function
force_search_terms()
exists). This search would start with y ~ 1
as the
only candidate at model size 0. At model size 1, y ~ x1
would be the only
candidate. At model size 2, y ~ x1 + x2
and y ~ x1 + x3
would be the
two candidates. At the last model size of 3, y ~ x1 + x2 + x3
would be
the only candidate. As another example, to exclude x1
from the search,
specify search_terms = c("x2", "x3", "x2 + x3")
(or, in a simpler way
that leads to the same results, search_terms = c("x2", "x3")
).
An object of class vsel
. The elements of this object are not meant
to be accessed directly but instead via helper functions (see the main
vignette and projpred-package).
If validate_search
is FALSE
, the search is not included in the CV
so that only a single full-data search is run.
For PSIS-LOO CV, projpred calls loo::psis()
(or, exceptionally,
loo::sis()
, see below) with r_eff = NA
. This is only a problem if there
was extreme autocorrelation between the MCMC iterations when the reference
model was built. In those cases however, the reference model should not
have been used anyway, so we don't expect projpred's r_eff = NA
to
be a problem.
PSIS cannot be used if the draws have different (i.e., nonconstant) weights or if the number of draws is too small. In such cases, projpred resorts to standard importance sampling (SIS) and throws a warning about this. Throughout the documentation, the term "PSIS" is used even though in fact, projpred resorts to SIS in these special cases.
With parallel = TRUE
, costly parts of projpred's CV are run in
parallel. Costly parts are the fold-wise searches and performance
evaluations in case of validate_search = TRUE
. (Note that in case of
-fold CV, the
reference model refits are not affected by
argument
parallel
; only projpred's CV is affected.) The
parallelization is powered by the foreach package. Thus, any parallel
(or sequential) backend compatible with foreach can be used, e.g.,
the backends from packages doParallel, doMPI, or
doFuture. For GLMs, this CV parallelization should work reliably, but
for other models (such as GLMMs), it may lead to excessive memory usage
which in turn may crash the R session (on Unix systems, setting an
appropriate memory limit via unix::rlimit_as()
may avoid crashing the
whole machine). However, the problem of excessive memory usage is less
pronounced for the CV parallelization than for the projection
parallelization described in projpred-package. In that regard, the CV
parallelization is recommended over the projection parallelization.
Magnusson, Måns, Michael Andersen, Johan Jonasson, and Aki Vehtari. 2019. "Bayesian Leave-One-Out Cross-Validation for Large Data." In Proceedings of the 36th International Conference on Machine Learning, edited by Kamalika Chaudhuri and Ruslan Salakhutdinov, 97:4244–53. Proceedings of Machine Learning Research. PMLR. https://proceedings.mlr.press/v97/magnusson19a.html.
Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017. "Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC." Statistics and Computing 27 (5): 1413–32. doi:10.1007/s11222-016-9696-4.
Vehtari, Aki, Daniel Simpson, Andrew Gelman, Yuling Yao, and Jonah Gabry. 2022. "Pareto Smoothed Importance Sampling." arXiv. doi:10.48550/arXiv.1507.02646.
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 1000, refresh = 0, seed = 9876 ) # Run cv_varsel() (with L1 search and small values for `K`, `nterms_max`, and # `nclusters_pred`, but only for the sake of speed in this example; this is # not recommended in general): cvvs <- cv_varsel(fit, method = "L1", cv_method = "kfold", K = 2, nterms_max = 3, nclusters_pred = 10, seed = 5555) # Now see, for example, `?print.vsel`, `?plot.vsel`, `?suggest_size.vsel`, # and `?ranking` for possible post-processing functions.
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 1000, refresh = 0, seed = 9876 ) # Run cv_varsel() (with L1 search and small values for `K`, `nterms_max`, and # `nclusters_pred`, but only for the sake of speed in this example; this is # not recommended in general): cvvs <- cv_varsel(fit, method = "L1", cv_method = "kfold", K = 2, nterms_max = 3, nclusters_pred = 10, seed = 5555) # Now see, for example, `?print.vsel`, `?plot.vsel`, `?suggest_size.vsel`, # and `?ranking` for possible post-processing functions.
These are helper functions to create cross-validation (CV) folds, i.e., to
split up the indices from 1 to n
into K
subsets ("folds") for
-fold CV. These functions are potentially useful when creating the
input for arguments
cvfits
and cvfun
of init_refmodel()
(or argument
cvfits
of cv_varsel.refmodel()
). Function cvfolds()
is deprecated;
please use cv_folds()
instead (apart from the name, they are the same). The
return value of cv_folds()
and cv_ids()
is different, see below for
details.
cv_folds(n, K, seed = NA) cvfolds(n, K, seed = NA) cv_ids(n, K, out = c("foldwise", "indices"), seed = NA)
cv_folds(n, K, seed = NA) cvfolds(n, K, seed = NA) cv_ids(n, K, out = c("foldwise", "indices"), seed = NA)
n |
Number of observations. |
K |
Number of folds. Must be at least 2 and not exceed |
seed |
Pseudorandom number generation (PRNG) seed by which the same
results can be obtained again if needed. Passed to argument |
out |
Format of the output, either |
cv_folds()
returns a vector of length n
such that each element is
an integer between 1 and K
denoting which fold the corresponding data
point belongs to. The return value of cv_ids()
depends on the out
argument. If out = "foldwise"
, the return value is a list
with K
elements, each being a list
with elements tr
and ts
giving the
training and test indices, respectively, for the corresponding fold. If
out = "indices"
, the return value is a list
with elements tr
and ts
each being a list
with K
elements giving the training and test indices,
respectively, for each fold.
n <- 100 set.seed(1234) y <- rnorm(n) cv <- cv_ids(n, K = 5) # Mean within the test set of each fold: cvmeans <- sapply(cv, function(fold) mean(y[fold$ts]))
n <- 100 set.seed(1234) y <- rnorm(n) cv <- cv_ids(n, K = 5) # Mean within the test set of each fold: cvmeans <- sapply(cv, function(fold) mean(y[fold$ts]))
Binomial toy example
df_binom
df_binom
A simulated classification dataset containing 100 observations.
response, 0 or 1.
predictors, 30 in total.
https://web.stanford.edu/~hastie/glmnet/glmnetData/BNExample.RData
Gaussian toy example
df_gaussian
df_gaussian
A simulated regression dataset containing 100 observations.
response, real-valued.
predictors, 20 in total. Mean and SD are approximately 0 and 1, respectively.
https://web.stanford.edu/~hastie/glmnet/glmnetData/QSExample.RData
This function adds some internally required elements to an object of class
family
(see, e.g., family()
). It is called internally by
init_refmodel()
, so you will rarely need to call it yourself.
extend_family( family, latent = FALSE, latent_y_unqs = NULL, latent_ilink = NULL, latent_ll_oscale = NULL, latent_ppd_oscale = NULL, augdat_y_unqs = NULL, augdat_link = NULL, augdat_ilink = NULL, augdat_args_link = list(), augdat_args_ilink = list(), ... )
extend_family( family, latent = FALSE, latent_y_unqs = NULL, latent_ilink = NULL, latent_ll_oscale = NULL, latent_ppd_oscale = NULL, augdat_y_unqs = NULL, augdat_link = NULL, augdat_ilink = NULL, augdat_args_link = list(), augdat_args_ilink = list(), ... )
family |
An object of class |
latent |
A single logical value indicating whether to use the latent
projection ( |
latent_y_unqs |
Only relevant for a latent projection where the original
response space has finite support (i.e., the original response values may
be regarded as categories), in which case this needs to be the character
vector of unique response values (which will be assigned to |
latent_ilink |
Only relevant for the latent projection, in which case
this needs to be the inverse-link function. If the original response family
was the |
latent_ll_oscale |
Only relevant for the latent projection, in which
case this needs to be the function computing response-scale (not
latent-scale) log-likelihood values. If |
latent_ppd_oscale |
Only relevant for the latent projection, in which
case this needs to be the function sampling response values given latent
predictors that have been transformed to response scale using
|
augdat_y_unqs |
Only relevant for augmented-data projection, in which
case this needs to be the character vector of unique response values (which
will be assigned to |
augdat_link |
Only relevant for augmented-data projection, in which case
this needs to be the link function. Use |
augdat_ilink |
Only relevant for augmented-data projection, in which
case this needs to be the inverse-link function. Use |
augdat_args_link |
Only relevant for augmented-data projection, in which
case this may be a named |
augdat_args_ilink |
Only relevant for augmented-data projection, in
which case this may be a named |
... |
Ignored (exists only to swallow up further arguments which might be passed to this function). |
In the following, ,
,
,
, and
from help topic refmodel-init-get are used.
Note that
does not necessarily denote the number of original
observations; it can also refer to new observations. Furthermore, let
denote either
or
,
whichever is appropriate in the context where it is used.
The family
object extended in the way needed by projpred.
As their first input, the functions supplied to arguments augdat_link
and
augdat_ilink
have to accept:
For augdat_link
: an array containing the probabilities for the response categories. The
order of the response categories is the same as in
family$cats
(see
argument augdat_y_unqs
).
For augdat_ilink
: an array containing the linear predictors.
The return value of these functions needs to be:
For augdat_link
: an array containing the linear predictors.
For augdat_ilink
: an array containing the probabilities for the response categories. The
order of the response categories has to be the same as in
family$cats
(see
argument augdat_y_unqs
).
For the augmented-data projection, the response vector resulting from
extract_model_data
(see init_refmodel()
) is coerced to a factor
(using
as.factor()
) at multiple places throughout this package. Inside of
init_refmodel()
, the levels of this factor
have to be identical to
family$cats
(after applying extend_family()
inside of
init_refmodel()
). Everywhere else, these levels have to be a subset of
<refmodel>$family$cats
(where <refmodel>
is an object resulting from
init_refmodel()
). See argument augdat_y_unqs
for how to control
family$cats
.
For ordinal brms families, be aware that the submodels (onto which the reference model is projected) currently have the following restrictions:
The discrimination parameter disc
is not supported (i.e., it is a
constant with value 1).
The thresholds are "flexible"
(see brms::brmsfamily()
).
The thresholds do not vary across the levels of a factor
-like variable
(see argument gr
of brms::resp_thres()
).
The "probit_approx"
link is replaced by "probit"
.
For the brms::categorical()
family, be aware that:
For multilevel submodels, the group-level effects are allowed to be correlated between different response categories.
For multilevel submodels, mclogit versions < 0.9.4 may throw the
error 'a' (<number> x 1) must be square
. Updating mclogit to a
version >= 0.9.4 should fix this.
The function supplied to argument latent_ilink
needs to have the prototype
latent_ilink(lpreds, cl_ref, wdraws_ref = rep(1, length(cl_ref)))
where:
lpreds
accepts an matrix containing the linear
predictors.
cl_ref
accepts a numeric vector of length ,
containing projpred's internal cluster indices for these draws.
wdraws_ref
accepts a numeric vector of length
, containing weights for these draws. These
weights should be treated as not being normalized (i.e., they don't
necessarily sum to
1
).
The return value of latent_ilink
needs to contain the linear predictors
transformed to the original response space, with the following structure:
If is.null(family$cats)
(after taking latent_y_unqs
into account): an
matrix.
If !is.null(family$cats)
(after taking latent_y_unqs
into account): an
array. In that case,
latent_ilink
needs to return probabilities (for the response categories
given in family$cats
, after taking latent_y_unqs
into account).
The function supplied to argument latent_ll_oscale
needs to have the
prototype
latent_ll_oscale(ilpreds, y_oscale, wobs = rep(1, length(y_oscale)), cl_ref, wdraws_ref = rep(1, length(cl_ref)))
where:
ilpreds
accepts the return value from latent_ilink
.
y_oscale
accepts a vector of length containing response values on
the original response scale.
wobs
accepts a numeric vector of length containing observation
weights.
cl_ref
accepts the same input as argument cl_ref
of latent_ilink
.
wdraws_ref
accepts the same input as argument wdraws_ref
of
latent_ilink
.
The return value of latent_ll_oscale
needs to be an
matrix containing the response-scale (not latent-scale) log-likelihood values
for the
observations from its inputs.
The function supplied to argument latent_ppd_oscale
needs to have the
prototype
latent_ppd_oscale(ilpreds_resamp, wobs, cl_ref, wdraws_ref = rep(1, length(cl_ref)), idxs_prjdraws)
where:
ilpreds_resamp
accepts the return value from latent_ilink
, but possibly
with resampled (clustered) draws (see argument nresample_clusters
of
proj_predict()
).
wobs
accepts a numeric vector of length containing observation
weights.
cl_ref
accepts the same input as argument cl_ref
of latent_ilink
.
wdraws_ref
accepts the same input as argument wdraws_ref
of
latent_ilink
.
idxs_prjdraws
accepts a numeric vector of length dim(ilpreds_resamp)[1]
containing the resampled indices of the projected draws (i.e., these indices
are values from the set where
ilpreds
denotes the return value of
latent_ilink
).
The return value of latent_ppd_oscale
needs to be a
matrix containing the response-scale (not latent-scale) draws from the
posterior(-projection) predictive distributions for the
observations
from its inputs.
If the bodies of these three functions involve parameter draws from the
reference model which have not been projected (e.g., for latent_ilink
, the
thresholds in an ordinal model), cl_agg()
is provided as a helper function
for aggregating these reference model draws in the same way as the draws have
been aggregated for the first argument of these functions (e.g., lpreds
in
case of latent_ilink
).
In fact, the weights passed to argument wdraws_ref
are nonconstant only in
case of cv_varsel()
with cv_method = "LOO"
and validate_search = TRUE
.
In that case, the weights passed to this argument are the PSIS-LOO CV weights
for one observation. Note that although argument wdraws_ref
has the suffix
_ref
, wdraws_ref
does not necessarily obtain weights for the initial
reference model's posterior draws: In case of cv_varsel()
with cv_method = "kfold"
, these weights may refer to one of the reference model
refits (but in that case, they are constant anyway).
If family$cats
is not NULL
(after taking latent_y_unqs
into account),
then the response vector resulting from extract_model_data
(see
init_refmodel()
) is coerced to a factor
(using as.factor()
) at multiple
places throughout this package. Inside of init_refmodel()
, the levels of
this factor
have to be identical to family$cats
(after applying
extend_family()
inside of init_refmodel()
). Everywhere else, these levels
have to be a subset of <refmodel>$family$cats
(where <refmodel>
is an
object resulting from init_refmodel()
).
Family objects not in the set of default family
objects.
Student_t(link = "identity", nu = 3)
Student_t(link = "identity", nu = 3)
link |
Name of the link function. In contrast to the default |
nu |
Degrees of freedom for the Student- |
A family object analogous to those described in family
.
Support for the Student_t()
family is still experimental.
A helper function to construct the input for argument search_terms
of
varsel()
or cv_varsel()
if certain predictor terms should be forced to be
selected first whereas other predictor terms are optional (i.e., they are
subject to the variable selection, but only after the inclusion of the
"forced" terms).
force_search_terms(forced_terms, optional_terms)
force_search_terms(forced_terms, optional_terms)
forced_terms |
A character vector of predictor terms that should be selected first. |
optional_terms |
A character vector of predictor terms that should be subject to the variable selection after the inclusion of the "forced" terms. |
A character vector that may be used as input for argument
search_terms
of varsel()
or cv_varsel()
.
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # We will force X1 and X2 to be selected first: search_terms_forced <- force_search_terms( forced_terms = paste0("X", 1:2), optional_terms = paste0("X", 3:5) ) # Run varsel() (here without cross-validation and with small values for # `nterms_max`, `nclusters`, and `nclusters_pred`, but only for the sake of # speed in this example; this is not recommended in general): vs <- varsel(fit, nclusters = 5, nclusters_pred = 10, search_terms = search_terms_forced, seed = 5555) # Now see, for example, `?print.vsel`, `?plot.vsel`, `?suggest_size.vsel`, # and `?ranking` for possible post-processing functions.
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # We will force X1 and X2 to be selected first: search_terms_forced <- force_search_terms( forced_terms = paste0("X", 1:2), optional_terms = paste0("X", 3:5) ) # Run varsel() (here without cross-validation and with small values for # `nterms_max`, `nclusters`, and `nclusters_pred`, but only for the sake of # speed in this example; this is not recommended in general): vs <- varsel(fit, nclusters = 5, nclusters_pred = 10, search_terms = search_terms_forced, seed = 5555) # Now see, for example, `?print.vsel`, `?plot.vsel`, `?suggest_size.vsel`, # and `?ranking` for possible post-processing functions.
The mesquite bushes yields dataset from Gelman and Hill (2006) (http://www.stat.columbia.edu/~gelman/arm/).
mesquite
mesquite
The response variable is the total weight (in grams) of photosynthetic material as derived from actual harvesting of the bush. The predictor variables are:
diameter of the canopy (the leafy area of the bush) in meters, measured along the longer axis of the bush.
canopy diameter measured along the shorter axis.
height of the canopy.
total height of the bush.
plant unit density (# of primary stems per plant unit).
group of measurements (0 for the first group, 1 for the second group).
http://www.stat.columbia.edu/~gelman/arm/examples/mesquite/mesquite.dat
Gelman, Andrew, and Jennifer Hill. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge, UK: Cambridge University Press. doi:10.1017/CBO9780511790942.
Retrieves the predictive performance summaries after running varsel()
or
cv_varsel()
. These summaries are computed by summary.vsel()
, so the main
method of performances()
is performances.vselsummary()
(objects of class
vselsummary
are returned by summary.vsel()
). As a shortcut method,
performances.vsel()
is provided as well (objects of class vsel
are
returned by varsel()
and cv_varsel()
). For a graphical representation,
see plot.vsel()
.
performances(object, ...) ## S3 method for class 'vselsummary' performances(object, ...) ## S3 method for class 'vsel' performances(object, ...)
performances(object, ...) ## S3 method for class 'vselsummary' performances(object, ...) ## S3 method for class 'vsel' performances(object, ...)
object |
The object from which to retrieve the predictive performance results. Possible classes may be inferred from the names of the corresponding methods (see also the description). |
... |
For |
An object of class performances
which is a list
with the
following elements:
submodels
: The predictive performance results for the submodels, as a
data.frame
.
reference_model
: The predictive performance results for the reference
model, as a named vector.
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Run varsel() (here without cross-validation, with L1 search, and with small # values for `nterms_max` and `nclusters_pred`, but only for the sake of # speed in this example; this is not recommended in general): vs <- varsel(fit, method = "L1", nterms_max = 3, nclusters_pred = 10, seed = 5555) print(performances(vs))
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Run varsel() (here without cross-validation, with L1 search, and with small # values for `nterms_max` and `nclusters_pred`, but only for the sake of # speed in this example; this is not recommended in general): vs <- varsel(fit, method = "L1", nterms_max = 3, nclusters_pred = 10, seed = 5555) print(performances(vs))
Plots the ranking proportions (see cv_proportions()
) from the fold-wise
predictor rankings in a cross-validation with fold-wise searches. This is a
visualization of the transposed matrix returned by cv_proportions()
. The
proportions printed as text inside of the colored tiles are rounded to whole
percentage points (the plotted proportions themselves are not rounded).
## S3 method for class 'cv_proportions' plot(x, text_angle = NULL, ...) ## S3 method for class 'ranking' plot(x, ...)
## S3 method for class 'cv_proportions' plot(x, text_angle = NULL, ...) ## S3 method for class 'ranking' plot(x, ...)
x |
For |
text_angle |
Passed to argument |
... |
For |
A ggplot2 plotting object (of class gg
and ggplot
).
Idea and original code by Aki Vehtari. Slight modifications of the original code by Frank Weber, Yann McLatchie, and Sölvi Rögnvaldsson. Final implementation in projpred by Frank Weber.
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 1000, refresh = 0, seed = 9876 ) # Run cv_varsel() (with L1 search and small values for `K`, `nterms_max`, and # `nclusters_pred`, but only for the sake of speed in this example; this is # not recommended in general): cvvs <- cv_varsel(fit, method = "L1", cv_method = "kfold", K = 2, nterms_max = 3, nclusters_pred = 10, seed = 5555) # Extract predictor rankings: rk <- ranking(cvvs) # Compute ranking proportions: pr_rk <- cv_proportions(rk) # Visualize the ranking proportions: gg_pr_rk <- plot(pr_rk) print(gg_pr_rk) # Since the object returned by plot.cv_proportions() is a standard ggplot2 # plotting object, you can modify the plot easily, e.g., to remove the # legend: print(gg_pr_rk + ggplot2::theme(legend.position = "none"))
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 1000, refresh = 0, seed = 9876 ) # Run cv_varsel() (with L1 search and small values for `K`, `nterms_max`, and # `nclusters_pred`, but only for the sake of speed in this example; this is # not recommended in general): cvvs <- cv_varsel(fit, method = "L1", cv_method = "kfold", K = 2, nterms_max = 3, nclusters_pred = 10, seed = 5555) # Extract predictor rankings: rk <- ranking(cvvs) # Compute ranking proportions: pr_rk <- cv_proportions(rk) # Visualize the ranking proportions: gg_pr_rk <- plot(pr_rk) print(gg_pr_rk) # Since the object returned by plot.cv_proportions() is a standard ggplot2 # plotting object, you can modify the plot easily, e.g., to remove the # legend: print(gg_pr_rk + ggplot2::theme(legend.position = "none"))
This is the plot()
method for vsel
objects (returned by varsel()
or
cv_varsel()
). It visualizes the predictive performance of the reference
model (possibly also that of some other "baseline" model) and that of the
submodels along the full-data predictor ranking. Basic information about the
(CV) variability in the ranking of the predictors is included as well (if
available; inferred from cv_proportions()
). For a tabular representation,
see summary.vsel()
and performances()
.
## S3 method for class 'vsel' plot( x, nterms_max = NULL, stats = "elpd", deltas = FALSE, alpha = 2 * pnorm(-1), baseline = if (!inherits(x$refmodel, "datafit")) "ref" else "best", thres_elpd = NA, resp_oscale = TRUE, point_size = 3, bar_thickness = 1, ranking_nterms_max = NULL, ranking_abbreviate = FALSE, ranking_abbreviate_args = list(), ranking_repel = NULL, ranking_repel_args = list(), ranking_colored = FALSE, show_cv_proportions = TRUE, cumulate = FALSE, text_angle = NULL, size_position = "primary_x_bottom", ... )
## S3 method for class 'vsel' plot( x, nterms_max = NULL, stats = "elpd", deltas = FALSE, alpha = 2 * pnorm(-1), baseline = if (!inherits(x$refmodel, "datafit")) "ref" else "best", thres_elpd = NA, resp_oscale = TRUE, point_size = 3, bar_thickness = 1, ranking_nterms_max = NULL, ranking_abbreviate = FALSE, ranking_abbreviate_args = list(), ranking_repel = NULL, ranking_repel_args = list(), ranking_colored = FALSE, show_cv_proportions = TRUE, cumulate = FALSE, text_angle = NULL, size_position = "primary_x_bottom", ... )
x |
An object of class |
nterms_max |
Maximum submodel size (number of predictor terms) for which
the performance statistics are calculated. Using |
stats |
One or more character strings determining which performance
statistics (i.e., utilities or losses) to estimate based on the
observations in the evaluation (or "test") set (in case of
cross-validation, these are all observations because they are partitioned
into multiple test sets; in case of
|
deltas |
If |
alpha |
A number determining the (nominal) coverage |
baseline |
For |
thres_elpd |
Only relevant if |
resp_oscale |
Only relevant for the latent projection. A single logical
value indicating whether to calculate the performance statistics on the
original response scale ( |
point_size |
Passed to argument |
bar_thickness |
Passed to argument |
ranking_nterms_max |
Maximum submodel size (number of predictor terms)
for which the predictor names and the corresponding ranking proportions are
added on the x-axis. Using |
ranking_abbreviate |
A single logical value indicating whether the
predictor names in the full-data predictor ranking should be abbreviated by
|
ranking_abbreviate_args |
A |
ranking_repel |
Either |
ranking_repel_args |
A |
ranking_colored |
A single logical value indicating whether the points
and the uncertainty bars should be gradient-colored according to the CV
ranking proportions ( |
show_cv_proportions |
A single logical value indicating whether the CV
ranking proportions (see |
cumulate |
Passed to argument |
text_angle |
Passed to argument |
size_position |
A single character string specifying the position of the
submodel sizes. Either |
... |
Arguments passed to the internal function which is used for
bootstrapping (if applicable; see argument |
The stats
options "mse"
and "rmse"
are only available for:
the traditional projection,
the latent projection with resp_oscale = FALSE
,
the latent projection with resp_oscale = TRUE
in combination with
<refmodel>$family$cats
being NULL
.
The stats
option "acc"
(= "pctcorr"
) is only available for:
the binomial()
family in case of the traditional projection,
all families in case of the augmented-data projection,
the binomial()
family (on the original response scale) in case of the
latent projection with resp_oscale = TRUE
in combination with
<refmodel>$family$cats
being NULL
,
all families (on the original response scale) in case of the latent
projection with resp_oscale = TRUE
in combination with
<refmodel>$family$cats
being not NULL
.
The stats
option "auc"
is only available for:
the binomial()
family in case of the traditional projection,
the binomial()
family (on the original response scale) in case of the
latent projection with resp_oscale = TRUE
in combination with
<refmodel>$family$cats
being NULL
.
A ggplot2 plotting object (of class gg
and ggplot
). If
ranking_abbreviate
is TRUE
, the output of abbreviate()
is stored in
an attribute called projpred_ranking_abbreviated
(to allow the
abbreviations to be easily mapped back to the original predictor names).
As long as the reference model's performance is computable, it is always
shown in the plot as a dashed red horizontal line. If baseline = "best"
,
the baseline model's performance is shown as a dotted black horizontal line.
If !is.na(thres_elpd)
and any(stats %in% c("elpd", "mlpd", "gmpd"))
, the
value supplied to thres_elpd
(which is automatically adapted internally in
case of the MLPD or the GMPD or deltas = FALSE
) is shown as a dot-dashed
gray horizontal line for the reference model and, if baseline = "best"
, as
a long-dashed green horizontal line for the baseline model.
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Run varsel() (here without cross-validation, with L1 search, and with small # values for `nterms_max` and `nclusters_pred`, but only for the sake of # speed in this example; this is not recommended in general): vs <- varsel(fit, method = "L1", nterms_max = 3, nclusters_pred = 10, seed = 5555) print(plot(vs))
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Run varsel() (here without cross-validation, with L1 search, and with small # values for `nterms_max` and `nclusters_pred`, but only for the sake of # speed in this example; this is not recommended in general): vs <- varsel(fit, method = "L1", nterms_max = 3, nclusters_pred = 10, seed = 5555) print(plot(vs))
After the projection of the reference model onto a submodel, the linear
predictors (for the original or a new dataset) based on that submodel can be
calculated by proj_linpred()
. These linear predictors can also be
transformed to response scale and averaged across the projected parameter
draws. Furthermore, proj_linpred()
returns the corresponding log predictive
density values if the (original or new) dataset contains response values. The
proj_predict()
function draws from the predictive distributions (there is
one such distribution for each observation from the original or new dataset)
of the submodel that the reference model has been projected onto. If the
projection has not been performed yet, both functions call project()
internally to perform the projection. Both functions can also handle multiple
submodels at once (for object
s of class vsel
or object
s returned by a
project()
call to an object of class vsel
; see project()
).
proj_linpred( object, newdata = NULL, offsetnew = NULL, weightsnew = NULL, filter_nterms = NULL, transform = FALSE, integrated = FALSE, allow_nonconst_wdraws_prj = return_draws_matrix, return_draws_matrix = FALSE, .seed = NA, ... ) proj_predict( object, newdata = NULL, offsetnew = NULL, weightsnew = NULL, filter_nterms = NULL, nresample_clusters = 1000, return_draws_matrix = FALSE, .seed = NA, resp_oscale = TRUE, ... )
proj_linpred( object, newdata = NULL, offsetnew = NULL, weightsnew = NULL, filter_nterms = NULL, transform = FALSE, integrated = FALSE, allow_nonconst_wdraws_prj = return_draws_matrix, return_draws_matrix = FALSE, .seed = NA, ... ) proj_predict( object, newdata = NULL, offsetnew = NULL, weightsnew = NULL, filter_nterms = NULL, nresample_clusters = 1000, return_draws_matrix = FALSE, .seed = NA, resp_oscale = TRUE, ... )
object |
An object returned by |
newdata |
Passed to argument |
offsetnew |
Passed to argument |
weightsnew |
Passed to argument |
filter_nterms |
Only applies if |
transform |
For |
integrated |
For |
allow_nonconst_wdraws_prj |
Only relevant for |
return_draws_matrix |
A single logical value indicating whether to
return an object (in case of |
.seed |
Pseudorandom number generation (PRNG) seed by which the same
results can be obtained again if needed. Passed to argument |
... |
Arguments passed to |
nresample_clusters |
For |
resp_oscale |
Only relevant for the latent projection. A single logical
value indicating whether to draw from the posterior-projection predictive
distributions on the original response scale ( |
Currently, proj_predict()
ignores observation weights that are not
equal to 1
. A corresponding warning is thrown if this is the case.
In case of the latent projection and transform = FALSE
:
Output element pred
contains the linear predictors without any
modifications that may be due to the original response distribution (e.g.,
for a brms::cumulative()
model, the ordered thresholds are not taken into
account).
Output element lpd
contains the latent log predictive density values,
i.e., those corresponding to the latent Gaussian distribution. If newdata
is not NULL
, this requires the latent response values to be supplied in a
column called .<response_name>
of newdata
where <response_name>
needs
to be replaced by the name of the original response variable (if
<response_name>
contained parentheses, these have been stripped off by
init_refmodel()
; see the left-hand side of formula(<refmodel>)
). For
technical reasons, the existence of column <response_name>
in newdata
is another requirement (even though .<response_name>
is actually used).
In the following, ,
,
, and
from help
topic refmodel-init-get are used. (For
proj_linpred()
with integrated = TRUE
, we have .) Furthermore, let
denote either
(if
transform = TRUE
)
or (if
transform = FALSE
). Then, if the
prediction is done for one submodel only (i.e., length(nterms) == 1 || !is.null(predictor_terms)
in the explicit or implicit call to project()
,
see argument object
):
proj_linpred()
returns a list
with the following elements:
Element pred
contains the actual predictions, i.e., the linear
predictors, possibly transformed to response scale (depending on
argument transform
).
Element lpd
is non-NULL
only if newdata
is NULL
or if
newdata
contains response values in the corresponding column. In that
case, it contains the log predictive density values (conditional on
each of the projected parameter draws if integrated = FALSE
and
averaged across the projected parameter draws if integrated = TRUE
).
In case of (i) the traditional projection, (ii) the latent projection
with transform = FALSE
, or (iii) the latent projection with
transform = TRUE
and <refmodel>$family$cats
(where <refmodel>
is
an object resulting from init_refmodel()
; see also
extend_family()
's argument latent_y_unqs
) being NULL
, both
elements are matrices
(converted to a—possibly weighted—
draws_matrix
if argument
return_draws_matrix
is TRUE
, see the description of this argument).
In case of (i) the augmented-data projection or (ii) the latent
projection with transform = TRUE
and <refmodel>$family$cats
being
not NULL
, pred
is an array (if argument
return_draws_matrix
is TRUE
, this array
is "compressed" to an matrix—with the columns consisting of
blocks of
rows—and then converted to a—possibly
weighted—
draws_matrix
) and lpd
is an matrix (converted to a—possibly
weighted—
draws_matrix
if argument return_draws_matrix
is TRUE
).
If return_draws_matrix
is FALSE
and allow_nonconst_wdraws_prj
is
TRUE
and integrated
is FALSE
and the projected draws have
nonconstant weights, then both list
elements have the weights of
these draws stored in an attribute wdraws_prj
. (If
return_draws_matrix
, allow_nonconst_wdraws_prj
, and integrated
are all FALSE
, then projected draws with nonconstant weights cause an
error.)
proj_predict()
returns an
matrix of predictions where
denotes
nresample_clusters
in case of clustered projection (or, more generally,
in case of projected draws with nonconstant weights). If argument
return_draws_matrix
is TRUE
, the returned matrix is converted to a
draws_matrix
(see posterior::draws_matrix()
). In case of (i) the
augmented-data projection or (ii) the latent projection with resp_oscale = TRUE
and <refmodel>$family$cats
being not NULL
, the returned matrix
(or draws_matrix
) has an attribute called cats
(the character vector of
response categories) and the values of the matrix (or draws_matrix
) are
the predicted indices of the response categories (these indices refer to
the order of the response categories from attribute cats
).
If the prediction is done for more than one submodel, the output from above
is returned for each submodel, giving a named list
with one element for
each submodel (the names of this list
being the numbers of predictor
terms of the submodels when counting the intercept, too).
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Projection onto an arbitrary combination of predictor terms (with a small # value for `ndraws`, but only for the sake of speed in this example; this # is not recommended in general): prj <- project(fit, predictor_terms = c("X1", "X3", "X5"), ndraws = 21, seed = 9182) # Predictions (at the training points) from the submodel onto which the # reference model was projected: prjl <- proj_linpred(prj) prjp <- proj_predict(prj, .seed = 7364)
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Projection onto an arbitrary combination of predictor terms (with a small # value for `ndraws`, but only for the sake of speed in this example; this # is not recommended in general): prj <- project(fit, predictor_terms = c("X1", "X3", "X5"), ndraws = 21, seed = 9182) # Predictions (at the training points) from the submodel onto which the # reference model was projected: prjl <- proj_linpred(prj) prjp <- proj_predict(prj, .seed = 7364)
This is the predict()
method for refmodel
objects (returned by
get_refmodel()
or init_refmodel()
). It offers three types of output which
are all based on the reference model and new (or old) observations: Either
the linear predictor on link scale, the linear predictor transformed to
response scale, or the log posterior predictive density.
## S3 method for class 'refmodel' predict( object, newdata = NULL, ynew = NULL, offsetnew = NULL, weightsnew = NULL, type = "response", ... )
## S3 method for class 'refmodel' predict( object, newdata = NULL, ynew = NULL, offsetnew = NULL, weightsnew = NULL, type = "response", ... )
object |
An object of class |
newdata |
Passed to argument |
ynew |
If not |
offsetnew |
Passed to argument |
weightsnew |
Passed to argument |
type |
Usually only relevant if |
... |
Currently ignored. |
Argument weightsnew
is only relevant if !is.null(ynew)
.
In case of a multilevel reference model, group-level effects for new group
levels are drawn randomly from a (multivariate) Gaussian distribution. When
setting projpred.mlvl_pred_new
to TRUE
, all group levels from newdata
(even those that already exist in the original dataset) are treated as new
group levels (if is.null(newdata)
, all group levels from the original
dataset are considered as new group levels in that case).
In the following, ,
, and
from help topic refmodel-init-get are used.
Furthermore, let
denote either
(if
type = "response"
) or (if
type = "link"
).
Then, if is.null(ynew)
, the returned object contains the reference
model's predictions (with the scale depending on argument type
) as:
a length- vector in case of (i) the traditional projection, (ii)
the latent projection with
type = "link"
, or (iii) the latent projection
with type = "response"
and object$family$cats
being NULL
;
an matrix in case of (i) the augmented-data
projection or (ii) the latent projection with
type = "response"
and
object$family$cats
being not NULL
.
If !is.null(ynew)
, the returned object is a length- vector of log
posterior predictive densities evaluated at
ynew
.
project()
runFor a projection
object (returned by project()
, possibly as elements of a
list
), this function extracts the combination of predictor terms onto which
the projection was performed.
predictor_terms(object, ...) ## S3 method for class 'projection' predictor_terms(object, ...)
predictor_terms(object, ...) ## S3 method for class 'projection' predictor_terms(object, ...)
object |
An object of class |
... |
Currently ignored. |
A character vector of predictor terms.
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Projection onto an arbitrary combination of predictor terms (with a small # value for `nclusters`, but only for the sake of speed in this example; # this is not recommended in general): prj <- project(fit, predictor_terms = c("X1", "X3", "X5"), nclusters = 10, seed = 9182) print(predictor_terms(prj)) # gives `c("X1", "X3", "X5")`
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Projection onto an arbitrary combination of predictor terms (with a small # value for `nclusters`, but only for the sake of speed in this example; # this is not recommended in general): prj <- project(fit, predictor_terms = c("X1", "X3", "X5"), nclusters = 10, seed = 9182) print(predictor_terms(prj)) # gives `c("X1", "X3", "X5")`
project()
outputThis is the print()
method for objects of class projection
. This method
mainly exists to avoid cluttering the console when printing such objects
accidentally.
## S3 method for class 'projection' print(x, ...)
## S3 method for class 'projection' print(x, ...)
x |
An object of class |
... |
Currently ignored. |
The input object x
(invisible).
This is the print()
method for reference model objects (objects of class
refmodel
). This method mainly exists to avoid cluttering the console when
printing such objects accidentally.
## S3 method for class 'refmodel' print(x, ...)
## S3 method for class 'refmodel' print(x, ...)
x |
An object of class |
... |
Currently ignored. |
The input object x
(invisible).
varsel()
or cv_varsel()
runThis is the print()
method for vsel
objects (returned by varsel()
or
cv_varsel()
). It displays a summary of a varsel()
or cv_varsel()
run by
first calling summary.vsel()
and then print.vselsummary()
.
## S3 method for class 'vsel' print(x, digits = getOption("projpred.digits", 2), ...)
## S3 method for class 'vsel' print(x, digits = getOption("projpred.digits", 2), ...)
x |
An object of class |
digits |
Passed to argument |
... |
Arguments passed to |
The output of summary.vsel()
(invisible).
varsel()
or cv_varsel()
runThis is the print()
method for summary objects created by summary.vsel()
.
It displays a summary of the results from a varsel()
or cv_varsel()
run.
## S3 method for class 'vselsummary' print(x, digits = getOption("projpred.digits", 2), ...)
## S3 method for class 'vselsummary' print(x, digits = getOption("projpred.digits", 2), ...)
x |
An object of class |
digits |
Passed to |
... |
Arguments passed to |
In the submodel predictive performance table printed at (or towards)
the bottom, column ranking_fulldata
contains the full-data predictor
ranking and column cv_proportions_diag
contains the main diagonal of the
matrix returned by cv_proportions()
(with cumulate
as set in the
summary.vsel()
call that created x
). To retrieve the fold-wise
predictor rankings, use the ranking()
function, possibly followed by
cv_proportions()
for computing the ranking proportions (which can be
visualized by plot.cv_proportions()
).
The output of summary.vsel()
(invisible).
Project the posterior of the reference model onto the parameter space of a single submodel consisting of a specific combination of predictor terms or (after variable selection) onto the parameter space of a single or multiple submodels of specific sizes.
project( object, nterms = NULL, solution_terms = predictor_terms, predictor_terms = NULL, refit_prj = TRUE, ndraws = 400, nclusters = NULL, seed = NA, verbose = getOption("projpred.verbose_project", TRUE), ... )
project( object, nterms = NULL, solution_terms = predictor_terms, predictor_terms = NULL, refit_prj = TRUE, ndraws = 400, nclusters = NULL, seed = NA, verbose = getOption("projpred.verbose_project", TRUE), ... )
object |
An object which can be used as input to |
nterms |
Only relevant if |
solution_terms |
Deprecated. Please use argument |
predictor_terms |
If not |
refit_prj |
A single logical value indicating whether to fit the
submodels (again) ( |
ndraws |
Only relevant if |
nclusters |
Only relevant if |
seed |
Pseudorandom number generation (PRNG) seed by which the same
results can be obtained again if needed. Passed to argument |
verbose |
A single logical value indicating whether to print out
additional information during the computations. More precisely, this gets
passed as |
... |
Arguments passed to |
Arguments ndraws
and nclusters
are automatically truncated at
the number of posterior draws in the reference model (which is 1
for
datafit
s). Using less draws or clusters in ndraws
or nclusters
than
posterior draws in the reference model may result in slightly inaccurate
projection performance. Increasing these arguments affects the computation
time linearly.
If refit_prj = FALSE
(which is only possible if object
is of class
vsel
), project()
retrieves the submodel fits from the full-data search
that was run when creating object
. Usually, the search relies on a rather
coarse clustering or thinning of the reference model's posterior draws (by
default, varsel()
and cv_varsel()
use nclusters = 20
). Consequently,
project()
with refit_prj = FALSE
then inherits this coarse clustering
or thinning.
If the projection is performed onto a single submodel (i.e.,
length(nterms) == 1 || !is.null(predictor_terms)
), an object of class
projection
which is a list
containing the following elements:
dis
Projected draws for the dispersion parameter.
ce
The cross-entropy part of the Kullback-Leibler (KL) divergence from the reference model to the submodel. For some families, this is not the actual cross-entropy, but a reduced one where terms which would cancel out when calculating the KL divergence have been dropped. In case of the Gaussian family, that reduced cross-entropy is further modified, yielding merely a proxy.
wdraws_prj
Weights for the projected draws.
predictor_terms
A character vector of the submodel's predictor terms.
outdmin
A list
containing the submodel fits (one fit per
projected draw). This is the same as the return value of the
div_minimizer
function (see init_refmodel()
), except if project()
was used with an object
of class vsel
based on an L1 search as well
as with refit_prj = FALSE
, in which case this is the output from an
internal L1-penalized divergence minimizer.
cl_ref
A numeric vector of length equal to the number of posterior draws in the reference model, containing the cluster indices of these draws.
wdraws_ref
A numeric vector of length equal to the number of
posterior draws in the reference model, giving the weights of these
draws. These weights should be treated as not being normalized (i.e.,
they don't necessarily sum to 1
).
const_wdraws_prj
A single logical value indicating whether the
projected draws have constant weights (TRUE
) or not (FALSE
).
refmodel
The reference model object.
If the projection is performed onto more than one submodel, the output from
above is returned for each submodel, giving a list
with one element for
each submodel.
The elements of an object of class projection
are not meant to be
accessed directly but instead via helper functions (see the main vignette
and projpred-package; see also as_draws_matrix.projection()
, argument
return_draws_matrix
of proj_linpred()
, and argument
nresample_clusters
of proj_predict()
for the intended use of the
weights stored in element wdraws_prj
).
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Run varsel() (here without cross-validation, with L1 search, and with small # values for `nterms_max` and `nclusters_pred`, but only for the sake of # speed in this example; this is not recommended in general): vs <- varsel(fit, method = "L1", nterms_max = 3, nclusters_pred = 10, seed = 5555) # Projection onto the best submodel with 2 predictor terms (with a small # value for `nclusters`, but only for the sake of speed in this example; # this is not recommended in general): prj_from_vs <- project(vs, nterms = 2, nclusters = 10, seed = 9182) # Projection onto an arbitrary combination of predictor terms (with a small # value for `nclusters`, but only for the sake of speed in this example; # this is not recommended in general): prj <- project(fit, predictor_terms = c("X1", "X3", "X5"), nclusters = 10, seed = 9182)
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Run varsel() (here without cross-validation, with L1 search, and with small # values for `nterms_max` and `nclusters_pred`, but only for the sake of # speed in this example; this is not recommended in general): vs <- varsel(fit, method = "L1", nterms_max = 3, nclusters_pred = 10, seed = 5555) # Projection onto the best submodel with 2 predictor terms (with a small # value for `nclusters`, but only for the sake of speed in this example; # this is not recommended in general): prj_from_vs <- project(vs, nterms = 2, nclusters = 10, seed = 9182) # Projection onto an arbitrary combination of predictor terms (with a small # value for `nclusters`, but only for the sake of speed in this example; # this is not recommended in general): prj <- project(fit, predictor_terms = c("X1", "X3", "X5"), nclusters = 10, seed = 9182)
Extracts the predictor ranking(s) from an object of class vsel
(returned
by varsel()
or cv_varsel()
). A predictor ranking is simply a character
vector of predictor terms ranked by predictive relevance (with the most
relevant term first). In any case, objects of class vsel
contain the
predictor ranking based on the full-data search. If an object of class
vsel
is based on a cross-validation (CV) with fold-wise searches (i.e., if
it was created by cv_varsel()
with validate_search = TRUE
), then it also
contains fold-wise predictor rankings.
ranking(object, ...) ## S3 method for class 'vsel' ranking(object, nterms_max = NULL, ...)
ranking(object, ...) ## S3 method for class 'vsel' ranking(object, nterms_max = NULL, ...)
object |
The object from which to retrieve the predictor ranking(s). Possible classes may be inferred from the names of the corresponding methods (see also the description). |
... |
Currently ignored. |
nterms_max |
Maximum submodel size (number of predictor terms) for the
predictor ranking(s), i.e., the submodel size at which to cut off the
predictor ranking(s). Using |
An object of class ranking
which is a list
with the following
elements:
fulldata
: The predictor ranking from the full-data search.
foldwise
: The predictor rankings from the fold-wise
searches in the form of a character matrix (only available if object
is
based on a CV with fold-wise searches, otherwise element foldwise
is
NULL
). The rows of this matrix correspond to the CV folds and the columns
to the submodel sizes. Each row contains the predictor ranking from the
search of that CV fold.
# For an example, see `?plot.cv_proportions`.
# For an example, see `?plot.cv_proportions`.
Function get_refmodel()
is a generic function whose methods usually call
init_refmodel()
which is the underlying workhorse (and may also be used
directly without a call to get_refmodel()
).
Both, get_refmodel()
and init_refmodel()
, create an object containing
information needed for the projection predictive variable selection, namely
about the reference model, the submodels, and how the projection should be
carried out. For the sake of simplicity, the documentation may refer to the
resulting object also as "reference model" or "reference model object", even
though it also contains information about the submodels and the projection.
A "typical" reference model object is created by get_refmodel.stanreg()
and
brms::get_refmodel.brmsfit()
, either implicitly by a call to a top-level
function such as project()
, varsel()
, and cv_varsel()
or explicitly by
a call to get_refmodel()
. All non-"typical" reference model objects will be
called "custom" reference model objects.
Some arguments are for -fold cross-validation (
-fold CV) only;
see
cv_varsel()
for the use of -fold CV in projpred.
get_refmodel(object, ...) ## S3 method for class 'refmodel' get_refmodel(object, ...) ## S3 method for class 'vsel' get_refmodel(object, ...) ## S3 method for class 'projection' get_refmodel(object, ...) ## Default S3 method: get_refmodel(object, family = NULL, ...) ## S3 method for class 'stanreg' get_refmodel(object, latent = FALSE, dis = NULL, ...) init_refmodel( object, data, formula, family, ref_predfun = NULL, div_minimizer = NULL, proj_predfun = NULL, extract_model_data = NULL, cvfun = NULL, cvfits = NULL, dis = NULL, cvrefbuilder = NULL, called_from_cvrefbuilder = FALSE, ... )
get_refmodel(object, ...) ## S3 method for class 'refmodel' get_refmodel(object, ...) ## S3 method for class 'vsel' get_refmodel(object, ...) ## S3 method for class 'projection' get_refmodel(object, ...) ## Default S3 method: get_refmodel(object, family = NULL, ...) ## S3 method for class 'stanreg' get_refmodel(object, latent = FALSE, dis = NULL, ...) init_refmodel( object, data, formula, family, ref_predfun = NULL, div_minimizer = NULL, proj_predfun = NULL, extract_model_data = NULL, cvfun = NULL, cvfits = NULL, dis = NULL, cvrefbuilder = NULL, called_from_cvrefbuilder = FALSE, ... )
object |
For |
... |
For |
family |
An object of class |
latent |
A single logical value indicating whether to use the latent
projection ( |
dis |
A vector of posterior draws for the reference model's dispersion
parameter or—more precisely—the posterior values for the reference
model's parameter-conditional predictive variance (assuming that this
variance is the same for all observations). May be |
data |
A |
formula |
The full formula to use for the search procedure. For custom
reference models, this does not necessarily coincide with the reference
model's formula. For general information about formulas in R, see
|
ref_predfun |
Prediction function for the linear predictor of the
reference model, including offsets (if existing). See also section
"Arguments |
div_minimizer |
A function for minimizing the Kullback-Leibler (KL)
divergence from the reference model to a submodel (i.e., for performing the
projection of the reference model onto a submodel). The output of
|
proj_predfun |
Prediction function for the linear predictor of a
submodel onto which the reference model is projected. See also section
"Arguments |
extract_model_data |
A function for fetching some variables (response,
observation weights, offsets) from the original dataset (supplied to
argument |
cvfun |
For |
cvfits |
For |
cvrefbuilder |
For |
called_from_cvrefbuilder |
A single logical value indicating whether
|
An object that can be passed to all the functions that take the
reference model fit as the first argument, such as varsel()
,
cv_varsel()
, project()
, proj_linpred()
, and proj_predict()
.
Usually, the returned object is of class refmodel
. However, if object
is NULL
, the returned object is of class datafit
as well as of class
refmodel
(with datafit
being first). Objects of class datafit
are
handled differently at several places throughout this package.
The elements of the returned object are not meant to be accessed directly
but instead via downstream functions (see the functions mentioned above as
well as predict.refmodel()
).
Although bad practice (in general), a reference model lacking an intercept can be used within projpred. However, it will always be projected onto submodels which include an intercept. The reason is that even if the true intercept in the reference model is zero, this does not need to hold for the submodels.
In multilevel (group-level) terms, function calls on the right-hand side of
the |
character (e.g., (1 | gr(group_variable))
, which is possible in
brms) are currently not allowed in projpred.
For additive models (still an experimental feature), only mgcv::s()
and
mgcv::t2()
are currently supported as smooth terms. Furthermore, these need
to be called without any arguments apart from the predictor names (symbols).
For example, for smoothing the effect of a predictor x
, only s(x)
or
t2(x)
are allowed. As another example, for smoothing the joint effect of
two predictors x
and z
, only s(x, z)
or t2(x, z)
are allowed (and
analogously for higher-order joint effects, e.g., of three predictors). Note
that all smooth terms need to be included in formula
(there is no random
argument as in rstanarm::stan_gamm4()
, for example).
ref_predfun
, proj_predfun
, and div_minimizer
Arguments ref_predfun
, proj_predfun
, and div_minimizer
may be NULL
for using an internal default (see projpred-package for the functions used
by the default divergence minimizers). Otherwise, let denote the
number of observations (in case of CV, these may be reduced to each fold),
the number of posterior draws for the reference
model's parameters, and
the number of draws for
the parameters of a submodel that the reference model has been projected onto
(short: the number of projected draws). For the augmented-data projection,
let
denote the number of response categories,
the number of latent response categories (which
typically equals
), and define
as well as
. Then the functions supplied to these arguments need to have the
following prototypes:
ref_predfun
: ref_predfun(fit, newdata = NULL)
where:
fit
accepts the reference model fit as given in argument object
(but possibly refitted to a subset of the observations, as done in
-fold CV).
newdata
accepts either NULL
(for using the original dataset,
typically stored in fit
) or data for new observations (at least in the
form of a data.frame
).
proj_predfun
: proj_predfun(fits, newdata)
where:
fits
accepts a list
of length
containing this number of submodel fits. This
list
is the same as that
returned by project()
in its output element outdmin
(which in turn is
the same as the return value of div_minimizer
, except if project()
was used with an object
of class vsel
based on an L1 search as well
as with refit_prj = FALSE
).
newdata
accepts data for new observations (at least in the form of a
data.frame
).
div_minimizer
does not need to have a specific prototype, but it needs to
be able to be called with the following arguments:
formula
accepts either a standard formula
with a single response
(if or in case of the
augmented-data projection) or a
formula
with response variables
cbind()
-ed on the left-hand side in
which case the projection has to be performed for each of the response
variables separately.
data
accepts a data.frame
to be used for the projection. In case of
the traditional or the latent projection, this dataset has rows.
In case of the augmented-data projection, this dataset has
rows.
family
accepts an object of class family
.
weights
accepts either observation weights (at least in the form of a
numeric vector) or NULL
(for using a vector of ones as weights).
projpred_var
accepts an
matrix of predictive variances (necessary for projpred's internal
GLM fitter) in case of the traditional or the latent projection and an
matrix (containing only
NA
s) in case of the augmented-data projection.
projpred_ws_aug
accepts an
matrix of expected values for the response in case of the traditional or
the latent projection and an
matrix of probabilities for the
response categories in case of the augmented-data projection.
...
accepts further arguments specified by the user (or by
projpred).
The return value of these functions needs to be:
ref_predfun
: for the traditional or the latent projection, an matrix; for the augmented-data
projection, an
array (the only exception is the augmented-data projection for
the
binomial()
family in which case ref_predfun
needs to return an matrix just like for the traditional
projection because the array is constructed by an internal wrapper function).
proj_predfun
: for the traditional or the latent projection, an matrix; for the augmented-data
projection, an
array.
div_minimizer
: a list
of length
containing this number of submodel fits.
extract_model_data
The function supplied to argument extract_model_data
needs to have the
prototype
extract_model_data(object, newdata, wrhs = NULL, orhs = NULL, extract_y = TRUE)
where:
object
accepts the reference model fit as given in argument object
(but
possibly refitted to a subset of the observations, as done in -fold
CV).
newdata
accepts data for new observations (at least in the form of a
data.frame
).
wrhs
accepts at least (i) a right-hand side formula consisting only of
the variable in newdata
containing the observation weights or (ii) NULL
for using the observation weights corresponding to newdata
(typically, the
observation weights are stored in a column of newdata
; if the model was
fitted without observation weights, a vector of ones should be used).
orhs
accepts at least (i) a right-hand side formula consisting only of
the variable in newdata
containing the offsets or (ii) NULL
for using the
offsets corresponding to newdata
(typically, the offsets are stored in a
column of newdata
; if the model was fitted without offsets, a vector of
zeros should be used).
extract_y
accepts a single logical value indicating whether output
element y
(see below) shall be NULL
(TRUE
) or not (FALSE
).
The return value of extract_model_data
needs to be a list
with elements
y
, weights
, and offset
, each being a numeric vector containing the data
for the response, the observation weights, and the offsets, respectively. An
exception is that y
may also be NULL
(depending on argument extract_y
),
a non-numeric vector, or a factor
.
The weights and offsets returned by extract_model_data
will be assumed to
hold for the reference model as well as for the submodels.
Above, arguments wrhs
and orhs
were assumed to have defaults of NULL
.
It should be possible to use defaults other than NULL
, but we strongly
recommend to use NULL
. If defaults other than NULL
are used, they need to
imply the behaviors described at items "(ii)" (see the descriptions of wrhs
and orhs
).
If a custom reference model for an augmented-data projection is needed, see
also extend_family()
.
For the augmented-data projection, the response vector resulting from
extract_model_data
is internally coerced to a factor
(using
as.factor()
). The levels of this factor
have to be identical to
family$cats
(after applying extend_family()
internally; see
extend_family()
's argument augdat_y_unqs
).
Note that response-specific offsets (i.e., one length- offset vector
per response category) are not supported by projpred yet. So far, only
offsets which are the same across all response categories are supported. This
is why in case of the
brms::categorical()
family, offsets are currently not
supported at all.
Currently, object = NULL
(i.e., a datafit
; see section "Value") is not
supported in case of the augmented-data projection.
If a custom reference model for a latent projection is needed, see also
extend_family()
.
For the latent projection, family$cats
(after applying extend_family()
internally; see extend_family()
's argument latent_y_unqs
) currently must
not be NULL
if the original (i.e., non-latent) response is a factor
.
Conversely, if family$cats
(after applying extend_family()
) is
non-NULL
, the response vector resulting from extract_model_data
is
internally coerced to a factor
(using as.factor()
). The levels of this
factor
have to be identical to that non-NULL
element family$cats
.
Currently, object = NULL
(i.e., a datafit
; see section "Value") is not
supported in case of the latent projection.
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Define the reference model object explicitly: ref <- get_refmodel(fit) print(class(ref)) # gives `"refmodel"` # Now see, for example, `?varsel`, `?cv_varsel`, and `?project` for # possible post-processing functions. Most of the post-processing functions # call get_refmodel() internally at the beginning, so you will rarely need # to call get_refmodel() yourself. # A custom reference model object which may be used in a variable selection # where the candidate predictors are not a subset of those used for the # reference model's predictions: ref_cust <- init_refmodel( fit, data = dat_gauss, formula = y ~ X6 + X7, family = gaussian(), cvfun = function(folds) { kfold( fit, K = max(folds), save_fits = TRUE, folds = folds, cores = 1 )$fits[, "fit"] }, dis = as.matrix(fit)[, "sigma"], cvrefbuilder = function(cvfit) { init_refmodel(cvfit, data = dat_gauss[-cvfit$omitted, , drop = FALSE], formula = y ~ X6 + X7, family = gaussian(), dis = as.matrix(cvfit)[, "sigma"], called_from_cvrefbuilder = TRUE) } ) # Now, the post-processing functions mentioned above (for example, # varsel(), cv_varsel(), and project()) may be applied to `ref_cust`.
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Define the reference model object explicitly: ref <- get_refmodel(fit) print(class(ref)) # gives `"refmodel"` # Now see, for example, `?varsel`, `?cv_varsel`, and `?project` for # possible post-processing functions. Most of the post-processing functions # call get_refmodel() internally at the beginning, so you will rarely need # to call get_refmodel() yourself. # A custom reference model object which may be used in a variable selection # where the candidate predictors are not a subset of those used for the # reference model's predictions: ref_cust <- init_refmodel( fit, data = dat_gauss, formula = y ~ X6 + X7, family = gaussian(), cvfun = function(folds) { kfold( fit, K = max(folds), save_fits = TRUE, folds = folds, cores = 1 )$fits[, "fit"] }, dis = as.matrix(fit)[, "sigma"], cvrefbuilder = function(cvfit) { init_refmodel(cvfit, data = dat_gauss[-cvfit$omitted, , drop = FALSE], formula = y ~ X6 + X7, family = gaussian(), dis = as.matrix(cvfit)[, "sigma"], called_from_cvrefbuilder = TRUE) } ) # Now, the post-processing functions mentioned above (for example, # varsel(), cv_varsel(), and project()) may be applied to `ref_cust`.
cvfits
from cvfun
A helper function that can be used to create input for
cv_varsel.refmodel()
's argument cvfits
by running first cv_folds()
and
then the reference model object's cvfun
(see init_refmodel()
). This is
helpful if -fold CV is run multiple times based on the same
reference model refits.
run_cvfun(object, ...) ## Default S3 method: run_cvfun(object, ...) ## S3 method for class 'refmodel' run_cvfun( object, K = if (!inherits(object, "datafit")) 5 else 10, folds = NULL, seed = NA, ... )
run_cvfun(object, ...) ## Default S3 method: run_cvfun(object, ...) ## S3 method for class 'refmodel' run_cvfun( object, K = if (!inherits(object, "datafit")) 5 else 10, folds = NULL, seed = NA, ... )
object |
An object of class |
... |
For |
K |
Number of folds. Must be at least 2 and not exceed the number of
observations. Ignored if |
folds |
Either |
seed |
Pseudorandom number generation (PRNG) seed by which the same
results can be obtained again if needed. Passed to argument |
An object that can be used as input for cv_varsel.refmodel()
's
argument cvfits
.
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Define the reference model object explicitly (not really necessary here # because the get_refmodel() call is quite fast in this example, but in # general, this approach is faster than defining the reference model object # multiple times implicitly): ref <- get_refmodel(fit) # Run the reference model object's `cvfun` (with a small value for `K`, but # only for the sake of speed in this example; this is not recommended in # general): cv_fits <- run_cvfun(ref, K = 2, seed = 184) # Run cv_varsel() (with L1 search and small values for `nterms_max` and # `nclusters_pred`, but only for the sake of speed in this example; this is # not recommended in general) and use `cv_fits` there: cvvs_L1 <- cv_varsel(ref, method = "L1", cv_method = "kfold", cvfits = cv_fits, nterms_max = 3, nclusters_pred = 10, seed = 5555) # Now see, for example, `?print.vsel`, `?plot.vsel`, `?suggest_size.vsel`, # and `?ranking` for possible post-processing functions. # The purpose of run_cvfun() is to create an object that can be used in # multiple cv_varsel() calls, e.g., to check the sensitivity to the search # method (L1 or forward): cvvs_fw <- cv_varsel(ref, method = "forward", cv_method = "kfold", cvfits = cv_fits, nterms_max = 3, nclusters = 5, nclusters_pred = 10, seed = 5555) # Stratified K-fold CV is straightforward: n_strat <- 3L set.seed(692) # Some example strata: strat_fac <- sample(paste0("lvl", seq_len(n_strat)), size = nrow(dat_gauss), replace = TRUE, prob = diff(c(0, pnorm(seq_len(n_strat - 1L) - 0.5), 1))) table(strat_fac) # Use loo::kfold_split_stratified() to create the folds vector: folds_strat <- loo::kfold_split_stratified(K = 2, x = strat_fac) table(folds_strat, strat_fac) # Call run_cvfun(), but this time with argument `folds` instead of `K` (here, # specifying argument `seed` would not be necessary because of the set.seed() # call above, but we specify it nonetheless for the sake of generality): cv_fits_strat <- run_cvfun(ref, folds = folds_strat, seed = 391) # Now use `cv_fits_strat` analogously to `cv_fits` from above.
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Define the reference model object explicitly (not really necessary here # because the get_refmodel() call is quite fast in this example, but in # general, this approach is faster than defining the reference model object # multiple times implicitly): ref <- get_refmodel(fit) # Run the reference model object's `cvfun` (with a small value for `K`, but # only for the sake of speed in this example; this is not recommended in # general): cv_fits <- run_cvfun(ref, K = 2, seed = 184) # Run cv_varsel() (with L1 search and small values for `nterms_max` and # `nclusters_pred`, but only for the sake of speed in this example; this is # not recommended in general) and use `cv_fits` there: cvvs_L1 <- cv_varsel(ref, method = "L1", cv_method = "kfold", cvfits = cv_fits, nterms_max = 3, nclusters_pred = 10, seed = 5555) # Now see, for example, `?print.vsel`, `?plot.vsel`, `?suggest_size.vsel`, # and `?ranking` for possible post-processing functions. # The purpose of run_cvfun() is to create an object that can be used in # multiple cv_varsel() calls, e.g., to check the sensitivity to the search # method (L1 or forward): cvvs_fw <- cv_varsel(ref, method = "forward", cv_method = "kfold", cvfits = cv_fits, nterms_max = 3, nclusters = 5, nclusters_pred = 10, seed = 5555) # Stratified K-fold CV is straightforward: n_strat <- 3L set.seed(692) # Some example strata: strat_fac <- sample(paste0("lvl", seq_len(n_strat)), size = nrow(dat_gauss), replace = TRUE, prob = diff(c(0, pnorm(seq_len(n_strat - 1L) - 0.5), 1))) table(strat_fac) # Use loo::kfold_split_stratified() to create the folds vector: folds_strat <- loo::kfold_split_stratified(K = 2, x = strat_fac) table(folds_strat, strat_fac) # Call run_cvfun(), but this time with argument `folds` instead of `K` (here, # specifying argument `seed` would not be necessary because of the set.seed() # call above, but we specify it nonetheless for the sake of generality): cv_fits_strat <- run_cvfun(ref, folds = folds_strat, seed = 391) # Now use `cv_fits_strat` analogously to `cv_fits` from above.
varsel()
or cv_varsel()
run
or the predictor combination from a project()
runThe solution_terms.vsel()
method retrieves the solution path from a
full-data search (vsel
objects are returned by varsel()
or
cv_varsel()
). The solution_terms.projection()
method retrieves the
predictor combination onto which a projection was performed (projection
objects are returned by project()
, possibly as elements of a list
). Both
methods (and hence also the solution_terms()
generic) are deprecated and
will be removed in a future release. Please use ranking()
instead of
solution_terms.vsel()
(ranking()
's output element fulldata
contains the
full-data predictor ranking that is extracted by solution_terms.vsel()
;
ranking()
's output element foldwise
contains the fold-wise predictor
rankings—if available—which were previously not accessible via a built-in
function) and predictor_terms()
instead of solution_terms.projection()
.
solution_terms(object, ...) ## S3 method for class 'vsel' solution_terms(object, ...) ## S3 method for class 'projection' solution_terms(object, ...)
solution_terms(object, ...) ## S3 method for class 'vsel' solution_terms(object, ...) ## S3 method for class 'projection' solution_terms(object, ...)
object |
The object from which to retrieve the predictor terms. Possible classes may be inferred from the names of the corresponding methods (see also the description). |
... |
Currently ignored. |
A character vector of predictor terms.
This function can suggest an appropriate submodel size based on a decision
rule described in section "Details" below. Note that this decision is quite
heuristic and should be interpreted with caution. It is recommended to
examine the results via plot.vsel()
, cv_proportions()
,
plot.cv_proportions()
, and/or summary.vsel()
and to make the final
decision based on what is most appropriate for the problem at hand.
suggest_size(object, ...) ## S3 method for class 'vsel' suggest_size( object, stat = "elpd", pct = 0, type = "upper", thres_elpd = NA, warnings = TRUE, ... )
suggest_size(object, ...) ## S3 method for class 'vsel' suggest_size( object, stat = "elpd", pct = 0, type = "upper", thres_elpd = NA, warnings = TRUE, ... )
object |
An object of class |
... |
Arguments passed to |
stat |
Performance statistic (i.e., utility or loss) used for the
decision. See argument |
pct |
A number giving the proportion (not percents) of the relative null model utility one is willing to sacrifice. See section "Details" below for more information. |
type |
Either |
thres_elpd |
Only relevant if |
warnings |
Mainly for internal use. A single logical value indicating
whether to throw warnings if automatic suggestion fails. Usually there is
no reason to set this to |
In general (beware of special cases below), the suggested model
size is the smallest model size for which either the
lower or upper bound (depending on argument
type
) of the
normal-approximation (or bootstrap or exponentiated normal-approximation;
see argument stat
) confidence interval (with nominal coverage 1 - alpha
; see argument alpha
of summary.vsel()
) for (with
denoting the
-th
submodel's true utility and
denoting the
baseline model's true utility)
falls above (or is equal to)
where denotes the null
model's estimated utility and
the baseline
model's estimated utility. The baseline model is either the reference model
or the best submodel found (see argument
baseline
of summary.vsel()
).
In doing so, loss statistics like the root mean squared error (RMSE) and
the mean squared error (MSE) are converted to utilities by multiplying them
by -1
, so a call such as suggest_size(object, stat = "rmse", type = "upper")
finds the smallest model size whose upper confidence interval
bound for the negative RMSE or MSE exceeds (or is equal to) the cutoff
(or, equivalently, has the lower confidence interval bound for the RMSE or
MSE below—or equal to—the cutoff). This is done to make the
interpretation of argument type
the same regardless of argument stat
.
For the geometric mean predictive density (GMPD), the decision rule above
is applied on log()
scale. In other words, if the true GMPD is denoted by
for the
-th submodel and
for the baseline model (so that
and
from above are given by
and
), then
suggest_size()
yields the smallest model size whose
lower or upper (depending on argument type
) confidence interval bound for
exceeds (or
is equal to)
where denotes the null
model's estimated GMPD and
the
baseline model's estimated GMPD.
If !is.na(thres_elpd)
and stat = "elpd"
, the decision rule above is
extended: The suggested model size is then the smallest model size
fulfilling the rule above or
. Correspondingly, in case
of
stat = "mlpd"
(and !is.na(thres_elpd)
), the suggested model size is
the smallest model size fulfilling the rule above or
with
denoting the number of observations.
Correspondingly, in case of
stat = "gmpd"
(and !is.na(thres_elpd)
), the
suggested model size is the smallest model size fulfilling the rule
above or
.
For example (disregarding the special extensions in case of
!is.na(thres_elpd)
with stat %in% c("elpd", "mlpd", "gmpd")
), alpha = 2 * pnorm(-1)
, pct = 0
, and type = "upper"
means that we select the
smallest model size for which the upper bound of the 1 - 2 * pnorm(-1)
(approximately 68.3%) confidence interval for
(
in case of
the GMPD) exceeds (or is equal to) zero (one in case of the GMPD), that is
(if
stat
is a performance statistic for which the normal approximation is
used, not the bootstrap and not the exponentiated normal approximation),
for which the submodel's utility estimate is at most one standard error
smaller than the baseline model's utility estimate (with that standard
error referring to the utility difference).
Apart from the two summary.vsel()
arguments mentioned above (alpha
and
baseline
), resp_oscale
is another important summary.vsel()
argument
that may be passed via ...
.
A single numeric value, giving the suggested submodel size (or NA
if the suggestion failed).
The intercept is not counted by suggest_size()
, so a suggested size of
zero stands for the intercept-only model.
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Run varsel() (here without cross-validation, with L1 search, and with small # values for `nterms_max` and `nclusters_pred`, but only for the sake of # speed in this example; this is not recommended in general): vs <- varsel(fit, method = "L1", nterms_max = 3, nclusters_pred = 10, seed = 5555) print(suggest_size(vs))
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Run varsel() (here without cross-validation, with L1 search, and with small # values for `nterms_max` and `nclusters_pred`, but only for the sake of # speed in this example; this is not recommended in general): vs <- varsel(fit, method = "L1", nterms_max = 3, nclusters_pred = 10, seed = 5555) print(suggest_size(vs))
varsel()
or cv_varsel()
runThis is the summary()
method for vsel
objects (returned by varsel()
or
cv_varsel()
). Apart from some general information about the varsel()
or
cv_varsel()
run, it shows the full-data predictor ranking, basic
information about the (CV) variability in the ranking of the predictors (if
available; inferred from cv_proportions()
), and estimates for
user-specified predictive performance statistics. For a graphical
representation, see plot.vsel()
. For extracting the predictive performance
results printed at the bottom of the output created by this summary()
method, see performances()
.
## S3 method for class 'vsel' summary( object, nterms_max = NULL, stats = "elpd", type = c("mean", "se", "diff", "diff.se"), deltas = FALSE, alpha = 2 * pnorm(-1), baseline = if (!inherits(object$refmodel, "datafit")) "ref" else "best", resp_oscale = TRUE, cumulate = FALSE, ... )
## S3 method for class 'vsel' summary( object, nterms_max = NULL, stats = "elpd", type = c("mean", "se", "diff", "diff.se"), deltas = FALSE, alpha = 2 * pnorm(-1), baseline = if (!inherits(object$refmodel, "datafit")) "ref" else "best", resp_oscale = TRUE, cumulate = FALSE, ... )
object |
An object of class |
nterms_max |
Maximum submodel size (number of predictor terms) for which
the performance statistics are calculated. Using |
stats |
One or more character strings determining which performance
statistics (i.e., utilities or losses) to estimate based on the
observations in the evaluation (or "test") set (in case of
cross-validation, these are all observations because they are partitioned
into multiple test sets; in case of
|
type |
One or more items from |
deltas |
If |
alpha |
A number determining the (nominal) coverage |
baseline |
For |
resp_oscale |
Only relevant for the latent projection. A single logical
value indicating whether to calculate the performance statistics on the
original response scale ( |
cumulate |
Passed to argument |
... |
Arguments passed to the internal function which is used for
bootstrapping (if applicable; see argument |
The stats
options "mse"
and "rmse"
are only available for:
the traditional projection,
the latent projection with resp_oscale = FALSE
,
the latent projection with resp_oscale = TRUE
in combination with
<refmodel>$family$cats
being NULL
.
The stats
option "acc"
(= "pctcorr"
) is only available for:
the binomial()
family in case of the traditional projection,
all families in case of the augmented-data projection,
the binomial()
family (on the original response scale) in case of the
latent projection with resp_oscale = TRUE
in combination with
<refmodel>$family$cats
being NULL
,
all families (on the original response scale) in case of the latent
projection with resp_oscale = TRUE
in combination with
<refmodel>$family$cats
being not NULL
.
The stats
option "auc"
is only available for:
the binomial()
family in case of the traditional projection,
the binomial()
family (on the original response scale) in case of the
latent projection with resp_oscale = TRUE
in combination with
<refmodel>$family$cats
being NULL
.
An object of class vselsummary
. The elements of this object are not
meant to be accessed directly but instead via helper functions
(print.vselsummary()
and performances.vselsummary()
).
print.vselsummary()
, performances.vselsummary()
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Run varsel() (here without cross-validation, with L1 search, and with small # values for `nterms_max` and `nclusters_pred`, but only for the sake of # speed in this example; this is not recommended in general): vs <- varsel(fit, method = "L1", nterms_max = 3, nclusters_pred = 10, seed = 5555) print(summary(vs), digits = 1)
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Run varsel() (here without cross-validation, with L1 search, and with small # values for `nterms_max` and `nclusters_pred`, but only for the sake of # speed in this example; this is not recommended in general): vs <- varsel(fit, method = "L1", nterms_max = 3, nclusters_pred = 10, seed = 5555) print(summary(vs), digits = 1)
Run the search part and the evaluation part for a projection predictive
variable selection. The search part determines the predictor ranking (also
known as solution path), i.e., the best submodel for each submodel size
(number of predictor terms). The evaluation part determines the predictive
performance of the submodels along the predictor ranking. A special method is
varsel.vsel()
because it re-uses the search results from an earlier
varsel()
(or cv_varsel()
) run, as illustrated in the main vignette.
varsel(object, ...) ## Default S3 method: varsel(object, ...) ## S3 method for class 'vsel' varsel(object, ...) ## S3 method for class 'refmodel' varsel( object, d_test = NULL, method = "forward", ndraws = NULL, nclusters = 20, ndraws_pred = 400, nclusters_pred = NULL, refit_prj = !inherits(object, "datafit"), nterms_max = NULL, verbose = TRUE, search_control = NULL, lambda_min_ratio = 1e-05, nlambda = 150, thresh = 1e-06, penalty = NULL, search_terms = NULL, search_out = NULL, seed = NA, ... )
varsel(object, ...) ## Default S3 method: varsel(object, ...) ## S3 method for class 'vsel' varsel(object, ...) ## S3 method for class 'refmodel' varsel( object, d_test = NULL, method = "forward", ndraws = NULL, nclusters = 20, ndraws_pred = 400, nclusters_pred = NULL, refit_prj = !inherits(object, "datafit"), nterms_max = NULL, verbose = TRUE, search_control = NULL, lambda_min_ratio = 1e-05, nlambda = 150, thresh = 1e-06, penalty = NULL, search_terms = NULL, search_out = NULL, seed = NA, ... )
object |
An object of class |
... |
For |
d_test |
A |
method |
The method for the search part. Possible options are
|
ndraws |
Number of posterior draws used in the search part. Ignored if
|
nclusters |
Number of clusters of posterior draws used in the search
part. Ignored in case of L1 search (because L1 search always uses a single
cluster). For the meaning of |
ndraws_pred |
Only relevant if |
nclusters_pred |
Only relevant if |
refit_prj |
For the evaluation part, should the submodels along the
predictor ranking be fitted again ( |
nterms_max |
Maximum submodel size (number of predictor terms) up to
which the search is continued. If |
verbose |
A single logical value indicating whether to print out additional information during the computations. |
search_control |
A
|
lambda_min_ratio |
Deprecated (please use |
nlambda |
Deprecated (please use |
thresh |
Deprecated (please use |
penalty |
Only relevant for L1 search. A numeric vector determining the
relative penalties or costs for the predictors. A value of |
search_terms |
Only relevant for forward search. A custom character
vector of predictor term blocks to consider for the search. Section
"Details" below describes more precisely what "predictor term block" means.
The intercept ( |
search_out |
Intended for internal use. |
seed |
Pseudorandom number generation (PRNG) seed by which the same
results can be obtained again if needed. Passed to argument |
Arguments ndraws
, nclusters
, nclusters_pred
, and ndraws_pred
are automatically truncated at the number of posterior draws in the
reference model (which is 1
for datafit
s). Using less draws or clusters
in ndraws
, nclusters
, nclusters_pred
, or ndraws_pred
than posterior
draws in the reference model may result in slightly inaccurate projection
performance. Increasing these arguments affects the computation time
linearly.
For argument method
, there are some restrictions: For a reference model
with multilevel or additive formula terms or a reference model set up for
the augmented-data projection, only the forward search is available.
Furthermore, argument search_terms
requires a forward search to take
effect.
L1 search is faster than forward search, but forward search may be more accurate. Furthermore, forward search may find a sparser model with comparable performance to that found by L1 search, but it may also start overfitting when more predictors are added.
An L1 search may select an interaction term before all involved lower-order interaction terms (including main-effect terms) have been selected. In projpred versions > 2.6.0, the resulting predictor ranking is automatically modified so that the lower-order interaction terms come before this interaction term, but if this is conceptually undesired, choose the forward search instead.
The elements of the search_terms
character vector don't need to be
individual predictor terms. Instead, they can be building blocks consisting
of several predictor terms connected by the +
symbol. To understand how
these building blocks work, it is important to know how projpred's
forward search works: It starts with an empty vector chosen
which will
later contain already selected predictor terms. Then, the search iterates
over model sizes (with
denoting the maximum submodel size, not counting the intercept). The
candidate models at model size
are constructed from those elements
from
search_terms
which yield model size when combined with the
chosen
predictor terms. Note that sometimes, there may be no candidate
models for model size . Also note that internally,
search_terms
is
expanded to include the intercept ("1"
), so the first step of the search
(model size 0) always consists of the intercept-only model as the only
candidate.
As a search_terms
example, consider a reference model with formula y ~ x1 + x2 + x3
. Then, to ensure that x1
is always included in the
candidate models, specify search_terms = c("x1", "x1 + x2", "x1 + x3", "x1 + x2 + x3")
(or, in a simpler way that leads to the same results,
search_terms = c("x1", "x1 + x2", "x1 + x3")
, for which helper function
force_search_terms()
exists). This search would start with y ~ 1
as the
only candidate at model size 0. At model size 1, y ~ x1
would be the only
candidate. At model size 2, y ~ x1 + x2
and y ~ x1 + x3
would be the
two candidates. At the last model size of 3, y ~ x1 + x2 + x3
would be
the only candidate. As another example, to exclude x1
from the search,
specify search_terms = c("x2", "x3", "x2 + x3")
(or, in a simpler way
that leads to the same results, search_terms = c("x2", "x3")
).
An object of class vsel
. The elements of this object are not meant
to be accessed directly but instead via helper functions (see the main
vignette and projpred-package).
d_test
If not NULL
, then d_test
needs to be a list
with the following
elements:
data
: a data.frame
containing the predictor variables for the test set.
offset
: a numeric vector containing the offset values for the test set
(if there is no offset, use a vector of zeros).
weights
: a numeric vector containing the observation weights for the test
set (if there are no observation weights, use a vector of ones).
y
: a vector or a factor
containing the response values for the test
set. In case of the latent projection, this has to be a vector containing the
latent response values, but it can also be a vector full of NA
s if
latent-scale post-processing is not needed.
y_oscale
: Only needs to be provided in case of the latent projection
where this needs to be a vector or a factor
containing the original
(i.e., non-latent) response values for the test set.
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Run varsel() (here without cross-validation, with L1 search, and with small # values for `nterms_max` and `nclusters_pred`, but only for the sake of # speed in this example; this is not recommended in general): vs <- varsel(fit, method = "L1", nterms_max = 3, nclusters_pred = 10, seed = 5555) # Now see, for example, `?print.vsel`, `?plot.vsel`, `?suggest_size.vsel`, # and `?ranking` for possible post-processing functions.
# Data: dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit <- rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Run varsel() (here without cross-validation, with L1 search, and with small # values for `nterms_max` and `nclusters_pred`, but only for the sake of # speed in this example; this is not recommended in general): vs <- varsel(fit, method = "L1", nterms_max = 3, nclusters_pred = 10, seed = 5555) # Now see, for example, `?print.vsel`, `?plot.vsel`, `?suggest_size.vsel`, # and `?ranking` for possible post-processing functions.
A helper function for extracting response values, observation weights, and
offsets from a dataset. It is designed for use in the extract_model_data
function of custom reference model objects (see init_refmodel()
).
y_wobs_offs(newdata, wrhs = NULL, orhs = NULL, resp_form)
y_wobs_offs(newdata, wrhs = NULL, orhs = NULL, resp_form)
newdata |
The |
wrhs |
Either a right-hand side formula consisting only of the variable
in |
orhs |
Either a right-hand side formula consisting only of the variable
in |
resp_form |
If this is a formula, then the second element of this
formula (if the formula is a standard formula with both left-hand and
right-hand side, then its second element is the left-hand side; if the
formula is a right-hand side formula, then its second element is the
right-hand side) will be extracted from |
A list
with elements y
, weights
, and offset
, each being a
numeric vector containing the data for the response, the observation
weights, and the offsets, respectively. An exception is that y
may also
be NULL
(depending on argument resp_form
), a non-numeric vector, or a
factor
.
# For an example, see `?init_refmodel`.
# For an example, see `?init_refmodel`.