varsel()
or cv_varsel()
call can now be re-used by the help of the new varsel.vsel()
and cv_varsel.vsel()
methods (i.e., by applying varsel()
or cv_varsel()
to the output of the earlier varsel()
or cv_varsel()
call). This can save a lot of time when re-running the predictive performance evaluation part multiple times based on the same search results. An illustration may be found in the updated main vignette (section "Preliminary cv_varsel()
run"; a more general description may also be found in section "Speed"). (GitHub: #461, #463, #465, #466)validate_search = FALSE
. Related to this is an internal change which may cause subsampled PSIS-LOO CV (an experimental feature controlled by argument nloo
of cv_varsel()
) with clustered projection during the search (i.e., 1 < nclusters && nclusters < S
, where S
denotes the number of posterior draws in the reference model) to yield slightly different results due to different internal pseudorandom number generator (PRNG) states. Furthermore, if is.na(seed)
, then the PRNG state for code downstream of such a cv_varsel()
call will be different due to this internal change. (GitHub: #464)print.vselsummary()
(and hence also print.vsel()
) now prints the reference model's performance evaluation results as well (not just those of the submodels). Correspondingly, a new helper function performances()
has been added which allows to access the reference model's (as well as the submodels') performance evaluation results. (GitHub: #471)solution_terms
of project()
has been deprecated. Please use the new argument predictor_terms
instead. (GitHub: #472)augmat
or augvec
do not need to have an attribute called nobs_orig
anymore, but a new attribute called ndiscrete
, giving the number of (possibly latent) response categories instead of the number of observations (see ?`augdat-internals`
). This simplifies the subsetting of such objects. (GitHub: #473)projpred.warn_prj_drawwise
to FALSE
. Previously, projpred suppressed such messages and warnings. (GitHub: #478)projpred.check_conv
to FALSE
. (GitHub: #478)as.matrix.projection()
, nm_scheme = "auto"
is deprecated. Please use nm_scheme = NULL
instead.plot.vsel()
now includes a title and a subtitle, with the subtitle mentioning the nominal coverage as well as the type of the confidence intervals (CIs) explicitly. However, in case of a facetted plot (i.e., in case of multiple stats
) and some stats
implying a different CI type than other stats
, the CI types are omitted (because mentioning them would make the subtitle too complicated). Note that title and subtitle can always be omitted with <plot.vsel() output object> + ggplot2::labs(title = NULL, subtitle = NULL)
. (GitHub: #468)plot.vsel()
has gained a new argument show_cv_proportions
, allowing to omit the CV ranking proportions. (GitHub: #470)summary.vsel()
's output element selection
to perf_sub
and made the names of this data.frame
's columns more consistent so that it is easier to handle that data.frame
programmatically. This should not be a breaking change because elements of vselsummary
objects (i.e., elements of objects returned by summary.vsel()
) are not meant to be accessed directly (for elements perf_sub
and perf_ref
, the new helper function performances()
has been added, see "Major changes" above). (GitHub: #471)summary.vsel()
and plot.vsel()
, the NA_character_
"string" (which was previously used as a placeholder for the predictor term of the intercept-only model at size 0
) was replaced by the string "(Intercept)"
. (GitHub: #471)project()
's output element solution_terms
to predictor_terms
. This should not be a breaking change because that element is meant to be accessed via predictor_terms()
. (GitHub: #472)solution_terms
and solution_terms_cv
of vsel
objects (returned by varsel()
and cv_varsel()
) to predictor_ranking
and predictor_ranking_cv
, respectively. This should not be a breaking change because those elements are meant to be accessed via ranking()
. (GitHub: #472)projpred.verbose_project
now affects the verbosity of all projections performed by the built-in divergence minimizers (except for the built-in L1-projection divergence minimizer). In particular, the divergence minimizer (no matter whether built-in or user-specified) is also employed when calling varsel()
or cv_varsel()
, so setting option projpred.verbose_project
to TRUE
now shows the progress of the projections during a varsel()
or cv_varsel()
call. Previously, that option only affected the projections performed through project()
(see the default for project()
's argument verbose
). Usually, setting projpred.verbose_project
to TRUE
only makes sense when setting global option projpred.extra_verbose
and argument verbose
(of varsel()
or cv_varsel()
) to TRUE
as well.print()
methods for objects of class refmodel
and projection
, mainly to avoid cluttering the console when printing such objects accidentally.extract_model_data
of init_refmodel()
is now allowed to be NULL
for using an internal default.print.vselsummary()
and print.vsel()
now use a minimum number of significant digits of 2
by default. The previous behavior can be restored by setting options(projpred.digits = getOption("digits"))
.stats
of the ?summary.vsel
help. (GitHub: #476)project()
's argument verbose
now gets passed to argument verbose_divmin
(not projpred_verbose
) of the divergence minimizer function (see argument div_minimizer
of init_refmodel()
).lambda_min_ratio
, nlambda
, and thresh
of varsel()
and cv_varsel()
have been deprecated. Instead, varsel()
and cv_varsel()
have gained a new argument called search_control
which accepts control arguments for the search as a list
. Thus, former arguments lambda_min_ratio
, nlambda
, and thresh
should now be specified via search_control
(but note that search_control
is more general because it also accepts control arguments for a forward search). (GitHub: #477)run_cvfun()
has gained a new argument folds
, accepting a vector of fold indices (the default is NULL
, meaning that the folds are constructed internally, as before). This new argument is helpful, for example, to perform a stratified K-fold CV in a convenient manner (an example of this has been added to the ?run_cvfun
help). (GitHub: #480)plot.vsel()
has gained a new argument size_position
. Setting it to "primary_x_top"
moves the text for the submodel sizes above the x-axis. Setting it to "secondary_x"
moves that text into a secondary x-axis located at the top of the plot. (GitHub: #484)plot.vsel()
have been changed: x-axis text is now right-aligned (left-aligned) for text_angle > 0
(< 0
) and also top-aligned for -90 < text_angle && text_angle < 90 && text_angle != 0
. We emphasize that alignments can always be customized with <plot.vsel() output object> + ggplot2::theme(axis.text.x.bottom = ggplot2::element_text(hjust = <hjust_value>, vjust = <vjust_value>))
. (GitHub: #484)plot.vsel()
to produce extra ("empty") ticks on the x-axis. (GitHub: #462)summary.vsel()
and plot.vsel()
causing bootstrap results (i.e., standard error and confidence interval for RMSE and AUC) to be incorrect if deltas = TRUE
. (GitHub: #474)summary.vsel()
and plot.vsel()
sometimes causing incorrect predictive performance results in case of subsampled PSIS-LOO CV (an experimental feature controlled by argument nloo
of cv_varsel()
). (GitHub: #475)cvfits
(the new structure was introduced by version 2.7.0, see GitHub pull request #456).The default search method
is now "forward"
search for all kinds of models (previously, "L1"
search was used by default where available). The reason for this change is that in general, forward search is more favorable compared to L1 search (see section "Details" in ?varsel
or ?cv_varsel
). (GitHub: #453, #459)
Several enhancements with respect to projected draws with different (i.e., nonconstant) weights, which typically occurs in case of clustered projection (GitHub: #206, #439):
as.matrix.projection()
now throws an error if the projected draws have nonconstant weights. (This error is the default behavior; it can be avoided by setting the new argument allow_nonconst_wdraws_prj
to TRUE
, but this is for expert use only because in that case, the weights of the projected draws are stored in an attribute wdraws_prj
and handling this attribute requires special care, e.g., when subsetting the returned matrix.) Instead, a posterior::as_draws_matrix()
method (as_draws_matrix.projection()
) has been added which allows for a safer handling of these weights (e.g., with the help of posterior::resample_draws()
, see section "Examples" of the ?as_draws_matrix.projection
help). Just like as.matrix.projection()
, as_draws_matrix.projection()
also works for the more common case of projected draws with constant weights. A posterior::as_draws()
method (as_draws.projection()
) has also been added, but this is merely a wrapper for as_draws_matrix.projection()
.proj_linpred()
now also throws an error (by default) if the projected draws have nonconstant weights and has gained the new arguments allow_nonconst_wdraws_prj
and return_draws_matrix
. As in as.matrix.projection()
, argument allow_nonconst_wdraws_prj
is for expert use only. Instead, return_draws_matrix
is the intended argument in case of projected draws with nonconstant weights. Similarly to as_draws_matrix.projection()
, it requires the posterior package and returns a draws_matrix
(with weighted draws if the projected draws have nonconstant weights and integrated
is FALSE
).proj_predict()
has gained an argument return_draws_matrix
for converting the returned matrix to a draws_matrix
(which again requires the posterior package). For proj_predict()
, no further modifications were necessary because its argument nresample_clusters
already takes weights of projected draws appropriately into account.Added helper function run_cvfun()
which can be used to create input for cv_varsel.refmodel()
's new argument cvfits
(which is the same as init_refmodel()
's argument cvfits
, but avoids having to call init_refmodel()
or get_refmodel()
twice). See the documentation of run_cvfun()
for details. (GitHub: #458)
Users applying varsel()
or cv_varsel()
to an object of class vsel
now need to use varsel(get_refmodel(<vsel_object>), <...>)
or cv_varsel(get_refmodel(<vsel_object>), <...>)
instead of varsel(<vsel_object>, <...>)
and cv_varsel(<vsel_object>, <...>)
, respectively. The reason is that new methods varsel.vsel()
and cv_varsel.vsel()
have been added. Currently, these are only placeholders, but in a future release, they will offer new functionality.
If an L1 search selects an interaction term before all involved lower-order interaction terms (including main-effect terms) have been selected, the predictor ranking is now automatically modified so that the lower-order interaction terms come before this interaction term. A corresponding warning is thrown, which may be deactivated by setting the global option projpred.warn_L1_interactions
to FALSE
. Previously, beginning with version 2.5.0, only a warning was thrown and this only if an L1 search selected an interaction term before all involved main-effect terms had been selected. (GitHub: #420)
Added a progress bar for project()
(when using the built-in divergence minimizers). For this, project()
has gained a new argument verbose
which can also be controlled via the global option projpred.verbose_project
. By default, the new progress bar is activated. (GitHub: #421)
Added a new argument parallel
to cv_varsel()
. With parallel = TRUE
, costly parts of projpred's cross-validation (CV) can be run in parallel. See the documentation of that new argument (and section "Note" of cv_varsel()
's documentation) for details. (GitHub: #422)
Added a warning for issue #323 (for multilevel Gaussian models, the projection onto the full model can be instable). (GitHub: #426)
plot.vsel()
has gained the new arguments point_size
and bar_thickness
which control the size of the points and the thickness of the uncertainty bars, respectively. By default, the points are slightly larger now and the uncertainty bars slightly thicker than before. The previous appearance can be achieved by setting point_size = 1.5
and bar_thickness = 0.5
. (GitHub: #429, #443)
plot.vsel()
: Added argument ranking_colored
for coloring the points and the uncertainty bars according to the magnitude of the (possibly cumulated) CV ranking proportions. (GitHub: #430; thanks to @yannmclatchie for the suggestion)
Added warnings for most of the problems described in section "Troubleshooting" of the main vignette. (GitHub: #431)
Output element p_type
of project()
has been removed. Instead, output element const_wdraws_prj
has been added, but its definition is essentially the inverse of former element p_type
(see the updated documentation of project()
's output). This should not be a breaking change for users (as p_type
was mainly intended for internal use and the new element const_wdraws_prj
is so, too) but this slightly enhances the cases where as.matrix.projection()
used to throw a warning (and now throws an error; see "Major changes" above) concerning the weights of the projected draws and the cases where proj_predict()
resamples from the projected draws using argument nresample_clusters
. (GitHub: #432)
Improved handling of PSIS-LOO CV warnings. (GitHub: #438, #451)
Reduced peak memory usage during forward search. A global option projpred.run_gc
has also been added, see the general package documentation (available online or by typing ?`projpred-package`
). (GitHub: #442)
Slightly improved efficiency in K-fold and PSIS-LOO CV, especially in case of a large number of observations. Under very special conditions (refit_prj = FALSE
, 1 < nclusters && nclusters < S
, and 1 < nclusters_pred && nclusters_pred < S
; note that 1 < nclusters
requires forward search, S
denotes the number of posterior draws in the reference model, and nclusters_pred
is essentially unused if refit_prj = FALSE
), this change might affect K-fold CV results, due to a different pseudorandom number generator (PRNG) state in folds other than the first one. Under similarly special conditions (refit_prj = FALSE
and 1 < nclusters_pred && nclusters_pred < S
), the PRNG state for LOO subsampling (see argument nloo
) is affected. Furthermore, if is.na(seed)
, then the PRNG state for code downstream of such cv_varsel()
calls will be different due to this change. (GitHub: #446)
Slightly improved efficiency at the end of cv_varsel()
, especially in case of a large number of observations. If is.na(seed)
, then the PRNG state for code downstream of a cv_varsel()
call with refit_prj = TRUE
and 1 < nclusters_pred && nclusters_pred < S
(where S
denotes the number of posterior draws in the reference model) will be different due to this change. (GitHub: #447)
Slightly improved memory usage in varsel()
, cv_varsel()
, and project()
. In case of LOO subsampling (see argument nloo
) with clustered projection (i.e., 1 < nclusters && nclusters < S
or 1 < nclusters_pred && nclusters_pred < S
, where S
denotes the number of posterior draws in the reference model), this change may lead to slightly different results due to different internal PRNG states. Furthermore, if is.na(seed)
, then the PRNG state for code downstream of such a cv_varsel()
call will be different due to this change. (GitHub: #448)
The internal function .extract_model_data
has been removed. As an alternative (with some differences compared to .extract_model_data
), the new function y_wobs_offs()
is exported.
Fixes/enhancements with respect to observation weights and offsets (GitHub: #449):
weightsnew
and offsetnew
(see proj_linpred()
, proj_predict()
, and predict.refmodel()
) now cause the original observation weights and offsets to be used if possible (instead of ones and zeros, respectively, which could even be considered to have been a bug---hence why this is mentioned under "Bug fixes" as well). For brms reference models, this behavior had already been implemented before.weights
or offset
is returned by the function supplied to argument extract_model_data
of init_refmodel()
(before, a vector of ones or zeros was used silently for the observation weights and offsets, respectively).Added the helper function force_search_terms()
which allows to construct search_terms
where certain predictor terms are forced to be included (i.e., they are forced to be selected first) whereas other predictor terms are optional (i.e., they are subject to the variable selection, but only after the inclusion of the "forced" terms). (GitHub: #346)
Reduced peak memory usage during performance evaluation (more precisely, during the re-projections done for the performance evaluation). This reduction is considerable especially for multilevel submodels, but possibly also for additive submodels. (GitHub: #440, #450)
A message is now thrown when cutting off the search at nterms_max
's internal default of (currently) 19
. (GitHub: #452)
Added sub-section "Speed" to the main vignette's "Troubleshooting" section. (GitHub: #455)
In case of K-fold CV, the list
passed to argument cvfits
of init_refmodel()
should not have a sub-list
called fits
anymore. Instead, the content of this former sub-list
called fits
should be moved one level up, i.e., should be placed directly in the list
passed to cvfits
(the empty element fits
should then be removed). For some time, the old structure will continue to work, but this possibility is deprecated and will be removed in the future. (GitHub: #456)
In case of K-fold CV, the K
reference model fits (i.e., the elements of the return value of the function passed to argument cvfun
of init_refmodel()
or the elements of the list
supplied to argument cvfits
of init_refmodel()
) do not need to be list
s anymore (see the documentation for argument cvrefbuilder
of init_refmodel()
). (GitHub: #457)
print.vselsummary()
based on output from varsel()
with refit_prj = FALSE
.cv_varsel()
with validate_search = FALSE
used to call loo::psis()
(for the submodel performance evaluation PSIS-LOO CV) even in case of draws with different (i.e., nonconstant) weights. In such cases, loo::sis()
is called now (with a warning). (GitHub: #438):
syntax) between grouping variables, caused by missing columns in the reference model's data.frame
(for brms reference models, this was already done correctly). (GitHub: #445)weightsnew
and offsetnew
(see proj_linpred()
, proj_predict()
, and predict.refmodel()
) now cause the original observation weights and offsets to be used if possible (instead of ones and zeros, respectively, which could be considered to have been a bug). For brms reference models, this behavior had already been implemented before. (GitHub: #449)validate_search = FALSE
to fail in case of a single projected draw. (GitHub: #454)In anticipation of a larger overhaul of the projpred user interface, this release comes with several new functions for accessing and investigating solution paths (which are now termed predictor rankings by these new functions, a term that is hopefully easier to grasp for new users):
ranking()
which returns the predictor ranking from the full-data search and possibly also the predictor rankings from fold-wise searches in case of cross-validation (CV). (More precisely, ranking()
is a generic. The only method is ranking.vsel()
, applicable to objects returned by varsel()
or cv_varsel()
. The output is of class ranking
.)cv_proportions()
which computes ranking proportions (across CV folds, see ?cv_proportions
for details) from fold-wise predictor rankings. (More precisely, cv_proportions()
is a generic. The main method is cv_proportions.ranking()
, but as a shortcut, cv_proportions.vsel()
has also been added. The output is of class cv_proportions
.)plot()
method called plot.cv_proportions()
for plotting ranking proportions from fold-wise predictor rankings. (As a shortcut, plot.ranking()
has also been added.)Because of these new functions, a message has been added to print.vselsummary()
, mentioning how to access and investigate the fold-wise predictor rankings (if they exist). Furthermore, due to these changes, element pct_solution_terms_cv
of vsel
objects has been replaced with element solution_terms_cv
which contains the fold-wise predictor rankings instead of the corresponding ranking proportions. However, elements of vsel
objects are not meant to be accessed directly, so this replacement should not be a breaking change for most users. Finally, method solution_terms.vsel()
(which---until now---was the only possibility to extract the full-data predictor ranking) has now been deprecated and will be removed in a future release. Please use the new function ranking()
instead (more precisely, ranking()
's output element fulldata
contains the full-data predictor ranking that is also extracted by solution_terms.vsel()
; ranking()
's output element foldwise
contains the fold-wise predictor rankings---if available---which were previously not accessible via a built-in function). (GitHub: #289, #406, #411)
Added function predictor_terms()
which retrieves the predictor terms used in a project()
run. Correspondingly, method solution_terms.projection()
has now been deprecated and will be removed in a future release. Please use predictor_terms()
instead. (GitHub: #411)
seed
(and .seed
) arguments now have a default of NA
instead of sample.int(.Machine$integer.max, 1)
and the pseudorandom number generator (PRNG) state is reset only if the user-supplied seed is not NA
. This allows setting a seed once at the beginning of any projpred-related code and then leaving all seed
(and .seed
) arguments at their default. Previously, such practice could lead to results which were "less random" than they should have been because the former default of sample.int(.Machine$integer.max, 1)
caused projpred functions with a seed
(or .seed
) argument to reset the PRNG state upon exit, meaning that two repeated calls to cv_varsel()
(for example) with no PRNG-using code between them would use the same seed internally. (GitHub: #412)
Added the main diagonal of the matrix returned by cv_proportions()
to a new column called cv_proportions_diag
of the summary table computed by summary.vsel()
. The purpose of this new column is to give a basic sense for the (CV) variability in the ranking of the predictors. Argument cumulate
of cv_proportions()
has been added to summary.vsel()
as well (to allow the ranking proportions in the newly added column to be cumulated ranking proportions, if desired). (GitHub: #289, #413)
Added the full-data predictor ranking and the main diagonal of the matrix returned by cv_proportions()
to the plot created by plot.vsel()
. These new elements can be omitted by setting plot.vsel()
's new argument ranking_nterms_max
to NA
(setting it to some specific submodel size causes the full-data predictor ranking and the corresponding ranking proportions to be omitted after that size). Argument cumulate
of cv_proportions()
has been added to plot.vsel()
as well (to allow the ranking proportions to be cumulated ranking proportions, if desired). Other new arguments are ranking_abbreviate
(together with ranking_abbreviate_args
), ranking_repel
(together with ranking_repel_args
), and text_angle
(see the plot.vsel()
documentation for details). (GitHub: #289, #414, #416, #417)
ranking()
, cv_proportions()
, and plot.cv_proportions()
(see "Major changes" above) are now illustrated in the main vignette. (GitHub: #407, #411)cv_varsel()
with cv_method = "kfold"
. This may slightly change results from such a cv_varsel()
run compared to older projpred versions due to different pseudorandom number generator (PRNG) states when clustering posterior draws. (GitHub: #419)cvfits
list (see init_refmodel()
) does not need to have an attribute called K
anymore.I()
terms. (GitHub: #404, #408)poly()
or polym()
terms. Note that just like step()
and MASS::stepAIC()
, projpred's search algorithms do not split up a poly()
or polym()
term into its lower-degree polynomial terms (which would be helpful, for example, if the linear part of a poly()
term with degrees = 2
was relevant but the quadratic part not). Such a split-up of a poly()
or polym()
term needs to be performed manually (if desired). (GitHub: #183, #409)seed
(or .seed
) argument to use the same seed internally when users set a seed once at the beginning (via set.seed()
) and then had two or more calls to such projpred functions with their seed
(or .seed
) argument being at its default and no PRNG-using code between those calls. (GitHub: #412)Setting the new global option projpred.extra_verbose
to TRUE
will print out which submodel projpred is currently projecting onto. Furthermore, if method = "forward"
and verbose = TRUE
in varsel()
or cv_varsel()
, this new option will also make projpred print out which submodel has been selected at those steps of the forward search for which a percentage is printed (the percentage refers to the maximum submodel size that the search is run up to). In general, however, we cannot recommend setting this new global option to TRUE
for cv_varsel()
with validate_search = TRUE
(simply due to the amount of information that will be printed, but also due to the progress bar which will not work anymore as intended). (GitHub: #363; thanks to @jtimonen)
Enhanced verbose
output. In particular, varsel()
is now more verbose, similarly to how cv_varsel()
has already been for a long time. The verbose
output for cv_varsel()
has also been updated, with the aim to give users a better understanding of the methodology behind projpred. (GitHub: #382)
Slightly improved the calculation of predictive variances to make them less prone to numerical inaccuracies. (GitHub: #199)
Improved computational efficiency by avoiding an unnecessary final full-data performance evaluation (including costly re-projections if refit_prj = TRUE
, which is the default for non-datafit
reference models) in cv_varsel()
with validate_search = TRUE
. Due to this change, results from cv_varsel()
(with validate_search = TRUE
) may slightly change due to a different pseudorandom number generator (PRNG) state when clustering posterior draws. The different PRNG state was necessary to make the PRNG state for the full-data search in the validate_search = TRUE
case consistent to the PRNG state for the full-data search in the validate_search = FALSE
case. (GitHub: #385)
Reduced dependencies. (GitHub: #388)
Argument digits
of print.vselsummary()
which used to be passed to an internal round()
call was removed. Instead, digits
can now be passed to print.data.frame()
via ...
, thereby determining the minimum number of significant digits to be printed. (GitHub: #389)
Although bad practice (in general), a reference model lacking an intercept can now be used within projpred. However, it will always be projected onto submodels which include an intercept. The reason is that even if the true intercept in the reference model is zero, this does not need to hold for the submodels. An informational message mentioning the projection onto intercept-including submodels is thrown when projpred encounters a reference model lacking an intercept. (GitHub: #96, #391)
In case of non-predictor arguments of s()
or t2()
, projpred now throws an error. (This had already been documented before, but a suitable error message was missing.) (GitHub: #393, based on #156 and #269)
In case of the brms::categorical()
family (supported since version 2.4.0), projpred now strips underscores from response category names in as.matrix.projection()
output, as done by brms. (GitHub: #394)
L1 search now throws a warning if an interaction term is selected before all involved main-effect terms have been selected. (GitHub: #395)
Documented that in multilevel (group-level) terms, function calls on the right-hand side of the |
character (e.g., (1 | gr(group_variable))
, which is possible in brms) are currently not allowed in projpred. A corresponding error message has also been added. (GitHub: #319)
Due to internal refactoring:
project()
's output elements submodl
and weights
have been renamed to outdmin
and wdraws_prj
, respectively.varsel()
's and cv_varsel()
's output element d_test
has been replaced with new output elements type_test
and y_wobs_test
.Apart from project()
's output element wdraws_prj
, these elements are not meant to be accessed manually, so changes are mentioned here only for the sake of completeness. Output element wdraws_prj
of project()
is only needed if project()
was used for a clustered projection, which is not the default (and discouraged in most applied cases, at least with a small number of clusters). Thus, these renamings are breaking changes only in very rare cases.
print.vselsummary()
now also prints K
in case of K-fold CV.
The print.vselsummary()
output has been slightly improved, e.g., adding a remark what "search included" or "search not included" means.
print.vselsummary()
now also prints whether deltas = TRUE
or deltas = FALSE
was used.
Output element pct_solution_terms_cv
has now also been added to vsel
objects returned by varsel()
, but in that case, it is simply NULL
. This (pct_solution_terms_cv
being NULL
) is now also the case if validate_search = FALSE
was used in cv_varsel()
.
Minor enhancements in the documentation.
Enhancements in the vignettes. In particular, section "Troubleshooting" of the main vignette has been revised.
If proj_predict()
is used with observation weights that are not all equal to 1
, a warning is now thrown. (GitHub: starts to address #402)
predict.refmodel()
to require newdata
to contain the response variable in case of a brms reference model. This is similar to paul-buerkner/brms#1457, but concerns predict.refmodel()
(paul-buerkner/brms#1457 referred to predictions from the submodels). In order to make this predict.refmodel()
fix work, brms version 2.19.0 or later is needed. (GitHub: #381)p_type
of project()
to be incorrect in case of refit_prj = FALSE
, !is.null(nclusters)
, and an object
of class vsel
that was created with a non-clustered (thinned) projection during the search phase. The fix comes with a slightly different behavior of proj_predict()
for datafit
s: It will not draw nresample_clusters
times from the posterior-projection predictive distribution (which is based on the same single projected draw), but only once. (GitHub: #211, #401)refit_prj = FALSE
after an L1 search), a new dataset containing a character
predictor variable with only a single unique value (or a new dataset containing a factor
predictor variable with a single level) used to cause an error. The case of a character
(not factor
) predictor variable with only a single unique value occurred, e.g., during the performance evaluation in a LOO CV if a character
predictor got selected into a fold's solution path. The character
issue existed from version 2.1.0 on (in earlier versions, however, there were other issues which caused character
predictors to throw an error). Now, all issues with respect to character
predictor variables should be resolved. The issue with single-level factor
predictor variables is resolved now as well. (GitHub: #403)refit_prj = FALSE
after an L1 search), a new dataset containing a factor
predictor with re-ordered levels (compared to this same factor
in the original dataset) used to lead to incorrect predictions. This bug existed at least from version 2.0.2 on (possibly even in earlier versions), but has been resolved now. (GitHub: #403)factor
. This issue existed at least from version 2.0.2 on (possibly even in earlier versions), but should have only affected rstanarm reference model fits (brms reference model fits were only affected in case of a brms::brm()
call with drop_unused_levels = FALSE
, which is not the default). (GitHub: #403)refit_prj = FALSE
(which is the default only for datafit
s, not for the reference model objects of class refmodel
that are usually employed in practice) to lead to incorrect predictions from the L1-searched submodels (which are L1-penalized GLMs) if the solution path had a main effect ranked after an interaction term. This bug existed at least from version 2.0.2 on (possibly even in earlier versions). The mentioned submodel predictions did not only affect the performance evaluation, but also the projected dispersion parameter and the returned Kullback-Leibler divergence (and the corresponding cross-entropy). (GitHub: #403)resp_oscale = TRUE
default in summary.vsel()
) is that varsel()
and cv_varsel()
no longer call suggest_size()
internally at the end. Thus, print()
-ing an object of class vsel
no longer includes the suggested projection size in the output (the stat
for this suggested size was fixed to "elpd"
anyway, a fact that many users were probably not aware of). (GitHub: #372)projpred.mlvl_pred_new
and projpred.mlvl_proj_ref_new
. These are explained in detail in the general package documentation (available online or by typing ?`projpred-package`
). (GitHub: #379)family
(see init_refmodel()
) has a non-identity link function: After clustering the reference model's posterior draws, we need to aggregate (within a given cluster) the reference model's fitted values which already take the offsets into account instead of taking the offsets into account after aggregating the fitted values which do not take the offsets into account. This fix should affect results only in a very slight manner. Due to projpred's internal adjustment for numerical stability when averaging a quantity across the draws within a given cluster, this also changes the projected residual standard deviations in Gaussian models in the order of 1e-10
. (GitHub: #374)plot.vsel()
and summary.vsel()
, the default of alpha = 0.32
is replaced by alpha = 2 * pnorm(-1)
(= 1 - diff(pnorm(c(-1, 1)))
, which is only approximately 0.32) so that now, a normal-approximation confidence interval with default alpha
stretches by exactly one standard error on either side of the point estimate. Typically, this changes results only slightly. In some cases, however, the new default may lead to a different suggested size, explaining why this is regarded as a major change. (GitHub: #371)ggplot2::aes_string()
is not used anymore, thereby avoiding an occasional soft-deprecation warning thrown by ggplot2 3.4.0. (GitHub: #367)ce
of project()
. The reason for this change is that the former KL divergence assumed the reference model's family to be the same as the submodel's family, which does not need to be the case for custom reference models. This should not be a user-facing change as users are discouraged to make use of specific output elements (like the former element kl
of objects of class projection
or vsel
) directly. (GitHub: #369)family
of init_refmodel()
and get_refmodel.default()
).get_refmodel()
and init_refmodel()
(thereby also distinguishing more clearly between "typical" and "custom" reference model objects) in (i) the description and several arguments of get_refmodel()
and init_refmodel()
, (ii) sections "Reference model" and "Supported types of models" of the vignette. (GitHub: #357, #359, #364, #365, #366)validate_search = FALSE
case of cv_varsel()
.search_terms
(at least in some instances), also affecting the output of solution_terms(<vsel_object>)
in those cases. (GitHub: #360; thanks to @sor16)validate_search = FALSE
case of cv_varsel()
. This bug was introduced in v2.2.0 (and existed up to---including---v2.2.1).cv_varsel()
with cv_method = "LOO"
(more precisely, only the LOO posterior predictive expected values <vsel_object>$summaries$ref$mu
were affected, not the (pointwise) LOO log posterior predictive density values <vsel_object>$summaries$ref$lppd
). (GitHub: #186 (partly), #356)cv_varsel()
with custom search_terms
(in some instances). (GitHub: #345, #360; thanks to @sor16)stats
of summary.vsel()
), the bootstrapping results are now also used for inferring the lower and upper confidence interval bounds. (GitHub: #318, #347; thanks to @awd97 and @VisionResearchBlog)datafit
s, offsets are not supported anymore. (GitHub: #186 (partly), #351)datafit
s (and other---unlikely---cases where nclusters == S
and S <= 20
, with S
denoting the number of draws in the reference model).datafit
s). (GitHub: #350)validate_search = FALSE
case of cv_varsel()
(with cv_method = "LOO"
), the PSIS weights are now calculated based on the reference model (they used to be calculated based on the submodels which is incorrect). (GitHub: #325)"mse"
, "rmse"
, "acc"
(= "pctcorr"
), and "auc"
(i.e., all performance statistics except for "elpd"
and "mlpd"
).plot.vsel()
and suggest_size()
gain a new argument thres_elpd
. By default, this argument doesn't have any impact, but a non-NA
value can be used for a customized model size selection rule (see ?suggest_size
for details). (GitHub: #335)suggest_size()
heuristic).seed
and .seed
are now allowed to be NA
for not calling set.seed()
internally at all.d_test
of varsel()
is not considered as an internal feature anymore. This was possible after fixing a bug for d_test
(see below). (GitHub: #341)<vsel_object>$summaries
and <vsel_object>$d_test
now corresponds to the order of the observations in the original dataset if <vsel_object>
was created by a call to cv_varsel(<...>, cv_method = "kfold")
(formerly, in that case, the observations in those sub-elements were ordered by fold). Thereby, the order of the observations in those sub-elements now always corresponds to the order of the observations in the original dataset, except if <vsel_object>
was created by a call to varsel(<...>, d_test = <non-NULL_d_test_object>)
, in which case the order of the observations in those sub-elements corresponds to the order of the observations in <non-NULL_d_test_object>
. (GitHub: #341)search_terms
caused the R session to crash).validate_search = FALSE
bug described above in "Major changes": The PSIS weights are now calculated based on the reference model (they used to be calculated based on the submodels which is incorrect). (GitHub: #325)\mbox{}
commands displayed incorrectly in the HTML help from R version 4.2.0 on. (GitHub: #326)plot.vsel()
now draws the dashed red horizontal line for the reference model (and---if present---the dotted black horizontal line for the baseline model) first (i.e., before the submodel-specific graphical elements), to avoid overplotting.d_test
of varsel()
: Not only the predictive performance of the reference model needs to be evaluated on the test data, but also the predictive performance of the submodels. (GitHub: #341)cv_varsel()
with LOO CV and validate_search = FALSE
instead of K-fold CV. (GitHub: #305)search_terms
of varsel()
and cv_varsel()
. (GitHub: #155, #308)NULL
) search_terms
, method = NULL
is internally changed to method = "forward"
and method = "L1"
throws a warning. This is done because search_terms
only takes effect in case of a forward search. (GitHub: #155, #308)search_terms
. This is necessary to prevent a bug described below. (GitHub: #308)PIRLS loop resulted in NaN value
errors automatically. (GitHub: #314)b
of projpred:::bootstrap()
to B
.search_terms
vector which excluded the intercept in conjunction with refit_prj = FALSE
(the latter in project()
, varsel()
, or cv_varsel()
) led to incorrect submodels being fetched from the search or to an error while doing so. This has been fixed now by internally forcing the inclusion of the intercept in search_terms
. (GitHub: #308)solution_terms
of project()
to fix a test failure in R versions >= 4.2.cv_varsel()
with nloo < n
where n
denotes the number of observations. (GitHub: #94, #252, commit feea39e)validate_search = FALSE
in cv_varsel()
.nclusters
(= 1
) and nclusters_pred
(= 5
) of varsel()
and cv_varsel()
were set internally (the user-visible defaults were NULL
). Now, nclusters
and ndraws_pred
(note the ndraws_pred
, not nclusters_pred
) have non-NULL
user-visible defaults of 20
and 400
, respectively. In general, this increases the runtime of these functions a lot. With respect to cv_varsel()
, the new vignette (see vignettes) mentions two ways to quickly obtain some rough preliminary results which in general should not be used as final results, though: (i) varsel()
and (ii) cv_varsel()
with validate_search = FALSE
(which only takes effect for cv_method = "LOO"
). (GitHub: #291 and several commits beforehand, in particular bbd0f0a, babe031, 4ef95d3, and ce7d1e0)proj_linpred()
and proj_predict()
, arguments nterms
, ndraws
, and seed
have been removed to allow the user to pass them to project()
. New arguments filter_nterms
, nresample_clusters
, and .seed
have been introduced (see the documentation for details). (GitHub: #92, #135)proj_linpred()
, dimensions are not dropped anymore (i.e., output elements pred
and lpd
are always S x N matrices now). (GitHub: #143)integrated = TRUE
, proj_linpred()
now averages the LPD (across the projected posterior draws) instead of taking the LPD at the averaged linear predictors. (GitHub: #143)newdata
does not contain the response variable, proj_linpred()
now returns NULL
for output element lpd
. (GitHub: #143)stanreg
(from package rstanarm) with offsets to have these offsets specified via an offset()
term in the model formula (and not via argument offset
).NULL
to a user-visible value (and NULL
is not allowed anymore).data
of get_refmodel.stanreg()
has been removed. (GitHub: #219)div_minimizer
of init_refmodel()
now always needs to return a list
of submodels (see the documentation for details). Correspondingly, the function passed to argument proj_predfun
of init_refmodel()
can now always expect a list
as input for argument fits
(see the documentation for details). (GitHub: #230)proj_predfun
of init_refmodel()
now always needs to return a matrix (see the documentation for details). (GitHub: #230)?`projpred-package`
. (GitHub: #235)Student_t()
family is regarded as experimental. Therefore, a corresponding warning is thrown when creating the reference model. (GitHub: #233, #252)Gamma()
family is regarded as experimental. Therefore, a corresponding warning is thrown when creating the reference model. (GitHub: paul-buerkner/brms#1255, #240, #252)init_refmodel()
in case of argument dis
being NULL
(the default) was dangerous for custom reference models with a family
having a dispersion parameter (in that case, dis
values of all-zeros were used silently). The new behavior now requires a non-NULL
argument dis
in that case. (GitHub: #254)cv_search
has been renamed to refit_prj
. (GitHub: #154, #265)as.matrix.projection()
has gained a new argument nm_scheme
which allows to choose the naming scheme for the column names of the returned matrix. The default ("auto"
) follows the naming scheme of the reference model fit (and uses the "rstanarm"
naming scheme if the reference model fit is of an unknown class). (GitHub: #82, #279)seed
(and .seed
) arguments now have a default of sample.int(.Machine$integer.max, 1)
instead of NULL
. Furthermore, the value supplied to these arguments is now used to generate new seeds internally on-the-fly. In many cases, this will change results compared to older projpred versions. Also note that now, the internal seeds are never fixed to a specific value if seed
(and .seed
) arguments are set to NULL
. (GitHub: #84, #286)as.matrix.projection()
method now also returns the estimated group-level effects themselves. (GitHub: #75)as.matrix.projection()
method now returns the variance components (population SD(s) and population correlation(s)) instead of the empirical SD(s) of the group-level effects. (GitHub: #74)README
file. (GitHub: #245)nclusters_pred
was removed. (GitHub: commit 5062f2f)project()
: Warn if elements of solution_terms
are not found in the reference model (and therefore ignored). (GitHub: #140)get_refmodel.default()
now passes arguments via the ellipsis (...
) to init_refmodel()
. (GitHub: #153, commit dd3716e)init_refmodel()
: The default (NULL
) for argument extract_model_data
has been removed as it wasn't meaningful anyway. (GitHub: #219)folds
of init_refmodel()
has been removed as it was effectively unused. (GitHub: #220)solution_terms()
. This allowed the introduction of a solution_terms.projection()
method. (GitHub: #223)predict.refmodel()
now uses a default of newdata = NULL
. (GitHub: #223)weights
of init_refmodel()
's argument proj_predfun
has been removed. (GitHub: #163, #224)div_minimizer
functions have been unified into a single div_minimizer
which chooses an appropriate submodel fitter based on the formula of the submodel, not based on that of the reference model. Furthermore, the automatic handling of errors in the submodel fitters has been improved. (GitHub: #230)plot.vsel()
. (GitHub: #234, #270)cvfun
for stanreg
fits will now always use inner parallelization in rstanarm::kfold.stanreg()
(i.e., across chains, not across CV folds), with getOption("mc.cores", 1)
cores. We do so on all systems (not only Windows). (GitHub: #249)fit
of init_refmodel()
's argument proj_predfun
was renamed to fits
. This is a non-breaking change since all calls to proj_predfun
in projpred have that argument unnamed. However, this cannot be guaranteed in the future, so we strongly encourage users with a custom proj_predfun
to rename argument fit
to fits
. (GitHub: #263)init_refmodel()
has gained argument cvrefbuilder
which may be a custom function for constructing the K reference models in a K-fold CV. (GitHub: #271)project()
, varsel()
, and cv_varsel()
to the divergence minimizer. (GitHub: #278)init_refmodel()
, any contrasts
attributes of the dataset's columns are silently removed. (GitHub: #284)NA
s in data supplied to newdata
arguments now trigger an error. (GitHub: #285)as.matrix.projection()
(causing incorrect column names for the returned matrix). (GitHub: #72, #73)vsel
object. (GitHub: #79, #80)varsel()
. (GitHub #90)nloo
of cv_varsel()
. (GitHub: #93)cv_varsel()
, causing an error in case of !validate_search && cv_method != "LOO"
. (GitHub: #95)proj_linpred()
to raise an error if argument newdata
was NULL
. (GitHub: #97)lpd
in proj_linpred()
(for integrated = TRUE
as well as for integrated = FALSE
). (GitHub: #105)proj_linpred()
's calculation of output element lpd
(for integrated = TRUE
). (GitHub: #106, #112)proj_linpred()
's output elements pred
and lpd
(for integrated = FALSE
): Now, they are both S x N matrices, with S denoting the number of (possibly clustered) posterior draws and N denoting the number of observations. (GitHub: #107, #112)proj_predict()
's output matrix to be transposed in case of nrow(newdata) == 1
. (GitHub: #112)proj_linpred()
. (GitHub: #114)varsel()
/make_formula
to fail with multidimensional interaction terms. (GitHub: #102, #103)cv_varsel()
for models with a single predictor. (GitHub: #115)nterms
of proj_linpred()
and proj_predict()
. (GitHub: #110)as.matrix.projection()
in case of 1 (clustered) draw after projection. (GitHub: #130)subfit
, make the column names of as.matrix.projection()
's output matrix consistent with other classes of submodels. (GitHub: #132)nterms_max
of plot.vsel()
if there is just the intercept-only submodel. (GitHub: #138)search_path
in, e.g., varsel()
's output. (GitHub: #140)unused argument
) when initializing the K reference models in a K-fold CV with CV fits not of class brmsfit
or stanreg
. (GitHub: #140)get_refmodel.default()
, remove old defunct arguments fetch_data
, wobs
, and offset
. (GitHub: #140)get_refmodel.stanreg()
. (GitHub: #142, #184)extract_model_data()
's argument extract_y
in get_refmodel.default()
. (GitHub: #153, commit 39fece8)extract_model_data()
in K-fold CV. (GitHub: #153, commit 4f32195)proj_predfun()
for GLMMs. (GitHub: #174)proj_predfun()
for datafit
s. (GitHub: #177)summary.vsel()$selection
for objects of class vsel
created by varsel()
. (GitHub: #179)search_terms
are not consecutive in size. (GitHub: commit 34e24de)cv_varsel()$pct_solution_terms_cv
. (GitHub: #188, commit e529ec1)glm_elnet()
(the workhorse for L1 search), causing the grid for lambda to be constructed without taking observation weights into account. (GitHub: #198; note that the second part of #198 did not have any consequences for users)print.vsel()
causing argument digits
to be ignored. (GitHub: #222)cv_search
in varsel()
and cv_varsel()
to be TRUE
for datafit
s, although it should be FALSE
in that case. (GitHub: #223)Error: Levels '<...>' of grouping factor '<...>' cannot be found in the fitted model. Consider setting argument 'allow_new_levels' to TRUE.
) when predicting from submodels which are GLMMs for newdata
containing new levels for grouping factors. (GitHub: #223)predict.refmodel()
: Fix a bug for integer ynew
. (GitHub: #223)predict.refmodel()
: Fix input checks for offsetnew
and weightsnew
. (GitHub: #223)extract_model_data()
, the weights and offsets are now checked if they are of length 0 (and if yes, then they are set to vectors of ones and zeros, respectively). This is important for extract_model_data()
functions which return weights and offsets of length 0 (see, e.g., brms
version <= 2.16.1). (GitHub: #223)var
(the predictive variances) and regul
(amount of ridge regularization) to the internal submodel fitter for GLMs. (GitHub: #230)NA
s, an appropriate error is now thrown. Previously, the reference model was created successfully, but this caused opaque errors in downstream code such as project()
. (GitHub: #274)We have fully rewritten the internals in several ways. Most importantly, we now leverage maximum likelihood estimation to third parties depending on the reference model's family. This allows a lot of flexibility and extensibility for various models. Functionality wise, the major updates since the last release are:
search_terms
that allows the user to specify custom unit building blocks of the projections. New vignette coming up.Better validation of function arguments.
Added print methods for vsel and cvsel objects. Added AUC statistics for binomial family. A few additional minor patches.
Removed the dependency on the rngtools package.
This version contains only a few patches, no new features to the user.
stan_glm(log(y) ~ log(x), ...)
, that is, it did not allow transformation for y
.refmodel
-objects using the generic get_refmodel
-function, and all the functions use only this object. This makes it much easier to use projpred with other reference models by writing them a new get_refmodel
-function. The syntax is now changed so that varsel
and cv_varsel
both return an object that has similar structure always, and the reference model is stored into this object.plot/summary
. Now it is possible to compare also to the best submodel found, not only to the reference model.nloo = n
by default in cv_varsel
. regul=1e-4
now by default in all functions.cv_search
argument for the main functions (varsel
,cv_varsel
,project
and the prediction functions). Now it is possible to make predictions also with those parameter estimates that were computed during the L1-penalized search. This change also allows the user to compute the Lasso-solution by providing the observed data as the 'reference fit' for init_refmodel. An example will be added to the vignette.Until this version, we did not keep record of the changes between different versions. Started to do this from version 0.9.0 onwards.