Package 'sparseR'

Title: Variable Selection under Ranked Sparsity Principles for Interactions and Polynomials
Description: An implementation of ranked sparsity methods, including penalized regression methods such as the sparsity-ranked lasso, its non-convex alternatives, and elastic net, as well as the sparsity-ranked Bayesian Information Criterion. As described in Peterson and Cavanaugh (2022) <doi:10.1007/s10182-021-00431-7>, ranked sparsity is a philosophy with methods primarily useful for variable selection in the presence of prior informational asymmetry, which occurs in the context of trying to perform variable selection in the presence of interactions and/or polynomials. Ultimately, this package attempts to facilitate dealing with cumbersome interactions and polynomials while not avoiding them entirely. Typically, models selected under ranked sparsity principles will also be more transparent, having fewer falsely selected interactions and polynomials than other methods.
Authors: Ryan Andrew Peterson [aut, cre]
Maintainer: Ryan Andrew Peterson <[email protected]>
License: GPL-3
Version: 0.3.1
Built: 2024-09-16 06:23:16 UTC
Source: CRAN

Help Index


sparseR: Implement ranked sparsity for selecting interactions and polynomials

Description

The sparseR package implements various techniques for selecting from a set of interaction and polynomial terms under ranked sparsity. Additional tools for data pre-processing, post-selection inference, and visualization are also included.

Author(s)

Maintainer: Ryan Andrew Peterson [email protected] (ORCID)

See Also

Useful links:


Data sets

Description

Detrano data sets (cleveland, hungarian, switzerland, va); The Iowa Radon Lung Cancer Study (irlcs_radon_syn): Data simulated to resemble the IRLCS study; Sheddon survival data (Z: clinical covariates, S:survival outcome)

Usage

cleveland

hungarian

switzerland

va

irlcs_radon_syn

Z

S

Format

An object of class data.frame with 303 rows and 14 columns.

An object of class data.frame with 294 rows and 14 columns.

An object of class data.frame with 123 rows and 14 columns.

An object of class data.frame with 200 rows and 14 columns.

An object of class data.frame with 1027 rows and 16 columns.

An object of class data.frame with 442 rows and 6 columns.

An object of class Surv with 442 rows and 2 columns.

Source

Detrano data

https://archive.ics.uci.edu/ml/datasets/heart+disease

IRLCS data sets

https://cheec.uiowa.edu/research/residential-radon-and-lung-cancer-case-control-study

Sheddon

https://www.gsea-msigdb.org/gsea/msigdb/cards/SHEDDEN_LUNG_CANCER_POOR_SURVIVAL_A6

References

Detrano

Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J., Sandhu, S., Guppy, K., Lee, S., & Froelicher, V. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. American Journal of Cardiology, 64,304–310.

IRLCS

FIELD, R., SMITH, B., STECK, D. et al. Residential radon exposure and lung cancer: Variation in risk estimates using alternative exposure scenarios. J Expo Sci Environ Epidemiol 12, 197–203 (2002). https://www.nature.com/articles/7500215

Shedden

Director's Challenge Consortium for the Molecular Classification of Lung Adenocarcinoma, Shedden, K., Taylor, J. M., Enkemann, S. A., Tsao, M. S., Yeatman, T. J., Gerald, W. L., Eschrich, S., Jurisica, I., Giordano, T. J., Misek, D. E., Chang, A. C., Zhu, C. Q., Strumpf, D., Hanash, S., Shepherd, F. A., Ding, K., Seymour, L., Naoki, K., Pennell, N., … Beer, D. G. (2008). Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nature medicine, 14(8), 822–827. https://www.nature.com/articles/nm.1790


Custom IC functions for stepwise models

Description

Custom IC functions for stepwise models

Usage

EBIC(...)

## Default S3 method:
EBIC(fit, varnames, pen_info, gammafn = NULL, return_df = TRUE, ...)

RBIC(fit, ...)

## Default S3 method:
RBIC(fit, varnames, pen_info, gammafn = NULL, return_df = TRUE, ...)

RAIC(fit, ...)

## Default S3 method:
RAIC(fit, varnames, pen_info, gammafn = NULL, return_df = TRUE, ...)

Arguments

...

additional args

fit

a fitted object

varnames

names of variables

pen_info

penalty information

gammafn

What to use for gamma in formula

return_df

should the deg. freedom be returned

Value

A vector of values for the criterion requested, and the degrees of freedom (appended to front of vector) if return_df == TRUE.


Plot relevant effects of a sparseR object

Description

Plot relevant effects of a sparseR object

Usage

effect_plot(fit, ...)

## S3 method for class 'sparseR'
effect_plot(
  fit,
  coef_name,
  at = c("cvmin", "cv1se"),
  by = NULL,
  by_levels,
  nn = 101,
  plot.args = list(),
  resids = TRUE,
  legend_location = "bottomright",
  ...
)

## S3 method for class 'sparseRBIC'
effect_plot(
  fit,
  coef_name,
  by = NULL,
  by_levels,
  nn = 101,
  plot.args = list(),
  resids = TRUE,
  legend_location = "bottomright",
  ...
)

Arguments

fit

a 'sparseR' object

...

additional arguments

coef_name

The name of the coefficient to plot along the x-axis

at

value of lambda to use

by

the variable(s) involved in the (possible) interaction

by_levels

values to cut continuous by variable (defaults to 3 quantiles)

nn

number of points to plot along prediction line

plot.args

list of arguments passed to the plot itself

resids

should residuals be plotted or not?

legend_location

location for legend passed to 'legend'

Value

nothing returned

Nothing (invisible) returned


Helper function to help set up penalties

Description

Helper function to help set up penalties

Usage

get_penalties(
  varnames,
  poly,
  poly_prefix = "poly_",
  int_sep = "\\:",
  pool = FALSE,
  gamma = 0.5,
  cumulative_k = FALSE,
  cumulative_poly = TRUE
)

Arguments

varnames

names of the covariates in the model matrix

poly

max polynomial considered

poly_prefix

what comes before the polynomial specification in these varnames?

int_sep

What denotes the multiplication for interactions?

pool

Should polynomials and interactions be pooled?

gamma

How much should the penalty increase with group size (0.5 assumes equal contribution of prior information)

cumulative_k

Should penalties be increased cumulatively as order interaction increases? (only used if !pool)

cumulative_poly

Should penalties be increased cumulatively as order polynomial increases? (only used if !pool)

Details

This is primarily a helper function for sparseR, but it may be useful if doing the model matrix set up by hand.

Value

a list of relevant information for the variables, including:

penalties

the numeric value of the penalties

vartype

Variable type (main effect, order k interaction, etc)

varname

names of variables


Plot relevant properties of sparseR objects

Description

Plot relevant properties of sparseR objects

Usage

## S3 method for class 'sparseR'
plot(x, plot_type = c("both", "cv", "path"), cols = NULL, log.l = TRUE, ...)

Arguments

x

a 'sparseR' object

plot_type

should the solution path, CV results, or both be plotted?

cols

option to specify color of groups

log.l

should the x-axis (lambda) be logged?

...

extra plotting options

Value

nothing returned


Predict coefficients or responses for sparseR object

Description

Predict coefficients or responses for sparseR object

Usage

## S3 method for class 'sparseR'
predict(object, newdata, lambda, at = c("cvmin", "cv1se"), ...)

## S3 method for class 'sparseR'
coef(object, lambda, at = c("cvmin", "cv1se"), ...)

Arguments

object

sparseR object

newdata

new data on which to make predictions

lambda

a particular value of lambda to predict with

at

a "smart" guess to use for lambda

...

additional arguments passed to predict.ncvreg

Value

predicted outcomes for 'newdata' (or coefficients) at specified (or smart) lambda value


Print sparseR object

Description

Print sparseR object

Usage

## S3 method for class 'sparseR'
print(x, prep = FALSE, ...)

Arguments

x

a sparseR object

prep

Should the SR set-up information be printed as well?

...

additional arguments passed to print.ncvreg

Value

returns x invisibly


Fit a ranked-sparsity model with regularized regression

Description

Fit a ranked-sparsity model with regularized regression

Usage

sparseR(
  formula,
  data,
  family = c("gaussian", "binomial", "poisson", "coxph"),
  penalty = c("lasso", "MCP", "SCAD"),
  alpha = 1,
  ncvgamma = 3,
  lambda.min = 0.005,
  k = 1,
  poly = 2,
  gamma = 0.5,
  cumulative_k = FALSE,
  cumulative_poly = TRUE,
  pool = FALSE,
  ia_formula = NULL,
  pre_process = TRUE,
  model_matrix = NULL,
  y = NULL,
  poly_prefix = "_poly_",
  int_sep = "\\:",
  pre_proc_opts = c("knnImpute", "scale", "center", "otherbin", "none"),
  filter = c("nzv", "zv"),
  extra_opts = list(),
  ...
)

Arguments

formula

Names of the terms

data

Data

family

The family of the model

penalty

What penalty should be used (lasso, MCP, or SCAD)

alpha

The mix of L1 penalty (lower values introduce more L2 ridge penalty)

ncvgamma

The tuning parameter for ncvreg (for MCP or SCAD)

lambda.min

The minimum value to be used for lambda (as ratio of max, see ?ncvreg)

k

The maximum order of interactions to consider (default: 1; all pairwise)

poly

The maximum order of polynomials to consider (default: 2)

gamma

The degree of extremity of sparsity rankings (see details)

cumulative_k

Should penalties be increased cumulatively as order interaction increases?

cumulative_poly

Should penalties be increased cumulatively as order polynomial increases?

pool

Should interactions of order k and polynomials of order k+1 be pooled together for calculating the penalty?

ia_formula

formula to be passed to step_interact (for interactions, see details)

pre_process

Should the data be preprocessed (if FALSE, must provide model_matrix)

model_matrix

A data frame or matrix specifying the full model matrix (used if !pre_process)

y

A vector of responses (used if !pre_process)

poly_prefix

If model_matrix is specified, what is the prefix for polynomial terms?

int_sep

If model_matrix is specified, what is the separator for interaction terms?

pre_proc_opts

List of preprocessing steps (see details)

filter

The type of filter applied to main effects + interactions

extra_opts

A list of options for all preprocess steps (see details)

...

Additional arguments (passed to fitting function)

Details

Selecting gamma: higher values of gamma will penalize "group" size more. By default, this is set to 0.5, which yields equal contribution of prior information across orders of interactions/polynomials (this is a good default for most settings).

Additionally, setting cumulative_poly or cumulative_k to TRUE increases the penalty cumulatively based on the order of either polynomial or interaction.

The options that can be passed to pre_proc_opts are: - knnImpute (should missing data be imputed?) - scale (should data be standardized)? - center (should data be centered to the mean or another value?) - otherbin (should factors with low prevalence be combined?) - none (should no preprocessing be done? can also specify a null object)

The options that can be passed to extra_opts are:

  • centers (named numeric vector which denotes where each covariate should be centered)

  • center_fn (alternatively, a function can be specified to calculate center such as min or median)

  • freq_cut, unique_cut (see ?step_nzv; these get used by the filtering steps)

  • neighbors (the number of neighbors for knnImpute)

  • one_hot (see ?step_dummy), this defaults to cell-means coding which can be done in regularized regression (change at your own risk)

  • raw (should polynomials not be orthogonal? defaults to true because variables are centered and scaled already by this point by default)

ia_formula will by default interact all variables with each other up to order k. If specified, ia_formula will be passed as the terms argument to recipes::step_interact, so the help documentation for that function can be investigated for further assistance in specifying specific interactions.

Value

an object of class sparseR containing the following:

fit

the fit object returned by ncvreg

srprep

a recipes object used to prep the data

pen_factors

the factor multiple on penalties for ranked sparsity

results

all coefficients and penalty factors at minimum CV lambda

results_summary

a tibble of summary results at minimum CV lambda

results1se

all coefficients and penalty factors at lambda_1se

results1se_summary

a tibble of summary results at lambda_1se

data

the (unprocessed) data

family

the family argument (for non-normal, eg. poisson)

info

a list containing meta-info about the procedure

References

For fitting functionality, the ncvreg package is used; see Breheny, P. and Huang, J. (2011) Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Statist., 5: 232-253.


Preprocess & create a model matrix with interactions + polynomials

Description

Preprocess & create a model matrix with interactions + polynomials

Usage

sparseR_prep(
  formula,
  data,
  k = 1,
  poly = 1,
  pre_proc_opts = c("knnImpute", "scale", "center", "otherbin", "none"),
  ia_formula = NULL,
  filter = c("nzv", "zv"),
  extra_opts = list(),
  family = "gaussian"
)

Arguments

formula

A formula of the main effects + outcome of the model

data

A required data frame or tibble containing the variables in formula

k

Maximum order of interactions to numeric variables

poly

the maximum order of polynomials to consider

pre_proc_opts

A character vector specifying methods for preprocessing (see details)

ia_formula

formula to be passed to step_interact (for interactions, see details)

filter

which methods should be used to filter out variables with (near) zero variance? (see details)

extra_opts

extra options to be used for preprocessing

family

family passed from sparseR

Details

The pre_proc_opts acts as a wrapper for the corresponding procedures in the recipes package. The currently supported options that can be passed to pre_proc_opts are: knnImpute: Should k-nearest-neighbors be performed (if necessary?) scale: Should variables be scaled prior to creating interactions (does not scale factor variables or dummy variables) center: Should variables be centered (will not center factor variables or dummy variables ) otherbin:

ia_formula will by default interact all variables with each other up to order k. If specified, ia_formula will be passed as the terms argument to recipes::step_interact, so the help documentation for that function can be investigated for further assistance in specifying specific interactions.

The methods specified in filter are important; filtering is necessary to cut down on extraneous polynomials and interactions (in cases where they really don't make sense). This is true, for instance, when using dummy variables in polynomials , or when using interactions of dummy variables that relate to the same categorical variable.

Value

an object of class recipe; see recipes::recipe()


Bootstrap procedure for stepwise regression

Description

Runs bootstrap on models selection procedure using RBIC to find bootstrapped standard error (smoothed, see Efron 2014) as well as selection percentage across candidate variables. (experimental)

Usage

sparseRBIC_bootstrap(srbic_fit, B = 100, quiet = FALSE)

Arguments

srbic_fit

An object fitted by sparseRBIC_step

B

Number of bootstrap samples

quiet

Should the display of a progress bar be silenced?

Value

a list containing:

results

a tibble containing coefficients, p-values, selection pct

bootstraps

a tibble of bootstrapped coefficients


Sample split procedure for stepwise regression

Description

Runs multiple on models selection procedures using RBIC to achieve valid inferential results post-selection

Usage

sparseRBIC_sampsplit(srbic_fit, S = 100, quiet = FALSE)

Arguments

srbic_fit

An object fitted by sparseRBIC_step

S

Number of splitting iterations

quiet

Should the display of a progress bar be silenced?

Value

a list containing:

results

a tibble containing coefficients, p-values, selection pct

splits

a tibble of different split-based coefficients


Fit a ranked-sparsity model with forward stepwise RBIC (experimental)

Description

Fit a ranked-sparsity model with forward stepwise RBIC (experimental)

Usage

sparseRBIC_step(
  formula,
  data,
  family = c("gaussian", "binomial", "poisson"),
  k = 1,
  poly = 1,
  ic = c("RBIC", "RAIC", "BIC", "AIC", "EBIC"),
  hier = c("strong", "weak", "none"),
  sequential = (hier[1] != "none"),
  cumulative_k = FALSE,
  cumulative_poly = TRUE,
  pool = FALSE,
  ia_formula = NULL,
  pre_process = TRUE,
  model_matrix = NULL,
  y = NULL,
  poly_prefix = "_poly_",
  int_sep = "\\:",
  pre_proc_opts = c("knnImpute", "scale", "center", "otherbin", "none"),
  filter = c("nzv", "zv"),
  extra_opts = list(),
  trace = 0,
  message = TRUE,
  ...
)

Arguments

formula

Names of the terms

data

Data

family

The family of the model

k

The maximum order of interactions to consider

poly

The maximum order of polynomials to consider

ic

The information criterion to use

hier

Should hierarchy be enforced (weak or strong)? Must be set with sequential == TRUE (see details)

sequential

Should the main effects be considered first, orders sequentially added/considered?

cumulative_k

Should penalties be increased cumulatively as order interaction increases?

cumulative_poly

Should penalties be increased cumulatively as order polynomial increases?

pool

Should interactions of order k and polynomials of order k+1 be pooled together for calculating the penalty?

ia_formula

formula to be passed to step_interact via terms argument

pre_process

Should the data be preprocessed (if FALSE, must provide model_matrix)

model_matrix

A data frame or matrix specifying the full model matrix (used if !pre_process)

y

A vector of responses (used if !pre_process)

poly_prefix

If model_matrix is specified, what is the prefix for polynomial terms?

int_sep

If model_matrix is specified, what is the separator for interaction terms?

pre_proc_opts

List of preprocessing steps (see details)

filter

The type of filter applied to main effects + interactions

extra_opts

A list of options for all preprocess steps (see details)

trace

Should intermediate results of model selection process be output

message

should experimental message be suppressed

...

additional arguments for running stepwise selection

Details

This function mirrors sparseR but uses stepwise selection guided by RBIC.

Additionally, setting cumulative_poly or cumulative_k to TRUE increases the penalty cumulatively based on the order of either polynomial or interaction.

The hier hierarchy enforcement will only work if sequential == TRUE, and notably will only consider the "first gen" hierarchy, that is, that all main effects which make up an interaction are already in the model. It is therefore possible for a third order interaction (x1:x2:x3) to enter a model without x1:x2 or x2:x3, so long as x1, x2, and x3 are all in the model.

The options that can be passed to pre_proc_opts are:

  • knnImpute (should missing data be imputed?)

  • scale (should data be standardized)?

  • center (should data be centered to the mean or another value?)

  • otherbin (should factors with low prevalence be combined?)

  • none (should no preprocessing be done? can also specify a null object)

The options that can be passed to extra_opts are:

  • centers (named numeric vector which denotes where each covariate should be centered)

  • center_fn (alternatively, a function can be specified to calculate center such as min or median)

  • freq_cut, unique_cut (see ?step_nzv - these get used by the filtering steps)

  • neighbors (the number of neighbors for knnImpute)

  • one_hot (see ?step_dummy), this defaults to cell-means coding which can be done in regularized regression (change at your own risk)

  • raw (should polynomials not be orthogonal? defaults to true because variables are centered and scaled already by this point by default)

Value

an object of class sparseRBIC containing the following:

fit

the final fit object

srprep

a recipes object used to prep the data

pen_info

coefficient-level variable counts, types + names

data

the (unprocessed) data

family

the family argument (for non-normal, eg. poisson)

info

a list containing meta-info about the procedure

stats

the IC for each fit and respective terms included


Centering numeric data to a value besides their mean

Description

'step_center_to' generalizes 'step_center' to allow for a different function than the 'mean' function to calculate centers. It creates a *specification* of a recipe step that will normalize numeric data to have a 'center' of zero.

Usage

step_center_to(
  recipe,
  ...,
  role = NA,
  trained = FALSE,
  centers = NULL,
  center_fn = mean,
  na_rm = TRUE,
  skip = FALSE,
  id = rand_id("center_to")
)

## S3 method for class 'step_center_to'
tidy(x, ...)

Arguments

recipe

A recipe object. The step will be added to the sequence of operations for this recipe.

...

One or more selector functions to choose which variables are affected by the step. See [selections()] for more details. For the 'tidy' method, these are not currently used.

role

Not used by this step since no new variables are created.

trained

A logical to indicate if the quantities for preprocessing have been estimated.

centers

A named numeric vector of centers. This is 'NULL' until computed by [prep.recipe()] (or it can be specified as a named numeric vector as well?).

center_fn

a function to be used to calculate where the center should be

na_rm

A logical value indicating whether 'NA' values should be removed during computations.

skip

A logical. Should the step be skipped when the recipe is baked by [bake.recipe()]? While all operations are baked when [prep.recipe()] is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when using 'skip = TRUE' as it may affect the computations for subsequent operations

id

A character string that is unique to this step to identify it.

x

A 'step_center_to' object.

Details

Centering data means that the average of a variable is subtracted from the data. 'step_center_to' estimates the variable centers from the data used in the 'training' argument of 'prep.recipe'. 'bake.recipe' then applies the centering to new data sets using these centers.

Value

An updated version of 'recipe' with the new step added to the sequence of existing steps (if any). For the 'tidy' method, a tibble with columns 'terms' (the selectors or variables selected) and 'value' (the centers).

See Also

[recipe()] [prep.recipe()] [bake.recipe()]

Examples

data(biomass, package = "modeldata")

biomass_tr <- biomass[biomass$dataset == "Training",]
biomass_te <- biomass[biomass$dataset == "Testing",]

rec <- recipes::recipe(
 HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur,
 data = biomass_tr)

center_trans <- rec %>%
  step_center_to(carbon, contains("gen"), -hydrogen)

center_obj <- recipes::prep(center_trans, training = biomass_tr)

transformed_te <- recipes::bake(center_obj, biomass_te)

biomass_te[1:10, names(transformed_te)]
transformed_te

recipes::tidy(center_trans)
recipes::tidy(center_obj)

Summary of sparseR model coefficients

Description

Summary of sparseR model coefficients

Usage

## S3 method for class 'sparseR'
summary(object, lambda, at = c("cvmin", "cv1se"), ...)

Arguments

object

a sparseR object

lambda

a particular value of lambda to predict with

at

a "smart" guess to use for lambda

...

additional arguments to be passed to summary.ncvreg

Value

an object of class 'summary.ncvreg' at specified or smart value of lambda.