Title: | Exact Variable-Subset Selection in Linear Regression |
---|---|
Description: | Exact and approximation algorithms for variable-subset selection in ordinary linear regression models. Either compute all submodels with the lowest residual sum of squares, or determine the single-best submodel according to a pre-determined statistical criterion. Hofmann et al. (2020) <doi:10.18637/jss.v093.i03>. |
Authors: | Marc Hofmann [aut, cre], Cristian Gatu [aut], Erricos J. Kontoghiorghes [aut], Ana Colubi [aut], Achim Zeileis [aut] , Martin Moene [cph] (for the GSL Lite library), Microsoft Corporation [cph] (for the GSL Lite library), Free Software Foundation, Inc. [cph] (for snippets from the GNU ISO C++ Library) |
Maintainer: | Marc Hofmann <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.5-2 |
Built: | 2024-12-19 06:51:02 UTC |
Source: | CRAN |
lmSubsets
Variable-subset selection in ordinary linear regression.
Marc Hofmann ([email protected])
Cristian Gatu ([email protected])
Erricos J. Kontoghiorghes ([email protected])
Ana Colubi ([email protected])
Achim Zeileis ([email protected])
Hofmann M, Gatu C, Kontoghiorghes EJ, Colubi A, Zeileis A (2020). lmSubsets: Exact variable-subset selection in linear regression for R. Journal of Statistical Software, 93, 1–21. doi:10.18637/jss.v093.i03.
Hofmann M, Gatu C, Kontoghiorghes EJ (2007). Efficient algorithms for computing the best subset regression models for large-scale problems. Computational Statistics \& Data Analysis, 52, 16–29. doi:10.1016/j.csda.2007.03.017.
Gatu C, Kontoghiorghes EJ (2006). Branch-and-bound algorithms for computing the best subset regression models. Journal of Computational and Graphical Statistics, 15, 139–156. doi:10.1198/106186006x100290.
Home page: https://github.com/marc-hofmann/lmSubsets.R
Evaluate Akaike's information criterion (AIC) for the specified submodels.
## S3 method for class 'lmSubsets' AIC(object, size, best = 1, ..., k = 2, na.rm = TRUE, drop = TRUE) ## S3 method for class 'lmSelect' AIC(object, best = 1, ..., k = 2, na.rm = TRUE, drop = TRUE)
## S3 method for class 'lmSubsets' AIC(object, size, best = 1, ..., k = 2, na.rm = TRUE, drop = TRUE) ## S3 method for class 'lmSelect' AIC(object, best = 1, ..., k = 2, na.rm = TRUE, drop = TRUE)
object |
|
size |
|
best |
|
... |
ignored |
k |
|
na.rm |
|
drop |
|
double[]
—the AIC values
lmSubsets()
for all-subsets
regression
lmSelect()
for best-subset
regression
AIC()
for the S3 generic
Data relating air pollution and mortality, frequently used for illustrations in ridge regression and related tasks.
data(AirPollution)
data(AirPollution)
A data frame containing 60 observations on 16 variables.
average annual precipitation in inches
average January temperature in degrees Fahrenheit
average July temperature in degrees Fahrenheit
percentage of 1960 SMSA population aged 65 or older
average household size
median school years completed by those over 22
percentage of housing units which are sound and with all facilities
population per square mile in urbanized areas, 1960
percentage of non-Caucasian population in urbanized areas, 1960
percentage employed in white collar occupations
percentage of families with income < USD 3000
relative hydrocarbon pollution potential
relative nitric oxides potential
relative sulphur dioxide potential
annual average percentage of relative humidity at 13:00
total age-adjusted mortality rate per 100,000
http://lib.stat.cmu.edu/datasets/pollution
McDonald GC, Schwing RC (1973). Instabilities of regression estimates relating air pollution to mortality. Technometrics, 15, 463–482.
Miller AJ (2002). Subset selection in regression. New York: Chapman and Hall.
## load data (with logs for relative potentials) data("AirPollution", package = "lmSubsets") for (i in 12:14) AirPollution[[i]] <- log(AirPollution[[i]]) ## fit subsets lm_all <- lmSubsets(mortality ~ ., data = AirPollution) plot(lm_all) ## refit best model lm6 <- refit(lm_all, size = 6) summary(lm6)
## load data (with logs for relative potentials) data("AirPollution", package = "lmSubsets") for (i in 12:14) AirPollution[[i]] <- log(AirPollution[[i]]) ## fit subsets lm_all <- lmSubsets(mortality ~ ., data = AirPollution) plot(lm_all) ## refit best model lm6 <- refit(lm_all, size = 6) summary(lm6)
Evaluate the Bayesian information criterion (BIC) for the specified submodels.
## S3 method for class 'lmSubsets' BIC(object, size, best = 1, ..., na.rm = TRUE, drop = TRUE) ## S3 method for class 'lmSelect' BIC(object, best = 1, ..., na.rm = TRUE, drop = TRUE)
## S3 method for class 'lmSubsets' BIC(object, size, best = 1, ..., na.rm = TRUE, drop = TRUE) ## S3 method for class 'lmSelect' BIC(object, best = 1, ..., na.rm = TRUE, drop = TRUE)
object |
|
size |
|
best |
|
... |
ignored |
na.rm |
|
drop |
|
double[]
—the BIC values
lmSubsets()
for all-subsets
regression
lmSelect()
for best-subset
regression
BIC()
for the S3 generic
Return the coefficients for the specified submodels.
## S3 method for class 'lmSubsets' coef(object, size, best = 1, ..., na.rm = TRUE, drop = TRUE) ## S3 method for class 'lmSelect' coef(object, best = 1, ..., na.rm = TRUE, drop = TRUE)
## S3 method for class 'lmSubsets' coef(object, size, best = 1, ..., na.rm = TRUE, drop = TRUE) ## S3 method for class 'lmSelect' coef(object, best = 1, ..., na.rm = TRUE, drop = TRUE)
object |
|
size |
|
best |
|
... |
ignored |
na.rm |
|
drop |
|
double[,]
, "data.frame"
—the submodel coefficients
lmSubsets()
for all-subsets
regression
lmSelect()
for best-subset
regression
coef()
for the S3 generic
Return the deviance for the specified submodels.
## S3 method for class 'lmSubsets' deviance(object, size, best = 1, ..., na.rm = TRUE, drop = TRUE) ## S3 method for class 'lmSelect' deviance(object, best = 1, ..., na.rm = TRUE, drop = TRUE)
## S3 method for class 'lmSubsets' deviance(object, size, best = 1, ..., na.rm = TRUE, drop = TRUE) ## S3 method for class 'lmSelect' deviance(object, best = 1, ..., na.rm = TRUE, drop = TRUE)
object |
|
size |
|
best |
|
... |
ignored |
na.rm |
|
drop |
|
double[]
, "data.frame"
—the submodel deviances
lmSubsets()
for all-subsets
regression
lmSelect()
for best-subset
regression
deviance()
for the S3 generic
Return the fitted values for the specified submodel.
## S3 method for class 'lmSubsets' fitted(object, size, best = 1, ...) ## S3 method for class 'lmSelect' fitted(object, best = 1, ...)
## S3 method for class 'lmSubsets' fitted(object, size, best = 1, ...) ## S3 method for class 'lmSelect' fitted(object, best = 1, ...)
object |
|
size |
|
best |
|
... |
ignored |
double[]
—the fitted values
lmSubsets()
for all-subsets
regression
lmSelect()
for best-subset
regression
fitted()
for the S3 generic
Return the formula for the specified submodel.
## S3 method for class 'lmSubsets' formula(x, size, best = 1, ...) ## S3 method for class 'lmSelect' formula(x, best, ...)
## S3 method for class 'lmSubsets' formula(x, size, best = 1, ...) ## S3 method for class 'lmSelect' formula(x, best, ...)
x |
|
size |
|
best |
|
... |
ignored |
"formula"
—the submodel formula
lmSubsets()
for all-subsets
regression
lmSelect()
for best-subset
regression
formula()
for the S3 generic
00UTC temperature observations and corresponding 24-hour reforecast ensemble means from the Global Ensemble Forecast System (GEFS, Hamill et al. 2013) for SYNOP station Innsbruck Airport (11120; 47.260, 11.357) from 2011-01-01 to 2015-12-31.
data(IbkTemperature)
data(IbkTemperature)
A data frame containing 1824 daily observations/forecasts for 42
variables. The first column (temp
) contains temperature
observations at 00UTC (coordinated universal time), columns 2–37 are
24-hour lead time GEFS reforecast ensemble means for different
variables (see below). Columns 38–42 are deterministic time
trend/season patterns.
observed temperature at Innsbruck Airport (deg )
total accumulated precipitation ()
temperature at 2 meters ()
U-component of wind at 10 meters ()
V-component of wind at 10 meters ()
U-component of wind at 80 meters ()
U-component of wind at 80 meters ()
convective available potential energy
()
convective inhibition ()
surface downward long-wave radiation flux
()
surface downward short-wave radiation flux
()
surface upward long-wave radiation flux
()
surface upward short-wave radiation flux
()
ground heat flux ()
surface latent heat net flux ()
surface sensible heat net flux ()
mean sea level pressure ()
surface pressure ()
precipitable water ()
volumetric soil moisture content (fraction)
specific humidity at 2 meters ()
total cloud cover (percent)
total column-integrated condensate
()
skin temperature ()
maximum temperature ()
minimum temperature ()
soil temperature (0–10 cm below surface) ()
upward long-wave radiation flux ()
water runoff ()
water equivalent of accumulated snow depth
()
wind mixing energy ()
vertical velocity at 850 hPa surface ()
temperature on 2 PVU surface ()
pressure on 2 PVU surface ()
U-component of wind on 2 PVU surface ()
U-component of wind on 2 PVU surface ()
Potential vorticity on 320 K isentrope
()
time in years
sine and cosine component of annual harmonic pattern
sine and cosine component of bi-annual harmonic pattern
Observations: https://www.ogimet.com/synops.phtml.en. Reforecasts: https://psl.noaa.gov/forecasts/reforecast2/.
Hamill TM, Bates GT, Whitaker JS, Murray DR, Fiorino M, Galarneau Jr. TJ, Zhu Y, Lapenta W (2013). NOAA's second-generation global medium-range ensemble reforecast data set. Bulletin of the American Meteorological Society, 94(10), 1553–1565. doi:10.1175/BAMS-D-12-00014.1.
## load data and omit missing values data("IbkTemperature", package = "lmSubsets") IbkTemperature <- na.omit(IbkTemperature) ## fit a simple climatological model for the temperature ## with a linear trend and annual/bi-annual harmonic seasonal pattern CLIM <- lm(temp ~ time + sin + cos + sin2 + cos2, data = IbkTemperature) ## fit a simple MOS with 2-meter temperature forecast in addition ## to the climatological model MOS0 <- lm(temp ~ t2m + time + sin + cos + sin2 + cos2, data = IbkTemperature) ## graphical comparison and MOS summary plot(temp ~ time, data = IbkTemperature, type = "l", col = "darkgray") lines(fitted(MOS0) ~ time, data = IbkTemperature, col = "darkred") lines(fitted(CLIM) ~ time, data = IbkTemperature, lwd = 2) MOS0 ## best subset selection of remaining variables for the MOS ## (i.e., forcing the regressors of m1 into the model) MOS1_all <- lmSubsets(temp ~ ., data = IbkTemperature, include = c("t2m", "time", "sin", "cos", "sin2", "cos2")) plot(MOS1_all) image(MOS1_all, size = 8:20) ## -> Note that soil temperature and maximum temperature are selected ## in addition to the 2-meter temperature ## best subset selection of all variables MOS2_all <- lmSubsets(temp ~ ., data = IbkTemperature) plot(MOS2_all) image(MOS2_all, size = 2:20) ## -> Note that 2-meter temperature is not selected into the best ## BIC model but soil-temperature (and maximum temperature) are used instead ## refit the best BIC subset selections MOS1 <- refit(lmSelect(MOS1_all)) MOS2 <- refit(lmSelect(MOS2_all)) ## compare BIC BIC(CLIM, MOS0, MOS1, MOS2) ## compare RMSE sqrt(sapply(list(CLIM, MOS0, MOS1, MOS2), deviance)/ nrow(IbkTemperature)) ## compare coefficients cf0 <- coef(CLIM) cf1 <- coef(MOS0) cf2 <- coef(MOS1) cf3 <- coef(MOS2) names(cf2) <- gsub("^x", "", names(coef(MOS1))) names(cf3) <- gsub("^x", "", names(coef(MOS2))) nam <- unique(c(names(cf0), names(cf1), names(cf2), names(cf3))) cf <- matrix(NA, nrow = length(nam), ncol = 4, dimnames = list(nam, c("CLIM", "MOS0", "MOS1", "MOS2"))) cf[names(cf0), 1] <- cf0 cf[names(cf1), 2] <- cf1 cf[names(cf2), 3] <- cf2 cf[names(cf3), 4] <- cf3 print(round(cf, digits = 3), na.print = "")
## load data and omit missing values data("IbkTemperature", package = "lmSubsets") IbkTemperature <- na.omit(IbkTemperature) ## fit a simple climatological model for the temperature ## with a linear trend and annual/bi-annual harmonic seasonal pattern CLIM <- lm(temp ~ time + sin + cos + sin2 + cos2, data = IbkTemperature) ## fit a simple MOS with 2-meter temperature forecast in addition ## to the climatological model MOS0 <- lm(temp ~ t2m + time + sin + cos + sin2 + cos2, data = IbkTemperature) ## graphical comparison and MOS summary plot(temp ~ time, data = IbkTemperature, type = "l", col = "darkgray") lines(fitted(MOS0) ~ time, data = IbkTemperature, col = "darkred") lines(fitted(CLIM) ~ time, data = IbkTemperature, lwd = 2) MOS0 ## best subset selection of remaining variables for the MOS ## (i.e., forcing the regressors of m1 into the model) MOS1_all <- lmSubsets(temp ~ ., data = IbkTemperature, include = c("t2m", "time", "sin", "cos", "sin2", "cos2")) plot(MOS1_all) image(MOS1_all, size = 8:20) ## -> Note that soil temperature and maximum temperature are selected ## in addition to the 2-meter temperature ## best subset selection of all variables MOS2_all <- lmSubsets(temp ~ ., data = IbkTemperature) plot(MOS2_all) image(MOS2_all, size = 2:20) ## -> Note that 2-meter temperature is not selected into the best ## BIC model but soil-temperature (and maximum temperature) are used instead ## refit the best BIC subset selections MOS1 <- refit(lmSelect(MOS1_all)) MOS2 <- refit(lmSelect(MOS2_all)) ## compare BIC BIC(CLIM, MOS0, MOS1, MOS2) ## compare RMSE sqrt(sapply(list(CLIM, MOS0, MOS1, MOS2), deviance)/ nrow(IbkTemperature)) ## compare coefficients cf0 <- coef(CLIM) cf1 <- coef(MOS0) cf2 <- coef(MOS1) cf3 <- coef(MOS2) names(cf2) <- gsub("^x", "", names(coef(MOS1))) names(cf3) <- gsub("^x", "", names(coef(MOS2))) nam <- unique(c(names(cf0), names(cf1), names(cf2), names(cf3))) cf <- matrix(NA, nrow = length(nam), ncol = 4, dimnames = list(nam, c("CLIM", "MOS0", "MOS1", "MOS2"))) cf[names(cf0), 1] <- cf0 cf[names(cf1), 2] <- cf1 cf[names(cf2), 3] <- cf2 cf[names(cf3), 4] <- cf3 print(round(cf, digits = 3), na.print = "")
Plot a heatmap of the specified submodels.
## S3 method for class 'lmSubsets' image(x, size = NULL, best = 1, which = NULL, hilite, hilite_penalty, main, sub, xlab = NULL, ylab, ann = par("ann"), axes = TRUE, col = c("gray40", "gray90"), lab = "lab", col_hilite = cbind("red", "pink"), lab_hilite = "lab", pad_size = 3, pad_best = 1, pad_which = 3, axis_pos = -4, axis_tck = -4, axis_lab = -10, ...) ## S3 method for class 'lmSelect' image(x, best = NULL, which = NULL, hilite, hilite_penalty, main, sub = NULL, xlab = NULL, ylab, ann = par("ann"), axes = TRUE, col = c("gray40", "gray90"), lab = "lab", col_hilite = cbind("red", "pink"), lab_hilite = "lab", pad_best = 2, pad_which = 2, axis_pos = -4, axis_tck = -4, axis_lab = -10, ...)
## S3 method for class 'lmSubsets' image(x, size = NULL, best = 1, which = NULL, hilite, hilite_penalty, main, sub, xlab = NULL, ylab, ann = par("ann"), axes = TRUE, col = c("gray40", "gray90"), lab = "lab", col_hilite = cbind("red", "pink"), lab_hilite = "lab", pad_size = 3, pad_best = 1, pad_which = 3, axis_pos = -4, axis_tck = -4, axis_lab = -10, ...) ## S3 method for class 'lmSelect' image(x, best = NULL, which = NULL, hilite, hilite_penalty, main, sub = NULL, xlab = NULL, ylab, ann = par("ann"), axes = TRUE, col = c("gray40", "gray90"), lab = "lab", col_hilite = cbind("red", "pink"), lab_hilite = "lab", pad_best = 2, pad_which = 2, axis_pos = -4, axis_tck = -4, axis_lab = -10, ...)
x |
|
size , best
|
submodels to be plotted |
which |
regressors to be plotted |
hilite , hilite_penalty
|
submodels to be highlighted |
main , sub , xlab , ylab
|
main, sub-, and axis titles |
ann |
annotate plot |
axes |
plot axes |
col , lab
|
color and label style |
col_hilite , lab_hilite
|
highlighting style |
pad_size , pad_best , pad_which
|
padding |
axis_pos , axis_tck , axis_lab
|
position of axes, tick length, and position of labels |
... |
ignored |
invisible(x)
lmSubsets()
for all-subsets
regression
lmSelect()
for best-subset
regression
## data data("AirPollution", package = "lmSubsets") ################# ## lmSubsets ## ################# lm_all <- lmSubsets(mortality ~ ., data = AirPollution, nbest = 20) ## heatmap image(lm_all, best = 1:3) ## highlight 5 best (BIC) image(lm_all, best = 1:3, hilite = 1:5, hilite_penalty = "BIC") ################ ## lmSelect ## ################ ## default criterion: BIC lm_best <- lmSelect(lm_all) ## highlight 5 best (AIC) image(lm_best, hilite = 1:5, hilite_penalty = "AIC") ## axis labels image(lm_best, lab = c("bold(lab)", "lab"), hilite = 1, lab_hilite = "underline(lab)")
## data data("AirPollution", package = "lmSubsets") ################# ## lmSubsets ## ################# lm_all <- lmSubsets(mortality ~ ., data = AirPollution, nbest = 20) ## heatmap image(lm_all, best = 1:3) ## highlight 5 best (BIC) image(lm_all, best = 1:3, hilite = 1:5, hilite_penalty = "BIC") ################ ## lmSelect ## ################ ## default criterion: BIC lm_best <- lmSelect(lm_all) ## highlight 5 best (AIC) image(lm_best, hilite = 1:5, hilite_penalty = "AIC") ## axis labels image(lm_best, lab = c("bold(lab)", "lab"), hilite = 1, lab_hilite = "underline(lab)")
Best-variable-subset selection in ordinary linear regression.
lmSelect(formula, ...) ## Default S3 method: lmSelect(formula, data, subset, weights, na.action, model = TRUE, x = FALSE, y = FALSE, contrasts = NULL, offset, ...)
lmSelect(formula, ...) ## Default S3 method: lmSelect(formula, data, subset, weights, na.action, model = TRUE, x = FALSE, y = FALSE, contrasts = NULL, offset, ...)
formula , data , subset , weights , na.action , model , x , y , contrasts , offset
|
standard formula interface |
... |
forwarded to |
The lmSelect()
generic provides various methods to conveniently
specify the regressor and response variables. The standard formula
interface (see lm()
) can be used, or the model
information can be extracted from an already fitted "lm"
object. The model matrix and response can also be passed in directly.
After processing the arguments, the call is forwarded to
lmSelect_fit()
.
"lmSelect"
—a list
containing the components returned
by lmSelect_fit()
Further components include call
, na.action
,
weights
, offset
, contrasts
, xlevels
,
terms
, mf
, x
, and y
. See
lm()
for more information.
lmSelect.matrix()
for the
matrix interface
lmSelect.lmSubsets()
for
coercing an all-subsets regression
lmSelect_fit()
for the low-level
interface
lmSubsets()
for all-subsets
regression
## load data data("AirPollution", package = "lmSubsets") ################### ## basic usage ## ################### ## fit 20 best subsets (BIC) lm_best <- lmSelect(mortality ~ ., data = AirPollution, nbest = 20) lm_best ## summary statistics summary(lm_best) ## visualize plot(lm_best) ######################## ## custom criterion ## ######################## ## the same as above, but with a custom criterion: M <- nrow(AirPollution) ll <- function (rss) { -M/2 * (log(2 * pi) - log(M) + log(rss) + 1) } aic <- function (size, rss, k = 2) { -2 * ll(rss) + k * (size + 1) } bic <- function (size, rss) { aic(size, rss, k = log(M)) } lm_cust <- lmSelect(mortality ~ ., data = AirPollution, penalty = bic, nbest = 20) lm_cust
## load data data("AirPollution", package = "lmSubsets") ################### ## basic usage ## ################### ## fit 20 best subsets (BIC) lm_best <- lmSelect(mortality ~ ., data = AirPollution, nbest = 20) lm_best ## summary statistics summary(lm_best) ## visualize plot(lm_best) ######################## ## custom criterion ## ######################## ## the same as above, but with a custom criterion: M <- nrow(AirPollution) ll <- function (rss) { -M/2 * (log(2 * pi) - log(M) + log(rss) + 1) } aic <- function (size, rss, k = 2) { -2 * ll(rss) + k * (size + 1) } bic <- function (size, rss) { aic(size, rss, k = log(M)) } lm_cust <- lmSelect(mortality ~ ., data = AirPollution, penalty = bic, nbest = 20) lm_cust
Low-level interface to best-variable-subset selection in ordinary linear regression.
lmSelect_fit(x, y, weights = NULL, offset = NULL, include = NULL, exclude = NULL, penalty = "BIC", tolerance = 0, nbest = 1, ..., pradius = NULL)
lmSelect_fit(x, y, weights = NULL, offset = NULL, include = NULL, exclude = NULL, penalty = "BIC", tolerance = 0, nbest = 1, ..., pradius = NULL)
x |
|
y |
|
weights |
|
offset |
|
include |
|
exclude |
|
penalty |
|
tolerance |
|
nbest |
|
... |
ignored |
pradius |
|
The best variable-subset model is determined, where the "best" model is the one with the lowest information criterion value. The information criterion belongs to the AIC family.
The regression data is specified with the x
, y
,
weights
, and offset
parameters. See
lm.fit()
for further details.
To force regressors into or out of the regression, a list of
regressors can be passed as an argument to the include
or
exclude
parameters, respectively.
The information criterion is specified with the penalty
parameter. Accepted values are "AIC"
, "BIC"
, or a
"numeric"
value representing the penalty-per-model-parameter.
A custom selection criterion may be specified by passing an R
function as an argument. The expected signature is function
(size, rss)
, where size
is the number of predictors (including
the intercept, if any), and rss
is the residual sum of squares.
The function must be non-decreasing in both parameters.
An approximation tolerance
can be specified to speed up the
search.
The number of returned submodels is determined by the nbest
parameter.
The preordering radius is given with the pradius
parameter.
A list
with the following components:
NOBS |
|
nobs |
|
nvar |
|
weights |
|
intercept |
|
include |
|
exclude |
|
size |
|
ic |
information criterion |
tolerance |
|
nbest |
|
submodel |
|
subset |
|
Hofmann M, Gatu C, Kontoghiorghes EJ, Colubi A, Zeileis A (2020). lmSubsets: Exact variable-subset selection in linear regression for R. Journal of Statistical Software, 93, 1–21. doi:10.18637/jss.v093.i03.
lmSelect()
for the high-level
interface
lmSubsets_fit()
for all-subsets
regression
data("AirPollution", package = "lmSubsets") x <- as.matrix(AirPollution[, names(AirPollution) != "mortality"]) y <- AirPollution[, names(AirPollution) == "mortality"] f <- lmSelect_fit(x, y) f
data("AirPollution", package = "lmSubsets") x <- as.matrix(AirPollution[, names(AirPollution) != "mortality"]) y <- AirPollution[, names(AirPollution) == "mortality"] f <- lmSelect_fit(x, y) f
Coerce an all-subsets regression.
## S3 method for class 'lmSubsets' lmSelect(formula, penalty = "BIC", ...)
## S3 method for class 'lmSubsets' lmSelect(formula, penalty = "BIC", ...)
formula |
|
penalty |
|
... |
ignored |
Computes a best-subset regression from an all-subsets regression.
"lmSelect"
—a best-subset regression
lmSelect()
for the S3 generic
lmSubsets()
for all-subsets
regression
data("AirPollution", package = "lmSubsets") lm_all <- lmSubsets(mortality ~ ., data = AirPollution, nbest = 20) lm_best <- lmSelect(lm_all) lm_best
data("AirPollution", package = "lmSubsets") lm_all <- lmSubsets(mortality ~ ., data = AirPollution, nbest = 20) lm_best <- lmSelect(lm_all) lm_best
Matrix interface to best-variable-subset selection in ordinary linear regression.
## S3 method for class 'matrix' lmSelect(formula, y, intercept = TRUE, ...)
## S3 method for class 'matrix' lmSelect(formula, y, intercept = TRUE, ...)
formula |
|
y |
|
intercept |
|
... |
forwarded to
|
This is a utility interface. Use the standard formula interface wherever possible.
"lmSelect"
—a best-subset regression
lmSelect()
for the S3 generic
lmSelect.default()
for the
standard formula interface
All-variable-subsets selection in ordinary linear regression.
lmSubsets(formula, ...) ## Default S3 method: lmSubsets(formula, data, subset, weights, na.action, model = TRUE, x = FALSE, y = FALSE, contrasts = NULL, offset, ...)
lmSubsets(formula, ...) ## Default S3 method: lmSubsets(formula, data, subset, weights, na.action, model = TRUE, x = FALSE, y = FALSE, contrasts = NULL, offset, ...)
formula , data , subset , weights , na.action , model , x , y , contrasts , offset
|
standard formula interface |
... |
fowarded to |
The lmSubsets()
generic provides various methods to
conveniently specify the regressor and response variables. The
standard formula interface (see lm()
) can be used,
or the model information can be extracted from an already fitted
"lm"
object. The model matrix and response can also be passed
in directly.
After processing of the arguments, the call is forwarded to
lmSubsets_fit()
.
"lmSubsets"
—a list
containing the components returned
by lmSubsets_fit()
Further components include call
, na.action
,
weights
, offset
, contrasts
, xlevels
,
terms
, mf
, x
, and y
. See
lm()
for more information.
lmSubsets.matrix()
for the
"matrix"
interface
lmSubsets_fit()
for the
low-level interface
lmSelect()
for best-subset
regression
## load data data("AirPollution", package = "lmSubsets") ################### ## basic usage ## ################### ## canonical example: fit all subsets lm_all <- lmSubsets(mortality ~ ., data = AirPollution, nbest = 5) lm_all ## plot RSS and BIC plot(lm_all) ## summary statistics summary(lm_all) ############################ ## forced in-/exclusion ## ############################ lm_force <- lmSubsets(lm_all, include = c("nox", "so2"), exclude = "whitecollar") lm_force
## load data data("AirPollution", package = "lmSubsets") ################### ## basic usage ## ################### ## canonical example: fit all subsets lm_all <- lmSubsets(mortality ~ ., data = AirPollution, nbest = 5) lm_all ## plot RSS and BIC plot(lm_all) ## summary statistics summary(lm_all) ############################ ## forced in-/exclusion ## ############################ lm_force <- lmSubsets(lm_all, include = c("nox", "so2"), exclude = "whitecollar") lm_force
Low-level interface to all-variable-subsets selection in ordinary linear regression.
lmSubsets_fit(x, y, weights = NULL, offset = NULL, include = NULL, exclude = NULL, nmin = NULL, nmax = NULL, tolerance = 0, nbest = 1, ..., pradius = NULL)
lmSubsets_fit(x, y, weights = NULL, offset = NULL, include = NULL, exclude = NULL, nmin = NULL, nmax = NULL, tolerance = 0, nbest = 1, ..., pradius = NULL)
x |
|
y |
|
weights |
|
offset |
|
include |
|
exclude |
|
nmin |
|
nmax |
|
tolerance |
|
nbest |
|
... |
ignored |
pradius |
|
The best variable-subset model for every subset size is determined, where the "best" model is the one with the lowest residual sum of squares (RSS).
The regression data is specified with the x
, y
,
weights
, and offset
parameters. See
lm.fit()
for further details.
To force regressors into or out of the regression, a list of
regressors can be passed as an argument to the include
or
exclude
parameters, respectively.
The scope of the search can be limited to a range of subset sizes by
setting nmin
and nmax
, the minimum and maximum number of
regressors allowed in the regression, respectively.
A tolerance
vector can be specified to speed up the search,
where tolerance[j]
is the approximation tolerance applied to
subset models of size j
.
The number of submodels returned for each subset size is determined by
the nbest
parameter.
The preordering radius is given with the pradius
parameter.
A list
with the following components:
NOBS |
|
nobs |
|
nvar |
|
weights |
|
intercept |
|
include |
|
exclude |
|
size |
|
tolerance |
|
nbest |
|
submodel |
|
subset |
|
Hofmann M, Gatu C, Kontoghiorghes EJ, Colubi A, Zeileis A (2020). lmSubsets: Exact variable-subset selection in linear regression for R. Journal of Statistical Software, 93, 1–21. doi:10.18637/jss.v093.i03.
lmSubsets()
for the high-level
interface
lmSelect_fit()
for best-subset
regression
data("AirPollution", package = "lmSubsets") x <- as.matrix(AirPollution[, names(AirPollution) != "mortality"]) y <- AirPollution[, names(AirPollution) == "mortality"] f <- lmSubsets_fit(x, y) f
data("AirPollution", package = "lmSubsets") x <- as.matrix(AirPollution[, names(AirPollution) != "mortality"]) y <- AirPollution[, names(AirPollution) == "mortality"] f <- lmSubsets_fit(x, y) f
Matrix interface to all-variable-subsets selection in ordinary linear regression.
## S3 method for class 'matrix' lmSubsets(formula, y, intercept = TRUE, ...)
## S3 method for class 'matrix' lmSubsets(formula, y, intercept = TRUE, ...)
formula |
|
y |
|
intercept |
|
... |
forwarded to
|
This is a utility interface. Use the standard formula interface wherever possible.
"lmSubsets"
—an all-subsets regression
lmSubsets()
for the S3 generic
lmSubsets.default()
for the
standard formula interface
data("AirPollution", package = "lmSubsets") x <- as.matrix(AirPollution) lm_mat <- lmSubsets(x, y = "mortality") lm_mat
data("AirPollution", package = "lmSubsets") x <- as.matrix(AirPollution) lm_mat <- lmSubsets(x, y = "mortality") lm_mat
Return the log-likelihood of the the specified submodels.
## S3 method for class 'lmSubsets' logLik(object, size, best = 1, ..., na.rm = TRUE, drop = TRUE) ## S3 method for class 'lmSelect' logLik(object, best = 1, ..., na.rm = TRUE, drop = TRUE)
## S3 method for class 'lmSubsets' logLik(object, size, best = 1, ..., na.rm = TRUE, drop = TRUE) ## S3 method for class 'lmSelect' logLik(object, best = 1, ..., na.rm = TRUE, drop = TRUE)
object |
|
size |
|
best |
|
... |
ignored |
na.rm |
|
drop |
|
double[]
—the log-likelihoods
lmSubsets()
for all-subsets
regression
lmSelect()
for best-subset
regression
logLik()
for the S3 generic
Extract the model response.
model_response(data, ...) ## Default S3 method: model_response(data, type = "any", ...)
model_response(data, ...) ## Default S3 method: model_response(data, type = "any", ...)
data |
an object |
type |
|
... |
further arguments |
The default method simply forwards the call to
model.response()
.
double[]
—the model response
model.response()
for the
default implementation
Return the model response.
## S3 method for class 'lmSubsets' model_response(data, ...) ## S3 method for class 'lmSelect' model_response(data, ...)
## S3 method for class 'lmSubsets' model_response(data, ...) ## S3 method for class 'lmSelect' model_response(data, ...)
data |
|
... |
ignored |
double[]
—the model response
lmSubsets()
for all-subsets
regression
lmSelect()
for best-subset
regression
model_response()
for the S3
generic
Return the model frame.
## S3 method for class 'lmSubsets' model.frame(formula, ...) ## S3 method for class 'lmSelect' model.frame(formula, ...)
## S3 method for class 'lmSubsets' model.frame(formula, ...) ## S3 method for class 'lmSelect' model.frame(formula, ...)
formula |
|
... |
forwarded to |
"data.frame"
—the model frame
lmSubsets()
for all-subsets
regression
lmSelect()
for best-subset
regression
model.frame()
for the S3 generic
Returns the model matrix for the specified submodel.
## S3 method for class 'lmSubsets' model.matrix(object, size, best = 1, ...) ## S3 method for class 'lmSelect' model.matrix(object, best, ...)
## S3 method for class 'lmSubsets' model.matrix(object, size, best = 1, ...) ## S3 method for class 'lmSelect' model.matrix(object, best, ...)
object |
|
size |
|
best |
|
... |
forwarded to |
double[,]
—the model matrix
lmSubsets()
for all-subsets
regression
lmSelect()
for best-subset
regression
model.matrix()
for the S3
generic
Plot the deviance of the selected submodels, as well as a specified information criterion.
## S3 method for class 'lmSubsets' plot(x, penalty = "BIC", xlim, ylim_rss, ylim_ic, type_rss = "o", type_ic = "o", main, sub, xlab, ylab_rss, ylab_ic, legend_rss, legend_ic, ann = par("ann"), axes = TRUE, lty_rss = c(1, 3), pch_rss = c(16, 21), col_rss = "black", bg_rss = "white", lty_ic = c(1, 3), pch_ic = c(16, 21), col_ic = "red", bg_ic = "white", ...) ## S3 method for class 'lmSelect' plot(x, xlim, ylim, type = "o", main, sub, xlab, ylab, legend, ann = par("ann"), axes = TRUE, lty = 1, pch = 16, col = "red", bg = "white", ...)
## S3 method for class 'lmSubsets' plot(x, penalty = "BIC", xlim, ylim_rss, ylim_ic, type_rss = "o", type_ic = "o", main, sub, xlab, ylab_rss, ylab_ic, legend_rss, legend_ic, ann = par("ann"), axes = TRUE, lty_rss = c(1, 3), pch_rss = c(16, 21), col_rss = "black", bg_rss = "white", lty_ic = c(1, 3), pch_ic = c(16, 21), col_ic = "red", bg_ic = "white", ...) ## S3 method for class 'lmSelect' plot(x, xlim, ylim, type = "o", main, sub, xlab, ylab, legend, ann = par("ann"), axes = TRUE, lty = 1, pch = 16, col = "red", bg = "white", ...)
x |
|
penalty |
the information criterion |
xlim , ylim , ylim_rss , ylim_ic
|
x and y limits |
type , type_rss , type_ic
|
type of plot |
main , sub
|
main and sub-title |
xlab , ylab , ylab_rss , ylab_ic
|
axis titles |
legend , legend_rss , legend_ic
|
plot legend |
ann |
annotate plot |
axes |
plot axes |
lty , lty_rss , lty_ic
|
line type |
pch , pch_rss , pch_ic
|
plotting character |
col , col_rss , col_ic
|
color |
bg , bg_rss , bg_ic
|
background color |
... |
further graphical parameters |
invisible(x)
lmSubsets()
for all-subsets
regression
lmSelect()
for best-subset
regression
plot()
for the S3 generic
## load data data("AirPollution", package = "lmSubsets") ################# ## lmSubsets ## ################# lm_all <- lmSubsets(mortality ~ ., data = AirPollution, nbest = 5) plot(lm_all) ################ ## lmSelect ## ################ lm_best <- lmSelect(mortality ~ ., data = AirPollution, nbest = 20) plot(lm_best)
## load data data("AirPollution", package = "lmSubsets") ################# ## lmSubsets ## ################# lm_all <- lmSubsets(mortality ~ ., data = AirPollution, nbest = 5) plot(lm_all) ################ ## lmSelect ## ################ lm_best <- lmSelect(mortality ~ ., data = AirPollution, nbest = 20) plot(lm_best)
Generic function for refitting a model on a subset or reweighted data set.
refit(object, ...)
refit(object, ...)
object |
an object to be refitted |
... |
forwarded arguments |
The refit
generic is a new function for refitting a certain
model object on multiple versions of a data set (and is hence
different from update
). Applications refit models after some
kind of model selection, e.g., variable subset selection,
partitioning, reweighting, etc.
The generic is similar to the one provided in modeltools and fxregime (and should fulfill the same purpose). To avoid dependencies, it is also provided here.
"lm"
—the refitted model
Fit the specified submodel and return the obtained "lm"
object.
## S3 method for class 'lmSubsets' refit(object, size, best = 1, ...) ## S3 method for class 'lmSelect' refit(object, best = 1, ...)
## S3 method for class 'lmSubsets' refit(object, size, best = 1, ...) ## S3 method for class 'lmSelect' refit(object, best = 1, ...)
object |
|
size |
|
best |
|
... |
ignored |
"lm"
—the fitted model
lmSubsets()
for all-subsets
regression
lmSelect()
for best-subset
regression
refit()
for the S3 generic
## load data data("AirPollution", package = "lmSubsets") ## fit subsets lm_all <- lmSubsets(mortality ~ ., data = AirPollution) ## refit best model lm5 <- refit(lm_all, size = 5) summary(lm5)
## load data data("AirPollution", package = "lmSubsets") ## fit subsets lm_all <- lmSubsets(mortality ~ ., data = AirPollution) ## refit best model lm5 <- refit(lm_all, size = 5) summary(lm5)
Return the residuals for the specified submodel.
## S3 method for class 'lmSubsets' residuals(object, size, best = 1, ...) ## S3 method for class 'lmSelect' residuals(object, best = 1, ...)
## S3 method for class 'lmSubsets' residuals(object, size, best = 1, ...) ## S3 method for class 'lmSelect' residuals(object, best = 1, ...)
object |
|
size |
|
best |
|
... |
ignored |
double[]
—the residuals
lmSubsets()
for all-subsets
regression
lmSelect()
for best-subset
regression
residuals()
for the S3 generic
Return the residual standard deviation for the specified submodels.
## S3 method for class 'lmSubsets' sigma(object, size, best = 1, ..., na.rm = TRUE, drop = TRUE) ## S3 method for class 'lmSelect' sigma(object, best = 1, ..., na.rm = TRUE, drop = TRUE)
## S3 method for class 'lmSubsets' sigma(object, size, best = 1, ..., na.rm = TRUE, drop = TRUE) ## S3 method for class 'lmSelect' sigma(object, best = 1, ..., na.rm = TRUE, drop = TRUE)
object |
|
size |
|
best |
|
... |
ignored |
na.rm |
|
drop |
|
double[]
—the residual standard deviations
lmSubsets()
for all-subsets
regression
lmSelect()
for best-subset
regression
sigma()
for the S3 generic
Evaluate summary statistics for the selected submodels.
## S3 method for class 'lmSubsets' summary(object, ..., na.rm = TRUE) ## S3 method for class 'lmSelect' summary(object, ..., na.rm = TRUE)
## S3 method for class 'lmSubsets' summary(object, ..., na.rm = TRUE) ## S3 method for class 'lmSelect' summary(object, ..., na.rm = TRUE)
object |
|
... |
ignored |
na.rm |
if |
"summary.lmSubsets"
, "summary.lmSelect"
—a subset
regression summary
lmSubsets()
for all-subsets
regression
lmSelect()
for best-subset
regression
Return the variable names for the specified submodels.
## S3 method for class 'lmSubsets' variable.names(object, size, best = 1, ..., na.rm = TRUE, drop = TRUE) ## S3 method for class 'lmSelect' variable.names(object, best = 1, ..., na.rm = TRUE, drop = TRUE)
## S3 method for class 'lmSubsets' variable.names(object, size, best = 1, ..., na.rm = TRUE, drop = TRUE) ## S3 method for class 'lmSelect' variable.names(object, best = 1, ..., na.rm = TRUE, drop = TRUE)
object |
|
size |
|
best |
|
... |
ignored |
na.rm |
|
drop |
|
logical[,]
, "data.frame"
—the variable names
lmSubsets()
for all-subsets
regression
lmSelect()
for best-subset
regression
variable.names()
for the S3
generic
Return the variance-covariance matrix for the specified submodel.
## S3 method for class 'lmSubsets' vcov(object, size, best = 1, ...) ## S3 method for class 'lmSelect' vcov(object, best = 1, ...)
## S3 method for class 'lmSubsets' vcov(object, size, best = 1, ...) ## S3 method for class 'lmSelect' vcov(object, best = 1, ...)
object |
|
size |
|
best |
|
... |
ignored |
double[,]
—the variance-covariance matrix
lmSubsets()
for all-subsets
regression
lmSelect()
for best-subset
regression
vcov()
for the S3 generic