Package 'lmSubsets'

Title:	Exact Variable-Subset Selection in Linear Regression
Description:	Exact and approximation algorithms for variable-subset selection in ordinary linear regression models. Either compute all submodels with the lowest residual sum of squares, or determine the single-best submodel according to a pre-determined statistical criterion. Hofmann et al. (2020) <doi:10.18637/jss.v093.i03>.
Authors:	Marc Hofmann [aut, cre], Cristian Gatu [aut], Erricos J. Kontoghiorghes [aut], Ana Colubi [aut], Achim Zeileis [aut] , Martin Moene [cph] (for the GSL Lite library), Microsoft Corporation [cph] (for the GSL Lite library), Free Software Foundation, Inc. [cph] (for snippets from the GNU ISO C++ Library)
Maintainer:	Marc Hofmann <[email protected]>
License:	GPL (>= 3)
Version:	0.5-2
Built:	2025-02-17 06:55:57 UTC
Source:	CRAN

Help Index

Package lmSubsets
Extract AIC values from a subset regression
Air pollution and mortality
Extract BIC values from a subset regression
Extract the ceofficients from a subset regression
Extract the deviance from a subset regression
Extract the fitted values from a subset regression
Extract a formula from a subset regression
Temperature observations and numerical weather predictions for Innsbruck
Heatmap of a subset regression
Best-subset regression
Best-subset regression
Best-subset regression
Best-subset regression
All-subsets regression
All-subsets regression
All-subsets regression
Extract the log-likelihood from a subset regression
Model response
Extract the model response from a subset regression
Extract the model frame from a subset regression
Extract a model matrix from a subset regression
Plot a subset regression
Refitting models
Refit a subset regression
Extract the residuals from all-subsets regression
Extract the residual standard deviation from a subset regression
Summarize a subset regression
Extract variable names from a subset regression
Extract the variance-covariance matrix from a subset regression

Package `lmSubsets`

Description

Variable-subset selection in ordinary linear regression.

Author(s)

Marc Hofmann ([email protected])
Cristian Gatu ([email protected])
Erricos J. Kontoghiorghes ([email protected])
Ana Colubi ([email protected])
Achim Zeileis ([email protected])

References

Hofmann M, Gatu C, Kontoghiorghes EJ, Colubi A, Zeileis A (2020). lmSubsets: Exact variable-subset selection in linear regression for R. Journal of Statistical Software, 93, 1–21. doi:10.18637/jss.v093.i03.

Hofmann M, Gatu C, Kontoghiorghes EJ (2007). Efficient algorithms for computing the best subset regression models for large-scale problems. Computational Statistics \& Data Analysis, 52, 16–29. doi:10.1016/j.csda.2007.03.017.

Gatu C, Kontoghiorghes EJ (2006). Branch-and-bound algorithms for computing the best subset regression models. Journal of Computational and Graphical Statistics, 15, 139–156. doi:10.1198/106186006x100290.

Extract AIC values from a subset regression

Description

Evaluate Akaike's information criterion (AIC) for the specified submodels.

Usage

## S3 method for class 'lmSubsets'
AIC(object, size, best = 1, ..., k = 2, na.rm = TRUE, drop = TRUE)

## S3 method for class 'lmSelect'
AIC(object, best = 1, ..., k = 2, na.rm = TRUE, drop = TRUE)
## S3 method for class 'lmSubsets'
AIC(object, size, best = 1, ..., k = 2, na.rm = TRUE, drop = TRUE)

## S3 method for class 'lmSelect'
AIC(object, best = 1, ..., k = 2, na.rm = TRUE, drop = TRUE)

Arguments

`object`	`"lmSubsets"`, `"lmSelect"`—a subset regression
`size`	`integer[]`—the submodel sizes
`best`	`integer[]`—the submodel positions
`...`	ignored
`k`	`double`—the penalty per model parameter
`na.rm`	`logical`—if `TRUE`, remove `NA` entries
`drop`	`logical`—if `TRUE`, simplify structure

Value

double[]—the AIC values

Air pollution and mortality

Description

Data relating air pollution and mortality, frequently used for illustrations in ridge regression and related tasks.

Usage

data(AirPollution)
data(AirPollution)

Format

A data frame containing 60 observations on 16 variables.

precipitation: average annual precipitation in inches
temperature1: average January temperature in degrees Fahrenheit
temperature7: average July temperature in degrees Fahrenheit
age: percentage of 1960 SMSA population aged 65 or older
household: average household size
education: median school years completed by those over 22
housing: percentage of housing units which are sound and with all facilities
population: population per square mile in urbanized areas, 1960
noncauc: percentage of non-Caucasian population in urbanized areas, 1960
whitecollar: percentage employed in white collar occupations
income: percentage of families with income < USD 3000
hydrocarbon: relative hydrocarbon pollution potential
nox: relative nitric oxides potential
so2: relative sulphur dioxide potential
humidity: annual average percentage of relative humidity at 13:00
mortality: total age-adjusted mortality rate per 100,000

Source

http://lib.stat.cmu.edu/datasets/pollution

References

McDonald GC, Schwing RC (1973). Instabilities of regression estimates relating air pollution to mortality. Technometrics, 15, 463–482.

Miller AJ (2002). Subset selection in regression. New York: Chapman and Hall.

Examples

## load data (with logs for relative potentials)
data("AirPollution", package = "lmSubsets")
for (i in 12:14)  AirPollution[[i]] <- log(AirPollution[[i]])

## fit subsets
lm_all <- lmSubsets(mortality ~ ., data = AirPollution)
plot(lm_all)

## refit best model
lm6 <- refit(lm_all, size = 6)
summary(lm6)
## load data (with logs for relative potentials)
data("AirPollution", package = "lmSubsets")
for (i in 12:14)  AirPollution[[i]] <- log(AirPollution[[i]])

## fit subsets
lm_all <- lmSubsets(mortality ~ ., data = AirPollution)
plot(lm_all)

## refit best model
lm6 <- refit(lm_all, size = 6)
summary(lm6)

Extract BIC values from a subset regression

Description

Evaluate the Bayesian information criterion (BIC) for the specified submodels.

Usage

## S3 method for class 'lmSubsets'
BIC(object, size, best = 1, ..., na.rm = TRUE, drop = TRUE)

## S3 method for class 'lmSelect'
BIC(object, best = 1, ..., na.rm = TRUE, drop = TRUE)
## S3 method for class 'lmSubsets'
BIC(object, size, best = 1, ..., na.rm = TRUE, drop = TRUE)

## S3 method for class 'lmSelect'
BIC(object, best = 1, ..., na.rm = TRUE, drop = TRUE)

Arguments

`object`	`"lmSubsets"`, `"lmSelect"`—a subset regression
`size`	`integer[]`—the submodel sizes
`best`	`integer[]`—the submodel positions
`...`	ignored
`na.rm`	`logical`—if `TRUE`, remove `NA` entries
`drop`	`logical`—if `TRUE`, simplify structure

Value

double[]—the BIC values

Extract the ceofficients from a subset regression

Description

Return the coefficients for the specified submodels.

Usage

## S3 method for class 'lmSubsets'
coef(object, size, best = 1, ..., na.rm = TRUE, drop = TRUE)

## S3 method for class 'lmSelect'
coef(object, best = 1, ..., na.rm = TRUE, drop = TRUE)
## S3 method for class 'lmSubsets'
coef(object, size, best = 1, ..., na.rm = TRUE, drop = TRUE)

## S3 method for class 'lmSelect'
coef(object, best = 1, ..., na.rm = TRUE, drop = TRUE)

Arguments

`object`	`"lmSubsets"`, `"lmSelect"`—a subset regression
`size`	`integer[]`—the submodel sizes
`best`	`integer[]`—the submodel positions
`...`	ignored
`na.rm`	`logical`—if `TRUE`, remove `NA` entries
`drop`	`logical`—if `TRUE`, simplify structure

Value

double[,], "data.frame"—the submodel coefficients

Extract the deviance from a subset regression

Description

Return the deviance for the specified submodels.

Usage

## S3 method for class 'lmSubsets'
deviance(object, size, best = 1, ..., na.rm = TRUE, drop = TRUE)

## S3 method for class 'lmSelect'
deviance(object, best = 1, ..., na.rm = TRUE, drop = TRUE)
## S3 method for class 'lmSubsets'
deviance(object, size, best = 1, ..., na.rm = TRUE, drop = TRUE)

## S3 method for class 'lmSelect'
deviance(object, best = 1, ..., na.rm = TRUE, drop = TRUE)

Arguments

`object`	`"lmSubsets"`, `"lmSelect"`—a subset regression
`size`	`integer[]`—the submodel sizes
`best`	`integer[]`—the submodel positions
`...`	ignored
`na.rm`	`logical`—if `TRUE`, remove `NA` entries
`drop`	`logical`—if `TRUE`, simplify structure

Value

double[], "data.frame"—the submodel deviances

Extract the fitted values from a subset regression

Description

Return the fitted values for the specified submodel.

Usage

## S3 method for class 'lmSubsets'
fitted(object, size, best = 1, ...)

## S3 method for class 'lmSelect'
fitted(object, best = 1, ...)
## S3 method for class 'lmSubsets'
fitted(object, size, best = 1, ...)

## S3 method for class 'lmSelect'
fitted(object, best = 1, ...)

Arguments

`object`	`"lmSubsets"`, `"lmSelect"`—a subset regression
`size`	`integer`—the submodel size
`best`	`integer`—the submodel position
`...`	ignored

Value

double[]—the fitted values

Extract a formula from a subset regression

Description

Return the formula for the specified submodel.

Usage

## S3 method for class 'lmSubsets'
formula(x, size, best = 1, ...)

## S3 method for class 'lmSelect'
formula(x, best, ...)
## S3 method for class 'lmSubsets'
formula(x, size, best = 1, ...)

## S3 method for class 'lmSelect'
formula(x, best, ...)

Arguments

`x`	`"lmSubsets"`, `"lmSelect"`—a subset regression
`size`	`integer`—the submodel size
`best`	`integer`—the submodel position
`...`	ignored

Value

"formula"—the submodel formula

Temperature observations and numerical weather predictions for Innsbruck

Description

00UTC temperature observations and corresponding 24-hour reforecast ensemble means from the Global Ensemble Forecast System (GEFS, Hamill et al. 2013) for SYNOP station Innsbruck Airport (11120; 47.260, 11.357) from 2011-01-01 to 2015-12-31.

Usage

data(IbkTemperature)
data(IbkTemperature)

Format

A data frame containing 1824 daily observations/forecasts for 42 variables. The first column (temp) contains temperature observations at 00UTC (coordinated universal time), columns 2–37 are 24-hour lead time GEFS reforecast ensemble means for different variables (see below). Columns 38–42 are deterministic time trend/season patterns.

temp: observed temperature at Innsbruck Airport (deg $C$ )
tp: total accumulated precipitation ( $kg~m^{-2}$ )
t2m: temperature at 2 meters ( $K$ )
u10m: U-component of wind at 10 meters ( $m~s^{-1}$ )
v10m: V-component of wind at 10 meters ( $m~s^{-1}$ )
u80m: U-component of wind at 80 meters ( $m~s^{-1}$ )
v80m: U-component of wind at 80 meters ( $m~s^{-1}$ )
cape: convective available potential energy ( $J~kg^{-1}$ )
ci: convective inhibition ( $J~kg^{-1}$ )
sdlwrf: surface downward long-wave radiation flux ( $W~m^{-2}$ )
sdswrf: surface downward short-wave radiation flux ( $W~m^{-2}$ )
sulwrf: surface upward long-wave radiation flux ( $W~m^{-2}$ )
suswrf: surface upward short-wave radiation flux ( $W~m^{-2}$ )
ghf: ground heat flux ( $W~m^{-2}$ )
slhnf: surface latent heat net flux ( $W~m^{-2}$ )
sshnf: surface sensible heat net flux ( $W~m^{-2}$ )
mslp: mean sea level pressure ( $Pa$ )
psfc: surface pressure ( $Pa$ )
pw: precipitable water ( $kg~m^{-2}$ )
vsmc: volumetric soil moisture content (fraction)
sh2m: specific humidity at 2 meters ( $kg~kg^{-1}$ )
tcc: total cloud cover (percent)
tcic: total column-integrated condensate ( $kg~m^{-2}$ )
tsfc: skin temperature ( $K$ )
tmax2m: maximum temperature ( $K$ )
tmin2m: minimum temperature ( $K$ )
st: soil temperature (0–10 cm below surface) ( $K$ )
ulwrf: upward long-wave radiation flux ( $W~m^{-2}$ )
wr: water runoff ( $kg~m^{-2}$ )
we: water equivalent of accumulated snow depth ( $kg~m^{-2}$ )
wp: wind mixing energy ( $J$ )
w850: vertical velocity at 850 hPa surface ( $Pa~s^{-1}$ )
t2pvu: temperature on 2 PVU surface ( $K$ )
p2pvu: pressure on 2 PVU surface ( $Pa$ )
u2pvu: U-component of wind on 2 PVU surface ( $m~s^{-1}$ )
v2pvu: U-component of wind on 2 PVU surface ( $m~s^{-1}$ )
pv: Potential vorticity on 320 K isentrope ( $K~m^{2}~kg^{-1}~s^{-1}$ )
time: time in years
sin, cos: sine and cosine component of annual harmonic pattern
sin2, cos2: sine and cosine component of bi-annual harmonic pattern

Source

Observations: https://www.ogimet.com/synops.phtml.en. Reforecasts: https://psl.noaa.gov/forecasts/reforecast2/.

References

Hamill TM, Bates GT, Whitaker JS, Murray DR, Fiorino M, Galarneau Jr. TJ, Zhu Y, Lapenta W (2013). NOAA's second-generation global medium-range ensemble reforecast data set. Bulletin of the American Meteorological Society, 94(10), 1553–1565. doi:10.1175/BAMS-D-12-00014.1.

Examples

## load data and omit missing values
data("IbkTemperature", package = "lmSubsets")
IbkTemperature <- na.omit(IbkTemperature)

## fit a simple climatological model for the temperature
## with a linear trend and annual/bi-annual harmonic seasonal pattern
CLIM <- lm(temp ~ time + sin + cos + sin2 + cos2,
  data = IbkTemperature)

## fit a simple MOS with 2-meter temperature forecast in addition
## to the climatological model
MOS0 <- lm(temp ~ t2m + time + sin + cos + sin2 + cos2,
  data = IbkTemperature)

## graphical comparison and MOS summary
plot(temp ~ time, data = IbkTemperature, type = "l", col = "darkgray")
lines(fitted(MOS0) ~ time, data = IbkTemperature, col = "darkred")
lines(fitted(CLIM) ~ time, data = IbkTemperature, lwd = 2)
MOS0

## best subset selection of remaining variables for the MOS
## (i.e., forcing the regressors of m1 into the model)
MOS1_all <- lmSubsets(temp ~ ., data = IbkTemperature,
  include = c("t2m", "time", "sin", "cos", "sin2", "cos2"))
plot(MOS1_all)
image(MOS1_all, size = 8:20)
## -> Note that soil temperature and maximum temperature are selected
## in addition to the 2-meter temperature

## best subset selection of all variables
MOS2_all <- lmSubsets(temp ~ ., data = IbkTemperature)
plot(MOS2_all)
image(MOS2_all, size = 2:20)
## -> Note that 2-meter temperature is not selected into the best
## BIC model but soil-temperature (and maximum temperature) are used instead

## refit the best BIC subset selections
MOS1 <- refit(lmSelect(MOS1_all))
MOS2 <- refit(lmSelect(MOS2_all))

## compare BIC
BIC(CLIM, MOS0, MOS1, MOS2)

## compare RMSE
sqrt(sapply(list(CLIM, MOS0, MOS1, MOS2), deviance)/
  nrow(IbkTemperature))

## compare coefficients
cf0 <- coef(CLIM)
cf1 <- coef(MOS0)
cf2 <- coef(MOS1)
cf3 <- coef(MOS2)
names(cf2) <- gsub("^x", "", names(coef(MOS1)))
names(cf3) <- gsub("^x", "", names(coef(MOS2)))
nam <- unique(c(names(cf0), names(cf1), names(cf2), names(cf3)))
cf <- matrix(NA, nrow = length(nam), ncol = 4,
  dimnames = list(nam, c("CLIM", "MOS0", "MOS1", "MOS2")))
cf[names(cf0), 1] <- cf0
cf[names(cf1), 2] <- cf1
cf[names(cf2), 3] <- cf2
cf[names(cf3), 4] <- cf3
print(round(cf, digits = 3), na.print = "")
## load data and omit missing values
data("IbkTemperature", package = "lmSubsets")
IbkTemperature <- na.omit(IbkTemperature)

## fit a simple climatological model for the temperature
## with a linear trend and annual/bi-annual harmonic seasonal pattern
CLIM <- lm(temp ~ time + sin + cos + sin2 + cos2,
  data = IbkTemperature)

## fit a simple MOS with 2-meter temperature forecast in addition
## to the climatological model
MOS0 <- lm(temp ~ t2m + time + sin + cos + sin2 + cos2,
  data = IbkTemperature)

## graphical comparison and MOS summary
plot(temp ~ time, data = IbkTemperature, type = "l", col = "darkgray")
lines(fitted(MOS0) ~ time, data = IbkTemperature, col = "darkred")
lines(fitted(CLIM) ~ time, data = IbkTemperature, lwd = 2)
MOS0

## best subset selection of remaining variables for the MOS
## (i.e., forcing the regressors of m1 into the model)
MOS1_all <- lmSubsets(temp ~ ., data = IbkTemperature,
  include = c("t2m", "time", "sin", "cos", "sin2", "cos2"))
plot(MOS1_all)
image(MOS1_all, size = 8:20)
## -> Note that soil temperature and maximum temperature are selected
## in addition to the 2-meter temperature

## best subset selection of all variables
MOS2_all <- lmSubsets(temp ~ ., data = IbkTemperature)
plot(MOS2_all)
image(MOS2_all, size = 2:20)
## -> Note that 2-meter temperature is not selected into the best
## BIC model but soil-temperature (and maximum temperature) are used instead

## refit the best BIC subset selections
MOS1 <- refit(lmSelect(MOS1_all))
MOS2 <- refit(lmSelect(MOS2_all))

## compare BIC
BIC(CLIM, MOS0, MOS1, MOS2)

## compare RMSE
sqrt(sapply(list(CLIM, MOS0, MOS1, MOS2), deviance)/
  nrow(IbkTemperature))

## compare coefficients
cf0 <- coef(CLIM)
cf1 <- coef(MOS0)
cf2 <- coef(MOS1)
cf3 <- coef(MOS2)
names(cf2) <- gsub("^x", "", names(coef(MOS1)))
names(cf3) <- gsub("^x", "", names(coef(MOS2)))
nam <- unique(c(names(cf0), names(cf1), names(cf2), names(cf3)))
cf <- matrix(NA, nrow = length(nam), ncol = 4,
  dimnames = list(nam, c("CLIM", "MOS0", "MOS1", "MOS2")))
cf[names(cf0), 1] <- cf0
cf[names(cf1), 2] <- cf1
cf[names(cf2), 3] <- cf2
cf[names(cf3), 4] <- cf3
print(round(cf, digits = 3), na.print = "")

Heatmap of a subset regression

Description

Plot a heatmap of the specified submodels.

Usage

## S3 method for class 'lmSubsets'
image(x, size = NULL, best = 1, which = NULL, hilite, hilite_penalty,
      main, sub, xlab = NULL, ylab, ann = par("ann"), axes = TRUE,
      col = c("gray40", "gray90"), lab = "lab",
      col_hilite = cbind("red", "pink"), lab_hilite = "lab",
      pad_size = 3, pad_best = 1, pad_which = 3, axis_pos = -4,
      axis_tck = -4, axis_lab = -10, ...)

## S3 method for class 'lmSelect'
image(x, best = NULL, which = NULL, hilite, hilite_penalty,
      main, sub = NULL, xlab = NULL, ylab, ann = par("ann"),
      axes = TRUE, col = c("gray40", "gray90"), lab = "lab",
      col_hilite = cbind("red", "pink"), lab_hilite = "lab",
      pad_best = 2, pad_which = 2, axis_pos = -4, axis_tck = -4,
      axis_lab = -10, ...)
## S3 method for class 'lmSubsets'
image(x, size = NULL, best = 1, which = NULL, hilite, hilite_penalty,
      main, sub, xlab = NULL, ylab, ann = par("ann"), axes = TRUE,
      col = c("gray40", "gray90"), lab = "lab",
      col_hilite = cbind("red", "pink"), lab_hilite = "lab",
      pad_size = 3, pad_best = 1, pad_which = 3, axis_pos = -4,
      axis_tck = -4, axis_lab = -10, ...)

## S3 method for class 'lmSelect'
image(x, best = NULL, which = NULL, hilite, hilite_penalty,
      main, sub = NULL, xlab = NULL, ylab, ann = par("ann"),
      axes = TRUE, col = c("gray40", "gray90"), lab = "lab",
      col_hilite = cbind("red", "pink"), lab_hilite = "lab",
      pad_best = 2, pad_which = 2, axis_pos = -4, axis_tck = -4,
      axis_lab = -10, ...)

Arguments

`x`	`"lmSubsets"`, `"lmSelect"`—a subset regression
`size`, `best`	submodels to be plotted
`which`	regressors to be plotted
`hilite`, `hilite_penalty`	submodels to be highlighted
`main`, `sub`, `xlab`, `ylab`	main, sub-, and axis titles
`ann`	annotate plot
`axes`	plot axes
`col`, `lab`	color and label style
`col_hilite`, `lab_hilite`	highlighting style
`pad_size`, `pad_best`, `pad_which`	padding
`axis_pos`, `axis_tck`, `axis_lab`	position of axes, tick length, and position of labels
`...`	ignored

Value

invisible(x)

Examples

## data
data("AirPollution", package = "lmSubsets")


#################
##  lmSubsets  ##
#################

lm_all <- lmSubsets(mortality ~ ., data = AirPollution, nbest = 20)

## heatmap
image(lm_all, best = 1:3)

## highlight 5 best (BIC)
image(lm_all, best = 1:3, hilite = 1:5, hilite_penalty = "BIC")


################
##  lmSelect  ##
################

## default criterion: BIC
lm_best <- lmSelect(lm_all)

## highlight 5 best (AIC)
image(lm_best, hilite = 1:5, hilite_penalty = "AIC")

## axis labels
image(lm_best, lab = c("bold(lab)", "lab"), hilite = 1,
      lab_hilite = "underline(lab)")
## data
data("AirPollution", package = "lmSubsets")


#################
##  lmSubsets  ##
#################

lm_all <- lmSubsets(mortality ~ ., data = AirPollution, nbest = 20)

## heatmap
image(lm_all, best = 1:3)

## highlight 5 best (BIC)
image(lm_all, best = 1:3, hilite = 1:5, hilite_penalty = "BIC")


################
##  lmSelect  ##
################

## default criterion: BIC
lm_best <- lmSelect(lm_all)

## highlight 5 best (AIC)
image(lm_best, hilite = 1:5, hilite_penalty = "AIC")

## axis labels
image(lm_best, lab = c("bold(lab)", "lab"), hilite = 1,
      lab_hilite = "underline(lab)")

Best-subset regression

Description

Best-variable-subset selection in ordinary linear regression.

Usage

lmSelect(formula, ...)

## Default S3 method:
lmSelect(formula, data, subset, weights, na.action,
         model = TRUE, x = FALSE, y = FALSE, contrasts = NULL,
         offset, ...)
lmSelect(formula, ...)

## Default S3 method:
lmSelect(formula, data, subset, weights, na.action,
         model = TRUE, x = FALSE, y = FALSE, contrasts = NULL,
         offset, ...)

Arguments

`formula`, `data`, `subset`, `weights`, `na.action`, `model`, `x`, `y`, `contrasts`, `offset`	standard formula interface
`...`	forwarded to `lmSelect_fit()`

Details

The lmSelect() generic provides various methods to conveniently specify the regressor and response variables. The standard formula interface (see lm()) can be used, or the model information can be extracted from an already fitted "lm" object. The model matrix and response can also be passed in directly.

After processing the arguments, the call is forwarded to lmSelect_fit().

Value

"lmSelect"—a list containing the components returned by lmSelect_fit()

Further components include call, na.action, weights, offset, contrasts, xlevels, terms, mf, x, and y. See lm() for more information.

Examples

## load data
data("AirPollution", package = "lmSubsets")


###################
##  basic usage  ##
###################

## fit 20 best subsets (BIC)
lm_best <- lmSelect(mortality ~ ., data = AirPollution, nbest = 20)
lm_best

## summary statistics
summary(lm_best)

## visualize
plot(lm_best)


########################
##  custom criterion  ##
########################

## the same as above, but with a custom criterion:
M <- nrow(AirPollution)

ll <- function (rss) {
  -M/2 * (log(2 * pi) - log(M) + log(rss) + 1)
}

aic <- function (size, rss, k = 2) {
  -2 * ll(rss) + k * (size + 1)
}

bic <- function (size, rss) {
  aic(size, rss, k = log(M))
}

lm_cust <- lmSelect(mortality ~ ., data = AirPollution,
                    penalty = bic, nbest = 20)
lm_cust
## load data
data("AirPollution", package = "lmSubsets")


###################
##  basic usage  ##
###################

## fit 20 best subsets (BIC)
lm_best <- lmSelect(mortality ~ ., data = AirPollution, nbest = 20)
lm_best

## summary statistics
summary(lm_best)

## visualize
plot(lm_best)


########################
##  custom criterion  ##
########################

## the same as above, but with a custom criterion:
M <- nrow(AirPollution)

ll <- function (rss) {
  -M/2 * (log(2 * pi) - log(M) + log(rss) + 1)
}

aic <- function (size, rss, k = 2) {
  -2 * ll(rss) + k * (size + 1)
}

bic <- function (size, rss) {
  aic(size, rss, k = log(M))
}

lm_cust <- lmSelect(mortality ~ ., data = AirPollution,
                    penalty = bic, nbest = 20)
lm_cust

Best-subset regression

Description

Low-level interface to best-variable-subset selection in ordinary linear regression.

Usage

lmSelect_fit(x, y, weights = NULL, offset = NULL, include = NULL,
             exclude = NULL, penalty = "BIC", tolerance = 0,
             nbest = 1, ..., pradius = NULL)
lmSelect_fit(x, y, weights = NULL, offset = NULL, include = NULL,
             exclude = NULL, penalty = "BIC", tolerance = 0,
             nbest = 1, ..., pradius = NULL)

Arguments

`x`	`double[,]`—the model matrix
`y`	`double[]`—the model response
`weights`	`double[]`—the model weights
`offset`	`double[]`—the model offset
`include`	`logical[]`, `integer[]`, `character[]`—the regressors to force in
`exclude`	`logical[]`, `integer[]`, `character[]`—the regressors to force out
`penalty`	`double`, `character`, `"function"`—the penalty per model parameter
`tolerance`	`double`—the approximation tolerance
`nbest`	`integer`—the number of best subsets
`...`	ignored
`pradius`	`integer`—the preordering radius

Details

The best variable-subset model is determined, where the "best" model is the one with the lowest information criterion value. The information criterion belongs to the AIC family.

The regression data is specified with the x, y, weights, and offset parameters. See lm.fit() for further details.

To force regressors into or out of the regression, a list of regressors can be passed as an argument to the include or exclude parameters, respectively.

The information criterion is specified with the penalty parameter. Accepted values are "AIC", "BIC", or a "numeric" value representing the penalty-per-model-parameter. A custom selection criterion may be specified by passing an R function as an argument. The expected signature is function (size, rss), where size is the number of predictors (including the intercept, if any), and rss is the residual sum of squares. The function must be non-decreasing in both parameters.

An approximation tolerance can be specified to speed up the search.

The number of returned submodels is determined by the nbest parameter.

The preordering radius is given with the pradius parameter.

Value

A list with the following components:

`NOBS`	`integer`—number of observations in model (before `weights` processing)
`nobs`	`integer`—number of observations in model (after `weights` processing)
`nvar`	`integer`—number of regressors in model
`weights`	`double[]`—model weights
`intercept`	`logical`—is `TRUE` if model contains an intercept term, `FALSE` otherwise
`include`	`logical[]`—regressors forced into the regression
`exclude`	`logical[]`—regressors forced out of the regression
`size`	`integer[]`—subset sizes
`ic`	information criterion
`tolerance`	`double`—approximation tolerance
`nbest`	`integer`—number of best subsets
`submodel`	`"data.frame"`—submodel information
`subset`	`"data.frame"`—selected subsets

References

Examples

data("AirPollution", package = "lmSubsets")

x <- as.matrix(AirPollution[, names(AirPollution) != "mortality"])
y <-           AirPollution[, names(AirPollution) == "mortality"]

f <- lmSelect_fit(x, y)
f
data("AirPollution", package = "lmSubsets")

x <- as.matrix(AirPollution[, names(AirPollution) != "mortality"])
y <-           AirPollution[, names(AirPollution) == "mortality"]

f <- lmSelect_fit(x, y)
f

Best-subset regression

Description

Coerce an all-subsets regression.

Usage

## S3 method for class 'lmSubsets'
lmSelect(formula, penalty = "BIC", ...)
## S3 method for class 'lmSubsets'
lmSelect(formula, penalty = "BIC", ...)

Arguments

`formula`	`"lmSubsets"`—an all-subsets regression
`penalty`	`double`, `character`, `"function"`—penalty per model parameter
`...`	ignored

Details

Computes a best-subset regression from an all-subsets regression.

Value

"lmSelect"—a best-subset regression

Examples

data("AirPollution", package = "lmSubsets")

lm_all <- lmSubsets(mortality ~ ., data = AirPollution, nbest = 20)

lm_best <- lmSelect(lm_all)
lm_best
data("AirPollution", package = "lmSubsets")

lm_all <- lmSubsets(mortality ~ ., data = AirPollution, nbest = 20)

lm_best <- lmSelect(lm_all)
lm_best

Best-subset regression

Description

Matrix interface to best-variable-subset selection in ordinary linear regression.

Usage

## S3 method for class 'matrix'
lmSelect(formula, y, intercept = TRUE, ...)
## S3 method for class 'matrix'
lmSelect(formula, y, intercept = TRUE, ...)

Arguments

`formula`	`"matrix"`—the model matrix
`y`	`double[]`—the model response
`intercept`	`logical[]`—if `FALSE`, remove intercept term
`...`	forwarded to `lmSelect.default()`

Details

This is a utility interface. Use the standard formula interface wherever possible.

Value

"lmSelect"—a best-subset regression

All-subsets regression

Description

All-variable-subsets selection in ordinary linear regression.

Usage

lmSubsets(formula, ...)

## Default S3 method:
lmSubsets(formula, data, subset, weights, na.action,
          model = TRUE, x = FALSE, y = FALSE, contrasts = NULL,
          offset, ...)
lmSubsets(formula, ...)

## Default S3 method:
lmSubsets(formula, data, subset, weights, na.action,
          model = TRUE, x = FALSE, y = FALSE, contrasts = NULL,
          offset, ...)

Arguments

`formula`, `data`, `subset`, `weights`, `na.action`, `model`, `x`, `y`, `contrasts`, `offset`	standard formula interface
`...`	fowarded to `lmSubsets_fit()`

Details

The lmSubsets() generic provides various methods to conveniently specify the regressor and response variables. The standard formula interface (see lm()) can be used, or the model information can be extracted from an already fitted "lm" object. The model matrix and response can also be passed in directly.

After processing of the arguments, the call is forwarded to lmSubsets_fit().

Value

"lmSubsets"—a list containing the components returned by lmSubsets_fit()

Further components include call, na.action, weights, offset, contrasts, xlevels, terms, mf, x, and y. See lm() for more information.

Examples

## load data
data("AirPollution", package = "lmSubsets")


###################
##  basic usage  ##
###################

## canonical example: fit all subsets
lm_all <- lmSubsets(mortality ~ ., data = AirPollution, nbest = 5)
lm_all

## plot RSS and BIC
plot(lm_all)

## summary statistics
summary(lm_all)


############################
##  forced in-/exclusion  ##
############################

lm_force <- lmSubsets(lm_all, include = c("nox", "so2"),
                      exclude = "whitecollar")
lm_force
## load data
data("AirPollution", package = "lmSubsets")


###################
##  basic usage  ##
###################

## canonical example: fit all subsets
lm_all <- lmSubsets(mortality ~ ., data = AirPollution, nbest = 5)
lm_all

## plot RSS and BIC
plot(lm_all)

## summary statistics
summary(lm_all)


############################
##  forced in-/exclusion  ##
############################

lm_force <- lmSubsets(lm_all, include = c("nox", "so2"),
                      exclude = "whitecollar")
lm_force

All-subsets regression

Description

Low-level interface to all-variable-subsets selection in ordinary linear regression.

Usage

lmSubsets_fit(x, y, weights = NULL, offset = NULL, include = NULL,
              exclude = NULL, nmin = NULL, nmax = NULL,
              tolerance = 0, nbest = 1, ..., pradius = NULL)
lmSubsets_fit(x, y, weights = NULL, offset = NULL, include = NULL,
              exclude = NULL, nmin = NULL, nmax = NULL,
              tolerance = 0, nbest = 1, ..., pradius = NULL)

Arguments

`x`	`double[,]`—the model matrix
`y`	`double[]`—the model response
`weights`	`double[]`—the model weights
`offset`	`double[]`—the model offset
`include`	`logical[]`, `integer[]`, `character[]`—the regressors to force in
`exclude`	`logical[]`, `integer[]`, `character[]`—the regressors to force out
`nmin`	`integer`—the minimum number of regressors
`nmax`	`integer`—the maximum number of regressors
`tolerance`	`double[]`—the approximation tolerances
`nbest`	`integer`—the number of best subsets
`...`	ignored
`pradius`	`integer`—the preordering radius

Details

The best variable-subset model for every subset size is determined, where the "best" model is the one with the lowest residual sum of squares (RSS).

The regression data is specified with the x, y, weights, and offset parameters. See lm.fit() for further details.

To force regressors into or out of the regression, a list of regressors can be passed as an argument to the include or exclude parameters, respectively.

The scope of the search can be limited to a range of subset sizes by setting nmin and nmax, the minimum and maximum number of regressors allowed in the regression, respectively.

A tolerance vector can be specified to speed up the search, where tolerance[j] is the approximation tolerance applied to subset models of size j.

The number of submodels returned for each subset size is determined by the nbest parameter.

The preordering radius is given with the pradius parameter.

Value

A list with the following components:

`NOBS`	`integer`—number of observations in model (before `weights` processing)
`nobs`	`integer`—number of observations in model (after `weights` processing)
`nvar`	`integer`—number of regressors in model
`weights`	`double[]`—model weights
`intercept`	`logical`—is `TRUE` if model contains an intercept term, `FALSE` otherwise
`include`	`logical[]`—regressors forced into the regression
`exclude`	`logical[]`—regressors forced out of the regression
`size`	`integer[]`—subset sizes
`tolerance`	`double[]`—approximation tolerances
`nbest`	`integer`—number of best subsets
`submodel`	`"data.frame"`—submodel information
`subset`	`"data.frame"`—variable subsets

References

Examples

data("AirPollution", package = "lmSubsets")

x <- as.matrix(AirPollution[, names(AirPollution) != "mortality"])
y <-           AirPollution[, names(AirPollution) == "mortality"]

f <- lmSubsets_fit(x, y)
f
data("AirPollution", package = "lmSubsets")

x <- as.matrix(AirPollution[, names(AirPollution) != "mortality"])
y <-           AirPollution[, names(AirPollution) == "mortality"]

f <- lmSubsets_fit(x, y)
f

All-subsets regression

Description

Matrix interface to all-variable-subsets selection in ordinary linear regression.

Usage

## S3 method for class 'matrix'
lmSubsets(formula, y, intercept = TRUE, ...)
## S3 method for class 'matrix'
lmSubsets(formula, y, intercept = TRUE, ...)

Arguments

`formula`	`"matrix"`—the model matrix
`y`	`double[]`—the model response
`intercept`	`logical`—if `FALSE`, remove intercept term
`...`	forwarded to `lmSubsets.default()`

Details

This is a utility interface. Use the standard formula interface wherever possible.

Value

"lmSubsets"—an all-subsets regression

Examples

data("AirPollution", package = "lmSubsets")

x <- as.matrix(AirPollution)

lm_mat <- lmSubsets(x, y = "mortality")
lm_mat
data("AirPollution", package = "lmSubsets")

x <- as.matrix(AirPollution)

lm_mat <- lmSubsets(x, y = "mortality")
lm_mat

Extract the log-likelihood from a subset regression

Description

Return the log-likelihood of the the specified submodels.

Usage

## S3 method for class 'lmSubsets'
logLik(object, size, best = 1, ..., na.rm = TRUE, drop = TRUE)

## S3 method for class 'lmSelect'
logLik(object, best = 1, ..., na.rm = TRUE, drop = TRUE)
## S3 method for class 'lmSubsets'
logLik(object, size, best = 1, ..., na.rm = TRUE, drop = TRUE)

## S3 method for class 'lmSelect'
logLik(object, best = 1, ..., na.rm = TRUE, drop = TRUE)

Arguments

`object`	`"lmSubsets"`, `"lmSelect"`—a subset regression
`size`	`integer[]`—the submodel sizes
`best`	`integer[]`—the submodel positions
`...`	ignored
`na.rm`	`logical`—if `TRUE`, remove `NA` entries
`drop`	`logical`—if `TRUE`, simplify structure

Value

double[]—the log-likelihoods

Model response

Description

Extract the model response.

Usage

model_response(data, ...)

## Default S3 method:
model_response(data, type = "any", ...)
model_response(data, ...)

## Default S3 method:
model_response(data, type = "any", ...)

Arguments

`data`	an object
`type`	`character`—the return type
`...`	further arguments

Details

The default method simply forwards the call to model.response().

Value

double[]—the model response

Extract the model response from a subset regression

Description

Return the model response.

Usage

## S3 method for class 'lmSubsets'
model_response(data, ...)

## S3 method for class 'lmSelect'
model_response(data, ...)
## S3 method for class 'lmSubsets'
model_response(data, ...)

## S3 method for class 'lmSelect'
model_response(data, ...)

Arguments

`data`	`"lmSubsets"`, `"lmSelect"`—a subset regression
`...`	ignored

Value

double[]—the model response

Extract the model frame from a subset regression

Description

Return the model frame.

Usage

## S3 method for class 'lmSubsets'
model.frame(formula, ...)

## S3 method for class 'lmSelect'
model.frame(formula, ...)
## S3 method for class 'lmSubsets'
model.frame(formula, ...)

## S3 method for class 'lmSelect'
model.frame(formula, ...)

Arguments

`formula`	`"lmSubsets"`, `"lmSelect"`—a subset regression
`...`	forwarded to `model.frame()`

Value

"data.frame"—the model frame

Extract a model matrix from a subset regression

Description

Returns the model matrix for the specified submodel.

Usage

## S3 method for class 'lmSubsets'
model.matrix(object, size, best = 1, ...)

## S3 method for class 'lmSelect'
model.matrix(object, best, ...)
## S3 method for class 'lmSubsets'
model.matrix(object, size, best = 1, ...)

## S3 method for class 'lmSelect'
model.matrix(object, best, ...)

Arguments

`object`	`"lmSubsets"`, `"lmSelect"`—a subset regression
`size`	`integer`—the submodel size
`best`	`integer`—the submodel position
`...`	forwarded to `model.frame()`

Value

double[,]—the model matrix

Plot a subset regression

Description

Plot the deviance of the selected submodels, as well as a specified information criterion.

Usage

## S3 method for class 'lmSubsets'
plot(x, penalty = "BIC", xlim, ylim_rss, ylim_ic, type_rss = "o",
     type_ic = "o", main, sub, xlab, ylab_rss, ylab_ic, legend_rss,
     legend_ic, ann = par("ann"), axes = TRUE, lty_rss = c(1, 3),
     pch_rss = c(16, 21), col_rss = "black", bg_rss = "white",
     lty_ic = c(1, 3), pch_ic = c(16, 21), col_ic = "red",
     bg_ic = "white", ...)

## S3 method for class 'lmSelect'
plot(x, xlim, ylim, type = "o", main, sub, xlab, ylab, legend,
     ann = par("ann"), axes = TRUE, lty = 1, pch = 16, col = "red",
     bg = "white", ...)
## S3 method for class 'lmSubsets'
plot(x, penalty = "BIC", xlim, ylim_rss, ylim_ic, type_rss = "o",
     type_ic = "o", main, sub, xlab, ylab_rss, ylab_ic, legend_rss,
     legend_ic, ann = par("ann"), axes = TRUE, lty_rss = c(1, 3),
     pch_rss = c(16, 21), col_rss = "black", bg_rss = "white",
     lty_ic = c(1, 3), pch_ic = c(16, 21), col_ic = "red",
     bg_ic = "white", ...)

## S3 method for class 'lmSelect'
plot(x, xlim, ylim, type = "o", main, sub, xlab, ylab, legend,
     ann = par("ann"), axes = TRUE, lty = 1, pch = 16, col = "red",
     bg = "white", ...)

Arguments

`x`	`"lmSubsets"`, `"lmSelect"`—a subset regression
`penalty`	the information criterion
`xlim`, `ylim`, `ylim_rss`, `ylim_ic`	x and y limits
`type`, `type_rss`, `type_ic`	type of plot
`main`, `sub`	main and sub-title
`xlab`, `ylab`, `ylab_rss`, `ylab_ic`	axis titles
`legend`, `legend_rss`, `legend_ic`	plot legend
`ann`	annotate plot
`axes`	plot axes
`lty`, `lty_rss`, `lty_ic`	line type
`pch`, `pch_rss`, `pch_ic`	plotting character
`col`, `col_rss`, `col_ic`	color
`bg`, `bg_rss`, `bg_ic`	background color
`...`	further graphical parameters

Value

invisible(x)

Examples

## load data
data("AirPollution", package = "lmSubsets")


#################
##  lmSubsets  ##
#################

lm_all <- lmSubsets(mortality ~ ., data = AirPollution, nbest = 5)
plot(lm_all)


################
##  lmSelect  ##
################

lm_best <- lmSelect(mortality ~ ., data = AirPollution, nbest = 20)
plot(lm_best)
## load data
data("AirPollution", package = "lmSubsets")


#################
##  lmSubsets  ##
#################

lm_all <- lmSubsets(mortality ~ ., data = AirPollution, nbest = 5)
plot(lm_all)


################
##  lmSelect  ##
################

lm_best <- lmSelect(mortality ~ ., data = AirPollution, nbest = 20)
plot(lm_best)

Refitting models

Description

Generic function for refitting a model on a subset or reweighted data set.

Usage

refit(object, ...)
refit(object, ...)

Arguments

`object`	an object to be refitted
`...`	forwarded arguments

Details

The refit generic is a new function for refitting a certain model object on multiple versions of a data set (and is hence different from update). Applications refit models after some kind of model selection, e.g., variable subset selection, partitioning, reweighting, etc.

The generic is similar to the one provided in modeltools and fxregime (and should fulfill the same purpose). To avoid dependencies, it is also provided here.

Value

"lm"—the refitted model

Refit a subset regression

Description

Fit the specified submodel and return the obtained "lm" object.

Usage

## S3 method for class 'lmSubsets'
refit(object, size, best = 1, ...)

## S3 method for class 'lmSelect'
refit(object, best = 1, ...)
## S3 method for class 'lmSubsets'
refit(object, size, best = 1, ...)

## S3 method for class 'lmSelect'
refit(object, best = 1, ...)

Arguments

`object`	`"lmSubsets"`, `"lmSelect"`—a subset regression
`size`	`integer`—the submodel size
`best`	`integer`—the submodel position
`...`	ignored

Value

"lm"—the fitted model

Examples

## load data
data("AirPollution", package = "lmSubsets")

## fit subsets
lm_all <- lmSubsets(mortality ~ ., data = AirPollution)

## refit best model
lm5 <- refit(lm_all, size = 5)
summary(lm5)
## load data
data("AirPollution", package = "lmSubsets")

## fit subsets
lm_all <- lmSubsets(mortality ~ ., data = AirPollution)

## refit best model
lm5 <- refit(lm_all, size = 5)
summary(lm5)

Extract the residuals from all-subsets regression

Description

Return the residuals for the specified submodel.

Usage

## S3 method for class 'lmSubsets'
residuals(object, size, best = 1, ...)

## S3 method for class 'lmSelect'
residuals(object, best = 1, ...)
## S3 method for class 'lmSubsets'
residuals(object, size, best = 1, ...)

## S3 method for class 'lmSelect'
residuals(object, best = 1, ...)

Arguments

`object`	`"lmSubsets"`, `"lmSelect"`—a subset regression
`size`	`integer`—the submodel size
`best`	`integer`—the submodel position
`...`	ignored

Value

double[]—the residuals

Extract the residual standard deviation from a subset regression

Description

Return the residual standard deviation for the specified submodels.

Usage

## S3 method for class 'lmSubsets'
sigma(object, size, best = 1, ..., na.rm = TRUE, drop = TRUE)

## S3 method for class 'lmSelect'
sigma(object, best = 1, ..., na.rm = TRUE, drop = TRUE)
## S3 method for class 'lmSubsets'
sigma(object, size, best = 1, ..., na.rm = TRUE, drop = TRUE)

## S3 method for class 'lmSelect'
sigma(object, best = 1, ..., na.rm = TRUE, drop = TRUE)

Arguments

`object`	`"lmSubsets"`, `"lmSelect"`—a subset regression
`size`	`integer[]`—the submodel sizes
`best`	`integer[]`—the submodel positions
`...`	ignored
`na.rm`	`logical`—if `TRUE`, remove `NA` entries
`drop`	`logical`—if `TRUE`, simplify structure

Value

double[]—the residual standard deviations

Summarize a subset regression

Description

Evaluate summary statistics for the selected submodels.

Usage

## S3 method for class 'lmSubsets'
summary(object, ..., na.rm = TRUE)

## S3 method for class 'lmSelect'
summary(object, ..., na.rm = TRUE)
## S3 method for class 'lmSubsets'
summary(object, ..., na.rm = TRUE)

## S3 method for class 'lmSelect'
summary(object, ..., na.rm = TRUE)

Arguments

`object`	`"lmSubsets"`, `"lmSelect"`—a subset regression
`...`	ignored
`na.rm`	if `TRUE`, remove `NA` values

Value

"summary.lmSubsets", "summary.lmSelect"—a subset regression summary

Extract variable names from a subset regression

Description

Return the variable names for the specified submodels.

Usage

## S3 method for class 'lmSubsets'
variable.names(object, size, best = 1, ..., na.rm = TRUE, drop = TRUE)

## S3 method for class 'lmSelect'
variable.names(object, best = 1, ..., na.rm = TRUE, drop = TRUE)
## S3 method for class 'lmSubsets'
variable.names(object, size, best = 1, ..., na.rm = TRUE, drop = TRUE)

## S3 method for class 'lmSelect'
variable.names(object, best = 1, ..., na.rm = TRUE, drop = TRUE)

Arguments

`object`	`"lmSubsets"`, `"lmSelect"`—a subset regression
`size`	`integer[]`—the submodel sizes
`best`	`integer[]`—the submodel positions
`...`	ignored
`na.rm`	`logical`—if `TRUE`, remove `NA` entries
`drop`	`logical`—if `TRUE`, simplify structure

Value

logical[,], "data.frame"—the variable names

Extract the variance-covariance matrix from a subset regression

Description

Return the variance-covariance matrix for the specified submodel.

Usage

## S3 method for class 'lmSubsets'
vcov(object, size, best = 1, ...)

## S3 method for class 'lmSelect'
vcov(object, best = 1, ...)
## S3 method for class 'lmSubsets'
vcov(object, size, best = 1, ...)

## S3 method for class 'lmSelect'
vcov(object, best = 1, ...)

Arguments

`object`	`"lmSubsets"`, `"lmSelect"`—a subset regression
`size`	`integer`—the submodel size
`best`	`integer`—the submodel position
`...`	ignored

Value

double[,]—the variance-covariance matrix

Package 'lmSubsets'

Help Index

Package lmSubsets

Description

Author(s)

References

See Also

Extract AIC values from a subset regression

Description

Usage

Arguments

Value

See Also

Air pollution and mortality

Description

Usage

Format

Source

References

Examples

Extract BIC values from a subset regression

Description

Usage

Arguments

Value

See Also

Extract the ceofficients from a subset regression

Description

Usage

Arguments

Value

See Also

Extract the deviance from a subset regression

Description

Usage

Arguments

Value

See Also

Extract the fitted values from a subset regression

Description

Usage

Arguments

Value

See Also

Extract a formula from a subset regression

Description

Usage

Arguments

Value

See Also

Temperature observations and numerical weather predictions for Innsbruck

Description

Usage

Format

Source

References

Examples

Heatmap of a subset regression

Description

Usage

Arguments

Value

See Also

Examples

Best-subset regression

Description

Usage

Arguments

Details

Value

See Also

Examples

Best-subset regression

Description

Usage

Arguments

Details

Value

References

See Also

Package `lmSubsets`