Title: | Efficient Best Subset Selection for GLMs via Branch and Bound Algorithms |
---|---|
Description: | Performs efficient and scalable glm best subset selection using a novel implementation of a branch and bound algorithm. To speed up the model fitting process, a range of optimization methods are implemented in 'RcppArmadillo'. Parallel computation is available using 'OpenMP'. |
Authors: | Jacob Seedorff [aut, cre] |
Maintainer: | Jacob Seedorff <[email protected]> |
License: | Apache License (>= 2) |
Version: | 3.0.1 |
Built: | 2024-11-28 06:55:59 UTC |
Source: | CRAN |
Creates a bar plot with the L0-penalization based variable importance values.
## S3 method for class 'BranchGLMVI' barplot( height, modified = FALSE, horiz = TRUE, decreasing = FALSE, which = "all", las = ifelse(horiz, 1, 2), main = NULL, lab = NULL, ... )
## S3 method for class 'BranchGLMVI' barplot( height, modified = FALSE, horiz = TRUE, decreasing = FALSE, which = "all", las = ifelse(horiz, 1, 2), main = NULL, lab = NULL, ... )
height |
a |
modified |
a logical value indicating if the modified variable importance values should be plotted. |
horiz |
a logical value to indicate whether bars should be horizontal. |
decreasing |
a logical value to indicate whether variables should be sorted in decreasing order. Can use NA if no ordering is desired. |
which |
which variable importance values to plot, can use a numeric vector of indices, a character vector of names, or "all" for all variables. |
las |
the style of axis labels, see par for more details. |
main |
the overall title for the plot. |
lab |
the title for the axis corresponding to the variable importance values. |
... |
further arguments passed to barplot.default. |
This only produces a plot, nothing is returned.
Data <- iris Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") # Doing branch and bound selection VS <- VariableSelection(Fit, type = "branch and bound", metric = "BIC", showprogress = FALSE) # Getting variable importance VI <- VariableImportance(VS, showprogress = FALSE) VI # Plotting variable importance oldmar <- par("mar") par(mar = c(4, 6, 3, 1) + 0.1) barplot(VI) par(mar = oldmar)
Data <- iris Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") # Doing branch and bound selection VS <- VariableSelection(Fit, type = "branch and bound", metric = "BIC", showprogress = FALSE) # Getting variable importance VI <- VariableImportance(VS, showprogress = FALSE) VI # Plotting variable importance oldmar <- par("mar") par(mar = c(4, 6, 3, 1) + 0.1) barplot(VI) par(mar = oldmar)
Creates box-and-whisker plots of approximate null distributions for the modified variable importance values.
## S3 method for class 'BranchGLMVI.boot' boxplot( x, which = "all", linecol = "red", linelwd = 2, horizontal = TRUE, lim = NULL, show.names = TRUE, lab = "Modified Variable Importance", main = NULL, las = ifelse(horizontal, 1, 2), ... )
## S3 method for class 'BranchGLMVI.boot' boxplot( x, which = "all", linecol = "red", linelwd = 2, horizontal = TRUE, lim = NULL, show.names = TRUE, lab = "Modified Variable Importance", main = NULL, las = ifelse(horizontal, 1, 2), ... )
x |
a |
which |
which approximate null distributions to plot, can use a numeric vector of indices, a character vector of names, or "all" for all variables. The default is to create box-and-whisker plots for each set of variables that are not kept in each model. |
linecol |
the color of the line which indicates the observed modified variable importance values. |
linelwd |
the width of the line which indicates the observed modified variable importance values. |
horizontal |
a logical value indicating if the boxplots should be horizontal. |
lim |
a numeric vector of length 2, giving the coordinates range. |
show.names |
set to TRUE or FALSE to override the defaults on whether an axis label is printed for each group. |
lab |
a label for the axis corresponding to the modified variable importance values. |
main |
a main title for the plot. |
las |
the style of axis labels, see more at par. |
... |
further arguments passed to boxplot.default. |
This only produces a plot, nothing is returned.
Data <- iris Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") # Doing branch and bound selection VS <- VariableSelection(Fit, type = "branch and bound", metric = "BIC", showprogress = FALSE) # Getting approximate null distributions set.seed(40174) myBoot <- VariableImportance.boot(VS, showprogress = FALSE) # Plotting boxplots of selected sets of variables oldmar <- par("mar") par(mar = c(4, 6, 3, 1) + 0.1) boxplot(myBoot, las = 1) par(mar = oldmar) # Plotting boxplots of selected sets of variables boxplot(myBoot, las = 1, cex.axis = 0.55)
Data <- iris Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") # Doing branch and bound selection VS <- VariableSelection(Fit, type = "branch and bound", metric = "BIC", showprogress = FALSE) # Getting approximate null distributions set.seed(40174) myBoot <- VariableImportance.boot(VS, showprogress = FALSE) # Plotting boxplots of selected sets of variables oldmar <- par("mar") par(mar = c(4, 6, 3, 1) + 0.1) boxplot(myBoot, las = 1) par(mar = oldmar) # Plotting boxplots of selected sets of variables boxplot(myBoot, las = 1, cex.axis = 0.55)
Fits generalized linear models (GLMs) via RcppArmadillo with the ability to perform some computation in parallel with OpenMP.
BranchGLM( formula, data = NULL, family, link, offset = NULL, method = "Fisher", grads = 10, parallel = FALSE, nthreads = 8, tol = 1e-06, maxit = NULL, init = NULL, fit = TRUE, contrasts = NULL, keepData = TRUE, keepY = TRUE ) BranchGLM.fit( x, y, family, link, offset = NULL, method = "Fisher", grads = 10, parallel = FALSE, nthreads = 8, init = NULL, maxit = NULL, tol = 1e-06 )
BranchGLM( formula, data = NULL, family, link, offset = NULL, method = "Fisher", grads = 10, parallel = FALSE, nthreads = 8, tol = 1e-06, maxit = NULL, init = NULL, fit = TRUE, contrasts = NULL, keepData = TRUE, keepY = TRUE ) BranchGLM.fit( x, y, family, link, offset = NULL, method = "Fisher", grads = 10, parallel = FALSE, nthreads = 8, init = NULL, maxit = NULL, tol = 1e-06 )
formula |
a formula for the model. |
data |
an optional data.frame, list or environment (or object coercible by
as.data.frame to a data.frame), containing the variables in formula.
Neither a matrix nor an array will be accepted. If not found in |
family |
the distribution used to model the data, one of "gaussian",
"gamma", "binomial", or "poisson". A |
link |
the link used to link the mean structure to the linear predictors. One of
"identity", "logit", "probit", "cloglog", "sqrt", "inverse", or "log". The accepted
links depend on the specified family, see more in details. This only needs to be
supplied if a string is supplied for |
offset |
the offset vector, by default the zero vector is used. |
method |
one of "Fisher", "BFGS", or "LBFGS". BFGS and L-BFGS are quasi-newton methods which are typically faster than Fisher's scoring when there are many covariates (at least 50). |
grads |
a positive integer to denote the number of gradients used to
approximate the inverse information with, only for |
parallel |
a logical value to indicate if parallelization should be used. |
nthreads |
a positive integer to denote the number of threads used with OpenMP,
only used if |
tol |
a positive number to denote the tolerance used to determine model convergence. |
maxit |
a positive integer to denote the maximum number of iterations performed. The default for Fisher's scoring is 50 and for the other methods the default is 200. |
init |
a numeric vector of initial values for the betas, if not specified then they are automatically selected via linear regression with the transformation specified by the link function. This is ignored for linear regression models. |
fit |
a logical value to indicate whether to fit the model or not. |
contrasts |
see |
keepData |
a logical value to indicate whether or not to store a copy of
data and the design matrix, the default is TRUE. If this is FALSE, then the
results from this cannot be used inside of |
keepY |
a logical value to indicate whether or not to store a copy of y,
the default is TRUE. If this is FALSE, then the binomial GLM helper functions
may not work and this cannot be used inside of |
x |
design matrix used for the fit, must be numeric. |
y |
outcome vector, must be numeric. |
Can use BFGS, L-BFGS, or Fisher's scoring to fit the GLM. BFGS and L-BFGS are
typically faster than Fisher's scoring when there are at least 50 covariates
and Fisher's scoring is typically best when there are fewer than 50 covariates.
This function does not currently support the use of weights. In the special
case of gaussian regression with identity link the method
argument is ignored
and the normal equations are solved directly.
The models are fit in C++ by using Rcpp and RcppArmadillo. In order to help
convergence, each of the methods makes use of a backtracking line-search using
the strong Wolfe conditions to find an adequate step size. There are
three conditions used to determine convergence, the first is whether there is a
sufficient decrease in the negative log-likelihood, the second is whether
the l2-norm of the score is sufficiently small, and the last condition is
whether the change in each of the beta coefficients is sufficiently
small. The tol
argument controls all of these criteria. If the algorithm fails to
converge, then iterations
will be -1.
All observations with any missing values are removed before model fitting.
BranchGLM.fit
can be faster than calling BranchGLM
if the
x matrix and y vector are already available, but doesn't return as much information.
The object returned by BranchGLM.fit
is not of class BranchGLM
, so
all of the methods for BranchGLM
objects such as predict
or
VariableSelection
cannot be used.
The dispersion parameter for binomial and Poisson regression is always fixed to be 1. For gaussian and gamma regression, the MLE of the dispersion parameter is used for the calculation of the log-likelihood and the Pearson estimator of the dispersion parameter is used for the calculation of standard errors for the coefficient estimates.
The binomial family accepts "cloglog", "log", "logit", and "probit" as possible link functions. The gamma and gaussian families accept "identity", "inverse", "log", and "sqrt" as possible link functions. The Poisson family accepts "identity", "log", and "sqrt" as possible link functions.
BranchGLM
returns a BranchGLM
object which is a list with the following components
coefficients |
a data.frame with the coefficient estimates, SEs, Wald test statistics, and p-values |
iterations |
number of iterations it took the algorithm to converge, if the algorithm failed to converge then this is -1 |
dispersion |
a vector of length 2 with the MLE of the dispersion parameter first and the Pearson estimator of the dispersion parameter second |
logLik |
the log-likelihood of the fitted model |
vcov |
the variance-covariance matrix of the fitted model |
resDev |
the residual deviance of the fitted model |
AIC |
the AIC of the fitted model |
preds |
predictions from the fitted model |
linpreds |
linear predictors from the fitted model |
residuals |
a numeric vector with the Pearson residuals |
variance |
a numeric vector with the variance evaluated at the final coefficient estimates |
tol |
tolerance used to fit the model |
maxit |
maximum number of iterations used to fit the model |
formula |
formula used to fit the model |
method |
iterative method used to fit the model |
grads |
number of gradients used to approximate inverse information for L-BFGS |
y |
y vector used in the model, not included if |
x |
design matrix used to fit the model, not included if |
rownames |
rownames taken from the design matrix |
offset |
offset vector in the model, not included if |
fulloffset |
supplied offset vector, not included if |
data |
original |
mf |
the model frame, not included if |
numobs |
number of observations in the design matrix |
names |
names of the predictor variables |
yname |
name of y variable |
parallel |
whether parallelization was employed to speed up model fitting process |
missing |
number of missing values removed from the original dataset |
link |
link function used to model the data |
family |
family used to model the data |
ylevel |
the levels of y, only included for binomial glms |
xlev |
the levels of the factors in the dataset |
terms |
the terms object used |
BranchGLM.fit
returns a list with the following components
coefficients |
a data.frame with the coefficients estimates, SEs, Wald test statistics, and p-values |
iterations |
number of iterations it took the algorithm to converge, if the algorithm failed to converge then this is -1 |
dispersion |
a vector of length 2 with the MLE of the dispersion parameter first and the Pearson estimator of the dispersion parameter second |
logLik |
the log-likelihood of the fitted model |
vcov |
the variance-covariance matrix of the fitted model |
resDev |
the residual deviance of the fitted model |
AIC |
the AIC of the fitted model |
preds |
predictions from the fitted model |
linpreds |
linear predictors from the fitted model |
residuals |
a numeric vector with the Pearson residuals |
variance |
a numeric vector with the variance evaluated at the final coefficient estimates |
tol |
tolerance used to fit the model |
maxit |
maximum number of iterations used to fit the model |
McCullagh, P., & Nelder, J. A. (1989). Generalized Linear Models (2nd ed.). Chapman & Hall.
predict.BranchGLM, coef.BranchGLM, VariableSelection, confint.BranchGLM, logLik.BranchGLM
Data <- iris # Linear regression ## Using BranchGLM BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") ## Using BranchGLM.fit x <- model.matrix(Sepal.Length ~ ., data = Data) y <- Data$Sepal.Length BranchGLM.fit(x, y, family = "gaussian", link = "identity") # Gamma regression ## Using BranchGLM BranchGLM(Sepal.Length ~ ., data = Data, family = "gamma", link = "log") ### init BranchGLM(Sepal.Length ~ ., data = Data, family = "gamma", link = "log", init = rep(0, 6), maxit = 50, tol = 1e-6, contrasts = NULL) ### method BranchGLM(Sepal.Length ~ ., data = Data, family = "gamma", link = "log", init = rep(0, 6), maxit = 50, tol = 1e-6, contrasts = NULL, method = "LBFGS") ### offset BranchGLM(Sepal.Length ~ ., data = Data, family = "gamma", link = "log", init = rep(0, 6), maxit = 50, tol = 1e-6, contrasts = NULL, offset = Data$Sepal.Width) ## Using BranchGLM.fit x <- model.matrix(Sepal.Length ~ ., data = Data) y <- Data$Sepal.Length BranchGLM.fit(x, y, family = "gamma", link = "log", init = rep(0, 6), maxit = 50, tol = 1e-6, offset = Data$Sepal.Width)
Data <- iris # Linear regression ## Using BranchGLM BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") ## Using BranchGLM.fit x <- model.matrix(Sepal.Length ~ ., data = Data) y <- Data$Sepal.Length BranchGLM.fit(x, y, family = "gaussian", link = "identity") # Gamma regression ## Using BranchGLM BranchGLM(Sepal.Length ~ ., data = Data, family = "gamma", link = "log") ### init BranchGLM(Sepal.Length ~ ., data = Data, family = "gamma", link = "log", init = rep(0, 6), maxit = 50, tol = 1e-6, contrasts = NULL) ### method BranchGLM(Sepal.Length ~ ., data = Data, family = "gamma", link = "log", init = rep(0, 6), maxit = 50, tol = 1e-6, contrasts = NULL, method = "LBFGS") ### offset BranchGLM(Sepal.Length ~ ., data = Data, family = "gamma", link = "log", init = rep(0, 6), maxit = 50, tol = 1e-6, contrasts = NULL, offset = Data$Sepal.Width) ## Using BranchGLM.fit x <- model.matrix(Sepal.Length ~ ., data = Data) y <- Data$Sepal.Length BranchGLM.fit(x, y, family = "gamma", link = "log", init = rep(0, 6), maxit = 50, tol = 1e-6, offset = Data$Sepal.Width)
Calculates the c-index/AUC.
Cindex(object, ...) AUC(object, ...) ## S3 method for class 'numeric' Cindex(object, y, ...) ## S3 method for class 'BranchGLM' Cindex(object, ...) ## S3 method for class 'BranchGLMROC' Cindex(object, ...)
Cindex(object, ...) AUC(object, ...) ## S3 method for class 'numeric' Cindex(object, y, ...) ## S3 method for class 'BranchGLM' Cindex(object, ...) ## S3 method for class 'BranchGLMROC' Cindex(object, ...)
object |
a |
... |
further arguments passed to other methods. |
y |
Observed values, can be a numeric vector of 0s and 1s, a two-level factor vector, or a logical vector. |
Uses trapezoidal rule to calculate AUC when given a BranchGLMROC object and uses Mann-Whitney U to calculate it otherwise. The trapezoidal rule method is less accurate, so the two methods may give different results.
A number corresponding to the c-index/AUC.
Data <- ToothGrowth Fit <- BranchGLM(supp ~ ., data = Data, family = "binomial", link = "logit") Cindex(Fit) AUC(Fit)
Data <- ToothGrowth Fit <- BranchGLM(supp ~ ., data = Data, family = "binomial", link = "logit") Cindex(Fit) AUC(Fit)
Extracts beta coefficients from BranchGLM
objects.
## S3 method for class 'BranchGLM' coef(object, type = "estimates", ...)
## S3 method for class 'BranchGLM' coef(object, type = "estimates", ...)
object |
a |
type |
"estimates" to only return the coefficient estimates or "all" to return the estimates along with SEs, test statistics, and p-values. |
... |
further arguments passed to or from other methods. |
A named vector with the corresponding coefficient estimates or a data.frame with the coefficient estimates along with SEs, test statistics, and p-values.
Data <- iris # Linear regression model Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") Fit # Getting only coefficient estimates coef(Fit, type = "estimates") # Getting coefficient estimates along with SEs, tests, and p-values coef(Fit, type = "all")
Data <- iris # Linear regression model Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") Fit # Getting only coefficient estimates coef(Fit, type = "estimates") # Getting coefficient estimates along with SEs, tests, and p-values coef(Fit, type = "all")
Extracts beta coefficients from BranchGLMVS or summary.BranchGLMVS objects.
## S3 method for class 'BranchGLMVS' coef(object, which = 1, ...) ## S3 method for class 'summary.BranchGLMVS' coef(object, which = 1, ...)
## S3 method for class 'BranchGLMVS' coef(object, which = 1, ...) ## S3 method for class 'summary.BranchGLMVS' coef(object, which = 1, ...)
object |
a |
which |
a numeric vector of indices or "all" to indicate which models to get coefficients from, the default is 1 which is used for the best model. For the branch and bound algorithms the number k is used for the kth best model and for the stepwise algorithms the number k is used for the model that is k - 1 steps away from the final model. |
... |
ignored. |
A numeric matrix with the corresponding coefficient estimates.
Data <- iris Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") # Doing branch and bound selection VS <- VariableSelection(Fit, type = "branch and bound", metric = "BIC", bestmodels = 10, showprogress = FALSE) ## Getting coefficients from best model coef(VS, which = 1) ## Getting coefficients from all best models coef(VS, which = "all")
Data <- iris Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") # Doing branch and bound selection VS <- VariableSelection(Fit, type = "branch and bound", metric = "BIC", bestmodels = 10, showprogress = FALSE) ## Getting coefficients from best model coef(VS, which = 1) ## Getting coefficients from all best models coef(VS, which = "all")
Finds profile likelihood ratio confidence intervals for beta coefficients with the ability to calculate the intervals in parallel.
## S3 method for class 'BranchGLM' confint(object, parm, level = 0.95, parallel = FALSE, nthreads = 8, ...)
## S3 method for class 'BranchGLM' confint(object, parm, level = 0.95, parallel = FALSE, nthreads = 8, ...)
object |
a |
parm |
a specification of which parameters are to be given confidence intervals, either a vector of numbers or a vector of names. If missing, all parameters are considered. |
level |
the confidence level required. |
parallel |
a logical value to indicate if parallelization should be used. |
nthreads |
a positive integer to denote the number of threads used with OpenMP,
only used if |
... |
further arguments passed from other methods. |
Endpoints of the confidence intervals that couldn't be found by the algorithm are filled in with NA. When there is a lot of multicollinearity in the data the algorithm may have problems finding many of the intervals.
An object of class BranchGLMCIs
which is a list with the following components.
CIs |
a numeric matrix with the confidence intervals |
level |
the supplied level |
MLE |
a numeric vector of the MLEs of the coefficients |
Data <- iris ### Fitting linear regression model mymodel <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") ### Getting confidence intervals CIs <- confint(mymodel, level = 0.95) CIs ### Plotting CIs plot(CIs, mary = 7, cex.y = 0.9)
Data <- iris ### Fitting linear regression model mymodel <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") ### Getting confidence intervals CIs <- confint(mymodel, level = 0.95) CIs ### Plotting CIs plot(CIs, mary = 7, cex.y = 0.9)
Extracts the deviance from BranchGLM
objects.
## S3 method for class 'BranchGLM' deviance(object, ...)
## S3 method for class 'BranchGLM' deviance(object, ...)
object |
a |
... |
further arguments passed to or from other methods. |
A single number denoting the deviance.
Computes the (generalized) Akaike An Information Criterion for BranchGLM objects.
## S3 method for class 'BranchGLM' extractAIC(fit, scale, k = 2, ...)
## S3 method for class 'BranchGLM' extractAIC(fit, scale, k = 2, ...)
fit |
a |
scale |
ignored. |
k |
a non-negative number specifying the ‘weight’ of the equivalent degrees of freedom (= edf) part in the AIC formula. |
... |
further arguments passed to or from other methods. |
A numeric vector of length 2, with first and second elements giving
edf |
the ‘equivalent degrees of freedom’ for |
AIC |
the (generalized) Akaike Information Criterion for |
Data <- iris Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") extractAIC(Fit)
Data <- iris Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") extractAIC(Fit)
Extracts family from BranchGLM
objects.
## S3 method for class 'BranchGLM' family(object, ...)
## S3 method for class 'BranchGLM' family(object, ...)
object |
a |
... |
further arguments passed to or from other methods. |
An object of class family
.
Only the family
and link
components of family
objects are used by the
BranchGLM
package. The other components of the family
object may not
be the same as what is used in the fitting process for BranchGLM
.
Extracts model formula from BranchGLM
objects.
## S3 method for class 'BranchGLM' formula(x, ...)
## S3 method for class 'BranchGLM' formula(x, ...)
x |
a |
... |
further arguments passed to or from other methods. |
A formula representing the model used to obtain x
.
Creates histograms of approximate null distributions for the modified variable importance values.
## S3 method for class 'BranchGLMVI.boot' hist( x, which = "all", linecol = "red", linelwd = 2, xlim = NULL, xlab = "Modified Variable Importance", main = NULL, ... )
## S3 method for class 'BranchGLMVI.boot' hist( x, which = "all", linecol = "red", linelwd = 2, xlim = NULL, xlab = "Modified Variable Importance", main = NULL, ... )
x |
a |
which |
which approximate null distributions to plot, can use a numeric vector of indices, a character vector of names, or "all" for all variables. The default is to create histograms for each set of variables that are not kept in each model. |
linecol |
the color of the line which indicates the observed modified variable importance values. |
linelwd |
the width of the line which indicates the observed modified variable importance values. |
xlim |
a numeric vector of length 2, giving the x coordinates range. |
xlab |
a label for the x axis. |
main |
a main title for the plot. |
... |
further arguments passed to hist.default. |
This only produces a plot, nothing is returned.
Data <- iris Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") # Doing branch and bound selection VS <- VariableSelection(Fit, type = "branch and bound", metric = "BIC", showprogress = FALSE) # Getting approximate null distributions set.seed(40174) myBoot <- VariableImportance.boot(VS, showprogress = FALSE) # Plotting histograms of second set of variables hist(myBoot, which = 2) # Plotting histograms of third set of variables hist(myBoot, which = 3, linecol = "blue", linelwd = 5)
Data <- iris Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") # Doing branch and bound selection VS <- VariableSelection(Fit, type = "branch and bound", metric = "BIC", showprogress = FALSE) # Getting approximate null distributions set.seed(40174) myBoot <- VariableImportance.boot(VS, showprogress = FALSE) # Plotting histograms of second set of variables hist(myBoot, which = 2) # Plotting histograms of third set of variables hist(myBoot, which = 3, linecol = "blue", linelwd = 5)
Extracts log-likelihood from BranchGLM
objects.
## S3 method for class 'BranchGLM' logLik(object, ...)
## S3 method for class 'BranchGLM' logLik(object, ...)
object |
a |
... |
further arguments passed to or from other methods. |
An object of class logLik
which is a number corresponding to
the log-likelihood with the following attributes: "df" (degrees of freedom)
and "nobs" (number of observations).
Extracts model frame from BranchGLM
objects.
## S3 method for class 'BranchGLM' model.frame(formula, ...)
## S3 method for class 'BranchGLM' model.frame(formula, ...)
formula |
a |
... |
further arguments passed to or from other methods. |
A data.frame used for the fitted model.
Plotting Multiple ROC Curves
MultipleROCCurves( ..., legendpos = "bottomright", title = "ROC Curves", colors = NULL, names = NULL, lty = 1, lwd = 1 )
MultipleROCCurves( ..., legendpos = "bottomright", title = "ROC Curves", colors = NULL, names = NULL, lty = 1, lwd = 1 )
... |
any number of |
legendpos |
a keyword to describe where to place the legend, such as "bottomright". The default is "bottomright" |
title |
title for the plot. |
colors |
vector of colors to be used on the ROC curves. |
names |
vector of names used to create a legend for the ROC curves. |
lty |
vector of linetypes used to create the ROC curves or a single linetype to be used for all ROC curves. |
lwd |
vector of linewidths used to create the ROC curves or a single linewidth to be used for all ROC curves. |
This only produces a plot, nothing is returned.
Data <- ToothGrowth ### Logistic ROC LogisticFit <- BranchGLM(supp ~ ., data = Data, family = "binomial", link = "logit") LogisticROC <- ROC(LogisticFit) ### Probit ROC ProbitFit <- BranchGLM(supp ~ ., data = Data, family = "binomial", link = "probit") ProbitROC <- ROC(ProbitFit) ### Cloglog ROC CloglogFit <- BranchGLM(supp ~ ., data = Data, family = "binomial", link = "cloglog") CloglogROC <- ROC(CloglogFit) ### Plotting ROC curves MultipleROCCurves(LogisticROC, ProbitROC, CloglogROC, names = c("Logistic ROC", "Probit ROC", "Cloglog ROC"))
Data <- ToothGrowth ### Logistic ROC LogisticFit <- BranchGLM(supp ~ ., data = Data, family = "binomial", link = "logit") LogisticROC <- ROC(LogisticFit) ### Probit ROC ProbitFit <- BranchGLM(supp ~ ., data = Data, family = "binomial", link = "probit") ProbitROC <- ROC(ProbitFit) ### Cloglog ROC CloglogFit <- BranchGLM(supp ~ ., data = Data, family = "binomial", link = "cloglog") CloglogROC <- ROC(CloglogFit) ### Plotting ROC curves MultipleROCCurves(LogisticROC, ProbitROC, CloglogROC, names = c("Logistic ROC", "Probit ROC", "Cloglog ROC"))
Extracts number of observations from BranchGLM
objects.
## S3 method for class 'BranchGLM' nobs(object, ...)
## S3 method for class 'BranchGLM' nobs(object, ...)
object |
a |
... |
further arguments passed to or from other methods. |
A single number indicating the number of observations used to fit the model.
Creates a plot to help visualize fitted values from BranchGLM
objects.
## S3 method for class 'BranchGLM' plot(x, ...)
## S3 method for class 'BranchGLM' plot(x, ...)
x |
a |
... |
further arguments passed to plot.default. |
This only produces a plot, nothing is returned.
Data <- iris Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") plot(Fit)
Data <- iris Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") plot(Fit)
Creates a plot to visualize confidence intervals from BranchGLMCIs objects.
## S3 method for class 'BranchGLMCIs' plot(x, which = "all", mary = 5, ...)
## S3 method for class 'BranchGLMCIs' plot(x, which = "all", mary = 5, ...)
x |
a |
which |
which intervals to plot, can use a numeric vector of indices, a character vector of names of desired variables, or "all" to plot all intervals. |
mary |
a numeric value used to determine how large to make margin of y-axis. If variable names are cut-off, consider increasing this from the default value of 5. |
... |
further arguments passed to plotCI. |
This only produces a plot, nothing is returned.
Data <- iris ### Fitting linear regression model mymodel <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") ### Getting confidence intervals CIs <- confint(mymodel, level = 0.95) CIs ### Plotting CIs plot(CIs, mary = 7, cex.y = 0.9)
Data <- iris ### Fitting linear regression model mymodel <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") ### Getting confidence intervals CIs <- confint(mymodel, level = 0.95) CIs ### Plotting CIs plot(CIs, mary = 7, cex.y = 0.9)
This plots a ROC curve.
## S3 method for class 'BranchGLMROC' plot(x, xlab = "1 - Specificity", ylab = "Sensitivity", type = "l", ...)
## S3 method for class 'BranchGLMROC' plot(x, xlab = "1 - Specificity", ylab = "Sensitivity", type = "l", ...)
x |
a |
xlab |
label for the x-axis. |
ylab |
label for the y-axis. |
type |
what type of plot to draw, see more details at plot.default. |
... |
further arguments passed to plot.default. |
This only produces a plot, nothing is returned.
Data <- ToothGrowth Fit <- BranchGLM(supp ~ ., data = Data, family = "binomial", link = "logit") MyROC <- ROC(Fit) plot(MyROC)
Data <- ToothGrowth Fit <- BranchGLM(supp ~ ., data = Data, family = "binomial", link = "logit") MyROC <- ROC(Fit) plot(MyROC)
Creates plots to help visualize variable selection results from BranchGLMVS or summary.BranchGLMVS objects.
## S3 method for class 'BranchGLMVS' plot( x, ptype = "both", marnames = 7, addLines = TRUE, type = "b", horiz = FALSE, cex.names = 1, cex.lab = 1, cex.axis = 1, cex.legend = 1, cols = c("deepskyblue", "indianred", "forestgreen"), ... ) ## S3 method for class 'summary.BranchGLMVS' plot( x, ptype = "both", marnames = 7, addLines = TRUE, type = "b", horiz = FALSE, cex.names = 1, cex.lab = 1, cex.axis = 1, cex.legend = 1, cols = c("deepskyblue", "indianred", "forestgreen"), ... )
## S3 method for class 'BranchGLMVS' plot( x, ptype = "both", marnames = 7, addLines = TRUE, type = "b", horiz = FALSE, cex.names = 1, cex.lab = 1, cex.axis = 1, cex.legend = 1, cols = c("deepskyblue", "indianred", "forestgreen"), ... ) ## S3 method for class 'summary.BranchGLMVS' plot( x, ptype = "both", marnames = 7, addLines = TRUE, type = "b", horiz = FALSE, cex.names = 1, cex.lab = 1, cex.axis = 1, cex.legend = 1, cols = c("deepskyblue", "indianred", "forestgreen"), ... )
x |
a |
ptype |
the type of plot to produce, look at details for more explanation. |
marnames |
a numeric value used to determine how large to make margin of axis with variable names, this is only for the "variables" plot. If variable names are cut-off, consider increasing this from the default value of 7. |
addLines |
a logical value to indicate whether or not to add black lines to separate the models for the "variables" plot. This is typically useful for smaller amounts of models, but can be annoying if there are many models. |
type |
what type of plot to draw for the "metrics" plot, see more details at plot.default. |
horiz |
a logical value to indicate whether models should be displayed horizontally for the "variables" plot. |
cex.names |
how big to make variable names in the "variables" plot. |
cex.lab |
how big to make axis labels. |
cex.axis |
how big to make axis annotation. |
cex.legend |
how big to make legend labels. |
cols |
the colors used to create the "variables" plot. Should be a character vector of length 3, the first color will be used for included variables, the second color will be used for excluded variables, and the third color will be used for kept variables. |
... |
further arguments passed to plot.default for the "metrics" plot and image.default for the "variables" plot. |
The different values for ptype are as follows
"metrics" for a plot that displays the metric values ordered by rank for the branch and bound algorithms or a plot which displays the metric values in the path taken by the stepwise algorithms
"variables" for a plot that displays which variables are in each of the top models for the branch and bound algorithms or a plot which displays the path taken by the stepwise algorithms
"both" for both plots
If there are so many models that the "variables" plot appears to be entirely black, then set addLines to FALSE.
This only produces plots, nothing is returned.
Data <- iris Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") # Doing branch and bound selection VS <- VariableSelection(Fit, type = "branch and bound", metric = "BIC", bestmodels = 10, showprogress = FALSE) VS ## Getting summary of the process Summ <- summary(VS) Summ ## Plotting the BIC of best models plot(Summ, type = "b", ptype = "metrics") ## Plotting the BIC of the best models plot(Summ, ptype = "variables") ### Alternative colors plot(Summ, ptype = "variables", cols = c("yellowgreen", "purple1", "grey50")) ### Smaller text size for names plot(Summ, ptype = "variables", cex.names = 0.75)
Data <- iris Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") # Doing branch and bound selection VS <- VariableSelection(Fit, type = "branch and bound", metric = "BIC", bestmodels = 10, showprogress = FALSE) VS ## Getting summary of the process Summ <- summary(VS) Summ ## Plotting the BIC of best models plot(Summ, type = "b", ptype = "metrics") ## Plotting the BIC of the best models plot(Summ, ptype = "variables") ### Alternative colors plot(Summ, ptype = "variables", cols = c("yellowgreen", "purple1", "grey50")) ### Smaller text size for names plot(Summ, ptype = "variables", cex.names = 0.75)
Creates a plot to display confidence intervals.
plotCI( CIs, points = NULL, ylab = "", ylas = 2, cex.y = 1, decreasing = FALSE, ... )
plotCI( CIs, points = NULL, ylab = "", ylas = 2, cex.y = 1, decreasing = FALSE, ... )
CIs |
a numeric matrix of confidence intervals, must have exactly 2 columns. The variable names displayed in the plot are taken from the column names. |
points |
points to be plotted in the middle of the CIs, typically means or medians. The default is to plot the midpoints of the intervals. |
ylab |
a label for the y-axis. |
ylas |
the style of the y-axis label, see more about this at |
cex.y |
font size used for variable names on y-axis. |
decreasing |
a logical value indicating if confidence intervals should be displayed in decreasing or increasing order according to points. Can use NA if no ordering is desired. |
... |
further arguments passed to plot.default. |
This only produces a plot, nothing is returned.
Data <- iris ### Fitting linear regression model mymodel <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") ### Getting confidence intervals CIs <- confint.default(mymodel, level = 0.95) xlim <- c(min(CIs), max(CIs)) ### Plotting CIs par(mar = c(5, 7, 3, 1) + 0.1) plotCI(CIs, main = "95% Confidence Intervals", xlim = xlim, cex.y = 0.9, xlab = "Beta Coefficients") abline(v = 0)
Data <- iris ### Fitting linear regression model mymodel <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") ### Getting confidence intervals CIs <- confint.default(mymodel, level = 0.95) xlim <- c(min(CIs), max(CIs)) ### Plotting CIs par(mar = c(5, 7, 3, 1) + 0.1) plotCI(CIs, main = "95% Confidence Intervals", xlim = xlim, cex.y = 0.9, xlab = "Beta Coefficients") abline(v = 0)
Obtains predictions from BranchGLM
objects.
## S3 method for class 'BranchGLM' predict( object, newdata = NULL, offset = NULL, type = "response", na.action = na.pass, ... )
## S3 method for class 'BranchGLM' predict( object, newdata = NULL, offset = NULL, type = "response", na.action = na.pass, ... )
object |
a |
newdata |
a data.frame, if not specified then the data the model was fit on is used. |
offset |
a numeric vector containing the offset variable, this is ignored if newdata is not supplied. |
type |
one of "linpreds" which is on the scale of the linear predictors or "response" which is on the scale of the response variable. If not specified, then "response" is used. |
na.action |
a function which indicates what should happen when the data
contains NAs. The default is |
... |
further arguments passed to or from other methods. |
A numeric vector of predictions.
Data <- airquality # Example without offset Fit <- BranchGLM(Temp ~ ., data = Data, family = "gaussian", link = "identity") ## Using default na.action predict(Fit) ## Using na.omit predict(Fit, na.action = na.omit) ## Using new data predict(Fit, newdata = Data[1:20, ], na.action = na.pass) # Using offset FitOffset <- BranchGLM(Temp ~ . - Day, data = Data, family = "gaussian", link = "identity", offset = Data$Day * -0.1) ## Getting predictions for data used to fit model ### Don't need to supply offset vector predict(FitOffset) ## Getting predictions for new dataset ### Need to include new offset vector since we are ### getting predictions for new dataset predict(FitOffset, newdata = Data[1:20, ], offset = Data$Day[1:20] * -0.1)
Data <- airquality # Example without offset Fit <- BranchGLM(Temp ~ ., data = Data, family = "gaussian", link = "identity") ## Using default na.action predict(Fit) ## Using na.omit predict(Fit, na.action = na.omit) ## Using new data predict(Fit, newdata = Data[1:20, ], na.action = na.pass) # Using offset FitOffset <- BranchGLM(Temp ~ . - Day, data = Data, family = "gaussian", link = "identity", offset = Data$Day * -0.1) ## Getting predictions for data used to fit model ### Don't need to supply offset vector predict(FitOffset) ## Getting predictions for new dataset ### Need to include new offset vector since we are ### getting predictions for new dataset predict(FitOffset, newdata = Data[1:20, ], offset = Data$Day[1:20] * -0.1)
Obtains predictions from BranchGLMVS or summary.BranchGLMVS objects.
## S3 method for class 'BranchGLMVS' predict(object, which = 1, ...) ## S3 method for class 'summary.BranchGLMVS' predict(object, which = 1, ...)
## S3 method for class 'BranchGLMVS' predict(object, which = 1, ...) ## S3 method for class 'summary.BranchGLMVS' predict(object, which = 1, ...)
object |
a |
which |
a positive integer to indicate which model to get predictions from, the default is 1 which is used for the best model. For the branch and bound algorithms the number k is used for the kth best model and for the stepwise algorithms the number k is used for the model that is k - 1 steps away from the final model. |
... |
further arguments passed to predict.BranchGLM. |
A numeric vector of predictions.
Data <- iris Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gamma", link = "log") # Doing branch and bound selection VS <- VariableSelection(Fit, type = "branch and bound", metric = "BIC", bestmodels = 10, showprogress = FALSE) ## Getting predictions from best model predict(VS, which = 1) ## Getting linear predictors from 5th best model predict(VS, which = 5, type = "linpreds")
Data <- iris Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gamma", link = "log") # Doing branch and bound selection VS <- VariableSelection(Fit, type = "branch and bound", metric = "BIC", bestmodels = 10, showprogress = FALSE) ## Getting predictions from best model predict(VS, which = 1) ## Getting linear predictors from 5th best model predict(VS, which = 5, type = "linpreds")
Print method for BranchGLM
objects.
## S3 method for class 'BranchGLM' print(x, coefdigits = 4, digits = 4, ...)
## S3 method for class 'BranchGLM' print(x, coefdigits = 4, digits = 4, ...)
x |
a |
coefdigits |
number of digits to display for coefficients table. |
digits |
number of significant digits to display for information after table. |
... |
further arguments passed to or from other methods. |
The supplied BranchGLM
object.
Print method for BranchGLMCIs objects.
## S3 method for class 'BranchGLMCIs' print(x, digits = 4, ...)
## S3 method for class 'BranchGLMCIs' print(x, digits = 4, ...)
x |
a |
digits |
number of significant digits to display. |
... |
further arguments passed from other methods. |
The supplied BranchGLMCIs
object.
Print method for BranchGLMROC objects.
## S3 method for class 'BranchGLMROC' print(x, ...)
## S3 method for class 'BranchGLMROC' print(x, ...)
x |
a |
... |
further arguments passed to other methods. |
The supplied BranchGLMROC
object.
Print method for BranchGLMTable objects.
## S3 method for class 'BranchGLMTable' print(x, digits = 4, ...)
## S3 method for class 'BranchGLMTable' print(x, digits = 4, ...)
x |
a |
digits |
number of digits to display. |
... |
further arguments passed to other methods. |
The supplied BranchGLMTable
object.
Print method for BranchGLMVI objects.
## S3 method for class 'BranchGLMVI' print(x, digits = 3, ...)
## S3 method for class 'BranchGLMVI' print(x, digits = 3, ...)
x |
a |
digits |
number of significant digits to display. |
... |
further arguments passed to other methods. |
The supplied BranchGLMVI
object.
Print method for BranchGLMVI.boot objects.
## S3 method for class 'BranchGLMVI.boot' print(x, digits = 3, ...)
## S3 method for class 'BranchGLMVI.boot' print(x, digits = 3, ...)
x |
a |
digits |
number of significant digits to display. |
... |
further arguments passed to other methods. |
The supplied BranchGLMVI.boot
object.
Print method for BranchGLMVS objects.
## S3 method for class 'BranchGLMVS' print(x, digits = 2, ...)
## S3 method for class 'BranchGLMVS' print(x, digits = 2, ...)
x |
a |
digits |
number of digits to display. |
... |
further arguments passed to other methods. |
The supplied BranchGLMVS
object.
Print method for summary.BranchGLMVS objects.
## S3 method for class 'summary.BranchGLMVS' print(x, digits = 2, ...)
## S3 method for class 'summary.BranchGLMVS' print(x, digits = 2, ...)
x |
a |
digits |
number of digits to display. |
... |
further arguments passed to other methods. |
The supplied summary.BranchGLMVS
object.
Extracts the Pearson residuals from BranchGLM
objects.
## S3 method for class 'BranchGLM' residuals(object, ...)
## S3 method for class 'BranchGLM' residuals(object, ...)
object |
a |
... |
further arguments passed to or from other methods. |
A vector of the Pearson residuals from the supplied BranchGLM
object.
Creates an ROC curve.
ROC(object, ...) ## S3 method for class 'numeric' ROC(object, y, ...) ## S3 method for class 'BranchGLM' ROC(object, ...)
ROC(object, ...) ## S3 method for class 'numeric' ROC(object, y, ...) ## S3 method for class 'BranchGLM' ROC(object, ...)
object |
a |
... |
further arguments passed to other methods. |
y |
observed values, can be a numeric vector of 0s and 1s, a two-level factor vector, or a logical vector. |
A BranchGLMROC
object which can be plotted with plot()
. The AUC can also
be calculated using AUC()
.
Data <- ToothGrowth Fit <- BranchGLM(supp ~ ., data = Data, family = "binomial", link = "logit") MyROC <- ROC(Fit) plot(MyROC)
Data <- ToothGrowth Fit <- BranchGLM(supp ~ ., data = Data, family = "binomial", link = "logit") MyROC <- ROC(Fit) plot(MyROC)
Extracts the square root of the dispersion parameter estimates from BranchGLM
objects.
## S3 method for class 'BranchGLM' sigma(object, ...)
## S3 method for class 'BranchGLM' sigma(object, ...)
object |
a |
... |
further arguments passed to or from other methods. |
A numeric vector of length 2 with first and second elements giving
mle |
the MLE of the dispersion parameter |
pearson |
the Pearson estimator of the dispersion parameter |
The dispersion parameter for binomial and Poisson regression is always fixed to be 1. The MLE of the dispersion parameter is used in the calculation of the log-likelihood while the Pearson estimator of the dispersion parameter is used to calculate standard errors for the coefficient estimates.
Data <- iris Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") sigma(Fit)
Data <- iris Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") sigma(Fit)
Summary method for BranchGLMVS objects.
## S3 method for class 'BranchGLMVS' summary(object, ...)
## S3 method for class 'BranchGLMVS' summary(object, ...)
object |
a |
... |
further arguments passed to or from other methods. |
An object of class summary.BranchGLMVS
which is a list with the
following components
results |
a data.frame which has the metric values for the best models along with the sets of variables included in each model |
VS |
the supplied |
formulas |
a list containing the formulas of the best models |
metric |
the metric used to perform variable selection |
plot.summary.BranchGLMVS, coef.summary.BranchGLMVS, predict.summary.BranchGLMVS
Data <- iris Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") # Doing branch and bound selection VS <- VariableSelection(Fit, type = "branch and bound", metric = "BIC", bestmodels = 10, showprogress = FALSE) VS ## Getting summary of the process Summ <- summary(VS) Summ ## Plotting the BIC of the best models plot(Summ, type = "b") ## Plotting the variables in the best models plot(Summ, ptype = "variables") ## Getting coefficients coef(Summ)
Data <- iris Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") # Doing branch and bound selection VS <- VariableSelection(Fit, type = "branch and bound", metric = "BIC", bestmodels = 10, showprogress = FALSE) VS ## Getting summary of the process Summ <- summary(VS) Summ ## Plotting the BIC of the best models plot(Summ, type = "b") ## Plotting the variables in the best models plot(Summ, ptype = "variables") ## Getting coefficients coef(Summ)
Creates a confusion matrix and calculates related measures.
Table(object, ...) ## S3 method for class 'numeric' Table(object, y, cutoff = 0.5, ...) ## S3 method for class 'BranchGLM' Table(object, cutoff = 0.5, ...)
Table(object, ...) ## S3 method for class 'numeric' Table(object, y, cutoff = 0.5, ...) ## S3 method for class 'BranchGLM' Table(object, cutoff = 0.5, ...)
object |
a |
... |
further arguments passed to other methods. |
y |
observed values, can be a numeric vector of 0s and 1s, a two-level factor vector, or a logical vector. |
cutoff |
cutoff for predicted values, the default is 0.5. |
A BranchGLMTable
object which is a list with the following components
table |
a matrix corresponding to the confusion matrix |
accuracy |
a number corresponding to the accuracy |
sensitivity |
a number corresponding to the sensitivity |
specificity |
a number corresponding to the specificity |
PPV |
a number corresponding to the positive predictive value |
levels |
a vector corresponding to the levels of the response variable |
Data <- ToothGrowth Fit <- BranchGLM(supp ~ ., data = Data, family = "binomial", link = "logit") Table(Fit)
Data <- ToothGrowth Fit <- BranchGLM(supp ~ ., data = Data, family = "binomial", link = "logit") Table(Fit)
Gets exact or approximate L0-penalization based variable importance values for generalized linear models (GLMs). More details about what the variable importance values are can be found in the details section.
VariableImportance( object, VIMethod = "simultaneous", parallel = FALSE, nthreads = 8, showprogress = TRUE )
VariableImportance( object, VIMethod = "simultaneous", parallel = FALSE, nthreads = 8, showprogress = TRUE )
object |
an object of class |
VIMethod |
one of "separate" or "simultaneous" to denote the method used to find
the variable importance values. This is ignored if the type of variable selection
employed in |
parallel |
a logical value to indicate if parallelization should be used. |
nthreads |
number of threads used with OpenMP, only used if |
showprogress |
a logical value to indicate whether or not to show progress updates. |
Note that variable importance values can only be found for sets of variables that are not kept through the model selection process. More details about the variable importance values will be made available in an upcoming paper.
When a branch and bound algorithm is used in object
, then the exact variable
importance values are computed. When a heuristic method is used, then approximate
variable importance values are computed based on the specified heuristic method.
A BranchGLMVI
object which is a list with the following components
results |
a data.frame with the variable importance values and degrees of freedom |
metric |
metric used to select the best models |
numchecked |
number of models fit |
VS |
the supplied |
with |
a numeric matrix with the best models that include each set of variables |
withmetrics |
a numeric vector with the metric values for the best models with each set of variables |
without |
a numeric matrix with the best models that exclude each set of variables |
withoutmetrics |
a numeric vector with the metric values for the best models without each set of variables |
Seedorff J, Cavanaugh JE. Assessing Variable Importance for Best Subset Selection. Entropy. 2024; 26(9):801. doi:10.3390/e26090801
VariableImportance.boot, barplot.BranchGLMVI
Data <- iris Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") # Doing branch and bound selection VS <- VariableSelection(Fit, type = "branch and bound", metric = "BIC", showprogress = FALSE) # Getting variable importance VI <- VariableImportance(VS, showprogress = FALSE) VI # Plotting variable importance oldmar <- par("mar") par(mar = c(4, 6, 3, 1) + 0.1) barplot(VI) par(mar = oldmar)
Data <- iris Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") # Doing branch and bound selection VS <- VariableSelection(Fit, type = "branch and bound", metric = "BIC", showprogress = FALSE) # Getting variable importance VI <- VariableImportance(VS, showprogress = FALSE) VI # Plotting variable importance oldmar <- par("mar") par(mar = c(4, 6, 3, 1) + 0.1) barplot(VI) par(mar = oldmar)
Performs a version of the parametric bootstrap to create an approximate null distribution for the modified variable importance values in order to get approximate p-values.
VariableImportance.boot(object, ...) ## S3 method for class 'BranchGLMVS' VariableImportance.boot( object, nboot = 100, parallel = FALSE, nthreads = 8, showprogress = TRUE, ... ) ## S3 method for class 'BranchGLMVI' VariableImportance.boot( object, nboot = 100, parallel = FALSE, nthreads = 8, showprogress = TRUE, ... )
VariableImportance.boot(object, ...) ## S3 method for class 'BranchGLMVS' VariableImportance.boot( object, nboot = 100, parallel = FALSE, nthreads = 8, showprogress = TRUE, ... ) ## S3 method for class 'BranchGLMVI' VariableImportance.boot( object, nboot = 100, parallel = FALSE, nthreads = 8, showprogress = TRUE, ... )
object |
a |
... |
further arguments to VariableImportance when |
nboot |
the number of bootstrap replications to perform. |
parallel |
a logical value to indicate if parallelization should be used. |
nthreads |
number of threads used with OpenMP, only used if |
showprogress |
a logical value to indicate if a progress bar should be displayed. |
This performs a version of the parametric bootstrap with the modified variable importance values to generate approximate p-values for the sets of variables. We are currently working on a paper that describes this function in further detail.
a BranchGLMVI.boot
object which is a list with the following components
summary |
a data.frame with the observed modified variable importance values and approximate p-values |
results |
a numeric matrix with the modified variable importance values for each set of bootstrap replications |
pvals |
a numeric vector with the approximate p-values based on modified variable importance |
nboot |
the number of bootstrap replications performed |
metric |
the metric used to calculate the modified variable importance values |
VI |
the supplied |
Seedorff J, Cavanaugh JE. Assessing Variable Importance for Best Subset Selection. Entropy. 2024; 26(9):801. doi:10.3390/e26090801
hist.BranchGLMVI.boot, boxplot.BranchGLMVI.boot, VariableImportance
Data <- iris Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") # Doing branch and bound selection VS <- VariableSelection(Fit, type = "branch and bound", metric = "BIC", showprogress = FALSE) # Getting approximate null distributions set.seed(40174) myBoot <- VariableImportance.boot(VS, showprogress = FALSE) myBoot # Plotting histogram of results for second set of variables hist(myBoot, which = 2) # Plotting boxplots of results oldmar <- par("mar") par(mar = c(4, 6, 3, 1) + 0.1) boxplot(myBoot, las = 1) par(mar = oldmar)
Data <- iris Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") # Doing branch and bound selection VS <- VariableSelection(Fit, type = "branch and bound", metric = "BIC", showprogress = FALSE) # Getting approximate null distributions set.seed(40174) myBoot <- VariableImportance.boot(VS, showprogress = FALSE) myBoot # Plotting histogram of results for second set of variables hist(myBoot, which = 2) # Plotting boxplots of results oldmar <- par("mar") par(mar = c(4, 6, 3, 1) + 0.1) boxplot(myBoot, las = 1) par(mar = oldmar)
Performs forward selection, multiple different variants of backward elimination, and efficient best subset variable selection with information criterion for generalized linear models (GLMs). Best subset selection is performed with branch and bound algorithms to greatly speed up the process and backward elimination can be performed with bounding algorithms to speed it up.
VariableSelection(object, ...) ## S3 method for class 'formula' VariableSelection( object, data = NULL, family, link, offset = NULL, method = "Fisher", type = "switch branch and bound", metric = "AIC", bestmodels = NULL, cutoff = NULL, keep = NULL, keepintercept = TRUE, maxsize = NULL, grads = 10, parallel = FALSE, nthreads = 8, tol = 1e-06, maxit = NULL, contrasts = NULL, showprogress = TRUE, ... ) ## S3 method for class 'BranchGLM' VariableSelection( object, type = "switch branch and bound", metric = "AIC", bestmodels = NULL, cutoff = NULL, keep = NULL, keepintercept = TRUE, maxsize = NULL, parallel = FALSE, nthreads = 8, showprogress = TRUE, ... )
VariableSelection(object, ...) ## S3 method for class 'formula' VariableSelection( object, data = NULL, family, link, offset = NULL, method = "Fisher", type = "switch branch and bound", metric = "AIC", bestmodels = NULL, cutoff = NULL, keep = NULL, keepintercept = TRUE, maxsize = NULL, grads = 10, parallel = FALSE, nthreads = 8, tol = 1e-06, maxit = NULL, contrasts = NULL, showprogress = TRUE, ... ) ## S3 method for class 'BranchGLM' VariableSelection( object, type = "switch branch and bound", metric = "AIC", bestmodels = NULL, cutoff = NULL, keep = NULL, keepintercept = TRUE, maxsize = NULL, parallel = FALSE, nthreads = 8, showprogress = TRUE, ... )
object |
a formula or a |
... |
further arguments. |
data |
a data.frame, list or environment (or object coercible by
as.data.frame to a data.frame), containing the variables in formula.
Neither a matrix nor an array will be accepted. If not found in |
family |
the distribution used to model the data, one of "gaussian",
"gamma", "binomial", or "poisson". A |
link |
the link used to link the mean structure to the linear predictors. One of
"identity", "logit", "probit", "cloglog", "sqrt", "inverse", or "log". This only needs to be
supplied if a string is supplied for |
offset |
the offset vector, by default the zero vector is used. |
method |
one of "Fisher", "BFGS", or "LBFGS". Fisher's scoring is recommended for forward selection and the branch and bound algorithms since they will typically fit many models with a small number of covariates. |
type |
one of "forward", "backward", "fast backward", "double backward", "fast double backward", "branch and bound", "backward branch and bound", or "switch branch and bound" to indicate the type of variable selection to perform. The default value is "switch branch and bound". See more about these algorithms in details |
metric |
the metric used to choose the best models, the default is "AIC", but "BIC" and "HQIC" are also available. AIC is the Akaike information criterion, BIC is the Bayesian information criterion, and HQIC is the Hannan-Quinn information criterion. |
bestmodels |
a positive integer to indicate the number of the best models to find according to the chosen metric or NULL. If this is NULL, then cutoff is used instead. This is only used for the branch and bound algorithms. |
cutoff |
a non-negative number which indicates that the function should return all models that have a metric value within cutoff of the best metric value or NULL. Only one of this or bestmodels should be specified and when both are NULL a cutoff of 0 is used. This is only used for the branch and bound algorithms. |
keep |
a character vector of names to denote variables that must be in the models. |
keepintercept |
a logical value to indicate whether to keep the intercept in all models, only used if an intercept is included in the formula. |
maxsize |
a positive integer to denote the maximum number of variables to
consider in a single model, the default is the total number of variables.
This number adds onto any variables specified in keep. This argument only works
for |
grads |
a positive integer to denote the number of gradients used to
approximate the inverse information with, only for |
parallel |
a logical value to indicate if parallelization should be used. |
nthreads |
a positive integer to denote the number of threads used with OpenMP,
only used if |
tol |
a positive number to denote the tolerance used to determine model convergence. |
maxit |
a positive integer to denote the maximum number of iterations performed. The default for Fisher's scoring is 50 and for the other methods the default is 200. |
contrasts |
see |
showprogress |
a logical value to indicate whether to show progress updates for branch and bound algorithms. |
The supplied formula or the formula from the fitted model is treated as the upper model. The variables specified in keep along with an intercept (if included in formula and keepintercept = TRUE) is the lower model. Factor variables are either kept in their entirety or entirely removed and interaction terms are properly handled. All observations that have any missing values in the upper model are removed.
There are 5 different stepwise variable selection algorithms that are available. These are forward selection, backward elimination, fast backward elimination, double backward elimination, and fast double backward elimination. All of these are heuristic algorithms, so the best model found by them may not be the optimal model.
Fast backward elimination should give the same results as backward elimination, but it makes use of the bounding techniques used by the branch and bound algorithms to make it faster. Fast backward elimination can give slightly different results than backward elimination if the GLM solver has difficulties fitting some of the larger models.
Double backward elimination and fast double backward elimination are a variant of backward elimination where up to 2 variables can be removed in one step instead of just 1. This typically results in higher quality models, but can also be much slower. The bounding algorithm used in fast double backward elimination makes it much faster.
The branch and bound algorithm is an efficient algorithm used to find the optimal models. The backward branch and bound algorithm is very similar to the branch and bound algorithm, except it tends to be faster when the best models contain most of the variables. The switch branch and bound algorithm is a combination of the two algorithms and is typically the fastest of the 3 branch and bound algorithms. All of the branch and bound algorithms are guaranteed to find the optimal models (up to numerical precision).
Fisher's scoring is recommended for branch and bound selection and forward selection. L-BFGS may be faster for the backward elimination and double backward elimination algorithms, especially when there are many variables.
A BranchGLMVS
object which is a list with the following components
initmodel |
the |
numchecked |
number of models fit |
names |
character vector of the names of the predictor variables |
order |
the order the variables were added to the model or removed from the model, this is only included for the stepwise algorithms |
type |
type of variable selection employed |
optType |
whether the type specified used a heuristic or exact algorithm |
metric |
metric used to select the best models |
bestmodels |
numeric matrix used to describe the best models for the branch and bound algorithms or a numeric matrix describing the models along the path taken for stepwise algorithms |
bestmetrics |
numeric vector with the best metrics found in the search for the branch and bound algorithms or a numeric vector with the metric values along the path taken for stepwise algorithms |
beta |
numeric matrix of beta coefficients for the models in bestmodels |
cutoff |
the cutoff that was used, this is set to -1 if bestmodels was used instead or if a stepwise algorithm was used |
keep |
vector of which variables were kept through the selection process |
keepintercept |
a boolean value denoting whether to keep the intercept through the selection process or not |
plot.BranchGLMVS, coef.BranchGLMVS, predict.BranchGLMVS, summary.BranchGLMVS
Data <- iris Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") # Doing branch and bound selection VS <- VariableSelection(Fit, type = "branch and bound", metric = "BIC", bestmodels = 10, showprogress = FALSE) VS ## Plotting the BIC of the best models plot(VS, type = "b") ## Getting the coefficients of the best model according to BIC FinalModel <- coef(VS, which = 1) FinalModel # Now doing it in parallel (although it isn't necessary for this dataset) parVS <- VariableSelection(Fit, type = "branch and bound", parallel = TRUE, metric = "BIC", bestmodels = 10, showprogress = FALSE) ## Getting the coefficients of the best model according to BIC FinalModel <- coef(parVS, which = 1) FinalModel # Using a formula formVS <- VariableSelection(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity", metric = "BIC", type = "branch and bound", bestmodels = 10, showprogress = FALSE) ## Getting the coefficients of the best model according to BIC FinalModel <- coef(formVS, which = 1) FinalModel # Using the keep argument keepVS <- VariableSelection(Fit, type = "branch and bound", keep = c("Species", "Petal.Width"), metric = "BIC", bestmodels = 4, showprogress = FALSE) keepVS ## Getting the coefficients from the fourth best model according to BIC when ## keeping Petal.Width and Species in every model FinalModel <- coef(keepVS, which = 4) FinalModel # Treating categorical variable beta parameters separately ## This function automatically groups together parameters from a categorical variable ## to avoid this, you need to create the indicator variables yourself x <- model.matrix(Sepal.Length ~ ., data = iris) Sepal.Length <- iris$Sepal.Length Data <- cbind.data.frame(Sepal.Length, x[, -1]) VSCat <- VariableSelection(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity", metric = "BIC", bestmodels = 10, showprogress = FALSE) VSCat ## Plotting results plot(VSCat, cex.names = 0.75)
Data <- iris Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity") # Doing branch and bound selection VS <- VariableSelection(Fit, type = "branch and bound", metric = "BIC", bestmodels = 10, showprogress = FALSE) VS ## Plotting the BIC of the best models plot(VS, type = "b") ## Getting the coefficients of the best model according to BIC FinalModel <- coef(VS, which = 1) FinalModel # Now doing it in parallel (although it isn't necessary for this dataset) parVS <- VariableSelection(Fit, type = "branch and bound", parallel = TRUE, metric = "BIC", bestmodels = 10, showprogress = FALSE) ## Getting the coefficients of the best model according to BIC FinalModel <- coef(parVS, which = 1) FinalModel # Using a formula formVS <- VariableSelection(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity", metric = "BIC", type = "branch and bound", bestmodels = 10, showprogress = FALSE) ## Getting the coefficients of the best model according to BIC FinalModel <- coef(formVS, which = 1) FinalModel # Using the keep argument keepVS <- VariableSelection(Fit, type = "branch and bound", keep = c("Species", "Petal.Width"), metric = "BIC", bestmodels = 4, showprogress = FALSE) keepVS ## Getting the coefficients from the fourth best model according to BIC when ## keeping Petal.Width and Species in every model FinalModel <- coef(keepVS, which = 4) FinalModel # Treating categorical variable beta parameters separately ## This function automatically groups together parameters from a categorical variable ## to avoid this, you need to create the indicator variables yourself x <- model.matrix(Sepal.Length ~ ., data = iris) Sepal.Length <- iris$Sepal.Length Data <- cbind.data.frame(Sepal.Length, x[, -1]) VSCat <- VariableSelection(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity", metric = "BIC", bestmodels = 10, showprogress = FALSE) VSCat ## Plotting results plot(VSCat, cex.names = 0.75)
Extracts covariance matrix of beta coefficients from BranchGLM
objects.
## S3 method for class 'BranchGLM' vcov(object, ...)
## S3 method for class 'BranchGLM' vcov(object, ...)
object |
a |
... |
further arguments passed to or from other methods. |
A numeric matrix which is the covariance matrix of the beta coefficients.