Title: | Bayesian Logistic Regression with Heavy-Tailed Priors |
---|---|
Description: | Efficient Bayesian multinomial logistic regression based on heavy-tailed (hyper-LASSO, non-convex) priors. The posterior of coefficients and hyper-parameters is sampled with restricted Gibbs sampling for leveraging the high-dimensionality and Hamiltonian Monte Carlo for handling the high-correlation among coefficients. A detailed description of the method: Li and Yao (2018), Journal of Statistical Computation and Simulation, 88:14, 2827-2851, <arXiv:1405.3319>. |
Authors: | Longhai Li [aut, cre] , Steven Liu [aut] |
Maintainer: | Longhai Li <[email protected]> |
License: | GPL-3 |
Version: | 0.4-4 |
Built: | 2024-10-30 06:58:16 UTC |
Source: | CRAN |
Efficient Bayesian multinomial logistic regression based on heavy-tailed priors. This package is suitable for classification and feature selection with high-dimensional features, such as gene expression profiles. Heavy-tailed priors can impose stronger shrinkage (compared to Gaussian and Laplace priors) to the coefficients associated with a large number of useless features, but still allow coefficients of a small number of useful features to stand out without punishment. It can also automatically make selection within a large number of correlated features. The posterior of coefficients and hyper- parameters is sampled with restricted Gibbs sampling for leveraging high-dimensionality and Hamiltonian Monte Carlo for handling high-correlations among coefficients.
Longhai Li and Weixin Yao (2018). Fully Bayesian Logistic Regression with Hyper-Lasso Priors for High-dimensional Feature Selection. Journal of Statistical Computation and Simulation 2018, 88:14, 2827-2851.
The Markov chain samples (without warmup) included in a htlr.fit
object will be coerced to a matrix.
## S3 method for class 'htlr.fit' as.matrix(x, k = NULL, include.warmup = FALSE, ...)
## S3 method for class 'htlr.fit' as.matrix(x, k = NULL, include.warmup = FALSE, ...)
x |
An object of S3 class |
k |
Coefficients associated with class |
include.warmup |
Whether or not to include warmup samples |
... |
Not used. |
A matrix with (p + 1)
columns and i
rows, where p
is the number of features
excluding intercept, and i
is the number of iterations after burnin.
## No. of features used: 100; No. of iterations after burnin: 15 fit <- htlr(X = colon$X, y = colon$y, fsel = 1:100, iter = 20, warmup = 5) dim(as.matrix(fit))
## No. of features used: 100; No. of iterations after burnin: 15 fit <- htlr(X = colon$X, y = colon$y, fsel = 1:100, iter = 20, warmup = 5) dim(as.matrix(fit))
In this dataset, expression levels of 40 tumor and 22 normal colon tissues for 6500 human genes are measured using the Affymetrix technology. A selection of 2000 genes with highest minimal intensity across the samples has been made by Alon et al. (1999). The data is preprocessed by carrying out a base 10 logarithmic transformation and standardizing each tissue sample to zero mean and unit variance across the genes.
data("colon")
data("colon")
A list contains data matrix X
and response vector y
:
A matrix with 66 rows (observations) and 2000 columns (features).
A binary vector where 0 indicates normal colon tissues and 1 indicates tumor colon tissues.
Dettling Marcel, and Peter Bühlmann (2002). Supervised clustering of genes. Genome biology, 3(12), research0069-1.
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. Different from the UCI original version, the dataset has been preprocessed such that rows with missing values are removed, and features are scaled.
data("diabetes392")
data("diabetes392")
A list contains data matrix X
and response vector y
:
A matrix with 392 rows (observations) and 8 columns (features).
A binary vector where 1 indicates diabetes patients and 0 for otherwise.
https://www.kaggle.com/uciml/pima-indians-diabetes-database
Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261–265). IEEE Computer Society Press.
https://avehtari.github.io/modelselection/diabetes.html
This function compares the prediction results returned by a classifier with ground truth, and finally gives a summary of the evaluation.
evaluate_pred(y.pred, y.true, caseid = names(y.true), showplot = TRUE, ...)
evaluate_pred(y.pred, y.true, caseid = names(y.true), showplot = TRUE, ...)
y.pred |
A matrix of predicted probabilities, as returned by a classifier. |
y.true |
Ground truth labels vector. |
caseid |
The names of test cases which we take account of. By default all test cases are used for evaluation. |
showplot |
Logical; if |
... |
Not used. |
A summary of evaluation result.
This function generates inputs X
given by the response variable y
using a multivariate normal model.
gendata_FAM(n, muj, A, sd_g = 0, stdx = FALSE)
gendata_FAM(n, muj, A, sd_g = 0, stdx = FALSE)
n |
Number of observations. |
muj |
C by p matrix, with row c representing y = c, and column j representing |
A |
Factor loading matrix of size p by p, see details. |
sd_g |
Numeric value indicating noise level |
stdx |
Logical; if |
The means of each covariate depend on
y
specified by the
matrix muj
; the covariate matrix of the multivariate normal
is equal to
, where
A
is the factor loading matrix
and is the noise level.
A list contains input matrix X
, response variables y
,
covariate matrix SGM
and muj
(standardized if stdx = TRUE
).
## feature #1: marginally related feature ## feature #2: marginally unrelated feature, but feature #2 is correlated with feature #1 ## feature #3-5: marginally related features and also internally correlated ## feature #6-10: noise features without relationship with the y set.seed(12345) n <- 100 p <- 10 means <- rbind( c(0, 1, 0), c(0, 0, 0), c(0, 0, 1), c(0, 0, 1), c(0, 0, 1) ) * 2 means <- rbind(means, matrix(0, p - 5, 3)) A <- diag(1, p) A[1:5, 1:3] <- rbind( c(1, 0, 0), c(2, 1, 0), c(0, 0, 1), c(0, 0, 1), c(0, 0, 1) ) dat <- gendata_FAM(n, means, A, sd_g = 0.5, stdx = TRUE) ggplot2::qplot(dat$y, bins = 6) corrplot::corrplot(cor(dat$X))
## feature #1: marginally related feature ## feature #2: marginally unrelated feature, but feature #2 is correlated with feature #1 ## feature #3-5: marginally related features and also internally correlated ## feature #6-10: noise features without relationship with the y set.seed(12345) n <- 100 p <- 10 means <- rbind( c(0, 1, 0), c(0, 0, 0), c(0, 0, 1), c(0, 0, 1), c(0, 0, 1) ) * 2 means <- rbind(means, matrix(0, p - 5, 3)) A <- diag(1, p) A[1:5, 1:3] <- rbind( c(1, 0, 0), c(2, 1, 0), c(0, 0, 1), c(0, 0, 1), c(0, 0, 1) ) dat <- gendata_FAM(n, means, A, sd_g = 0.5, stdx = TRUE) ggplot2::qplot(dat$y, bins = 6) corrplot::corrplot(cor(dat$X))
This function generates the response variables y
given
optional supplied X
using a multinomial logistic regression model.
gendata_MLR(n, p, NC = 3, nu = 2, w = 1, X = NULL, betas = NULL)
gendata_MLR(n, p, NC = 3, nu = 2, w = 1, X = NULL, betas = NULL)
n |
Number of observations. |
p |
Number of features. |
NC |
Number of classes for response variables. |
nu , w
|
If |
X |
The design matrix; will be generated from standard normal distribution if not supplied. |
betas |
User supplied regression coefficients. |
A list contains input matrix X
, response variables y
, and regression coefficients deltas
.
set.seed(12345) dat <- gendata_MLR(n = 100, p = 10) ggplot2::qplot(dat$y, bins = 6) corrplot::corrplot(cor(dat$X))
set.seed(12345) dat <- gendata_MLR(n = 100, p = 10) ggplot2::qplot(dat$y, bins = 6) corrplot::corrplot(cor(dat$X))
This function trains linear logistic regression models with HMC in restricted Gibbs sampling.
htlr( X, y, fsel = 1:ncol(X), stdx = TRUE, prior = "t", df = 1, iter = 2000, warmup = floor(iter/2), thin = 1, init = "lasso", leap = 50, leap.warm = floor(leap/10), leap.stepsize = 0.3, cut = 0.05, verbose = FALSE, rep.legacy = FALSE, keep.warmup.hist = FALSE, ... )
htlr( X, y, fsel = 1:ncol(X), stdx = TRUE, prior = "t", df = 1, iter = 2000, warmup = floor(iter/2), thin = 1, init = "lasso", leap = 50, leap.warm = floor(leap/10), leap.stepsize = 0.3, cut = 0.05, verbose = FALSE, rep.legacy = FALSE, keep.warmup.hist = FALSE, ... )
X |
Input matrix, of dimension nobs by nvars; each row is an observation vector. |
y |
Vector of response variables. Must be coded as non-negative integers, e.g., 1,2,...,C for C classes, label 0 is also allowed. |
fsel |
Subsets of features selected before fitting, such as by univariate screening. |
stdx |
Logical; if |
prior |
The prior to be applied to the model. Either a list of hyperparameter settings
returned by |
df |
The degree freedom of t/ghs/neg prior for coefficients. Will be ignored if the
configuration list from |
iter |
A positive integer specifying the number of iterations (including warmup). |
warmup |
A positive integer specifying the number of warmup (aka burnin).
The number of warmup iterations should not be larger than iter and the default is |
thin |
A positive integer specifying the period for saving samples. |
init |
The initial state of Markov Chain; it accepts three forms:
|
leap |
The length of leapfrog trajectory in sampling phase. |
leap.warm |
The length of leapfrog trajectory in burnin phase. |
leap.stepsize |
The integrator step size used in the Hamiltonian simulation. |
cut |
The coefficients smaller than this criteria will be fixed in each HMC updating step. |
verbose |
Logical; setting it to |
rep.legacy |
Logical; if |
keep.warmup.hist |
Warmup iterations are not recorded by default, set |
... |
Other optional parameters:
|
An object with S3 class htlr.fit
.
Longhai Li and Weixin Yao (2018). Fully Bayesian Logistic Regression with Hyper-Lasso Priors for High-dimensional Feature Selection. Journal of Statistical Computation and Simulation 2018, 88:14, 2827-2851.
set.seed(12345) data("colon") ## fit HTLR models with selected features, note that the chain length setting is for demo only ## using t prior with 1 df and log-scale fixed to -10 fit.t <- htlr(X = colon$X, y = colon$y, fsel = 1:100, prior = htlr_prior("t", df = 1, logw = -10), init = "bcbc", iter = 20, thin = 1) ## using NEG prior with 1 df and log-scale fixed to -10 fit.neg <- htlr(X = colon$X, y = colon$y, fsel = 1:100, prior = htlr_prior("neg", df = 1, logw = -10), init = "bcbc", iter = 20, thin = 1) ## using horseshoe prior with 1 df and auto-selected log-scale fit.ghs <- htlr(X = colon$X, y = colon$y, fsel = 1:100, prior = "ghs", df = 1, init = "bcbc", iter = 20, thin = 1)
set.seed(12345) data("colon") ## fit HTLR models with selected features, note that the chain length setting is for demo only ## using t prior with 1 df and log-scale fixed to -10 fit.t <- htlr(X = colon$X, y = colon$y, fsel = 1:100, prior = htlr_prior("t", df = 1, logw = -10), init = "bcbc", iter = 20, thin = 1) ## using NEG prior with 1 df and log-scale fixed to -10 fit.neg <- htlr(X = colon$X, y = colon$y, fsel = 1:100, prior = htlr_prior("neg", df = 1, logw = -10), init = "bcbc", iter = 20, thin = 1) ## using horseshoe prior with 1 df and auto-selected log-scale fit.ghs <- htlr(X = colon$X, y = colon$y, fsel = 1:100, prior = "ghs", df = 1, init = "bcbc", iter = 20, thin = 1)
Configure prior hyper-parameters for HTLR model fitting
htlr_prior( ptype = c("t", "ghs", "neg"), df = 1, logw = -(1/df) * 10, eta = ifelse(df > 1, 3, 0), sigmab0 = 2000 )
htlr_prior( ptype = c("t", "ghs", "neg"), df = 1, logw = -(1/df) * 10, eta = ifelse(df > 1, 3, 0), sigmab0 = 2000 )
ptype |
The prior to be applied to the model. Either "t" (student-t, default), "ghs" (horseshoe), or "neg" (normal-exponential-gamma). |
df |
The degree freedom (aka alpha) of t/ghs/neg prior for coefficients. |
logw |
The log scale of priors for coefficients. |
eta |
The |
sigmab0 |
The |
The output is a configuration list which is to be passed to prior
argument of htlr
.
For naive users, you only need to specify the prior type and degree freedom, then the other hyper-parameters
will be chosen automatically. For advanced users, you can supply each prior hyper-parameters by yourself.
For suggestion of picking hyper-parameters, see references
.
A configuration list containing ptype
, alpha
, logw
, eta
, and sigmab0
.
Longhai Li and Weixin Yao. (2018). Fully Bayesian Logistic Regression with Hyper-Lasso Priors for High-dimensional Feature Selection. Journal of Statistical Computation and Simulation 2018, 88:14, 2827-2851.
Get the indices of non-zero coefficients from fitted HTLR model objects.
nzero_idx(fit, cut = 0.1)
nzero_idx(fit, cut = 0.1)
fit |
An object of S3 class |
cut |
Threshold on relative SDB to distinguish zero coefficients. |
Indices vector of non-zero coefficients in the model.
set.seed(12345) data("colon") fit <- htlr(X = colon$X, y = colon$y, fsel = 1:100, iter = 20) nzero_idx(fit)
set.seed(12345) data("colon") fit <- htlr(X = colon$X, y = colon$y, fsel = 1:100, iter = 20) nzero_idx(fit)
Similar to other predict methods, this function returns predictions from a fitted htlrfit
object.
## S3 method for class 'htlr.fit' predict(object, newx, type = c("response", "class"), ...)
## S3 method for class 'htlr.fit' predict(object, newx, type = c("response", "class"), ...)
object |
A fitted model object with S3 class |
newx |
A Matrix of values at which predictions are to be made. |
type |
Type of prediction required. Type "response" gives the fitted probabilities. Type "class" produces the class label corresponding to the maximum probability. |
... |
Advanced options to specify the Markov chain iterations used for inference.
See |
The object returned depends on type.
This function splits the input data and response variables into training and testing parts.
split_data(X, y, p.train = 0.7, n.train = round(nrow(X) * p.train))
split_data(X, y, p.train = 0.7, n.train = round(nrow(X) * p.train))
X |
Input matrix, of dimension |
y |
Vector of response variables. |
p.train |
Percentage of training set. |
n.train |
Number of cases for training; will override |
List of training data x.tr
, y.tr
and testing data x.te
, y.te
.
dat <- gendata_MLR(n = 100, p = 10) dat <- split_data(dat$X, dat$y, p.train = 0.7) dim(dat$x.tr) dim(dat$x.te)
dat <- gendata_MLR(n = 100, p = 10) dat <- split_data(dat$X, dat$y, p.train = 0.7) dim(dat$x.tr) dim(dat$x.te)
This function accepts a design matrix and returns a standardized version of that matrix,
the statistics of each column such as median
and sd
are also provided.
std(X, tol = 1e-06)
std(X, tol = 1e-06)
X |
Design matrix, of dimension |
tol |
The tolerance value; a column of |
For each column of X
, the standardization is done by first subtracting its median,
then dividing by its sample standard deviation, while the original version in ncvreg
uses
mean and population standard deviation. Its speed is slower than ncvreg
because of the
complexity of median finding, but still substantially faster than scale()
provided by R base.
The standardized design matrix with the following attributes:
Indices of non-singular columns.
Median of each non-singular column which is used for standardization.
Standard deviation of each non-singular column which is used for standardization.
Patrick Breheny (original)
Steven Liu (modification)
http://pbreheny.github.io/ncvreg/reference/std.html
set.seed(123) mat <- matrix(rnorm(n = 80 * 90, mean = 100, sd = 50), 80, 90) mat %>% as.numeric() %>% ggplot2::qplot(bins = 30, xlab = '') mat %>% std() %>% as.numeric() %>% ggplot2::qplot(bins = 30, xlab = '')
set.seed(123) mat <- matrix(rnorm(n = 80 * 90, mean = 100, sd = 50), 80, 90) mat %>% as.numeric() %>% ggplot2::qplot(bins = 30, xlab = '') mat %>% std() %>% as.numeric() %>% ggplot2::qplot(bins = 30, xlab = '')
This function gives a summary of posterior of parameters.
## S3 method for class 'htlr.fit' summary( object, features = 1L:object$p, method = median, usedmc = get_sample_indice(dim(object$mcdeltas)[3], object$mc.param$iter.rmc), ... )
## S3 method for class 'htlr.fit' summary( object, features = 1L:object$p, method = median, usedmc = get_sample_indice(dim(object$mcdeltas)[3], object$mc.param$iter.rmc), ... )
object |
An object of S3 class |
features |
A vector of indices (int) or names (char) that specify the parameters we will look at. By default all parameters are selected. |
method |
A function that is used to aggregate the MCMC samples. The default is |
usedmc |
Indices of Markov chain iterations used for inference. By default all iterations are used. |
... |
Not used. |
A point summary of MCMC samples.
set.seed(12345) data("colon") fit <- htlr(X = colon$X, y = colon$y, fsel = 1:100, iter = 20) summary(fit, features = 1:16)
set.seed(12345) data("colon") fit <- htlr(X = colon$X, y = colon$y, fsel = 1:100, iter = 20) summary(fit, features = 1:16)