Title: | Bayesian Variable Selection in High Dimensional Settings using Nonlocal Priors |
---|---|
Description: | Variable/Feature selection in high or ultra-high dimensional settings has gained a lot of attention recently specially in cancer genomic studies. This package provides a Bayesian approach to tackle this problem, where it exploits mixture of point masses at zero and nonlocal priors to improve the performance of variable selection and coefficient estimation. product moment (pMOM) and product inverse moment (piMOM) nonlocal priors are implemented and can be used for the analyses. This package performs variable selection for binary response and survival time response datasets which are widely used in biostatistic and bioinformatics community. Benefiting from parallel computing ability, it reports necessary outcomes of Bayesian variable selection such as Highest Posterior Probability Model (HPPM), Median Probability Model (MPM) and posterior inclusion probability for each of the covariates in the model. The option to use Bayesian Model Averaging (BMA) is also part of this package that can be exploited for predictive power measurements in real datasets. |
Authors: | Amir Nikooienejad [aut, cre], Valen E. Johnson [ths] |
Maintainer: | Amir Nikooienejad <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.1.9 |
Built: | 2024-11-27 06:29:12 UTC |
Source: | CRAN |
This function performs Bayesian variable selection for high dimensional design matrix using iMOM prior for non-zero coefficients. It also performs adaptive hyperparameter selection for iMOM prior. Cleaning the data in a preprocessing step and before any data analysis is done by user preference. This function is for binary and survival time response datasets. In the former, MCMC is used to search in the model space while for the latter a stochastic search does that job. This function has the option to do all the mentioned tasks in a parallel fashion, exploiting hundreds of CPUs. It is highly recommended to use a cluster for this purpose. This function also supports fixing covariates in variable selection process, thus making them included in the final selected model with probability 1. Categorical variable are also supported by this function as input covariates to the selection process. They need to be well defined factor variables as part of the input data frame. For the output, this function reports necessary measurements that is common in Bayesian variable selection algorithms. They include Highest Posterior Probability model, median probability model and posterior inclusion probability for each of the covariates in the design matrix.
bvs( X, resp, prep = TRUE, logT = FALSE, fixed_cols = NULL, eff_size = 0.5, family = c("logistic", "survival"), hselect = TRUE, nlptype = "piMOM", r = 1, tau = 0.25, niter = 30, mod_prior = c("unif", "beta"), inseed = NULL, cplng = FALSE, ncpu = 4, parallel.MPI = FALSE )
bvs( X, resp, prep = TRUE, logT = FALSE, fixed_cols = NULL, eff_size = 0.5, family = c("logistic", "survival"), hselect = TRUE, nlptype = "piMOM", r = 1, tau = 0.25, niter = 30, mod_prior = c("unif", "beta"), inseed = NULL, cplng = FALSE, ncpu = 4, parallel.MPI = FALSE )
X |
The |
resp |
For logistic regression models it is the binary response vector which could be either numeric or factor variable in R. For the Cox proportional hazard models this is a two column matrix where the first column contains survival time vector and the second column is the censoring status for each observation. |
prep |
A boolean variable determining if the preprocessing step should
be performed on the design matrix or not. That step contains removing
columns that have |
logT |
A boolean variable determining if log transform should be done on continuous columns before scaling them in the preprocessing step. Note that those columns should not contain any zeros or negative values. |
fixed_cols |
A vector of indices showing the columns in the input
data frame that are not subject to the the selection procedure. These
columns are always in the final selected model. Note that if any of these
columns contain |
eff_size |
This is the expected effect size in the model for a standardized design matrix, which is basically the coefficient value that is expected to occur the most based on some prior knowledge. |
family |
Determines the type of data analysis. |
hselect |
A boolean variable indicating whether the automatic procedure
for hyperparameter selection should be run or not. The default value is
|
nlptype |
Determines the type of nonlocal prior that is used in the analyses. It can be "piMOM" for product inverse moment prior, or "pMOM" for product moment prior. The default is set to piMOM prior. |
r |
The paramter |
tau |
The paramter |
niter |
Number of iterations. For binary response data, this determines the number of MCMC iterations per CPU. For survival response data this is the number of iterations per temperature schedule in the stochastic search algorithm. |
mod_prior |
Type of prior used for the model space. |
inseed |
The input seed for making the parallel processing
reproducible. This parameter is ignored in logistic regression models when
|
cplng |
This parameter is only used in logistic regression models, and indicating if coupling algorithm for MCMC output should be performed or not. |
ncpu |
This is the number of cpus used in parallel processing. For logistic regression models this is the number of parallel coupled chains run at the same time. For survival outcome data this is the number of cpus doing stochastic search at the same time to increase th enumber of visited models. |
parallel.MPI |
A boolean variable determining if MPI is used for
parallel processing or not. Note that in order to use this feature, your
system should support MPI and |
It returns a list containing different objects that depend on the
family of the model and the coupling flag for logistic regression models.
The following describes the objects in the output list based on different
combinations of those two input arguments.
1) family = logistic && cplng = FALSE
num_vis_models |
Number of unique models visited throughout the search of the model space. |
max_prob |
Maximum unnormalized probability among all visited models |
HPM |
The indices of the model with highest posterior
probability among all visited models, with respect to the columns in
the output |
beta_hat |
The coefficient vector for the selected model. The first component is always for the intercept. |
MPM |
The indices of median probability model. According to the paper
Barbieri et. al., this is defined to be the model consisting of those
variables whose posterior inclusion probability is at least 1/2. The order
of columns is similar to that is explained for |
max_prob_vec |
A |
max_models |
A list containing models corresponding to
|
inc_probs |
A vector of length |
nlptype |
The type of nonlocal prior used in the analyses. |
des_mat |
The design matrix used in the analysis where fixed columns
are moved to the beginning of the matrix and if |
gene_names |
Names of the genes extracted from the design matrix. |
r |
The hyperparameter for iMOM prior density function, calculated using the proposed algorithm for the given dataset. |
tau |
The hyperparameter for iMOM prior density function, calculated using the proposed algorithm for the given dataset. |
2) family = logistic && cplng = TRUE
cpl_percent |
Shows what percentage of pairs of chains are coupled. |
margin_probs |
A |
chains |
A |
nlptype |
The type of nonlocal prior used in the analyses. |
cpl_flags |
A |
beta_hat |
A |
uniq_models |
A list showing unique models with the indices of the included covariates at each model. |
freq |
Frequency of each of the unique models. It is used to find the highest frquency model. |
probs |
Unnormalized probability of each of the unique models. |
des_mat |
The design matrix used in the analysis where fixed columns
are moved to the beginning of the matrix and if |
gene_names |
Names of the genes extracted from the design matrix. |
r |
The hyperparameter for iMOM prior density function, calculated using the proposed algorithm for the given dataset. |
tau |
The hyperparameter for iMOM prior density function, calculated using the proposed algorithm for the given dataset. |
3) family = survival
num_vis_models |
Number of visited models during the whole process. |
max_prob |
The unnormalized probability of the maximum model among all visited models. |
HPM |
The indices of the model with highest posterior
probability among all visited models, with respect to the columns in
|
MPM |
The indices of median probability model. According to the paper
Barbieri et. al., this is defined to be the model consisting of those
variables whose posterior inclusion probability is at least 1/2. The order
of columns is similar to that is explained for |
beta_hat |
The coefficient vector for the selected model reported in
|
max_prob_vec |
A |
max_models |
A list containing models corresponding to
|
inc_probs |
A |
nlptype |
The type of nonlocal prior used in the analyses. |
des_mat |
The design matrix used in the analysis where fixed columns
are moved to the beginning of the matrix and if |
start_models |
A |
gene_names |
Names of the genes extracted from the design matrix. |
r |
The hyperparameter for iMOM prior density function, calculated using the proposed algorithm for the given dataset. |
tau |
The hyperparameter for iMOM prior density function, calculated using the proposed algorithm for the given dataset. |
Amir Nikooienejad
Nikooienejad, A., Wang, W., and Johnson, V. E. (2016). Bayesian
variable selection for binary outcomes in high dimensional genomic studies
using nonlocal priors. Bioinformatics, 32(9), 1338-1345.
Nikooienejad, A., Wang, W., & Johnson, V. E. (2020). Bayesian variable
selection for survival data using inverse moment priors. Annals of Applied
Statistics, 14(2), 809-828.
Johnson, V. E. (1998). A coupling-regeneration scheme for
diagnosing convergence in Markov chain Monte Carlo algorithms. Journal of
the American Statistical Association, 93(441), 238-248.
Shin, M., Bhattacharya, A., and Johnson, V. E. (2017). Scalable Bayesian
variable selection using nonlocal prior densities in ultrahigh dimensional
settings. Statistica Sinica.
Johnson, V. E., and Rossell, D. (2010). On the use of non-local prior
densities in Bayesian hypothesis tests. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 72(2), 143-170.
Barbieri, M. M., and Berger, J. O. (2004). Optimal predictive model
selection. The annals of statistics, 32(3), 870-897.
### Simulating Logistic Regression Data n <- 200 p <- 40 set.seed(123) Sigma <- diag(p) full <- matrix(c(rep(0.5, p*p)), ncol=p) Sigma <- full + 0.5*Sigma cholS <- chol(Sigma) Beta <- c(-1.9,1.3,2.2) X <- matrix(rnorm(n*p), ncol=p) X <- X%*%cholS beta <- numeric(p) beta[c(1:length(Beta))] <- Beta XB <- X%*%beta probs <- as.vector(exp(XB)/(1+exp(XB))) y <- rbinom(n,1,probs) colnames(X) <- paste("gene_",c(1:p),sep="") X <- as.data.frame(X) ### Running 'bvs' function without coupling and with hyperparamter selection ### procedure bout <- bvs(X, y, family = "logistic", nlptype = "piMOM", mod_prior = "beta", niter = 50) ### Highest Posterior Model bout$HPM ### Estimated Coefficients: bout$beta_hat ### Number of Visited Models: bout$num_vis_models
### Simulating Logistic Regression Data n <- 200 p <- 40 set.seed(123) Sigma <- diag(p) full <- matrix(c(rep(0.5, p*p)), ncol=p) Sigma <- full + 0.5*Sigma cholS <- chol(Sigma) Beta <- c(-1.9,1.3,2.2) X <- matrix(rnorm(n*p), ncol=p) X <- X%*%cholS beta <- numeric(p) beta[c(1:length(Beta))] <- Beta XB <- X%*%beta probs <- as.vector(exp(XB)/(1+exp(XB))) y <- rbinom(n,1,probs) colnames(X) <- paste("gene_",c(1:p),sep="") X <- as.data.frame(X) ### Running 'bvs' function without coupling and with hyperparamter selection ### procedure bout <- bvs(X, y, family = "logistic", nlptype = "piMOM", mod_prior = "beta", niter = 50) ### Highest Posterior Model bout$HPM ### Estimated Coefficients: bout$beta_hat ### Number of Visited Models: bout$num_vis_models
This function estimates coefficient vector for a given set of covariates in a logistic regression and Cox proportional hazard models. It uses the inverse moment nonlocal prior (iMOM) for non zero coefficients.
CoefEst( X, resp, mod_cols, nlptype = "piMOM", tau, r, family = c("logistic", "survival") )
CoefEst( X, resp, mod_cols, nlptype = "piMOM", tau, r, family = c("logistic", "survival") )
X |
The design matrix. It is assumed that the preprocessing steps have
been done on this matrix. It is recommended that to use the output of
|
resp |
For logistic regression models, this variable is the binary response vector. For Cox proportional hazard models this is a two column matrix where the first column contains the survival time vector and the second column is the censoring status for each observation. |
mod_cols |
A vector of column indices of the design matrix, representing the selected model. |
nlptype |
Determines the type of nonlocal prior that is used in the analyses. It can be "piMOM" for product inverse moment prior, or "pMOM" for product moment prior. The default is set to piMOM prior. |
tau |
Hyperparameter |
r |
Hyperparameter |
family |
Determines the type of data analysis. |
It returns the vector of coefficients for the given model.
Amir Nikooienejad
Nikooienejad, A., Wang, W., and Johnson, V. E. (2016). Bayesian
variable selection for binary outcomes in high dimensional genomic studies
using non-local priors. Bioinformatics, 32(9), 1338-1345.
Nikooienejad, A., Wang, W., & Johnson, V. E. (2020). Bayesian variable
selection for survival data using inverse moment priors. Annals of Applied
Statistics, 14(2), 809-828.
### Simulating Survival Data n <- 400 p <- 1000 lambda <- 0.8 cens_rate <- 0.27 set.seed(123) Sigma <- diag(p) full <- matrix(c(rep(0.5, p*p)), ncol=p) Sigma <- full + 0.5*Sigma cholS <- chol(Sigma) Beta <- c(-1.8, 1.2, -1.7, 1.4, -1.4, 1.3) X = matrix(rnorm(n*p), ncol=p) X = X%*%cholS X <- scale(X) beta <- numeric(p) beta[c(1:length(Beta))] <- Beta XB <- X%*%beta uvector <- -log(runif(n)); times <- uvector/(lambda*exp(XB)) cens_time <- quantile(times,1-cens_rate) status <- as.numeric(times < cens_time) TS <- cbind(times,status) ### Estimating coeffcients of the true model and an arbitrary hyper ### parameter for the iMOM prior density mod <- c(1:6) coef <- CoefEst(X, TS, mod, tau = 1.8, r = 2, family = "survival") coef
### Simulating Survival Data n <- 400 p <- 1000 lambda <- 0.8 cens_rate <- 0.27 set.seed(123) Sigma <- diag(p) full <- matrix(c(rep(0.5, p*p)), ncol=p) Sigma <- full + 0.5*Sigma cholS <- chol(Sigma) Beta <- c(-1.8, 1.2, -1.7, 1.4, -1.4, 1.3) X = matrix(rnorm(n*p), ncol=p) X = X%*%cholS X <- scale(X) beta <- numeric(p) beta[c(1:length(Beta))] <- Beta XB <- X%*%beta uvector <- -log(runif(n)); times <- uvector/(lambda*exp(XB)) cens_time <- quantile(times,1-cens_rate) status <- as.numeric(times < cens_time) TS <- cbind(times,status) ### Estimating coeffcients of the true model and an arbitrary hyper ### parameter for the iMOM prior density mod <- c(1:6) coef <- CoefEst(X, TS, mod, tau = 1.8, r = 2, family = "survival") coef
This function performs Bayesian variable selection for
survival data in a non-parallel fashion. It runs modified S5 algorithm to
search the model space but since this is only on one CPU, the number of
visited models will not be large and therefore is NOT recommended for
high dimensional datasets. This function is called by bvs
function in a parllel fashion and therefore that function is recommended
to be used.
cox_bvs(exmat, cur_cols, nf, tau, r, nlptype, a, b, d, L, J, temps)
cox_bvs(exmat, cur_cols, nf, tau, r, nlptype, a, b, d, L, J, temps)
exmat |
An extended matrix where the first two columns are survival
times and status, respectively and the rest is the design matrix which is
produced by |
cur_cols |
A vector containing indices of the initial model for
variable selector to start the S5 algorithm from. Note that the first
|
nf |
The number of fixed covariates that do not enter the selection procedure. |
tau |
The paramter |
r |
The paramter |
nlptype |
Determines the type of nonlocal prior that is used in the
analyses. |
a |
The first parameter in beta distribution used as prior on model size. This parameter is equal to 1 when uinform-binomial prior is used. |
b |
The second paramter in beta distribution used as prior on model size. This parameter is equal to 1 when uinform-binomial prior is used. |
d |
This is the number of candidate covariates picked from top variables with highest utility function value and used in S5 algorithm. |
L |
Number of temperatures in S5 algorithm. |
J |
Number of iterations at each temperature in S5 algorithm. |
temps |
Vector of temperatuers used in S5 algorithm. |
It returns a list containing following objects:
max_model |
A |
hash_key |
A column vector indicating the generated key for each model that is used to track visited models and growing dictionary. |
max_prob |
The unnormalized probability of the model with highest posterior probability. |
all_probs |
A vector containing unnormalized probabilities of all visited models. |
vis_covs_list |
A list containing the covariates in each visited model in the stochastic search process. |
Amir Nikooienejad
Nikooienejad, A., Wang, W., and Johnson, V. E. (2017). Bayesian
Variable Selection in High Dimensional Survival Time Cancer Genomic
Datasets using Nonlocal Priors. arXiv preprint, arXiv:1712.02964.
Shin, M., Bhattacharya, A., and Johnson, V. E. (2017). Scalable
Bayesian variable selection using nonlocal prior densities in ultrahigh
dimensional settings. Statistica Sinica.
Johnson, V. E., and Rossell, D. (2010). On the use of non-local prior
densities in Bayesian hypothesis tests. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 72(2), 143-170.
### Initializing the parameters n <- 100 p <- 40 set.seed(123) Sigma <- diag(p) full <- matrix(c(rep(0.5, p*p)), ncol=p) Sigma <- full + 0.5*Sigma cholS <- chol(Sigma) Beta <- c(-1.8, 1.2, -1.7, 1.4, -1.4, 1.3) X = matrix(rnorm(n*p), ncol=p) X = X%*%cholS X <- scale(X) beta <- numeric(p) beta[c(1:length(Beta))] <- Beta XB <- X%*%beta sur_times <- rexp(n,exp(XB)) cens_times <- rexp(n,0.2) times <- pmin(sur_times,cens_times) status <- as.numeric(sur_times <= cens_times) exmat <- cbind(times,status,X) L <- 10; J <- 10 d <- 2 * ceiling(log(p)) temps <- seq(3, 1, length.out = L) tau <- 0.5; r <- 1; a <- 6; b <- p-a nlptype <- 0 ### PiMOM nonlocal prior cur_cols <- c(1,2,3) ### Starting model for the search algorithm nf <- 0 ### No fixed columns ### Running the Function coxout <- cox_bvs(exmat,cur_cols,nf,tau,r,nlptype,a,b,d,L,J,temps) ### The number of visited model for this specific run: length(coxout$hash_key) ### The selected model: which(coxout$max_model>0) ### The unnormalized probability of the selected model: coxout$max_prob
### Initializing the parameters n <- 100 p <- 40 set.seed(123) Sigma <- diag(p) full <- matrix(c(rep(0.5, p*p)), ncol=p) Sigma <- full + 0.5*Sigma cholS <- chol(Sigma) Beta <- c(-1.8, 1.2, -1.7, 1.4, -1.4, 1.3) X = matrix(rnorm(n*p), ncol=p) X = X%*%cholS X <- scale(X) beta <- numeric(p) beta[c(1:length(Beta))] <- Beta XB <- X%*%beta sur_times <- rexp(n,exp(XB)) cens_times <- rexp(n,0.2) times <- pmin(sur_times,cens_times) status <- as.numeric(sur_times <= cens_times) exmat <- cbind(times,status,X) L <- 10; J <- 10 d <- 2 * ceiling(log(p)) temps <- seq(3, 1, length.out = L) tau <- 0.5; r <- 1; a <- 6; b <- p-a nlptype <- 0 ### PiMOM nonlocal prior cur_cols <- c(1,2,3) ### Starting model for the search algorithm nf <- 0 ### No fixed columns ### Running the Function coxout <- cox_bvs(exmat,cur_cols,nf,tau,r,nlptype,a,b,d,L,J,temps) ### The number of visited model for this specific run: length(coxout$hash_key) ### The selected model: which(coxout$max_model>0) ### The unnormalized probability of the selected model: coxout$max_prob
This function finds data specific hyperparameters for inverse
moment prior density so that the overlap between the iMOM prior and null
MLE density is . In this algorithm
r
is always chosen
to be equal to 1 and tau
is found based on the mentioned overlap.
HyperSelect( X, resp, eff_size = 0.7, nlptype = "piMOM", iter = 10000, mod_prior = c("unif", "beta"), family = c("logistic", "survival") )
HyperSelect( X, resp, eff_size = 0.7, nlptype = "piMOM", iter = 10000, mod_prior = c("unif", "beta"), family = c("logistic", "survival") )
X |
The design matrix. |
resp |
For logistic regression models, it is the binary response vector. For Cox proportional hazard model, this is a two column matrix where the first column contains survival time vector and the second column is the censoring status for each observation. |
eff_size |
This is the expected effect size in the model for a standardized design matrix, which is basically the coefficient value that is expected to occur the most based on some prior knowledge. |
nlptype |
Determines the type of nonlocal prior that is used in the analyses. It can be "piMOM" for product inverse moment prior, or "pMOM" for product moment prior. The default is set to piMOM prior. |
iter |
The number of iteration needed to simulate from null model in order to approximate the null MLE density. |
mod_prior |
Type of prior used for model space. |
family |
Determines the type of data analysis. |
It returns a list having following object:
tau |
The hyperparameter for iMOM prior density function, calculated using the proposed algorithm for the given dataset. |
Amir Nikooienejad
Nikooienejad, A., Wang, W., and Johnson, V. E. (2016). Bayesian
variable selection for binary outcomes in high dimensional genomic studies
using nonlocal priors. Bioinformatics, 32(9), 1338-1345.
Johnson, V. E., and Rossell, D. (2010). On the use of nonlocal prior
densities in Bayesian hypothesis tests. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 72(2), 143-170.
### Simulating Logistic Regression Data n <- 20 p <- 100 set.seed(321) Sigma <- diag(p) full <- matrix(c(rep(0.5, p * p)), ncol = p) Sigma <- full + 0.5 * Sigma cholS <- chol(Sigma) Beta <- c(1, 1.8, 2.5) X = matrix(rnorm(n*p, 1, 2), ncol = p) X = X%*%cholS beta <- numeric(p) beta[c(1:length(Beta))] <- Beta XB <- X%*%beta probs <- as.vector(exp(XB)/(1+exp(XB))) y <- rbinom(n,1,probs) colnames(X) <- paste("gene_",c(1:p),sep="") Xout <- PreProcess(X) XX <- Xout$X hparam <- HyperSelect(XX, y, iter = 1000, mod_prior = "beta", family = "logistic") hparam$tau
### Simulating Logistic Regression Data n <- 20 p <- 100 set.seed(321) Sigma <- diag(p) full <- matrix(c(rep(0.5, p * p)), ncol = p) Sigma <- full + 0.5 * Sigma cholS <- chol(Sigma) Beta <- c(1, 1.8, 2.5) X = matrix(rnorm(n*p, 1, 2), ncol = p) X = X%*%cholS beta <- numeric(p) beta[c(1:length(Beta))] <- Beta XB <- X%*%beta probs <- as.vector(exp(XB)/(1+exp(XB))) y <- rbinom(n,1,probs) colnames(X) <- paste("gene_",c(1:p),sep="") Xout <- PreProcess(X) XX <- Xout$X hparam <- HyperSelect(XX, y, iter = 1000, mod_prior = "beta", family = "logistic") hparam$tau
This function performs Bayesian variable selection for
logistic regression data in a non-parallel fashion. It does not contain any
pre-processing step or variable initialization. Moreover it does not have
the feature to be run in parallel for performing the coupling algorithm.
Therefore in general, it is NOT recommended to be used unless the user
knows how to initialize all the parameters. However, this function is
called by bvs
function, the recommended way to run Bayesian
variable selection for such datasets.
logreg_bvs( exmat, chain1, nf, tau, r, nlptype, a, b, in_cons, loopcnt, cplng, chain2 )
logreg_bvs( exmat, chain1, nf, tau, r, nlptype, a, b, in_cons, loopcnt, cplng, chain2 )
exmat |
An extended matrix where the first column is binary resonse
vector and the rest is the design matrix which has its first column all 1
to account for the intercept in the model and is the output of
|
chain1 |
The first chain or initial model where the MCMC algorithm
starts from. Note that the first |
nf |
The number of fixed covariates that do not enter the selection procedure. |
tau |
The paramter |
r |
The paramter |
nlptype |
Determines the type of nonlocal prior that is used in the
analyses. |
a |
The first parameter in beta distribution used as prior on model size. This parameter is equal to 1 when uinform-binomial prior is used. |
b |
The second paramter in beta distribution used as prior on model size. This parameter is equal to 1 when uinform-binomial prior is used. |
in_cons |
The average model size. This value under certain conditions
and when |
loopcnt |
Number of iterations for MCMC procedure. |
cplng |
A boolean variable indicating the coupling algorithm to be performed or not. |
chain2 |
Second chain or model for starting the MCMC procedure. This
parameter is only used when |
It returns a list containing following objects:
max_chain |
A |
beta_hat |
The coefficient vector for the selected model. The first one is always for the intercept. |
max_prop |
The unnormalized probability of the model with highest posterior probability. |
num_iterations |
The number of MCMC iterations that are executed.
This is used when |
cplng_flag |
This is used when |
num_vis_models |
Number of visited models in search for the highest probability model. This contains redundant models too and is not the number of unique models. |
hash_key |
This is only used when |
hash_prob |
This is only used when |
vis_covs |
This is only used when |
Amir Nikooienejad
Nikooienejad, A., Wang, W., and Johnson, V. E. (2016). Bayesian
variable selection for binary outcomes in high dimensional genomic studies
using nonlocal priors. Bioinformatics, 32(9), 1338-1345.
Nikooienejad, A., Wang, W., and Johnson, V. E. (2017). Bayesian Variable
Selection in High Dimensional Survival Time Cancer Genomic Datasets using
Nonlocal Priors. arXiv preprint, arXiv:1712.02964.
Johnson, V. E., and Rossell, D. (2010). On the use of non-local prior
densities in Bayesian hypothesis tests. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 72(2), 143-170.
Johnson, V. E. (1998). A coupling-regeneration scheme for
diagnosing convergence in Markov chain Monte Carlo algorithms. Journal of
the American Statistical Association, 93(441), 238-248.
### Initializing parameters n <- 200 p <- 40 set.seed(123) Sigma <- diag(p) full <- matrix(c(rep(0.5, p*p)), ncol=p) Sigma <- full + 0.5*Sigma cholS <- chol(Sigma) Beta <- c(-1.7,1.8,2.5) X <- matrix(rnorm(n*p), ncol=p) X <- X%*%cholS colnames(X) <- paste("gene_",c(1:p),sep="") beta <- numeric(p) beta[c(1:length(Beta))] <- Beta XB <- X%*%beta probs <- as.vector(exp(XB)/(1+exp(XB))) y <- rbinom(n,1,probs) exmat <- cbind(y,X) tau <- 0.5; r <- 1; a <- 3; b <- p-a; in_cons <- a; loopcnt <- 100; cplng <- FALSE; initProb <- in_cons/p ### Initializing Chains schain <- p while (schain > in_cons || schain == 0) { chain1 <- rbinom(p, 1, initProb) schain <- sum(chain1) } chain1 <- as.numeric(c(1, chain1)) chain2 <- chain1 nlptype <- 0 ## PiMOM nonlocal prior nf <- 0 ### No fixed columns ### Running the function bvsout <- logreg_bvs(exmat,chain1,nf,tau,r,nlptype,a,b,in_cons,loopcnt,cplng,chain2) ### Number of visited models for this specific run: bvsout$num_vis_models ### The selected model: which(bvsout$max_chain > 0) ### Estimated coefficients: bvsout$beta_hat ### The unnormalized probability of the selected model: bvsout$max_prob
### Initializing parameters n <- 200 p <- 40 set.seed(123) Sigma <- diag(p) full <- matrix(c(rep(0.5, p*p)), ncol=p) Sigma <- full + 0.5*Sigma cholS <- chol(Sigma) Beta <- c(-1.7,1.8,2.5) X <- matrix(rnorm(n*p), ncol=p) X <- X%*%cholS colnames(X) <- paste("gene_",c(1:p),sep="") beta <- numeric(p) beta[c(1:length(Beta))] <- Beta XB <- X%*%beta probs <- as.vector(exp(XB)/(1+exp(XB))) y <- rbinom(n,1,probs) exmat <- cbind(y,X) tau <- 0.5; r <- 1; a <- 3; b <- p-a; in_cons <- a; loopcnt <- 100; cplng <- FALSE; initProb <- in_cons/p ### Initializing Chains schain <- p while (schain > in_cons || schain == 0) { chain1 <- rbinom(p, 1, initProb) schain <- sum(chain1) } chain1 <- as.numeric(c(1, chain1)) chain2 <- chain1 nlptype <- 0 ## PiMOM nonlocal prior nf <- 0 ### No fixed columns ### Running the function bvsout <- logreg_bvs(exmat,chain1,nf,tau,r,nlptype,a,b,in_cons,loopcnt,cplng,chain2) ### Number of visited models for this specific run: bvsout$num_vis_models ### The selected model: which(bvsout$max_chain > 0) ### Estimated coefficients: bvsout$beta_hat ### The unnormalized probability of the selected model: bvsout$max_prob
This function calculates the logarithm of unnormalized probability of a given set of covariates for both survival and binary response data. It uses the inverse moment nonlocal prior (iMOM) for non zero coefficients and beta binomial prior for the model space.
ModProb( X, resp, mod_cols, nlptype = "piMOM", tau, r, a, b, family = c("logistic", "survival") )
ModProb( X, resp, mod_cols, nlptype = "piMOM", tau, r, a, b, family = c("logistic", "survival") )
X |
The design matrix. It is assumed that the design matrix has
standardized columns. It is recommended that to use the output of
|
resp |
For logistic regression models, this variable is the binary response vector. For Cox proportional hazard models this is a two column matrix where the first column contains the survival time vector and the second column is the censoring status for each observation. |
mod_cols |
A vector of column indices of the design matrix, representing the model. |
nlptype |
Determines the type of nonlocal prior that is used in the analyses. It can be "piMOM" for product inverse moment prior, or "pMOM" for product moment prior. The default is set to piMOM prior. |
tau |
Hyperparameter |
r |
Hyperparameter |
a |
First parameter in the beta binomial prior. |
b |
Second parameter in the beta binomial prior. |
family |
Determines the type of data analysis. |
It returns the unnormalized probability for the selected model.
Amir Nikooienejad
Nikooienejad, A., Wang, W., and Johnson, V. E. (2016). Bayesian
variable selection for binary outcomes in high dimensional genomic studies
using nonlocal priors. Bioinformatics, 32(9), 1338-1345.
Nikooienejad, A., Wang, W., & Johnson, V. E. (2020). Bayesian variable
selection for survival data using inverse moment priors. Annals of Applied
Statistics, 14(2), 809-828.
### Simulating Logistic Regression Data n <- 400 p <- 1000 set.seed(123) Sigma <- diag(p) full <- matrix(c(rep(0.5, p*p)), ncol=p) Sigma <- full + 0.5*Sigma cholS <- chol(Sigma) Beta <- c(1,1.8,2.5) X = matrix(rnorm(n*p), ncol=p) X = X%*%cholS beta <- numeric(p) beta[c(1:length(Beta))] <- Beta XB <- X%*%beta probs <- as.vector(exp(XB)/(1+exp(XB))) y <- rbinom(n,1,probs) ### Calling the function for a subset of the true model, with an arbitrary ### parameters for prior densities mod <- c(1:3) Mprob <- ModProb(X, y, mod, tau = 0.7, r = 1, a = 7, b = 993, family = "logistic") Mprob
### Simulating Logistic Regression Data n <- 400 p <- 1000 set.seed(123) Sigma <- diag(p) full <- matrix(c(rep(0.5, p*p)), ncol=p) Sigma <- full + 0.5*Sigma cholS <- chol(Sigma) Beta <- c(1,1.8,2.5) X = matrix(rnorm(n*p), ncol=p) X = X%*%cholS beta <- numeric(p) beta[c(1:length(Beta))] <- Beta XB <- X%*%beta probs <- as.vector(exp(XB)/(1+exp(XB))) y <- rbinom(n,1,probs) ### Calling the function for a subset of the true model, with an arbitrary ### parameters for prior densities mod <- c(1:3) Mprob <- ModProb(X, y, mod, tau = 0.7, r = 1, a = 7, b = 993, family = "logistic") Mprob
This function is used for predictive accuracy measurement of
the selected models using Bayesian Model Averaging methodology. The Occam's
window with cut out threshold of thr
is used. That means only models
that have posterior probability of at least thr
* posteior
probability of the highest posterior probability model are considered in
model averaging. For survival time response datasets, the predictive Area
Under Curve (AUC) at each given time point is computed as the output. In
this case, the predictive AUC is obtained using Uno's method for
observations in the test set. For binary outcome data, only one AUC is
reported which is from the ROC computed on the test set. The training set is
used to find the selected model and relevant probabilities.
predBMA( bvsobj, X, resp, prep, logT, nlptype = "piMOM", train_idx, test_idx, thr = 0.05, times = NULL, family = c("logistic", "survival") )
predBMA( bvsobj, X, resp, prep, logT, nlptype = "piMOM", train_idx, test_idx, thr = 0.05, times = NULL, family = c("logistic", "survival") )
bvsobj |
An object that is generated by |
X |
The same |
resp |
For logistic regression models, this variable is the binary
response vector. For the Cox proportional hazard models this is a two column
matrix where the first column contains survival times and the second column
is the censoring status for each observation. Note that for survival times,
the time section of this variable should be in the same scale and unit
(year, days, etc.) as |
prep |
A boolean variable determining if the preprocessing step has
been done on the original data frame in |
logT |
A boolean variable determining if log transform should has been
done on continuous columns of the data frame in |
nlptype |
Determines the type of nonlocal prior that is used in the analyses. It can be "piMOM" for product inverse moment prior, or "pMOM" for product moment prior. The default is set to piMOM prior. |
train_idx |
An integer vector containing the indices of the training set. |
test_idx |
An integer vector containing the indices of the test set. The set of observations that prediction will be performed on. |
thr |
The threshold used for Occam's window as explained in the description. The default value for this variable is 0.05. |
times |
A vector of times at which predictive AUC is to be computed. This input is only used for prediction in survival data analysis. |
family |
Determines the type of data analysis. |
The output is different based on the family for the anlysis of data
1) family = logistic
The output is a list with the two following objects:
auc |
This is the area under the ROC curve after Bayesian model averaging is used to obtain ROC for the test data. |
roc_curve |
This is a two column matrix representing points on the ROC curve and can be used to plot the curve. The first column is FPR and the second column is TPR which represent x-axis and y-axis in the ROC curve, respectively. |
2) family = survival
auc |
A vector with the same length as |
Amir Nikooienejad
Raftery, A. E., Madigan, D., & Hoeting, J. A. (1997). Bayesian
model averaging for linear regression models. Journal of the American
Statistical Association, 92(437), 179-191.
Nikooienejad, A., Wang, W., & Johnson, V. E. (2020). Bayesian variable
selection for survival data using inverse moment priors. Annals of Applied
Statistics, 14(2), 809-828.
Uno, H., Cai, T., Tian, L., & Wei, L. J. (2007). Evaluating
prediction rules for t-year survivors with censored regression models.
Journal of the American Statistical Association, 102(478), 527-537.
### Simulating Logistic Regression Data n <- 200 p <- 40 set.seed(123) Sigma <- diag(p) full <- matrix(c(rep(0.5, p*p)), ncol=p) Sigma <- full + 0.5*Sigma cholS <- chol(Sigma) Beta <- c(-1.7,1.8,2.5) X <- matrix(rnorm(n*p), ncol=p) X <- X%*%cholS colnames(X) <- c(paste("gene_",c(1:p),sep="")) beta <- numeric(p) beta[c(1:length(Beta))] <- Beta Xout <- PreProcess(X) X <- Xout$X XB <- X%*%beta probs <- as.vector(exp(XB)/(1+exp(XB))) y <- rbinom(n,1,probs) X <- as.data.frame(X) train_idx <- sample(1:n,0.8*n) test_idx <- setdiff(1:n,train_idx) X_train <- X[train_idx,] y_train <- y[train_idx] bout <- bvs(X_train, y_train, prep=FALSE, family = "logistic", mod_prior = "beta",niter = 50) BMAout <- predBMA(bout, X, y, prep = FALSE, logT = FALSE, train_idx = train_idx, test_idx = test_idx, family="logistic") ### AUC for the prediction: BMAout$auc ### Plotting ROC Curve roc <- BMAout$roc_curve plot(roc,lwd=2,type='l',col='blue')
### Simulating Logistic Regression Data n <- 200 p <- 40 set.seed(123) Sigma <- diag(p) full <- matrix(c(rep(0.5, p*p)), ncol=p) Sigma <- full + 0.5*Sigma cholS <- chol(Sigma) Beta <- c(-1.7,1.8,2.5) X <- matrix(rnorm(n*p), ncol=p) X <- X%*%cholS colnames(X) <- c(paste("gene_",c(1:p),sep="")) beta <- numeric(p) beta[c(1:length(Beta))] <- Beta Xout <- PreProcess(X) X <- Xout$X XB <- X%*%beta probs <- as.vector(exp(XB)/(1+exp(XB))) y <- rbinom(n,1,probs) X <- as.data.frame(X) train_idx <- sample(1:n,0.8*n) test_idx <- setdiff(1:n,train_idx) X_train <- X[train_idx,] y_train <- y[train_idx] bout <- bvs(X_train, y_train, prep=FALSE, family = "logistic", mod_prior = "beta",niter = 50) BMAout <- predBMA(bout, X, y, prep = FALSE, logT = FALSE, train_idx = train_idx, test_idx = test_idx, family="logistic") ### AUC for the prediction: BMAout$auc ### Plotting ROC Curve roc <- BMAout$roc_curve plot(roc,lwd=2,type='l',col='blue')
This function preprocesses the design matrix by removing
columns that contain NA
's or are all zero. It also standardizes
non-binary columns to have mean zero and variance one. The user has the
choice of log transforming continuous covariates before scaling them.
PreProcess(X, logT = FALSE)
PreProcess(X, logT = FALSE)
X |
The |
logT |
A boolean variable determining if log transform should be done on continuous columns before scaling them. Note that those columns should not contain any zeros or negative values. |
It returns a list having the following objects:
X |
The filtered design matrix which can be used in variable selection procedure. Binary columns are moved to the end of the design matrix. |
gnames |
Gene names read from the column names of the filtered design matrix. |
Amir Nikooienejad
### Constructing a synthetic design matrix for the purpose of preprocessing ### imposing columns with different scales n <- 40 p1 <- 50 p2 <- 150 p <- p1 + p2 X1 <- matrix(rnorm(n*p1, 1, 2), ncol = p1) X2 <- matrix(rnorm(n*p2), ncol = p2) X <- cbind(X1, X2) ### putting NA elements in the matrix X[3,85] <- NA X[25,85] <- NA X[35,43] <- NA X[15,128] <- NA colnames(X) <- paste("gene_",c(1:p),sep="") ### Running the function. Note the intercept column that is added as the ### first column in the "logistic" family Xout <- PreProcess(X) dim(Xout$X)[2] == (p + 1) ## 1 is added because intercept column is included ## This is FALSE because of the removal of columns with NA elements
### Constructing a synthetic design matrix for the purpose of preprocessing ### imposing columns with different scales n <- 40 p1 <- 50 p2 <- 150 p <- p1 + p2 X1 <- matrix(rnorm(n*p1, 1, 2), ncol = p1) X2 <- matrix(rnorm(n*p2), ncol = p2) X <- cbind(X1, X2) ### putting NA elements in the matrix X[3,85] <- NA X[25,85] <- NA X[35,43] <- NA X[15,128] <- NA colnames(X) <- paste("gene_",c(1:p),sep="") ### Running the function. Note the intercept column that is added as the ### first column in the "logistic" family Xout <- PreProcess(X) dim(Xout$X)[2] == (p + 1) ## 1 is added because intercept column is included ## This is FALSE because of the removal of columns with NA elements