Title: | Bias-Corrected Bayesian Classification with Selected Features |
---|---|
Description: | Fully Bayesian Classification with a subset of high-dimensional features, such as expression levels of genes. The data are modeled with a hierarchical Bayesian models using heavy-tailed t distributions as priors. When a large number of features are available, one may like to select only a subset of features to use, typically those features strongly correlated with the response in training cases. Such a feature selection procedure is however invalid since the relationship between the response and the features has be exaggerated by feature selection. This package provides a way to avoid this bias and yield better-calibrated predictions for future cases when one uses F-statistic to select features. |
Authors: | Longhai Li <[email protected]> |
Maintainer: | Longhai Li <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.0-1 |
Built: | 2024-10-31 06:22:36 UTC |
Source: | CRAN |
These examples demonstrate how to use BCBCSF package. They use
all prior and Markov chain sampling settings by default (except
no_rmc
as noted below). The methods for setting others can be found
from documents for specific functions. However, the default settings may
work well for a wide range of gene expression data.
Li, L. (2012), Bias-corrected Hierarchical Bayesian Classification with a Selected Subset of High-dimensional Features, Journal of American Statistical Association,107:497,120-134
bcbcsf_fitpred
, bcbcsf_pred
,
cross_vld
, eval_pred
,
reload_fit_bcbcsf
, bcbcsf_sumfit
,
bcbcsf_plotsumfit
##\dontrun{ ## load lymphoma microarray data data (lymphoma) ## select some cases as testing data set ts <- c (sort(sample (1:42,5)), 43:44, 61:62) ## training data X_tr <- lymph.X[-ts,] y_tr <- lymph.y[-ts] ## test data X_ts <- lymph.X[ts,] y_ts <- lymph.y[ts] ########################################################################## ######################## training and prediction ######################### ########################################################################## ## fitting training data with top features selected by F-statistic out_fit <- bcbcsf_fitpred (X_tr = X_tr, y_tr = y_tr, nos_fsel = c(20, 50), no_rmc = 100) ## note 1: if 'X_ts' is given above, prediction is made after fitting ## note 2: no_rmc = 100 is too small, omit it and use the default ## predicting class labels of test cases out_pred <- bcbcsf_pred (X_ts = X_ts, out_fit = out_fit) ## evaluate prediction given true labels eval_pred (out_pred = out_pred, y_ts = y_ts) ########################################################################## ####################### visualizing prediction results ################### ########################################################################## ## reload one bcbcsf fit result from hardrive fit_bcbcsf <- reload_fit_bcbcsf (out_fit$fitfiles[1]) ## the fitting result for no_fsel = 50 can be retrieved directly from ## out_fit: fit_bcbcsf_fsel50 <- out_fit$fit_bcbcsf ## summarize the fitting result sum_fit <- bcbcsf_sumfit (fit_bcbcsf) ## visualize fitting result bcbcsf_plotsumfit (sum_fit) ########################################################################## ############################ cross validation ############################ ########################################################################## ## doing cross validation with bcbcsf_fitpred on lymphoma data cv_pred <- cross_vld ( ##################### classifier, data, and fold ################### fitpred_func = bcbcsf_fitpred, X = lymph.X, y = lymph.y, nfold = 2, ################ all other arguments passed classifier ############ nos_fsel = c(20,50), no_rmc = 100 ) ## note: no_rmc = 100 is too small, omit it and use the default in practice ## evaluate prediction given true labels eval_pred (out_pred = cv_pred, y_ts = lymph.y) ## warning: this function is slow if nfold is large; if you have a ## computer cluster, you better parallel the cross validation folds. ##}
##\dontrun{ ## load lymphoma microarray data data (lymphoma) ## select some cases as testing data set ts <- c (sort(sample (1:42,5)), 43:44, 61:62) ## training data X_tr <- lymph.X[-ts,] y_tr <- lymph.y[-ts] ## test data X_ts <- lymph.X[ts,] y_ts <- lymph.y[ts] ########################################################################## ######################## training and prediction ######################### ########################################################################## ## fitting training data with top features selected by F-statistic out_fit <- bcbcsf_fitpred (X_tr = X_tr, y_tr = y_tr, nos_fsel = c(20, 50), no_rmc = 100) ## note 1: if 'X_ts' is given above, prediction is made after fitting ## note 2: no_rmc = 100 is too small, omit it and use the default ## predicting class labels of test cases out_pred <- bcbcsf_pred (X_ts = X_ts, out_fit = out_fit) ## evaluate prediction given true labels eval_pred (out_pred = out_pred, y_ts = y_ts) ########################################################################## ####################### visualizing prediction results ################### ########################################################################## ## reload one bcbcsf fit result from hardrive fit_bcbcsf <- reload_fit_bcbcsf (out_fit$fitfiles[1]) ## the fitting result for no_fsel = 50 can be retrieved directly from ## out_fit: fit_bcbcsf_fsel50 <- out_fit$fit_bcbcsf ## summarize the fitting result sum_fit <- bcbcsf_sumfit (fit_bcbcsf) ## visualize fitting result bcbcsf_plotsumfit (sum_fit) ########################################################################## ############################ cross validation ############################ ########################################################################## ## doing cross validation with bcbcsf_fitpred on lymphoma data cv_pred <- cross_vld ( ##################### classifier, data, and fold ################### fitpred_func = bcbcsf_fitpred, X = lymph.X, y = lymph.y, nfold = 2, ################ all other arguments passed classifier ############ nos_fsel = c(20,50), no_rmc = 100 ) ## note: no_rmc = 100 is too small, omit it and use the default in practice ## evaluate prediction given true labels eval_pred (out_pred = cv_pred, y_ts = lymph.y) ## warning: this function is slow if nfold is large; if you have a ## computer cluster, you better parallel the cross validation folds. ##}
bcbcsf_fitpred
trains models with Gibbs sampling for each number of retained features. The results are saved in files. This function also makes predictions for test cases if they are provided.
bcbcsf_pred
uses the posterior samples saved by bcbcsf_fitpred
to predict the class labels of test cases. Prediction results are an array of predictive probabilities array_probs_pred
, whose rows for test cases, columns for classes, and the 3rd dimension for different numbers of retained features.
cross_vld
uses cross-validation to obtain predictive probabilities for all cases of a data set. This generic function can be used with bcbcsf_fitpred
and other classifiers.
bcbcsf_fitpred ( ## arguments specifying info of data sets X_tr, y_tr, nos_fsel = ncol (X_tr), X_ts = NULL, standardize = FALSE, rankf = FALSE, ## arguments for prediction burn = NULL, thin = 1, offset_sdxj = 0.5, ## arguments for Markov chain sampling no_rmc = 1000, no_imc = 5, no_mhwmux = 10, fit_bcbcsf_filepre = ".fitbcbcsf_", ## arguments specifying priors for parameters and hyerparameters w0_mu = 0.05, alpha0_mu = 0.5, alpha1_mu = 3, w0_x = 1.00, alpha0_x = 0.5, alpha1_x = 10, w0_nu = 0.05, alpha0_nu = 0.5, prior_psi = NULL, ## arguments for metropolis sampling for wmu, wx stepadj_mhwmux = 1, diag_mhwmux = FALSE, ## arguments for computing adjustment factor bcor = 1, cut_qf = exp (-10), cut_dpoi = exp (-10), nos_sim = 1000, ## whether look at progress monitor = TRUE) bcbcsf_pred (X_ts, out_fit, burn = NULL, thin = 1, offset_sdxj = 0.5) cross_vld (X, y, nfold = 10, folds = NULL, fitpred_func = bcbcsf_fitpred, ...)
bcbcsf_fitpred ( ## arguments specifying info of data sets X_tr, y_tr, nos_fsel = ncol (X_tr), X_ts = NULL, standardize = FALSE, rankf = FALSE, ## arguments for prediction burn = NULL, thin = 1, offset_sdxj = 0.5, ## arguments for Markov chain sampling no_rmc = 1000, no_imc = 5, no_mhwmux = 10, fit_bcbcsf_filepre = ".fitbcbcsf_", ## arguments specifying priors for parameters and hyerparameters w0_mu = 0.05, alpha0_mu = 0.5, alpha1_mu = 3, w0_x = 1.00, alpha0_x = 0.5, alpha1_x = 10, w0_nu = 0.05, alpha0_nu = 0.5, prior_psi = NULL, ## arguments for metropolis sampling for wmu, wx stepadj_mhwmux = 1, diag_mhwmux = FALSE, ## arguments for computing adjustment factor bcor = 1, cut_qf = exp (-10), cut_dpoi = exp (-10), nos_sim = 1000, ## whether look at progress monitor = TRUE) bcbcsf_pred (X_ts, out_fit, burn = NULL, thin = 1, offset_sdxj = 0.5) cross_vld (X, y, nfold = 10, folds = NULL, fitpred_func = bcbcsf_fitpred, ...)
X_tr , X_ts , X
|
matrices containing gene expression data; rows should be for the cases, and columns for different genes; |
y_tr , y
|
class labels in training or test data set, or just a data set. |
nos_fsel |
a vector of numbers of features to be retained. |
burn , thin
|
|
offset_sdxj |
a value between 0 and 1; 100* |
no_rmc , no_imc
|
|
fit_bcbcsf_filepre |
a string added to the names of files saving Markov chain fitting results; the actual file names contain also the data dimension and number of retained features; when |
w0_mu , alpha0_mu , alpha1_mu , w0_x , alpha0_x , alpha1_x , w0_nu , alpha0_nu
|
settings of priors for means and variances of genes; they are denoted by |
prior_psi |
a vector of length the number of classes, specifying the Dirichlet prior distribution for probabilities of classes; it is denoted by |
no_mhwmux , stepadj_mhwmux , diag_mhwmux
|
arguments specifying Metropolis sampling for |
bcor |
taking value 0 or 1, indicating whether bias-correction is to be applied. |
cut_qf , cut_dpoi , nos_sim
|
arguments specifying approximation of adjustment factor; |
nfold , folds
|
|
out_fit |
a list returned by |
standardize |
if it is set to TRUE, the original gene expression values are centralized and divided by the pooled standard deviation; by default, it is FALSE. |
rankf |
if it is set to TRUE, the original features will be re-ordered by F-statistic; by default, it is FALSE. |
monitor |
if it is set to TRUE, progress of fitting is shown on screen |
fitpred_func |
an R function that can fit with training data, and predict for test data; the arguments of |
... |
arguments passed to classifier |
nos_fsel |
a vector of numbers of features retained. |
fitfiles |
a string vector of length |
array_probs_pred |
an array of predictive probabilities, whose rows for test cases, columns for classes, and the 3rd dimension for different numbers of retained features. |
fit_bcbcsf |
a list of Markov chain sampling results from the fitting with number of retained features equal to the last number in |
This function is used to find error rate, amlp, loss and predictive probabilities at true labels.
eval_pred (out_pred, y_ts, Mloss = NULL)
eval_pred (out_pred, y_ts, Mloss = NULL)
out_pred |
a list returned by function |
y_ts |
a vector of true class labels. |
Mloss |
a matrix indicting loss function, with element |
; by default, it is NULL.
probs_at_truelabels |
a matrix of predictive probabilities at true labels, with rows for cases, and columns for different numbers of retained features |
summary |
a data frame, with rows for different numbers of retained features, and columns: |
These functions are used to look at the fitting results, especially plot the gene signals.
reload_fit_bcbcsf (fit_bcbcsf_afile) bcbcsf_sumfit (fit_bcbcsf = NULL, fit_bcbcsf_afile = NULL, burn = NULL, thin = 1) bcbcsf_plotsumfit (sum_fit)
reload_fit_bcbcsf (fit_bcbcsf_afile) bcbcsf_sumfit (fit_bcbcsf = NULL, fit_bcbcsf_afile = NULL, burn = NULL, thin = 1) bcbcsf_plotsumfit (sum_fit)
fit_bcbcsf_afile |
a string of name of a file saving a Markov chain fitting result; it can be found from the value |
fit_bcbcsf |
a list of Markov chain fitting result, returned by function |
burn , thin
|
|
sum_fit |
a list returned by function |
reload_fit_bcbcsf
returns a list of Markov chain fitting results, including how to do feature selection and data preprocessing.
bcbcsf_sumfit
returns a list of point estimates of means and variances.
bcbcsf_plotsumfit
returns nothing; it plots the normalized means (for each gene, original expression means substracted by their means and divided by the common standard deviation), and overall signals (Euclid distance of normalized means) for the selected features.
This is one of the microarray data sets used to demonstrate BCBCSF in the reference article. Information about this data set can be found from the reference.
data(lymphoma)
data(lymphoma)
lymph.X |
a matrix of gene expression data for p = 4026 genes on n = 62 cases in G=3 classes |
lymph.y |
a vector of class labels coded by 1,2,3. |