Title: | Latent Effect Adjustment After Primary Projection |
---|---|
Description: | These functions take a gene expression value matrix, a primary covariate vector, an additional known covariates matrix. A two stage analysis is applied to counter the effects of latent variables on the rankings of hypotheses. The estimation and adjustment of latent effects are proposed by Sun, Zhang and Owen (2011). "leapp" is developed in the context of microarray experiments, but may be used as a general tool for high throughput data sets where dependence may be involved. |
Authors: | Yunting Sun <[email protected]> , Nancy R.Zhang <[email protected]>, Art B.Owen <[email protected]> |
Maintainer: | Yunting Sun <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.3 |
Built: | 2024-12-04 07:32:25 UTC |
Source: | CRAN |
These functions take a gene expression value matrix, a primary covariate vector, an additional known covariates matrix. A two stage analysis is applied to counter the effects of latent variables on the rankings of hypotheses. The estimation and adjustment of latent effects are proposed by Sun, Zhang and Owen (2011). "leapp" is developed in the context of microarray experiments, but may be used as a general tool for high throughput data sets where dependence may be involved.
Package: | leapp |
Type: | Package |
Version: | 1.1 |
Date: | 2013-01-05 |
License: | What license is it under? |
LazyLoad: | yes |
Maintainer: Yunting Sun <[email protected]>
Sun, Zhang and Owen (2011), "Multiple hypothesis testing, adjusting for latent variables"
## Not run: library(sva) library(MASS) library(leapp) data(simdat) model <- cbind(rep(1,60),simdat$g) model0 <- cbind(rep(1,60)) p.raw <- f.pvalue(simdat$data,model,model0) p.oracle <-f.pvalue(simdat$data - simdat$u p.leapp <- leapp(simdat$data,pred.prim = simdat$g)$p p = cbind(p.raw,p.oracle, p.leapp) topk = seq(0,0.5,length.out = 50)*1000 null.set = which(simdat$gamma !=0) fpr= apply(p,2,FindFpr,null.set,topk) tpr= apply(p,2,FindTpr,null.set,topk) ROCplot(fpr,tpr, main = "ROC Comparison", name.method = c("raw","oracle","leapp"), save = FALSE ) ## End(Not run)
## Not run: library(sva) library(MASS) library(leapp) data(simdat) model <- cbind(rep(1,60),simdat$g) model0 <- cbind(rep(1,60)) p.raw <- f.pvalue(simdat$data,model,model0) p.oracle <-f.pvalue(simdat$data - simdat$u p.leapp <- leapp(simdat$data,pred.prim = simdat$g)$p p = cbind(p.raw,p.oracle, p.leapp) topk = seq(0,0.5,length.out = 50)*1000 null.set = which(simdat$gamma !=0) fpr= apply(p,2,FindFpr,null.set,topk) tpr= apply(p,2,FindTpr,null.set,topk) ROCplot(fpr,tpr, main = "ROC Comparison", name.method = c("raw","oracle","leapp"), save = FALSE ) ## End(Not run)
The algorithm alternates between 1) computing latent loadings u and latent variable v and 2) estimating noise standard deviation for each of the N genes.
AlternateSVD(x, r, pred = NULL, max.iter = 10, TOL = 1e-04)
AlternateSVD(x, r, pred = NULL, max.iter = 10, TOL = 1e-04)
x |
an N by n data matrix |
r |
a numeric, number of latent factors to estimate |
pred |
an n by s matrix, each column is a vector of known covariates for n samples, s covariates are considered, default to |
max.iter |
a numeric, maximum number of iteration allowed, default to 10 |
TOL |
a numeric, tolerance level for the algorithm to converge, default to 1e-04 |
sigma |
a vector of length N, noise standard deviations for N genes |
coef |
an N by s matrix, estimated coefficients for known covariates |
uest |
an N by r matrix, estimated latent loadings |
vest |
an n by r matrix, estiamted latent factors |
Yunting Sun [email protected], Nancy R.Zhang [email protected], Art B.Owen [email protected]
Given a vector of p values for m genes and a set of null genes, compute the area under ROC curve using the Wilcoxin statistics
FindAUC(pvalue, ind)
FindAUC(pvalue, ind)
pvalue |
A vector of p values, one for each gene, with length m |
ind |
A vector of indices that the corresponding gene are true positive |
auc |
A numeric, area under the ROC curve for the given gene list |
Yunting Sun [email protected], Nancy R.Zhang [email protected], Art B.Owen [email protected]
Given a vector of sizes of retrieved genes, for each size k, select the top k genes with smallest p values and compute the false positive rate from the retrieved genes and the true positive genes.
FindFpr(pvalue, ind,topk)
FindFpr(pvalue, ind,topk)
pvalue |
A vector of p values, one for each gene, with length m |
ind |
A vector of indices that the corresponding gene are true positive |
topk |
A vector of integers ranging from 1 to m , length of retrieved gene list |
fpr |
A vector of false positive rates at given sizes of retrieval. |
Yunting Sun [email protected], Nancy R.Zhang [email protected], Art B.Owen [email protected]
Given a vector of sizes of retrieved genes, for each size k, select the top k genes with smallest p values and compute the precision from the retrieved genes and the true positive genes.
FindPrec(pvalue, ind,topk)
FindPrec(pvalue, ind,topk)
pvalue |
A vector of p values, one for each gene, with length m |
ind |
A vector of indices that the corresponding gene are true positive |
topk |
A vector of integers ranging from 1 to m , length of retrieved gene list |
prec |
A vector of precisions at given sizes of retrieval. |
Yunting Sun [email protected], Nancy R.Zhang [email protected], Art B.Owen [email protected]
Given a vector of sizes of retrieved genes, for each size k, select the top k genes with smallest p values and compute the recall from the retrieved genes and the true positive genes.
FindRec(pvalue, ind, topk)
FindRec(pvalue, ind, topk)
pvalue |
A vector of p values, one for each gene, with length m |
ind |
A vector of indices that the corresponding gene are true positive |
topk |
A vector of integers ranging from 1 to m , length of retrieved gene list |
rec |
A vector of precisions at given sizes of retrieval. |
Yunting Sun [email protected], Nancy R.Zhang [email protected], Art B.Owen [email protected]
Given a vector of sizes of retrieved genes, for each size k, select the top k genes with smallest p values and compute the true positive rate from the retrieved genes and the true positive genes.
FindTpr(pvalue, ind,topk)
FindTpr(pvalue, ind,topk)
pvalue |
A vector of p values, one for each gene, with length m |
ind |
A vector of indices that the corresponding gene are true positive |
topk |
A vector of integers ranging from 1 to m , length of retrieved gene list |
tpr |
A vector of True positive rates at given sizes of retrieval. |
Yunting Sun [email protected], Nancy R.Zhang [email protected], Art B.Owen [email protected]
Outlier detection and robust regression through an iterative penalized regression with tuning parameter chosen by modified BIC
IPOD(X, Y, H, method = "hard", TOL = 1e-04, length.out = 50)
IPOD(X, Y, H, method = "hard", TOL = 1e-04, length.out = 50)
X |
an N by k design matrix |
Y |
an N by 1 response |
H |
an N by N projection matrix |
method |
a string, if method = "hard", hard thresholding is applied; if method = "soft", soft thresholding is applied |
TOL |
relative iterative converence tolerance, default to 1e-04 |
length.out |
A numeric, number of candidate tuning parameter lambda under consideration for further modified BIC model selection, default to 50. |
If there is no predictors, set X = NULL
.
Y = X beta + gamma + sigma epsilon
Y is N by 1 reponse vector, X is N by k design matrix, beta is k by 1 coefficients, gamma is N by 1 outlier indicator, sigma is a scalar and the noise standard deviation and epsilon is N by 1 vector with components independently distributed as standard normal N(0,1).
gamma |
a vector of length N, estimated outlier indicator gamma |
resOpt.scale |
a vector of length N, test statistics for each of the N genes |
p |
a vector of length N, p-values for each of the N genes |
Yunting Sun [email protected], Nancy R.Zhang [email protected], Art B.Owen [email protected]
Y = X beta + gamma + sigma epsilon estimate k by 1 coefficients vector beta and N by 1 outlier indicator vector gamma from (Y,X).
IPODFUN(X, Y, H, sigma, betaInit, method = "hard", TOL = 1e-04)
IPODFUN(X, Y, H, sigma, betaInit, method = "hard", TOL = 1e-04)
X |
an N by k design matrix |
Y |
an N by 1 response vector |
H |
an N by N projection matrix X(X'X)^-1X' |
sigma |
a numeric, noise standard deviation |
betaInit |
a k by 1 initial value for coeffient beta |
method |
a string, if "hard", conduct hard thresholding, if "soft", conduct soft thresholding, default to "hard" |
TOL |
a numeric, tolerance of convergence, default to 1e-04 |
The initial estimator for the coefficient beta can be chosen to be the estimator from a robust linear regression
gamma |
an N by 1 vector of estimated outlier indicator |
ress |
an N by 1 vector of residual Y - X beta - gamma |
Yunting Sun [email protected], Nancy R.Zhang [email protected], Art B.Owen [email protected]
She, Y. and Owen, A.B. "Outlier detection using nonconvex penalized regression" 2010
Adjust for latent factors and conduct multiple hypotheses testing from gene expression data using the algorithm of Sun,Zhang and Owen (2011). Number of latent factors can be chosen by Buja and Eyuboglu (1992).
leapp (data,pred.prim,pred.covar, O = NULL, num.fac = "buja", method = "hard", sparse = TRUE, centered = FALSE, verbose = FALSE, perm.num = 50, TOL = 1e-4, length.out = 50)
leapp (data,pred.prim,pred.covar, O = NULL, num.fac = "buja", method = "hard", sparse = TRUE, centered = FALSE, verbose = FALSE, perm.num = 50, TOL = 1e-4, length.out = 50)
data |
An N genes by n arrays matrix of expression data |
pred.prim |
An n by 1 primary predictor |
pred.covar |
An n by s known covariate matrix not of primary interest |
O |
An n by n rotation matrix such that O pred.prim = (1, 0,...,0) |
num.fac |
A numeric or string, number of latent factors chosen. it has default value "buja" which uses Buja and Eyuboglu (1992) to pick the number of factors |
method |
A string which takes values in ("hard","soft"). "hard": hard thresholding in the IPOD algorithm; "soft": soft thresholding in the IPOD algorithm |
sparse |
A logical value, if |
centered |
A logical value, indicates whether the data has been centered at zero, default to |
verbose |
A logical value, if |
perm.num |
A numeric, number of permutation performed using algorithm of Buja and Eyuboglu (1992), default to 50 |
TOL |
A numeric, convergence tolerance level, default to 1e-4 |
length.out |
A numeric, number of candidate tuning parameter lambda under consideration for further modified BIC model selection, default to 50. |
The data for test i should be in the ith row of data. If the rotation matrix O is set to NULL, the function will compute one rotation from primary predictor pred.prim.
p |
A vector of p-values one for each row of data. |
vest |
An n by num.fac matrix, estimated latent factors |
uest |
An N by num.fac matrix, estimated latent loadings |
gamma |
An N by 1 vector, estimated primary effect |
sigma |
An N by 1 vector, estimated noise standard deviation one for each row of data |
Yunting Sun [email protected], Nancy R.Zhang [email protected], Art B.Owen [email protected]
## Not run: ## Load data data(simdat) #Calculate the p-values p <- leapp(simdat$data,pred.prim = simdat$g,method = "hard")$p auc <- FindAUC(p, which(simdat$gamma!=0)) ## End(Not run)
## Not run: ## Load data data(simdat) #Calculate the p-values p <- leapp(simdat$data,pred.prim = simdat$g,method = "hard")$p auc <- FindAUC(p, which(simdat$gamma!=0)) ## End(Not run)
Calculate F-statistics, t-statistics and corresponding p-values given multiple regression models under the null and alternative hypotheses.
Pvalue(dat, mod, mod0)
Pvalue(dat, mod, mod0)
dat |
An N genes by n arrays matrix of expression data |
mod |
An n by (s+1) design matrix under the full model (alternative), the first column is the primary predictor, s>=0 and the rest of the columns are additional covariates |
mod0 |
An n by s design matrix under the null hypothesis, s>=0,
should be the same as the 2:(s+1) columns of mod. If s=0, please set mod0 = |
p |
An N by 1 vector of p-values one for each row of data. |
tstat |
An N by 1 vector of t-statistics for primary parameters. |
fstat |
An N by 1 vector of F-statistics for primary parameters. |
coef |
An N by (s+1) matrix of coefficients with respect to design matrix mod under the full model. |
Yunting Sun [email protected], Nancy R.Zhang [email protected], Art B.Owen [email protected]
## Not run: data(simdat) n = ncol(simdat$data) mod = cbind(simdat$g, rep(1,n)) mod0 = cbind(rep(1,n)) result = Pvalue(dimdat$data,mod = mod, mod0 = mod0) ## End(Not run)
## Not run: data(simdat) n = ncol(simdat$data) mod = cbind(simdat$g, rep(1,n)) mod0 = cbind(rep(1,n)) result = Pvalue(dimdat$data,mod = mod, mod0 = mod0) ## End(Not run)
Outlier detection and robust regression with a ridge type penalty on the outlier indicator gamma. Allow non sparse outliers and require known noise standard deviation.
ridge(X, Y, H, sigma)
ridge(X, Y, H, sigma)
X |
an N by k design matrix |
Y |
an N by 1 response vector |
H |
an N by N projection matrix |
sigma |
a numeric, noise standard deviation |
p |
an N by 1 vector of p-values for each of the N genes |
gamma |
an N by 1 vector of estimated primary variable gamma |
Yunting Sun [email protected], Nancy R.Zhang [email protected], Art B.Owen [email protected]
Input an p by d matrix, each column of which contains false positive rates(FPR) computed from each of the d methods and p significance levels and a matrix of corresponding true positive rates(TPR) at the same significance levels. Plot ROC curve for each of the d methods.
ROCplot(fpr,tpr,main, name.method, xlim = c(0,0.2),ylim = c(0.4,1), save = TRUE, name.file = NULL)
ROCplot(fpr,tpr,main, name.method, xlim = c(0,0.2),ylim = c(0.4,1), save = TRUE, name.file = NULL)
fpr |
A matrix of false positive rates for increasing sizes of retrieved significant genes |
tpr |
A vector of corresponding true positive rates at the same significance levels |
main |
a string, title of the figure |
name.method |
a string vector of length d containing names of the d methods |
xlim |
the range of the x axis(FPR), default to c(0,0.2) |
ylim |
the range of the y axis (TPR), default to c(0.4,1) |
save |
a logical value, if |
name.file |
a string giving the name of the png file to save the plot |
The order of the name.method should be the same as that in the fpr and tpr.
Yunting Sun [email protected], Nancy R.Zhang [email protected], Art B.Owen [email protected]
## Not run: library(sva) library(MASS) library(leapp) data(simdat) model <- cbind(rep(1,60),simdat$g) model0 <- cbind(rep(1,60)) p.raw <- f.pvalue(simdat$data,model,model0) p.oracle <-f.pvalue(simdat$data - simdat$u p.leapp <- leapp(simdat$data,pred.prim = simdat$g, method = "hard")$p p = cbind(p.raw,p.oracle, p.leapp) topk = seq(0,0.5,length.out = 50)*1000 null.set = which(simdat$gamma !=0) fpr= apply(p,2,FindFpr,null.set,topk) tpr= apply(p,2,FindTpr,null.set,topk) ROCplot(fpr,tpr, main = "ROC Comparison", name.method = c("raw","oracle","leapp"), save = FALSE ) ## End(Not run)
## Not run: library(sva) library(MASS) library(leapp) data(simdat) model <- cbind(rep(1,60),simdat$g) model0 <- cbind(rep(1,60)) p.raw <- f.pvalue(simdat$data,model,model0) p.oracle <-f.pvalue(simdat$data - simdat$u p.leapp <- leapp(simdat$data,pred.prim = simdat$g, method = "hard")$p p = cbind(p.raw,p.oracle, p.leapp) topk = seq(0,0.5,length.out = 50)*1000 null.set = which(simdat$gamma !=0) fpr= apply(p,2,FindFpr,null.set,topk) tpr= apply(p,2,FindTpr,null.set,topk) ROCplot(fpr,tpr, main = "ROC Comparison", name.method = c("raw","oracle","leapp"), save = FALSE ) ## End(Not run)
This data set is a simulated gene expression matrix with N(0,1) background noise and affected by two variables. gene expression values of 1000 genes from 60 samples are simulated. First 30 samples are from case group and last 30 samples are from control group. The primary treatment variable g affects ten percent of the genes and the latent variable affects uniformly on the genes. The correlation between primary treatment variable g and the latent variable is 0.5 and the SNR = 1, SLR = 0.5. The variances of noise across genes are distributed as inverse gamma. Also included in the data are a vector of normalized primary variable g, a vector of primary parameter gamma,a vector of latent factor v, a vector of latent loadings u and a vector of noise standard deviation for all genes sigma.
data(simdat)
data(simdat)
A list of 6 components
data |
A 1000 x 60 gene expression value matrix with genes in rows and arrays in columns |
g |
A vector of length 60, primary variable |
gamma |
A vector of length 1000, primary parameter |
v |
A vector of length 60, latent variable |
u |
A vector of length 1000, latent loadings |
sigma |
A vector of length 1000, noise standard deviation for each of the 1000 genes |
Yunting Sun [email protected], Nancy R.Zhang [email protected], Art B.Owen [email protected]