Title: | Canonical Variate Regression |
---|---|
Description: | Perform canonical variate regression (CVR) for two sets of covariates and a univariate response, with regularization and weight parameters tuned by cross validation. |
Authors: | Chongliang Luo, Kun Chen. |
Maintainer: | Chongliang Luo <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.1.1 |
Built: | 2024-12-07 06:29:55 UTC |
Source: | CRAN |
Perform canonical variate regression (CVR) for two sets of covariates and a univariate response, with regularization and weight parameters tuned by cross validation.
Index of help topics:
CVR Fit canonical variate regression with tuning parameters selected by cross validation. CVR-package Canonical Variate Regression SimulateCVR Generate simulation data. SparseCCA Sparse canonical correlation analysis. alcohol Data sets for the alcohol dependence example cvrsolver Canonical Variate Regression. mouse Data sets for the mouse body weight example plot.CVR Plot a CVR object.
functions: cvrsolver
, SparseCCA
, SimulateCVR
, CVR
, plot.CVR
.
Chongliang Luo, Kun Chen.
Maintainer: Chongliang Luo <[email protected]>
Chongliang Luo, Jin Liu, Dipak D. Dey and Kun Chen (2016) Canonical variate regression. Accepted by Biostatistics, doi: 10.1093/biostatistics/kxw001.
Daniela M. Witten, Robert Tibshirani and Trevor Hastie (2009) A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10(3), 515-534.
PMA.
A list of 3 data frames that contains the gene expression, DNA methylation and AUD (alcohol use disorder) of 46 human subjects. The data is already screened for quality control. For the raw data see the link below. For more details see the reference.
alcohol
alcohol
A list of 3 data frames:
Human gene expression. A data frame of 46 rows and 300 columns.
Human DNA methylation. A data frame of 46 rows and 500 columns.
Human AUD indicator. A data frame of 46 rows and 1 column. The first 23 subjects are AUDs and the others are matched controls.
Alcohol dependence: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE49393.
Chongliang Luo, Jin Liu, Dipak D. Dey and Kun Chen (2016) Canonical variate regression. Biostatistics, doi: 10.1093/biostatistics/kxw001.
############## Alcohol dependence example ###################### data(alcohol) gene <- scale(as.matrix(alcohol$gene)) meth <- scale(as.matrix(alcohol$meth)) disorder <- as.matrix(alcohol$disorder) alcohol.X <- list(X1 = gene, X2 = meth) ## Not run: foldid <- c(rep(1:5, 4), c(3,4,5), rep(1:5, 4), c(1,2,5)) ## table(foldid, disorder) ## there maybe warnings due to the glm refitting with small sample size alcohol.cvr <- CVR(disorder, alcohol.X, rankseq = 2, etaseq = 0.02, family = "b", penalty = "L1", foldid = foldid ) plot(alcohol.cvr) plot(gene %*% alcohol.cvr$solution$W[[1]][, 1], meth %*% alcohol.cvr$solution$W[[2]][, 1]) cor(gene %*% alcohol.cvr$solution$W[[1]], meth %*% alcohol.cvr$solution$W[[2]]) ## End(Not run)
############## Alcohol dependence example ###################### data(alcohol) gene <- scale(as.matrix(alcohol$gene)) meth <- scale(as.matrix(alcohol$meth)) disorder <- as.matrix(alcohol$disorder) alcohol.X <- list(X1 = gene, X2 = meth) ## Not run: foldid <- c(rep(1:5, 4), c(3,4,5), rep(1:5, 4), c(1,2,5)) ## table(foldid, disorder) ## there maybe warnings due to the glm refitting with small sample size alcohol.cvr <- CVR(disorder, alcohol.X, rankseq = 2, etaseq = 0.02, family = "b", penalty = "L1", foldid = foldid ) plot(alcohol.cvr) plot(gene %*% alcohol.cvr$solution$W[[1]][, 1], meth %*% alcohol.cvr$solution$W[[2]][, 1]) cor(gene %*% alcohol.cvr$solution$W[[1]], meth %*% alcohol.cvr$solution$W[[2]]) ## End(Not run)
This function fits the solution path of canonical variate regression,
with tuning parameters selected by cross validation. The tuning parameters
include the rank, the and the
.
CVR(Y, Xlist, rankseq = 2, neta = 10, etaseq = NULL, nlam = 50, Lamseq = NULL, family = c("gaussian", "binomial", "poisson"), Wini = NULL, penalty = c("GL1", "L1"), nfold = 10, foldid = NULL, opts = list(), type.measure = NULL)
CVR(Y, Xlist, rankseq = 2, neta = 10, etaseq = NULL, nlam = 50, Lamseq = NULL, family = c("gaussian", "binomial", "poisson"), Wini = NULL, penalty = c("GL1", "L1"), nfold = 10, foldid = NULL, opts = list(), type.measure = NULL)
Y |
A univariate response variable. |
Xlist |
A list of two covariate matrices as in |
rankseq |
A sequence of candidate ranks. The default is a single value 2. |
neta |
Number of |
etaseq |
A sequence of length |
nlam |
Number of |
Lamseq |
A matrix of |
family |
Type of response as in |
Wini |
A list of initial loading W's. The default is from the SparseCCA solution. See |
penalty |
Type of penalty on loading matrices W's as in |
nfold |
Number of folds in cross validation. The default is 10. |
foldid |
Specifying training and testing sets in cross validation; random generated if not supplied.
It remains the same across different rank and |
opts |
A list of options for controlling the algorithm. The default of |
type.measure |
Type of measurement used in cross validation. |
In this function, the rank, and
are tuned by cross validation. CVR then is refitted with
all data using the selected tuning parameters. The
plot
function shows the tuning of ,
with selected rank and
.
An object with S3 class "CVR" containing the following components
cverror |
A matrix containing the CV errors. The number of rows is the length
of |
etahat |
Selected |
rankhat |
Selected rank. |
Lamhat |
Selected |
Alphapath |
An array containing the fitted paths of the intercept term |
Betapath |
An array containing the fitted paths of the regression coefficient |
W1path , W2path
|
Arrays containing the fitted paths of W1 and W2. |
foldid |
|
cvout |
Cross validation results using selected |
solution |
A list including the solutions of |
Chongliang Luo, Kun Chen.
Chongliang Luo, Jin Liu, Dipak D. Dey and Kun Chen (2016) Canonical variate regression. Biostatistics, doi: 10.1093/biostatistics/kxw001.
cvrsolver
, SparseCCA
, SimulateCVR
.
############## Gaussian response ###################### set.seed(42) mydata <- SimulateCVR(family = "g", n = 100, rank = 4, p1 = 50, p2 = 70, pnz = 10, beta = c(2, 1, 0, 0)) X1 <- mydata$X1; X2 <- mydata$X2 Xlist <- list(X1 = X1, X2 = X2); Y <- mydata$y ## fix rank = 4, tune eta and lambda ##out_cvr <- CVR(Y, Xlist, rankseq = 4, neta = 5, nlam = 25, ## family = "g", nfold = 5) ## out_cvr$solution$W[[1]]; ## out_cvr$solution$W[[2]]; ### uncomment to see plots ## plot.CVR(out_cvr) ## ## Distance of subspaces ##U <- mydata$U ##Pj <- function(U) U %*% solve(t(U) %*% U, t(U)) ##sum((Pj(U) - (Pj(X1 %*% out_cvr$sol$W[[1]]) + Pj(X2 %*% out_cvr$sol$W[[2]]))/2)^2) ## Precision/Recall rate ## the first 10 rows of the true W1 and W2 are set to be nonzero ##W12 <- rbind(out_cvr$sol$W[[1]], out_cvr$sol$W[[1]]) ##W12norm <- apply(W12, 1, function(a)sqrt(sum(a^2))) ##prec <- sum(W12norm[c(1:10, 51:60)] != 0)/sum(W12norm != 0); prec ##rec <- sum(W12norm[c(1:10, 51:60)] != 0)/20; rec ## sequential SparseCCA, compare the Distance of subspaces and Prec/Rec ##W12s <- SparseCCA(X1, X2, 4) ## Distance larger than CVR's ##sum((Pj(U) - (Pj(X1 %*% W12s$W1) + Pj(X2 %*% W12s$W2))/2)^2) ##W12snorm <- apply(rbind(W12s$W1, W12s$W2), 1, function(a)sqrt(sum(a^2))) ## compare Prec/Rec ##sum(W12snorm[c(1:10, 51:60)] != 0)/sum(W12snorm != 0); ##sum(W12snorm[c(1:10, 51:60)] != 0)/20; ############## binary response ######################## set.seed(12) mydata <- SimulateCVR(family = "binomial", n = 300, rank = 4, p1 = 50, p2 = 70, pnz = 10, beta = c(2, 1, 0, 0)) X1 <- mydata$X1; X2 <- mydata$X2 Xlist <- list(X1 = X1, X2 = X2); Y <- mydata$y ## out_cvr <- CVR(Y, Xlist, 4, neta = 5, nlam=25, family = "b", nfold = 5) ## out_cvr$sol$W[[1]]; ## out_cvr$sol$W[[2]]; ## plot.CVR(out_cvr) ############## Poisson response ###################### set.seed(34) mydata <- SimulateCVR(family = "p", n = 100, rank = 4, p1 = 50, p2 = 70, pnz = 10, beta = c(0.2, 0.1, 0, 0)) X1 <- mydata$X1; X2 <- mydata$X2 Xlist <- list(X1 = X1, X2 = X2); Y <- mydata$y ## etaseq <- 10^seq(-3, log10(0.95), len = 10) ## out_cvr <- CVR(Y, Xlist, 4, neta = 5, nlam = 25, family = "p", nfold = 5) ## out_cvr$sol$W[[1]]; ## out_cvr$sol$W[[2]]; ## plot.CVR(out_cvr)
############## Gaussian response ###################### set.seed(42) mydata <- SimulateCVR(family = "g", n = 100, rank = 4, p1 = 50, p2 = 70, pnz = 10, beta = c(2, 1, 0, 0)) X1 <- mydata$X1; X2 <- mydata$X2 Xlist <- list(X1 = X1, X2 = X2); Y <- mydata$y ## fix rank = 4, tune eta and lambda ##out_cvr <- CVR(Y, Xlist, rankseq = 4, neta = 5, nlam = 25, ## family = "g", nfold = 5) ## out_cvr$solution$W[[1]]; ## out_cvr$solution$W[[2]]; ### uncomment to see plots ## plot.CVR(out_cvr) ## ## Distance of subspaces ##U <- mydata$U ##Pj <- function(U) U %*% solve(t(U) %*% U, t(U)) ##sum((Pj(U) - (Pj(X1 %*% out_cvr$sol$W[[1]]) + Pj(X2 %*% out_cvr$sol$W[[2]]))/2)^2) ## Precision/Recall rate ## the first 10 rows of the true W1 and W2 are set to be nonzero ##W12 <- rbind(out_cvr$sol$W[[1]], out_cvr$sol$W[[1]]) ##W12norm <- apply(W12, 1, function(a)sqrt(sum(a^2))) ##prec <- sum(W12norm[c(1:10, 51:60)] != 0)/sum(W12norm != 0); prec ##rec <- sum(W12norm[c(1:10, 51:60)] != 0)/20; rec ## sequential SparseCCA, compare the Distance of subspaces and Prec/Rec ##W12s <- SparseCCA(X1, X2, 4) ## Distance larger than CVR's ##sum((Pj(U) - (Pj(X1 %*% W12s$W1) + Pj(X2 %*% W12s$W2))/2)^2) ##W12snorm <- apply(rbind(W12s$W1, W12s$W2), 1, function(a)sqrt(sum(a^2))) ## compare Prec/Rec ##sum(W12snorm[c(1:10, 51:60)] != 0)/sum(W12snorm != 0); ##sum(W12snorm[c(1:10, 51:60)] != 0)/20; ############## binary response ######################## set.seed(12) mydata <- SimulateCVR(family = "binomial", n = 300, rank = 4, p1 = 50, p2 = 70, pnz = 10, beta = c(2, 1, 0, 0)) X1 <- mydata$X1; X2 <- mydata$X2 Xlist <- list(X1 = X1, X2 = X2); Y <- mydata$y ## out_cvr <- CVR(Y, Xlist, 4, neta = 5, nlam=25, family = "b", nfold = 5) ## out_cvr$sol$W[[1]]; ## out_cvr$sol$W[[2]]; ## plot.CVR(out_cvr) ############## Poisson response ###################### set.seed(34) mydata <- SimulateCVR(family = "p", n = 100, rank = 4, p1 = 50, p2 = 70, pnz = 10, beta = c(0.2, 0.1, 0, 0)) X1 <- mydata$X1; X2 <- mydata$X2 Xlist <- list(X1 = X1, X2 = X2); Y <- mydata$y ## etaseq <- 10^seq(-3, log10(0.95), len = 10) ## out_cvr <- CVR(Y, Xlist, 4, neta = 5, nlam = 25, family = "p", nfold = 5) ## out_cvr$sol$W[[1]]; ## out_cvr$sol$W[[2]]; ## plot.CVR(out_cvr)
Perform canonical variate regression with a set of fixed tuning parameters.
cvrsolver(Y, Xlist, rank, eta, Lam, family, Wini, penalty, opts)
cvrsolver(Y, Xlist, rank, eta, Lam, family, Wini, penalty, opts)
Y |
A response matrix. The response can be continuous, binary or Poisson. |
Xlist |
A list of covariate matrices. Cannot contain missing values. |
rank |
Number of pairs of canonical variates. |
eta |
Weight parameter between 0 and 1. |
Lam |
A vector of penalty parameters |
family |
Type of response. |
Wini |
A list of initial loading matrices W's. It must be provided. See |
penalty |
Type of penalty on W's. "GL1" for rowwise sparsity and "L1" for entrywise sparsity. |
opts |
A list of options for controlling the algorithm. Some of the options are:
|
CVR is used for extracting canonical variates and also predicting the response for multiple sets of covariates (Xlist = list(X1, X2)) and response (Y). The covariates can be, for instance, gene expression, SNPs or DNA methylation data. The response can be, for instance, quantitative measurement or binary phenotype. The criterion minimizes the objective function
s.t. for
.
are general loss functions with intercept
and coefficients
.
is the weight parameter and
are the regularization parameters.
is the rank, i.e. the number of canonical pairs.
By adjusting
, one can change the weight of the first correlation term and the second prediction term.
is reduced rank regression and
is sparse CCA (with orthogonal constrained W's). By choosing appropriate
one can induce sparsity of
's to select useful variables for predicting Y.
's with
's and (
) are iterated using an ADMM algorithm. See the reference for details.
An object containing the following components
iter |
The number of iterations the algorithm takes. |
W |
A list of fitted loading matrices. |
B |
A list of fitted |
Z |
A list of fitted |
alpha |
Fitted intercept term in the general loss term. |
beta |
Fitted regression coefficients in the general loss term. |
objvals |
A sequence of the objective values. |
Chongliang Luo, Kun Chen.
Chongliang Luo, Jin Liu, Dipak D. Dey and Kun Chen (2016) Canonical variate regression. Biostatistics, doi: 10.1093/biostatistics/kxw001.
## see SimulateCVR for simulation examples, see CVR for parameter tuning.
## see SimulateCVR for simulation examples, see CVR for parameter tuning.
A list of 3 data frames that contains the genotype, gene expression and body-mass index of 294 mice. The data is already screened for quality control. For the raw data see the links below. For more details see the reference.
mouse
mouse
A list of 3 data frames:
Mouse genotype. A data frame of 294 rows and 163 columns.
Mouse gene expression. A data frame of 294 rows and 215 columns.
Mouse body-mass index. A data frame of 294 rows and 1 column.
Mouse genotype: http://www.genetics.org/cgi/content/full/genetics.110.116087/DC1 Mouse gene expression: ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/SeriesMatrix/GSE2814/ Mouse body-mass index: http://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/MouseWeight/.
Chongliang Luo, Jin Liu, Dipak D. Dey and Kun Chen (2016) Canonical variate regression. Biostatistics, doi: 10.1093/biostatistics/kxw001.
############## Mouse body weight example ###################### data(mouse) expr <- scale(as.matrix(mouse$expr)) geno <- scale(as.matrix(mouse$geno)) bmi <- as.matrix(mouse$bmi) mouse.X <- list(X1 = expr, X2 = geno) ## Not run: mouse.cvr <- CVR(bmi, mouse.X, rankseq = 2, etaseq = 0.04, family = "g", penalty = "L1") plot(mouse.cvr) plot(expr %*% mouse.cvr$solution$W[[1]][, 2], geno %*% mouse.cvr$solution$W[[2]][, 2]) cor(expr %*% mouse.cvr$solution$W[[1]], geno %*% mouse.cvr$solution$W[[2]]) ## End(Not run)
############## Mouse body weight example ###################### data(mouse) expr <- scale(as.matrix(mouse$expr)) geno <- scale(as.matrix(mouse$geno)) bmi <- as.matrix(mouse$bmi) mouse.X <- list(X1 = expr, X2 = geno) ## Not run: mouse.cvr <- CVR(bmi, mouse.X, rankseq = 2, etaseq = 0.04, family = "g", penalty = "L1") plot(mouse.cvr) plot(expr %*% mouse.cvr$solution$W[[1]][, 2], geno %*% mouse.cvr$solution$W[[2]][, 2]) cor(expr %*% mouse.cvr$solution$W[[1]], geno %*% mouse.cvr$solution$W[[2]]) ## End(Not run)
Plot the tuning of CVR
## S3 method for class 'CVR' plot(x, ...)
## S3 method for class 'CVR' plot(x, ...)
x |
A CVR object. |
... |
Other graphical parameters used in plot. |
The first plot is mean cv error vs log(). The type of mean cv error is
decided by
type.measure
(see parameters of CVR
). The selected is marked
by a vertical line in the plot. The second plot is sparsity vs log(
).
Sparsity is the proportion of non-zero elements in fitted W1 and W2.
The threshold is marked by a horizontal line.
Press ENTER to see the second plot, which shows the tuning of
.
Generate two sets of covariates and an univariate response driven by several latent factors.
SimulateCVR(family = c("gaussian", "binomial", "poisson"), n = 100, rank = 4, p1 = 50, p2 = 70, pnz = 10, sigmax = 0.2, sigmay = 0.5, beta = c(2, 1, 0, 0), standardization = TRUE)
SimulateCVR(family = c("gaussian", "binomial", "poisson"), n = 100, rank = 4, p1 = 50, p2 = 70, pnz = 10, sigmax = 0.2, sigmay = 0.5, beta = c(2, 1, 0, 0), standardization = TRUE)
family |
Type of response. |
n |
Number of rows. The default is 100. |
rank |
Number of latent factors generating the covariates. The default is 4. |
p1 |
Number of variables in X1. The default is 50. |
p2 |
Number of variables in X2. The default is 70. |
pnz |
Number of variables in X1 and X2 related to the signal. The default is 10. |
sigmax |
Standard deviation of normal noise in X1 and X2. The default is 0.2. |
sigmay |
Standard deviation of normal noise in Y. Only used when the response is Gaussian. The default is 0.5. |
beta |
Numeric vector, the coefficients used to generate respose from the latent factors. The default is c(2, 1, 0, 0). |
standardization |
Logical. If TRUE, standardize X1 and X2 before output. The default is TRUE. |
The latent factors in U are randomly generated normal vectors,
are N(0,1) noise matrices.
The nonzero entries of and
are generated from Uniform([-1,-0.5]U[0.5,1]).
For Gaussian response,
is N(0,1) noise vector,
for binary response,
,
and for Poisson response,
.
See the reference for more details.
X1 , X2
|
The two sets of covariates with dimensions n*p1 and n*p2 respectively. |
y |
The response vector with length n. |
U |
The true latent factor matrix with dimension n*rank. |
beta |
The coefficients used to generate response from |
V1 , V2
|
The true loading matrices for X1 and X2 with dimensions p1*rank and p2*rank. The first |
Chongliang Luo, Kun Chen.
Chongliang Luo, Jin Liu, Dipak D. Dey and Kun Chen (2016) Canonical variate regression. Biostatistics, doi: 10.1093/biostatistics/kxw001.
set.seed(42) mydata <- SimulateCVR(family = "g", n = 100, rank = 4, p1 = 50, p2 = 70, pnz = 10, beta = c(2, 1, 0, 0)) X1 <- mydata$X1 X2 <- mydata$X2 Xlist <- list(X1 = X1, X2 = X2); Y <- mydata$y opts <- list(standardization = FALSE, maxIters = 300, tol = 0.005) ## use sparse CCA solution as initial values, see SparseCCA() Wini <- SparseCCA(X1, X2, 4, 0.7, 0.7) ## perform CVR with fixed eta and lambda, see cvrsolver() fit <- cvrsolver(Y, Xlist, rank = 4, eta = 0.5, Lam = c(1, 1), family = "gaussian", Wini, penalty = "GL1", opts) ## check sparsity recovery fit$W[[1]]; fit$W[[2]]; ## check orthogonality X1W1 <- X1 %*% fit$W[[1]]; t(X1W1) %*% X1W1
set.seed(42) mydata <- SimulateCVR(family = "g", n = 100, rank = 4, p1 = 50, p2 = 70, pnz = 10, beta = c(2, 1, 0, 0)) X1 <- mydata$X1 X2 <- mydata$X2 Xlist <- list(X1 = X1, X2 = X2); Y <- mydata$y opts <- list(standardization = FALSE, maxIters = 300, tol = 0.005) ## use sparse CCA solution as initial values, see SparseCCA() Wini <- SparseCCA(X1, X2, 4, 0.7, 0.7) ## perform CVR with fixed eta and lambda, see cvrsolver() fit <- cvrsolver(Y, Xlist, rank = 4, eta = 0.5, Lam = c(1, 1), family = "gaussian", Wini, penalty = "GL1", opts) ## check sparsity recovery fit$W[[1]]; fit$W[[2]]; ## check orthogonality X1W1 <- X1 %*% fit$W[[1]]; t(X1W1) %*% X1W1
Get sparse CCA solutions of X1 and X2. Use CCA
and CCA.permute
from PMA package.
See PMA package for details.
SparseCCA(X1, X2, rank = 2, penaltyx1 = NULL, penaltyx2 = NULL, nperms = 25, ifplot = 0)
SparseCCA(X1, X2, rank = 2, penaltyx1 = NULL, penaltyx2 = NULL, nperms = 25, ifplot = 0)
X1 , X2
|
Numeric matrices representing the two sets of covariates. They should have the same number of rows and cannot contain missing values. |
rank |
The number of canonical variate pairs wanted. The default is 2. |
penaltyx1 , penaltyx2
|
Numeric vectors as the penalties to be applied to X1 and X2. The defaults are seq(0.1, 0.7, len = 20). See PMA package for details. |
nperms |
Number of times the data should be permuted. The default is 25. Don't use too small value. See PMA package for details. |
ifplot |
0 or 1. The default is 0 which means don't plot the result. See PMA package for details. |
This function is generally used for tuning the penalties in sparse CCA if penaltyx1
and penaltyx2
are supplied as vectors.
The CCA solution is based on PMD algorithm and the tuning is based on permutation.
The fitted W1 and W2 are scaled so that the diagonals of W1'X1'X1W1 and W2'X2'X2W2 are all 1's.
Specifically, if a single value of mild penalty is provided for both penaltyx1
and penaltyx2
, the result can be used as
initial values of W's in CVR. For instance, with penaltyx1 = 0.7
and penaltyx2 = 0.7
,
the fitted W1 and W2 are only shrinked, but mostly not zero yet.
W1 |
Loading matrix corresponding to X1. X1*W1 gives the canonical variates from X1. |
W2 |
Loading matrix corresponding to X2. X2*W2 gives the canonical variates from X2. |
Chongliang Luo, Kun Chen.
Daniela M. Witten, Robert Tibshirani and Trevor Hastie (2009) A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10(3), 515-534.
Chongliang Luo, Jin Liu, Dipak D. Dey and Kun Chen (2016) Canonical variate regression. Biostatistics, doi: 10.1093/biostatistics/kxw001.
CVR
, CCA
, CCA.permute
.