Title: | Random Forest with Canonical Correlation Analysis |
---|---|
Description: | Random Forest with Canonical Correlation Analysis (RFCCA) is a random forest method for estimating the canonical correlations between two sets of variables depending on the subject-related covariates. The trees are built with a splitting rule specifically designed to partition the data to maximize the canonical correlation heterogeneity between child nodes. The method is described in Alakus et al. (2021) <doi:10.1093/bioinformatics/btab158>. 'RFCCA' uses 'randomForestSRC' package (Ishwaran and Kogalur, 2020) by freezing at the version 2.9.3. The custom splitting rule feature is utilised to apply the proposed splitting rule. The 'randomForestSRC' package implements 'OpenMP' by default, contingent upon the support provided by the target architecture and operating system. In this package, 'LAPACK' and 'BLAS' libraries are used for matrix decompositions. |
Authors: | Cansu Alakus [aut, cre], Denis Larocque [aut], Aurelie Labbe [aut], Hemant Ishwaran [ctb] (Author of included randomForestSRC codes), Udaya B. Kogalur [ctb] (Author of included randomForestSRC codes), Intel Corporation [cph] (Copyright holder of included LAPACKE codes), Keita Teranishi [ctb] (Author of included cblas_dgemm.c codes) |
Maintainer: | Cansu Alakus <[email protected]> |
License: | GPL (>= 3) |
Version: | 2.0.0 |
Built: | 2024-11-05 06:21:35 UTC |
Source: | CRAN |
RFCCA is a random forest method for estimating the canonical correlations between two sets of variables depending on the subject-related covariates. The trees are built with a splitting rule specifically designed to partition the data to maximize the canonical correlation heterogeneity between child nodes. RFCCA uses 'randomForestSRC' package (Ishwaran and Kogalur, 2020) by freezing at the version 2.9.3. The custom splitting rule feature is utilised to apply the proposed splitting rule. The method is described in Alakus et al. (2021).
rfcca
predict.rfcca
global.significance
vimp.rfcca
plot.vimp.rfcca
print.rfcca
Alakus, C., Larocque, D., Jacquemont, S., Barlaam, F., Martin, C.-O., Agbogba, K., Lippe, S., and Labbe, A. (2021). Conditional canonical correlation estimation based on covariates with random forests. Bioinformatics, 37(17), 2714-2721.
Ishwaran, H., Kogalur, U. (2020). Fast Unified Random Forests for Survival, Regression, and Classification (RF-SRC). R package version 2.9.3, https://cran.r-project.org/package=randomForestSRC.
A generated data set containing three sets of variables: X, Y, Z. The canonical correlation between X and Y depends on some of the Z variables. The sample size is 300. Z1-Z5 are the important variables for the varying correlation between X and Y. Z6-Z7 are the noise variables.
data
data
A list with three elements namely X, Y, Z. Each element has 300 rows. X has 2 columns, Y has 2 columns and Z has 7 columns.
## load generated example data data(data, package = "RFCCA")
## load generated example data data(data, package = "RFCCA")
This function runs a permutation test to evaluates the global effect of subject-related covariates (Z). Returns an estimated p-value.
global.significance( X, Y, Z, ntree = 200, mtry = NULL, nperm = 500, nodesize = NULL, nodedepth = NULL, nsplit = 10, Xcenter = TRUE, Ycenter = TRUE )
global.significance( X, Y, Z, ntree = 200, mtry = NULL, nperm = 500, nodesize = NULL, nodedepth = NULL, nsplit = 10, Xcenter = TRUE, Ycenter = TRUE )
X |
The first multivariate data set which has |
Y |
The second multivariate data set which has |
Z |
The set of subject-related covariates which has |
ntree |
Number of trees. |
mtry |
Number of z-variables randomly selected as candidates for
splitting a node. The default is |
nperm |
Number of permutations. |
nodesize |
Forest average number of unique data points in a terminal
node. The default is the |
nodedepth |
Maximum depth to which a tree should be grown. In the default, this parameter is ignored. |
nsplit |
Non-negative integer value for the number of random splits to
consider for each candidate splitting variable. When zero or |
Xcenter |
Should the columns of X be centered? The default is
|
Ycenter |
Should the columns of Y be centered? The default is
|
An object of class (rfcca,globalsignificance)
which is a list
with the following components:
call |
The original call to |
pvalue |
p-value, see below for details. |
n |
Sample size of the data ( |
ntree |
Number of trees grown. |
nperm |
Number of permutations. |
mtry |
Number of variables randomly selected for splitting at each node. |
nodesize |
Minimum forest average number of unique data points in a terminal node. |
nodedepth |
Maximum depth to which a tree is allowed to be grown. |
nsplit |
Number of randomly selected split points. |
xvar |
Data frame of x-variables. |
xvar.names |
A character vector of the x-variable names. |
yvar |
Data frame of y-variables. |
yvar.names |
A character vector of the y-variable names. |
zvar |
Data frame of z-variables. |
zvar.names |
A character vector of the z-variable names. |
predicted.oob |
OOB predicted canonical correlations for training observations based on the selected final canonical correlation estimation method. |
predicted.perm |
Predicted canonical correlations for the permutations. A matrix of predictions with observations on the rows and permutations on the columns. |
We perform a hypothesis test to evaluate the global effect of the
subject-related covariates on distinguishing between canonical correlations.
Define the unconditional canonical correlation between and
as
which is found by computing CCA with
all
and
, and the conditional canonical correlation between
and
given
as
which is found by
rfcca()
. If there is a global effect of on correlations
between
and
,
should be significantly
different from
. We conduct a permutation test
for the null hypothesis
We estimate a p-value with the permutation test. If the p-value is
less than the pre-specified significance level , we reject the
null hypothesis.
rfcca
predict.rfcca
print.rfcca
## load generated example data data(data, package = "RFCCA") set.seed(2345) global.significance(X = data$X, Y = data$Y, Z = data$Z, ntree = 40, nperm = 5)
## load generated example data data(data, package = "RFCCA") set.seed(2345) global.significance(X = data$X, Y = data$Y, Z = data$Z, ntree = 40, nperm = 5)
Plots variable importance measures (VIMP) for subject-related z-variables for training data.
## S3 method for class 'rfcca' plot.vimp(x, sort = TRUE, ndisp = NULL, ...)
## S3 method for class 'rfcca' plot.vimp(x, sort = TRUE, ndisp = NULL, ...)
x |
An object of class (rfcca,grow) or (rfcca,predict). |
sort |
Should the z-variables be sorted according to their variable
importance measures in the plot? The default is |
ndisp |
Number of z-variables to display in the plot. If |
... |
Optional arguments to be passed to other methods. |
Invisibly, the variable importance measures that were plotted.
## load generated example data data(data, package = "RFCCA") set.seed(2345) ## train rfcca rfcca.obj <- rfcca(X = data$X, Y = data$Y, Z = data$Z, ntree = 100, importance = TRUE) ## plot vimp plot.vimp(rfcca.obj)
## load generated example data data(data, package = "RFCCA") set.seed(2345) ## train rfcca rfcca.obj <- rfcca(X = data$X, Y = data$Y, Z = data$Z, ntree = 100, importance = TRUE) ## plot vimp plot.vimp(rfcca.obj)
Obtain predicted canonical correlations using a rfcca forest for training or new data.
## S3 method for class 'rfcca' predict( object, newdata, membership = FALSE, finalcca = c("cca", "scca", "rcca"), ... )
## S3 method for class 'rfcca' predict( object, newdata, membership = FALSE, finalcca = c("cca", "scca", "rcca"), ... )
object |
An object of class |
newdata |
Test data of the set of subject-related covariates (Z). A
data.frame with numeric values and factors. If missing, the out-of-bag
predictions in |
membership |
Should terminal node membership information be returned? |
finalcca |
Which CCA should be used for final canonical correlation
estimation? Choices are |
... |
Optional arguments to be passed to other methods. |
An object of class (rfcca,predict)
which is a list with the
following components:
call |
The original grow call to |
n |
Sample size of the test data ( |
ntree |
Number of trees grown. |
xvar |
Data frame of x-variables. |
xvar.names |
A character vector of the x-variable names. |
yvar |
Data frame of y-variables. |
yvar.names |
A character vector of the y-variable names. |
zvar |
Data frame of test z-variables. If |
zvar.names |
A character vector of the z-variable names. |
forest |
The |
membership |
A matrix recording terminal node membership for the test data where each cell represents the node number that an observation falls in for that tree. |
predicted |
Test set predicted canonical correlations based on the
selected final canonical correlation estimation method. If |
predicted.coef |
Predicted canonical weight vectors for x- and y- variables. |
finalcca |
The selected CCA used for final canonical correlation estimations. |
## load generated example data data(data, package = "RFCCA") set.seed(2345) ## define train/test split smp <- sample(1:nrow(data$X), size = round(nrow(data$X) * 0.7), replace = FALSE) train.data <- lapply(data, function(x) {x[smp, ]}) test.Z <- data$Z[-smp, ] ## train rfcca rfcca.obj <- rfcca(X = train.data$X, Y = train.data$Y, Z = train.data$Z, ntree = 100) ## predict without new data (OOB predictions will be returned) pred.obj <- predict(rfcca.obj) pred.oob <- pred.obj$predicted ## predict with new test data pred.obj2 <- predict(rfcca.obj, newdata = test.Z) pred <- pred.obj2$predicted ## print predict objects print(pred.obj) print(pred.obj2)
## load generated example data data(data, package = "RFCCA") set.seed(2345) ## define train/test split smp <- sample(1:nrow(data$X), size = round(nrow(data$X) * 0.7), replace = FALSE) train.data <- lapply(data, function(x) {x[smp, ]}) test.Z <- data$Z[-smp, ] ## train rfcca rfcca.obj <- rfcca(X = train.data$X, Y = train.data$Y, Z = train.data$Z, ntree = 100) ## predict without new data (OOB predictions will be returned) pred.obj <- predict(rfcca.obj) pred.oob <- pred.obj$predicted ## predict with new test data pred.obj2 <- predict(rfcca.obj, newdata = test.Z) pred <- pred.obj2$predicted ## print predict objects print(pred.obj) print(pred.obj2)
Print summary output of a RFCCA analysis. This is the default print method for the package.
## S3 method for class 'rfcca' print(x, ...)
## S3 method for class 'rfcca' print(x, ...)
x |
An object of class |
... |
Optional arguments to be passed to other methods. |
No return value, called for side effects.
## load generated example data data(data, package = "RFCCA") set.seed(2345) ## train rfcca rfcca.obj <- rfcca(X = data$X, Y = data$Y, Z = data$Z, ntree = 100, importance = TRUE) ## print the grow object print(rfcca.obj)
## load generated example data data(data, package = "RFCCA") set.seed(2345) ## train rfcca rfcca.obj <- rfcca(X = data$X, Y = data$Y, Z = data$Z, ntree = 100, importance = TRUE) ## print the grow object print(rfcca.obj)
Estimates the canonical correlations between two sets of variables depending on the subject-related covariates.
rfcca( X, Y, Z, ntree = 200, mtry = NULL, nodesize = NULL, nodedepth = NULL, nsplit = 10, importance = FALSE, finalcca = c("cca", "scca", "rcca"), bootstrap = TRUE, samptype = c("swor", "swr"), sampsize = if (samptype == "swor") function(x) { x * 0.632 } else function(x) { x }, forest = TRUE, membership = FALSE, bop = TRUE, Xcenter = TRUE, Ycenter = TRUE, ... )
rfcca( X, Y, Z, ntree = 200, mtry = NULL, nodesize = NULL, nodedepth = NULL, nsplit = 10, importance = FALSE, finalcca = c("cca", "scca", "rcca"), bootstrap = TRUE, samptype = c("swor", "swr"), sampsize = if (samptype == "swor") function(x) { x * 0.632 } else function(x) { x }, forest = TRUE, membership = FALSE, bop = TRUE, Xcenter = TRUE, Ycenter = TRUE, ... )
X |
The first multivariate data set which has |
Y |
The second multivariate data set which has |
Z |
The set of subject-related covariates which has |
ntree |
Number of trees. |
mtry |
Number of z-variables randomly selected as candidates for
splitting a node. The default is |
nodesize |
Forest average number of unique data points in a terminal
node. The default is the |
nodedepth |
Maximum depth to which a tree should be grown. In the default, this parameter is ignored. |
nsplit |
Non-negative integer value for the number of random splits to
consider for each candidate splitting variable. When zero or |
importance |
Should variable importance of z-variables be assessed? The
default is |
finalcca |
Which CCA should be used for final canonical correlation
estimation? Choices are |
bootstrap |
Should the data be bootstrapped? The default value is
|
samptype |
Type of bootstrap. Choices are |
sampsize |
Size of sample to draw. For sampling without replacement, by default it is .632 times the sample size. For sampling with replacement, it is the sample size. |
forest |
Should the forest object be returned? It is used for prediction
on new data. The default is |
membership |
Should terminal node membership and inbag information be returned? |
bop |
Should the Bag of Observations for Prediction (BOP) for training
observations be returned? The default is |
Xcenter |
Should the columns of X be centered? The default is
|
Ycenter |
Should the columns of Y be centered? The default is
|
... |
Optional arguments to be passed to other methods. |
An object of class (rfcca,grow)
which is a list with the
following components:
call |
The original call to |
n |
Sample size of the data ( |
ntree |
Number of trees grown. |
mtry |
Number of variables randomly selected for splitting at each node. |
nodesize |
Minimum forest average number of unique data points in a terminal node. |
nodedepth |
Maximum depth to which a tree is allowed to be grown. |
nsplit |
Number of randomly selected split points. |
xvar |
Data frame of x-variables. |
xvar.names |
A character vector of the x-variable names. |
yvar |
Data frame of y-variables. |
yvar.names |
A character vector of the y-variable names. |
zvar |
Data frame of z-variables. |
zvar.names |
A character vector of the z-variable names. |
leaf.count |
Number of terminal nodes for each tree in the forest.
Vector of length |
bootstrap |
Was the data bootstrapped? |
forest |
If |
membership |
A matrix recording terminal node membership where each cell represents the node number that an observations falls in for that tree. |
importance |
Variable importance measures (VIMP) for each z-variable. |
inbag |
A matrix recording inbag membership where each cell represents whether the observation is in the bootstrap sample in the corresponding tree. |
predicted.oob |
OOB predicted canonical correlations for training observations based on the selected final canonical correlation estimation method. |
predicted.coef |
Predicted canonical weight vectors for x- and y- variables. |
bop |
If |
finalcca |
The selected CCA used for final canonical correlation estimations. |
rfsrc.grow |
An object of class |
Final canonical
correlation can be computed with CCA (Hotelling, 1936), Sparse CCA (Witten
et al., 2009) or Regularized CCA (Vinod,1976; Leurgans et al., 1993). If
Regularized CCA will be used, and
should be
specified.
Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28(3/4), 321–377.
Leurgans, S. E., Moyeed, R. A., & Silverman, B. W. (1993). Canonical correlation analysis when the data are curves. Journal of the Royal Statistical Society: Series B (Methodological), 55(3), 725-740.
Vinod, H.D. (1976). Canonical ridge and econometrics of joint production. Journal of econometrics, 4(2), 147–166.
Witten, D. M., Tibshirani, R., & Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3), 515-534.
predict.rfcca
global.significance
vimp.rfcca
print.rfcca
## load generated example data data(data, package = "RFCCA") set.seed(2345) ## define train/test split smp <- sample(1:nrow(data$X), size = round(nrow(data$X) * 0.7), replace = FALSE) train.data <- lapply(data, function(x) {x[smp, ]}) test.Z <- data$Z[-smp, ] ## train rfcca rfcca.obj <- rfcca(X = train.data$X, Y = train.data$Y, Z = train.data$Z, ntree = 100, importance = TRUE) ## print the grow object print(rfcca.obj) ## get the OOB predictions pred.oob <- rfcca.obj$predicted.oob ## predict with new test data pred.obj <- predict(rfcca.obj, newdata = test.Z) pred <- pred.obj$predicted ## get the variable importance measures z.vimp <- rfcca.obj$importance ## train rfcca and estimate the final canonical correlations with "scca" rfcca.obj2 <- rfcca(X = train.data$X, Y = train.data$Y, Z = train.data$Z, ntree = 100, finalcca = "scca")
## load generated example data data(data, package = "RFCCA") set.seed(2345) ## define train/test split smp <- sample(1:nrow(data$X), size = round(nrow(data$X) * 0.7), replace = FALSE) train.data <- lapply(data, function(x) {x[smp, ]}) test.Z <- data$Z[-smp, ] ## train rfcca rfcca.obj <- rfcca(X = train.data$X, Y = train.data$Y, Z = train.data$Z, ntree = 100, importance = TRUE) ## print the grow object print(rfcca.obj) ## get the OOB predictions pred.oob <- rfcca.obj$predicted.oob ## predict with new test data pred.obj <- predict(rfcca.obj, newdata = test.Z) pred <- pred.obj$predicted ## get the variable importance measures z.vimp <- rfcca.obj$importance ## train rfcca and estimate the final canonical correlations with "scca" rfcca.obj2 <- rfcca(X = train.data$X, Y = train.data$Y, Z = train.data$Z, ntree = 100, finalcca = "scca")
Calculates variable importance measures (VIMP) for subject-related z-variables for training data.
## S3 method for class 'rfcca' vimp(object, ...)
## S3 method for class 'rfcca' vimp(object, ...)
object |
An object of class (rfcca,grow). |
... |
Optional arguments to be passed to other methods. |
An object of class (rfcca,predict)
which is a list with the
following components:
call |
The original grow call to |
n |
Sample size of the data ( |
ntree |
Number of trees grown. |
zvar |
Data frame of z-variables. |
zvar.names |
A character vector of the z-variable names. |
predicted.oob |
OOB predicted canonical correlations for training observations based on the selected final canonical correlation estimation method. |
finalcca |
The selected CCA used for final canonical correlation estimations. |
importance |
Variable importance measures (VIMP) for each z-variable. |
## load generated example data data(data, package = "RFCCA") set.seed(2345) ## train rfcca rfcca.obj <- rfcca(X = data$X, Y = data$Y, Z = data$Z, ntree = 100) ## get variable importance measures vimp.obj <- vimp(rfcca.obj) vimp.z <- vimp.obj$importance
## load generated example data data(data, package = "RFCCA") set.seed(2345) ## train rfcca rfcca.obj <- rfcca(X = data$X, Y = data$Y, Z = data$Z, ntree = 100) ## get variable importance measures vimp.obj <- vimp(rfcca.obj) vimp.z <- vimp.obj$importance