This R package implements the Covariance Regression with Random Forests (CovRegRF) method described in Alakus et al. (2023) <doi:10.1186/s12859-023-05377-y>. The theoretical details of the proposed method are presented in Section 1 followed by a data analysis using this method in Section 2.

Proposed method

Most of the existing multivariate regression analyses focus on estimating the conditional mean of the response variable given its covariates. However, it is also crucial in various areas to capture the conditional covariances or correlations among the elements of a multivariate response vector based on covariates. We consider the following setting: let Y_n × q be a matrix of q response variables measured on n observations, where y_i represents the ith row of Y. Similarly, let X_n × p be a matrix of p covariates available for all n observations, where x_i represents the ith row of X. We assume that the observation y_i with covariates x_i has a conditional covariance matrix Σ_{x_i}. We propose a novel method called Covariance Regression with Random Forests (CovRegRF) to estimate the covariance matrix of a multivariate response Y given a set of covariates X, using a random forest framework. Random forest trees are built with a specialized splitting criterion $$\sqrt{n_Ln_R}*d(\Sigma^L, \Sigma^R)$$ where Σ^L and Σ^R are the covariance matrix estimates of left and right nodes, and n_L and n_R are the left and right node sizes, respectively, d(Σ^L, Σ^R) is the Euclidean distance between the upper triangular part of the two matrices and computed as follows: $$d(A, B) = \sqrt{\sum_{i=1}^{q}\sum_{j=i}^{q} (\mathbf{A}_{ij} - \mathbf{B}_{ij})^2}$$ where A_q × q and B_q × q are symmetric matrices. For a new observation, the random forest provides the set of nearest neighbour out-of-bag observations which is used to estimate the conditional covariance matrix for that observation.

Significance test

We propose a hypothesis test to evaluate the effect of a subset of covariates on the covariance matrix estimates while controlling for the other covariates. Let Σ_X be the conditional covariance matrix of Y given all X variables and Σ_X^c is the conditional covariance matrix of Y given only the set of controlling X variables. If a subset of covariates has an effect on the covariance matrix estimates obtained with the proposed method, then Σ_X should be significantly different from Σ_X^c. We conduct a permutation test for the null hypothesis H₀ : Σ_X = Σ_X^c We estimate a p-value with the permutation test. If the p-value is less than the pre-specified significance level α, we reject the null hypothesis.

Data analysis

We will show how to use the CovRegRF package on a generated data set. The data set consists of two multivariate data sets: X_n × 3 and Y_n × 3. The sample size (n) is 200. The covariance matrix of Y depends on X₁ and X₂ (i.e. X₃ is a noise variable). We load the data and split it into train and test sets:

library(CovRegRF)
data(data)
xvar.names <- colnames(data$X)
yvar.names <- colnames(data$Y)
data1 <- data.frame(data$X, data$Y)

set.seed(4567)
smp <- sample(1:nrow(data1), size = round(nrow(data1)*0.6), replace = FALSE)
traindata <- data1[smp,, drop=FALSE]
testdata <- data1[-smp, xvar.names, drop=FALSE]

Firstly, we check the global effect of X on the covariance matrix estimates by applying the significance test for the three covariates.

formula <- as.formula(paste(paste(yvar.names, collapse="+"), ".", sep=" ~ "))
globalsig.obj <- significance.test(formula, traindata, params.rfsrc = list(ntree = 200), 
                                   nperm = 10, test.vars = NULL)
globalsig.obj$pvalue
#> [1] 0

Using 10 permutations, the estimated p-value is 0 which is smaller than the significance level (α) of 0.05 and we reject the null hypothesis indicating the conditional covariance matrices significantly vary with the set of covariates. When performing a permutation test to estimate a p-value, we need more than 10 permutations. Using 500 permutations, the estimated p-value is 0.012. The computational time increases with the number of permutations.

Next, we apply the proposed method with covregrf() and get the out-of-bag (OOB) covariance matrix estimates for the training observations.

covregrf.obj <- covregrf(formula, traindata, params.rfsrc = list(ntree = 200))
pred.oob <- covregrf.obj$predicted.oob
head(pred.oob, 2)
#> [[1]]
#>           y1        y2       y3
#> y1 1.1286879 0.8387699 1.101836
#> y2 0.8387699 1.9878143 1.507416
#> y3 1.1018365 1.5074164 3.591839
#> 
#> [[2]]
#>          y1       y2       y3
#> y1 1.353311 1.050629 1.761897
#> y2 1.050629 2.153400 2.142313
#> y3 1.761897 2.142313 4.453531

Then, we get the variable importance (VIMP) measures for the covariates. VIMP measures reflect the predictive power of X on the estimated covariance matrices. Also, we can plot the VIMP measures.

vimp.obj <- vimp(covregrf.obj)
vimp.obj$importance
#>         x1         x2         x3 
#> 1.38012921 0.51122499 0.07159317
plot.vimp(vimp.obj)

plot of chunk vimp_plot

From the VIMP measures, we see that X₃ has smaller importance than X₁ and X₂. We apply the significance test to evaluate the effect of X₃ on the covariance matrices while controlling for X₁ and X₂.

partialsig.obj <- significance.test(formula, traindata, params.rfsrc = list(ntree = 200), 
                                    nperm = 10, test.vars = "x3")
partialsig.obj$pvalue
#> [1] 0.3

Using 10 permutations, the estimated p-values is 0.3 and we fail to reject the null hypothesis, indicating that we do not have enough evidence to prove that X₃ has an effect on the estimated covariance matrices while X₁ and X₂ are in the model. Using 500 permutations, the estimated p-value is 0.218.

Finally, we can get the covariance matrix predictions for the test observations.

pred.obj <- predict(covregrf.obj, testdata)
pred <- pred.obj$predicted
head(pred, 2)
#> [[1]]
#>           y1        y2        y3
#> y1 1.0710647 0.5039413 0.6859008
#> y2 0.5039413 1.6077176 1.1050398
#> y3 0.6859008 1.1050398 2.7710218
#> 
#> [[2]]
#>          y1       y2       y3
#> y1 1.294158 1.306257 1.854186
#> y2 1.306257 2.183658 2.424231
#> y3 1.854186 2.424231 4.387248

References

Alakus, C., Larocque, D., and Labbe, A. (2023). Covariance regression with random forests. BMC Bioinformatics 24, 258.

Session info

sessionInfo()
#> R version 4.2.0 (2022-04-22)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Monterey 12.6.3
#> 
#> Matrix products: default
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] CovRegRF_2.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] visNetwork_2.1.0   digest_0.6.29      R6_2.5.1           jsonlite_1.8.0     magrittr_2.0.3     evaluate_0.15     
#>  [7] highr_0.9          rlang_1.0.6        stringi_1.7.8      cli_3.6.0          data.table_1.14.2  rstudioapi_0.13   
#> [13] DiagrammeR_1.0.9   rmarkdown_2.14     data.tree_1.0.0    RColorBrewer_1.1-3 tools_4.2.0        stringr_1.4.0     
#> [19] htmlwidgets_1.5.4  glue_1.6.2         parallel_4.2.0     xfun_0.31          yaml_2.3.5         fastmap_1.1.0     
#> [25] compiler_4.2.0     htmltools_0.5.3    knitr_1.39

CovRegRF: Covariance Regression with Random Forests

Proposed method

Significance test

Data analysis

References

Session info