CovRegRF: Covariance Regression with Random Forests

This R package implements the Covariance Regression with Random Forests (CovRegRF) method described in Alakus et al. (2023) <doi:10.1186/s12859-023-05377-y>. The theoretical details of the proposed method are presented in Section 1 followed by a data analysis using this method in Section 2.

Proposed method

Most of the existing multivariate regression analyses focus on estimating the conditional mean of the response variable given its covariates. However, it is also crucial in various areas to capture the conditional covariances or correlations among the elements of a multivariate response vector based on covariates. We consider the following setting: let Yn × q be a matrix of q response variables measured on n observations, where yi represents the ith row of Y. Similarly, let Xn × p be a matrix of p covariates available for all n observations, where xi represents the ith row of X. We assume that the observation yi with covariates xi has a conditional covariance matrix Σxi. We propose a novel method called Covariance Regression with Random Forests (CovRegRF) to estimate the covariance matrix of a multivariate response Y given a set of covariates X, using a random forest framework. Random forest trees are built with a specialized splitting criterion $$\sqrt{n_Ln_R}*d(\Sigma^L, \Sigma^R)$$ where ΣL and ΣR are the covariance matrix estimates of left and right nodes, and nL and nR are the left and right node sizes, respectively, d(ΣL, ΣR) is the Euclidean distance between the upper triangular part of the two matrices and computed as follows: $$d(A, B) = \sqrt{\sum_{i=1}^{q}\sum_{j=i}^{q} (\mathbf{A}_{ij} - \mathbf{B}_{ij})^2}$$ where Aq × q and Bq × q are symmetric matrices. For a new observation, the random forest provides the set of nearest neighbour out-of-bag observations which is used to estimate the conditional covariance matrix for that observation.

Significance test

We propose a hypothesis test to evaluate the effect of a subset of covariates on the covariance matrix estimates while controlling for the other covariates. Let ΣX be the conditional covariance matrix of Y given all X variables and ΣXc is the conditional covariance matrix of Y given only the set of controlling X variables. If a subset of covariates has an effect on the covariance matrix estimates obtained with the proposed method, then ΣX should be significantly different from ΣXc. We conduct a permutation test for the null hypothesis H0 : ΣX = ΣXc We estimate a p-value with the permutation test. If the p-value is less than the pre-specified significance level α, we reject the null hypothesis.

Data analysis

We will show how to use the CovRegRF package on a generated data set. The data set consists of two multivariate data sets: Xn × 3 and Yn × 3. The sample size (n) is 200. The covariance matrix of Y depends on X1 and X2 (i.e. X3 is a noise variable). We load the data and split it into train and test sets:

library(CovRegRF)
data(data)
xvar.names <- colnames(data$X)
yvar.names <- colnames(data$Y)
data1 <- data.frame(data$X, data$Y)

set.seed(4567)
smp <- sample(1:nrow(data1), size = round(nrow(data1)*0.6), replace = FALSE)
traindata <- data1[smp,, drop=FALSE]
testdata <- data1[-smp, xvar.names, drop=FALSE]

Firstly, we check the global effect of X on the covariance matrix estimates by applying the significance test for the three covariates.

formula <- as.formula(paste(paste(yvar.names, collapse="+"), ".", sep=" ~ "))
globalsig.obj <- significance.test(formula, traindata, params.rfsrc = list(ntree = 200), 
                                   nperm = 10, test.vars = NULL)
globalsig.obj$pvalue
#> [1] 0

Using 10 permutations, the estimated p-value is 0 which is smaller than the significance level (α) of 0.05 and we reject the null hypothesis indicating the conditional covariance matrices significantly vary with the set of covariates. When performing a permutation test to estimate a p-value, we need more than 10 permutations. Using 500 permutations, the estimated p-value is 0.012. The computational time increases with the number of permutations.

Next, we apply the proposed method with covregrf() and get the out-of-bag (OOB) covariance matrix estimates for the training observations.

covregrf.obj <- covregrf(formula, traindata, params.rfsrc = list(ntree = 200))
pred.oob <- covregrf.obj$predicted.oob
head(pred.oob, 2)
#> [[1]]
#>           y1        y2       y3
#> y1 1.1286879 0.8387699 1.101836
#> y2 0.8387699 1.9878143 1.507416
#> y3 1.1018365 1.5074164 3.591839
#> 
#> [[2]]
#>          y1       y2       y3
#> y1 1.353311 1.050629 1.761897
#> y2 1.050629 2.153400 2.142313
#> y3 1.761897 2.142313 4.453531

Then, we get the variable importance (VIMP) measures for the covariates. VIMP measures reflect the predictive power of X on the estimated covariance matrices. Also, we can plot the VIMP measures.

vimp.obj <- vimp(covregrf.obj)
vimp.obj$importance
#>         x1         x2         x3 
#> 1.38012921 0.51122499 0.07159317
plot.vimp(vimp.obj)

plot of chunk vimp_plot

From the VIMP measures, we see that X3 has smaller importance than X1 and X2. We apply the significance test to evaluate the effect of X3 on the covariance matrices while controlling for X1 and X2.

partialsig.obj <- significance.test(formula, traindata, params.rfsrc = list(ntree = 200), 
                                    nperm = 10, test.vars = "x3")
partialsig.obj$pvalue
#> [1] 0.3

Using 10 permutations, the estimated p-values is 0.3 and we fail to reject the null hypothesis, indicating that we do not have enough evidence to prove that X3 has an effect on the estimated covariance matrices while X1 and X2 are in the model. Using 500 permutations, the estimated p-value is 0.218.

Finally, we can get the covariance matrix predictions for the test observations.

pred.obj <- predict(covregrf.obj, testdata)
pred <- pred.obj$predicted
head(pred, 2)
#> [[1]]
#>           y1        y2        y3
#> y1 1.0710647 0.5039413 0.6859008
#> y2 0.5039413 1.6077176 1.1050398
#> y3 0.6859008 1.1050398 2.7710218
#> 
#> [[2]]
#>          y1       y2       y3
#> y1 1.294158 1.306257 1.854186
#> y2 1.306257 2.183658 2.424231
#> y3 1.854186 2.424231 4.387248

References

Alakus, C., Larocque, D., and Labbe, A. (2023). Covariance regression with random forests. BMC Bioinformatics 24, 258.

Session info

sessionInfo()
#> R version 4.2.0 (2022-04-22)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Monterey 12.6.3
#> 
#> Matrix products: default
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] CovRegRF_2.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] visNetwork_2.1.0   digest_0.6.29      R6_2.5.1           jsonlite_1.8.0     magrittr_2.0.3     evaluate_0.15     
#>  [7] highr_0.9          rlang_1.0.6        stringi_1.7.8      cli_3.6.0          data.table_1.14.2  rstudioapi_0.13   
#> [13] DiagrammeR_1.0.9   rmarkdown_2.14     data.tree_1.0.0    RColorBrewer_1.1-3 tools_4.2.0        stringr_1.4.0     
#> [19] htmlwidgets_1.5.4  glue_1.6.2         parallel_4.2.0     xfun_0.31          yaml_2.3.5         fastmap_1.1.0     
#> [25] compiler_4.2.0     htmltools_0.5.3    knitr_1.39