---
title: "Covsim: Simulating non-normal data with given covariance matrix"
author: "Njål Foldnes and Steffen Grønneberg"
date: "2024-05-31"
output: html_document
vignette: >
%\VignetteIndexEntry{Covsim: Simulating non-normal data with given covariance matrix}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
citation_package: natbib
bibliography: bibcovsim.bib
biblio-style: apalike
---
The covsim package includes three methods for simulating non-normal data from distributions with given covariance matrix. The most flexible method is VIne-To-Anything (VITA) (@gronneberg2017covariance), and its implementation and use in covsim is explained in @jss. VITA allows complete specification of the marginal distributions, and partial specification of bivariate dependencies.
In addition to VITA, two more simple methods are available in covsim, which allows specification of univariate skewness and kurtosis. The independent generator (IG) was proposed by @foldnes2016simple, while the piecewise linear approach (PLSIM) was proposed by @foldnes2021plsim.
# VITA
The function vita() returns a regular vine object as defined by package rvinecopulib (@rvinecopulib).
This object may be used to simulate data from the corresponding vine distribution.
The function takes a list of marginal distributions and a covariance matrix as essential input.
In addition, the user may provide details on the vine structure and the bivariate copulae to be considered at each node in the vine.
## Bivariate case
Let us first consider the bivariate case, where we want a distribution with standard normal marginals coupled by a Clayton copula, whose correlation is $\rho=.8$
```r
library("covsim")
mnorm <- list(list(distr = "norm"), list(distr = "norm"))
sigma.target <- matrix(c(1, 0.8, 0.8, 1), 2)
set.seed(1)
calibrated.vita <- vita(mnorm, sigma.target, family_set = "clayton")
#> Tree 1
#> 1 - 2 ( 1 of 1 )
summary(calibrated.vita)
#> $margins
#> # A data.frame: 2 x 2
#> margin distr
#> 1 norm
#> 2 norm
#>
#> $copula
#> # A data.frame: 1 x 10
#> tree edge conditioned conditioning var_types family rotation parameters df tau
#> 1 1 1, 2 c,c clayton 0 3.4 0 0.63
```
The calibrated vita object is made from a Clayton copula with dependence parameter $\theta=3.4$.
You can confirm that the vine distribution has correlation $\rho=.8$, by simulating a $n=10^5$ sample from the vita object:
```r
library("rvinecopulib")
large.sample <- rvine(10^5, calibrated.vita)
cov(large.sample)
#> [,1] [,2]
#> [1,] 0.9994719 0.8017954
#> [2,] 0.8017954 0.9985791
```
## Trivariate case
The motivation for this example is given in @jss, and we here restrict ourselves to calibration of the vita object. Our target covariance matrix is
$$ \Sigma = \begin{pmatrix} 1 & 0.4 & 0.3 \\
0.4 & 1 & 0.4 \\
0.3 & 0.4 & 1 \\ \end{pmatrix}, $$
and we want the marginals to be the standard
normal distribution, a scaled chi-squared
distribution with one degree of freedom, and a scaled Student's $t$ distribution with five DFs.
The scalings ensure unit variance. In addition, we want to specify the vine structure explicitly, and also the copula family for each bivariate dependency as follows.
Note: We set Nmax=10^5 to speed things up, the default, Nmax=10^6 should be used for serious work.
```r
sigma.target <- matrix(c(1, 0.4, 0.3, 0.4, 1, 0.4, 0.3, 0.4, 1), 3)
margins <- list(list(distr = "norm"), list(distr = "chisq", df = 1),
list(distr = "t", df = 5))
pcs <- list(list(bicop_dist("clayton"), bicop_dist("joe")),
list(bicop_dist("frank")))
vine_cop <- vinecop_dist(pcs, structure = dvine_structure(1:3))
margin.variances <- c(1, 2, 5/3)
pre <- diag(sqrt(margin.variances/diag(sigma.target)))
vita.target <- pre %*% sigma.target %*% pre
set.seed(1)
calibrated.vita <- vita(margins, vita.target, vc = vine_cop,
verbose = TRUE, Nmax = 10^5)#FAST CALIBRATION
#> Tree 1
#> 1 - 2 ( 1 of 3 )
#> 2 - 3 ( 2 of 3 )
#> Tree 2
#> 1 - 3 ( 3 of 3 )
post <- diag(1/diag(pre))
vita.sample <- rvine(10^5, calibrated.vita) %*% post
round(cov(vita.sample) - sigma.target, 3)
#> [,1] [,2] [,3]
#> [1,] 0.001 0.003 -0.004
#> [2,] 0.003 0.002 -0.002
#> [3,] -0.004 -0.002 -0.003
library(GGally)
GGally::ggpairs(data.frame(head(vita.sample, 2*10^3)))
```
Note that if the vc option is not included in the vita() call, the default will be the simple structure given by a d-vine. The copula families in each node (each bivariate association) may be specified by option family_set, where the default ensures that first a Clayton copula is tentatively calibrated. If unsuccesful, next a gauss, joe, gumbel or frank copula are calibrated, in that order. If none of these families can produce the required covariance, vita() returns an error message indicating that no vine distributional class can match the given marginals and covariance matrix.
## SEM example in 20 dimensions
![SEM population model, from @li](sem.jpg)
```r
sem.pop <- 'Ksi1 =~ start(.8)*x1 +start(.7)*x2 +start(.6)*x3 +start(.5)*x4
Ksi2 =~ start(.8)*x5 + start(.7)*x6 + start(.6)*x7 +start(.5)*x8
Eta1 =~ start(.8)*y1 + start(.7)*y2 + start(.6)*y3 +start(.5)*y4
Eta2 =~ start(.8)*y5 + start(.7)*y6 + start(.6)*y7 +start(.5)*y8
Eta3 =~ start(.8)*y9 + start(.7)*y10 + start(.6)*y11 +start(.5)*y12
Eta1 ~ start(.4)*Ksi1 + start(.6)*Ksi2
Eta2 ~ start(.4)*Ksi1 + start(.2)*Ksi2 + start(.3)*Eta1
Eta3 ~ start(.1)*Ksi1 + start(.1)*Ksi2 + start(.2)*Eta1 +start(.5)*Eta2
Ksi1 ~~ start(.3)*Ksi2; Eta1 ~~ start(.5)*Eta1;
Eta2 ~~ start(.5)*Eta2; Eta3 ~~ start(.5)*Eta3
x1 ~~ start(.5)*x1; x2 ~~ start(.5)*x2
x3 ~~ start(.5)*x3; x4 ~~ start(.5)*x4; x5 ~~ start(.5)*x5
x6 ~~ start(.5)*x6; x7 ~~ start(.5)*x7; x8 ~~ start(.5)*x8
y1 ~~ start(.5)*y1; y2 ~~ start(.5)*y2; y3 ~~ start(.5)*y3
y4 ~~ start(.5)*y4; y5 ~~ start(.5)*y5; y6 ~~ start(.5)*y6
y7 ~~ start(.5)*y7; y8 ~~ start(.5)*y8; y9 ~~ start(.5)*y9
y10 ~~ start(.5)*y10; y11 ~~ start(.5)*y11; y12 ~~ start(.5)*y12'
library(lavaan)
sigma.target <- lavInspect(sem(sem.pop, data = NULL), "sigma.hat")
```
Next, we fit a VITA distribution with normal marginals to the target
covariance matrix. This is a variant of a data generating distribution
used in the simulation study of @foldnes2021plsim. First, the
margins are scaled to match the target variances. Then, we calibrate a
VITA distribution. Note that we do not specify which family of copulae
to use, so the default Clayton copula is used. Finally, a list of 1000
samples, each of sample size 1000, is drawn from the calibrated vita
distribution.
```r
marginsnorm <- lapply(X = sqrt(diag(sigma.target)),
function(X) list(distr = "norm", sd = sqrt(X)))
vitadist <- vita(marginsnorm, sigma.target)
#> Tree 1
#> 1 - 2 ( 1 of 190 )
#> 2 - 3 ( 2 of 190 )
#> 3 - 4 ( 3 of 190 )
#> 4 - 5 ( 4 of 190 )
#> 5 - 6 ( 5 of 190 )
#> 6 - 7 ( 6 of 190 )
#> 7 - 8 ( 7 of 190 )
#> 8 - 9 ( 8 of 190 )
#> 9 - 10 ( 9 of 190 )
#> 10 - 11 ( 10 of 190 )
#> 11 - 12 ( 11 of 190 )
#> 12 - 13 ( 12 of 190 )
#> 13 - 14 ( 13 of 190 )
#> 14 - 15 ( 14 of 190 )
#> 15 - 16 ( 15 of 190 )
#> 16 - 17 ( 16 of 190 )
#> 17 - 18 ( 17 of 190 )
#> 18 - 19 ( 18 of 190 )
#> 19 - 20 ( 19 of 190 )
#> Tree 2
#> 1 - 3 ( 20 of 190 )
#> 2 - 4 ( 21 of 190 )
#> 3 - 5 ( 22 of 190 )
#> 4 - 6 ( 23 of 190 )
#> 5 - 7 ( 24 of 190 )
#> 6 - 8 ( 25 of 190 )
#> 7 - 9 ( 26 of 190 )
#> 8 - 10 ( 27 of 190 )
#> 9 - 11 ( 28 of 190 )
#> 10 - 12 ( 29 of 190 )
#> 11 - 13 ( 30 of 190 )
#> 12 - 14 ( 31 of 190 )
#> 13 - 15 ( 32 of 190 )
#> 14 - 16 ( 33 of 190 )
#> 15 - 17 ( 34 of 190 )
#> 16 - 18 ( 35 of 190 )
#> 17 - 19 ( 36 of 190 )
#> 18 - 20 ( 37 of 190 )
#> Tree 3
#> 1 - 4 ( 38 of 190 )
#> 2 - 5 ( 39 of 190 )
#> 3 - 6 ( 40 of 190 )
#> 4 - 7 ( 41 of 190 )
#> 5 - 8 ( 42 of 190 )
#> 6 - 9 ( 43 of 190 )
#> 7 - 10 ( 44 of 190 )
#> 8 - 11 ( 45 of 190 )
#> 9 - 12 ( 46 of 190 )
#> 10 - 13 ( 47 of 190 )
#> 11 - 14 ( 48 of 190 )
#> 12 - 15 ( 49 of 190 )
#> 13 - 16 ( 50 of 190 )
#> 14 - 17 ( 51 of 190 )
#> 15 - 18 ( 52 of 190 )
#> 16 - 19 ( 53 of 190 )
#> 17 - 20 ( 54 of 190 )
#> Tree 4
#> 1 - 5 ( 55 of 190 )
#> 2 - 6 ( 56 of 190 )
#> 3 - 7 ( 57 of 190 )
#> 4 - 8 ( 58 of 190 )
#> 5 - 9 ( 59 of 190 )
#> 6 - 10 ( 60 of 190 )
#> 7 - 11 ( 61 of 190 )
#> 8 - 12 ( 62 of 190 )
#> 9 - 13 ( 63 of 190 )
#> 10 - 14 ( 64 of 190 )
#> 11 - 15 ( 65 of 190 )
#> 12 - 16 ( 66 of 190 )
#> 13 - 17 ( 67 of 190 )
#> 14 - 18 ( 68 of 190 )
#> 15 - 19 ( 69 of 190 )
#> 16 - 20 ( 70 of 190 )
#> Tree 5
#> 1 - 6 ( 71 of 190 )
#> 2 - 7 ( 72 of 190 )
#> 3 - 8 ( 73 of 190 )
#> 4 - 9 ( 74 of 190 )
#> 5 - 10 ( 75 of 190 )
#> 6 - 11 ( 76 of 190 )
#> 7 - 12 ( 77 of 190 )
#> 8 - 13 ( 78 of 190 )
#> 9 - 14 ( 79 of 190 )
#> 10 - 15 ( 80 of 190 )
#> 11 - 16 ( 81 of 190 )
#> 12 - 17 ( 82 of 190 )
#> 13 - 18 ( 83 of 190 )
#> 14 - 19 ( 84 of 190 )
#> 15 - 20 ( 85 of 190 )
#> Tree 6
#> 1 - 7 ( 86 of 190 )
#> 2 - 8 ( 87 of 190 )
#> 3 - 9 ( 88 of 190 )
#> 4 - 10 ( 89 of 190 )
#> 5 - 11 ( 90 of 190 )
#> 6 - 12 ( 91 of 190 )
#> 7 - 13 ( 92 of 190 )
#> 8 - 14 ( 93 of 190 )
#> 9 - 15 ( 94 of 190 )
#> 10 - 16 ( 95 of 190 )
#> 11 - 17 ( 96 of 190 )
#> 12 - 18 ( 97 of 190 )
#> 13 - 19 ( 98 of 190 )
#> 14 - 20 ( 99 of 190 )
#> Tree 7
#> 1 - 8 ( 100 of 190 )
#> 2 - 9 ( 101 of 190 )
#> 3 - 10 ( 102 of 190 )
#> 4 - 11 ( 103 of 190 )
#> 5 - 12 ( 104 of 190 )
#> 6 - 13 ( 105 of 190 )
#> 7 - 14 ( 106 of 190 )
#> 8 - 15 ( 107 of 190 )
#> 9 - 16 ( 108 of 190 )
#> 10 - 17 ( 109 of 190 )
#> 11 - 18 ( 110 of 190 )
#> 12 - 19 ( 111 of 190 )
#> 13 - 20 ( 112 of 190 )
#> Tree 8
#> 1 - 9 ( 113 of 190 )
#> 2 - 10 ( 114 of 190 )
#> 3 - 11 ( 115 of 190 )
#> 4 - 12 ( 116 of 190 )
#> 5 - 13 ( 117 of 190 )
#> 6 - 14 ( 118 of 190 )
#> 7 - 15 ( 119 of 190 )
#> 8 - 16 ( 120 of 190 )
#> 9 - 17 ( 121 of 190 )
#> 10 - 18 ( 122 of 190 )
#> 11 - 19 ( 123 of 190 )
#> 12 - 20 ( 124 of 190 )
#> Tree 9
#> 1 - 10 ( 125 of 190 )
#> 2 - 11 ( 126 of 190 )
#> 3 - 12 ( 127 of 190 )
#> 4 - 13 ( 128 of 190 )
#> 5 - 14 ( 129 of 190 )
#> 6 - 15 ( 130 of 190 )
#> 7 - 16 ( 131 of 190 )
#> 8 - 17 ( 132 of 190 )
#> 9 - 18 ( 133 of 190 )
#> 10 - 19 ( 134 of 190 )
#> 11 - 20 ( 135 of 190 )
#> Tree 10
#> 1 - 11 ( 136 of 190 )
#> 2 - 12 ( 137 of 190 )
#> 3 - 13 ( 138 of 190 )
#> 4 - 14 ( 139 of 190 )
#> 5 - 15 ( 140 of 190 )
#> 6 - 16 ( 141 of 190 )
#> 7 - 17 ( 142 of 190 )
#> 8 - 18 ( 143 of 190 )
#> 9 - 19 ( 144 of 190 )
#> 10 - 20 ( 145 of 190 )
#> Tree 11
#> 1 - 12 ( 146 of 190 )
#> 2 - 13 ( 147 of 190 )
#> 3 - 14 ( 148 of 190 )
#> 4 - 15 ( 149 of 190 )
#> 5 - 16 ( 150 of 190 )
#> 6 - 17 ( 151 of 190 )
#> 7 - 18 ( 152 of 190 )
#> 8 - 19 ( 153 of 190 )
#> 9 - 20 ( 154 of 190 )
#> Tree 12
#> 1 - 13 ( 155 of 190 )
#> 2 - 14 ( 156 of 190 )
#> 3 - 15 ( 157 of 190 )
#> 4 - 16 ( 158 of 190 )
#> 5 - 17 ( 159 of 190 )
#> 6 - 18 ( 160 of 190 )
#> 7 - 19 ( 161 of 190 )
#> 8 - 20 ( 162 of 190 )
#> Tree 13
#> 1 - 14 ( 163 of 190 )
#> 2 - 15 ( 164 of 190 )
#> 3 - 16 ( 165 of 190 )
#> 4 - 17 ( 166 of 190 )
#> 5 - 18 ( 167 of 190 )
#> 6 - 19 ( 168 of 190 )
#> 7 - 20 ( 169 of 190 )
#> Tree 14
#> 1 - 15 ( 170 of 190 )
#> 2 - 16 ( 171 of 190 )
#> 3 - 17 ( 172 of 190 )
#> 4 - 18 ( 173 of 190 )
#> 5 - 19 ( 174 of 190 )
#> 6 - 20 ( 175 of 190 )
#> Tree 15
#> 1 - 16 ( 176 of 190 )
#> 2 - 17 ( 177 of 190 )
#> 3 - 18 ( 178 of 190 )
#> 4 - 19 ( 179 of 190 )
#> 5 - 20 ( 180 of 190 )
#> Tree 16
#> 1 - 17 ( 181 of 190 )
#> 2 - 18 ( 182 of 190 )
#> 3 - 19 ( 183 of 190 )
#> 4 - 20 ( 184 of 190 )
#> Tree 17
#> 1 - 18 ( 185 of 190 )
#> 2 - 19 ( 186 of 190 )
#> 3 - 20 ( 187 of 190 )
#> Tree 18
#> 1 - 19 ( 188 of 190 )
#> 2 - 20 ( 189 of 190 )
#> Tree 19
#> 1 - 20 ( 190 of 190 )
randomsamples <- replicate(10^3, rvine(10^3, vitadist))
```
The above calibration step is time-consuming, since it is high-dimensional.
With 20 variables, the calibration step
required 1.8 hours (again using a 2.3 GHz 8-Core Intel Core i9
CPU). This step is only performed once. When completed, random
samples can be drawn at a relatively fast rate. Producing 1000 samples
each of size 1000 took one minute to complete. Finally, we note that
the calibration step may be performed faster by specifying option
Nmax=$10^5$ when calling vita(), at the expense of
reduced precision in covariance matching.
## Ordinal-categorical data
We assume that the underlying correlation in a
continuous bivariate distribution with standard normal marginals is
$\rho=0.5$, and we discretize into three categories using thresholds
$\tau_1=0$ and $\tau_2=1$. This means that we consider simulated data
of the form
\[
X_i =
\left\{ \begin{matrix}
1, & \text{if } \xi_i \leq \tau_1 \\
2, & \text{if } \tau_1 < \xi_i \leq \tau_2 \\
3, & \text{if } \xi_i > \tau_2 \\
\end{matrix}
\right.
=
\left\{ \begin{matrix}
1, & \text{if } \xi_i \leq 0 \\
2, & \text{if } 0 < \xi_i \leq 1 \\
3, & \text{if } \xi_i > 1 \\
\end{matrix}
\right.
\]
for $i=1,2$, where $(\xi_1,\xi_2)$ is a continuous random vector
simulated using VITA. Both ordinal variables have proportions
$0.5, 0.34,$ and $0.16$. We inquire whether the polychoric correlation
estimator used in ordinal SEM becomes biased when we replace the
bivariate normal with a Clayton or a Joe copula. So first, we
determine parameters for the latter two copulas such that, when
marginals are standard normal, the Pearson correlation is $0.5$.
We then discretize a large sample from each of the VITA distributions and estimate the polychoric correlation.
The estimates turn out to be very biased:
```r
sigma.target <- matrix(c(1, 0.5, 0.5, 1), 2)
set.seed(1)
vita_clayton <- vita(list(list(distr = "norm"), list(distr = "norm")),
sigma.target, family_set = "clayton")
#> Tree 1
#> 1 - 2 ( 1 of 1 )
set.seed(1)
vita_joe <- vita(list(list(distr = "norm"), list(distr = "norm")),
sigma.target, family_set = "joe")
#> Tree 1
#> 1 - 2 ( 1 of 1 )
clayton.disc <- apply(rvine(10^5, vita_clayton), 2, cut,
breaks = c(-Inf, 0, 1, Inf), labels = FALSE)
joe.disc <- apply(rvine(10^5, vita_joe), 2, cut,
breaks = c(-Inf, 0, 1, Inf), labels = FALSE)
library(psych)
#polychoric correlation (based on underlying normality) is severely biased (downards for Clayton, and upwards for Joe)
polychoric(clayton.disc)$rho
#> [,1] [,2]
#> [1,] 1.000000 0.419161
#> [2,] 0.419161 1.000000
polychoric(joe.disc)$rho
#> [,1] [,2]
#> [1,] 1.0000000 0.5918562
#> [2,] 0.5918562 1.0000000
```
# IG and Piecewise linear methods
The IG and PLSIM methods require just specifying the kurtosis and skewness of each marginal distribution, together with the covariance matrix. Hence, the marginals are not fully controlled, in contrast to VITA.
Let us require skewness=2 and excess kurtosis =7 in each of three marginal distributions.
The covariance matrix is specified as
$$ \Sigma = \begin{pmatrix} 1 & 0.4 & 0.3 \\
0.4 & 1 & 0.4 \\
0.3 & 0.4 & 1 \\ \end{pmatrix}, $$
## IG
Function rIG() returns a list of simulated samples, of length specified by option reps.
The default is reps=1 (one sample only), so we need to append [[1]]:
```r
skewness=rep(2, 3)
excesskurtosis=rep(7,3)
sigma.target <- matrix(c(1, 0.4, 0.3, 0.4, 1, 0.4, 0.3, 0.4, 1), 3)
set.seed(1)
ig.sample <- rIG(10^4, sigma.target, skewness, excesskurtosis )[[1]]
round(cov(ig.sample)-sigma.target,3)
#> [,1] [,2] [,3]
#> [1,] -0.069 -0.016 -0.005
#> [2,] -0.016 -0.052 -0.025
#> [3,] -0.005 -0.025 -0.017
psych::skew(ig.sample)
#> [1] 1.814969 1.890223 1.925510
psych::kurtosi(ig.sample)
#> [1] 5.400986 5.717628 6.272962
```
## PLSIM
The piecewise linear approach is run similarly:
```r
set.seed(1)
pl.sample <- rPLSIM(10^4, sigma.target, skewness, excesskurtosis )[[1]]
pl.sample <- data.frame(pl.sample)
round(cov(pl.sample)-sigma.target,3)
#> X1 X2 X3
#> X1 0.011 0.012 0.036
#> X2 0.012 0.051 0.034
#> X3 0.036 0.034 0.044
psych::skew(pl.sample)
#> [1] 1.985208 2.047052 1.923874
psych::kurtosi(pl.sample)
#> X1 X2 X3
#> 6.545025 6.978239 6.029132
```
Note that rPLSIM() is slower that rIG(), since it requires numerical optimization for each pair of variables.
The distributions stemming from PLSIM are different from the IG distributions, even if the covariance matrix and the marginal skewness and kurtosi are identical.
```r
GGally::ggpairs(data.frame(ig.sample)[2:10^3,])
```
```r
GGally::ggpairs(data.frame(pl.sample)[2:10^3,])
```
# References