scDECO-Copula

library(scDECO)

Quick Start

n <- 2500

x.use <- rnorm(n)
w.use <- runif(n,-1,1)
marginals.use <- c("ZINB", "ZIGA")

# simulate data
y.use <- scdeco.sim.cop(marginals=marginals.use, x=x.use,
                    eta1.true=c(-2, 0.8), eta2.true=c(-2, 0.8),
                    beta1.true=c(1, 0.5), beta2.true=c(1, 1),
                    alpha1.true=7, alpha2.true=3,
                    tau.true=c(-0.2, .3), w=w.use)

Parameters:

marginals: The two marginals. Options are NB, ZINB, GA, ZIGA, Beta, ZIBEta
x: The vector (or matrix) containing the covariate values to be regressed for mean and rho parameters.
eta1.true: The coefficients of the 1st marginal’s zero-inflation parameter.
eta2.true: The coefficients of the 2nd marginal’s zero-inflation parameter.
beta1.true: The coefficients of the 1st marginal’s mean parameter.
beta2.true: The coefficients of the 2nd marginal’s mean parameter.
alpha1.true: The coefficient of the 1st marginal’s second parameter.
alpha2.true: The coefficient of the 2nd marginal’s second parameter.
tau.true: The coefficients of the correlation parameter.
w: A vector (or matrix) containing the covariate values to be regressed for zero-inflation parameters.

This will simulate a 2-column matrix of NROW(x) rows of observations from the scdeco.cop model.

# fit the model
mcmc.out <- scdeco.cop(y=y.use, x=x.use, marginals=marginals.use, w=w.use,
                       n.mcmc=10, burn=0, thin=1) # n.mcmc=5000, burn=1000, thin=10)

Parameters:

y: 2-column matrix with the dependent variable observations.
n.mcmc: The number of MCMC iterations to run.
burn: The number of MCMC iterations to burn from the beginning of the chain.
thin: The number of MCMC iterations to thin.

This will return a matrix where the columns correspond to the different parameters of the model and the rows correspond to MCMC samples where the burn and thin has already been incorporated.

One can obtain estimates and confidence intervals for each parameter by looking at quantiles of these MCMC samples.

# extract estimates and confidence intervals
lowerupper <- t(apply(mcmc.out, 2, quantile, c(0.025, 0.5, 0.975)))
estmat <- cbind(lowerupper[,1],
                c(c(-2, 0.8), c(-2, 0.8), c(1, 0.5), c(1, 1), 7, 3, c(-0.2, .3)),
                lowerupper[,c(2,3)])
colnames(estmat) <- c("lower", "trueval", "estval", "upper")
estmat
#>               lower trueval      estval       upper
#> eta10  -0.280164392    -2.0 -0.12756307 -0.01123481
#> eta11  -0.271563688     0.8 -0.13292294 -0.01359985
#> eta20  -0.195761578    -2.0 -0.12832756  0.00000000
#> eta21   0.000000000     0.8  0.16404571  0.27336665
#> beta10  0.007189396     1.0  0.08593311  0.24862243
#> beta11  0.027144918     0.5  0.21671489  0.32932538
#> beta20  0.000000000     1.0  0.01312043  0.04983955
#> beta21  0.000000000     1.0  0.17356108  0.17356108
#> alpha1  0.852584613     7.0  0.96647467  1.11782672
#> alpha2  0.718347355     3.0  0.75444712  0.95926971
#> tau0    0.000000000    -0.2  0.07896686  0.24849958
#> tau1    0.000000000     0.3  0.08875574  0.33252516

Model Details

Allow Y₁, …, Y_n to be n independent bivariate random vectors. For j = 1, 2 we assume the marginal CDF of Y_ij is given by F_j(⋅; θ_j, x_i) where θ_j represents a set of parameters associated with F_j, and x_i = (1, x_i1, …, x_ip)^′ a set of covariates for the ith cell. We construct the joint CDF of Y_i via Gaussian copula with covariate-dependent parameters as follows. Let Z_i = (Z_i1, Z_i2)^′ be such that

$$\boldsymbol {Z}_i \sim N_2{\left(\boldsymbol {0}= \def\eqcellsep{&}\begin{bmatrix} 0 \\ 0 \end{bmatrix} , \boldsymbol {R}_i = \def\eqcellsep{&}\begin{bmatrix} 1 & \rho _i \\ \rho _i & 1 \end{bmatrix} \right)} $$

with

$$ \rho _i = \text{corr}(Z_{i1},Z_{i2}) = \frac{\exp ({\boldsymbol {x}}_i^{\prime }\boldsymbol{\tau }) - 1}{\exp ({\boldsymbol {x}}_i^{\prime }\boldsymbol{\tau }) + 1}$$ where τ = (τ₀, τ₁, …, τ_p)^′.

In the marginal distributions supported by this paper (Negative Binomial, Gamma, and Beta), we model their mean parameter as a function of covariates using the log link function like so

μ_ij = E[Y_ij] = exp {x_i^′β^(j)}

where β^(j) = (β₀^(j), β₁^(j), …, β_p^(j))^′.

and we allow the second parameter of those distributions, which we call α, to be free of covariates.

This formulation incorporates dynamic association between Y_i1 and Y_i2, that is, association that depends on covariates, through the correlation between Z_i1 and Z_i2. We denote the joint CDF of Z_i by Φ_τ to reflect its dependence on parameter τ. For both discrete and continuous marginals, the general form of the joint CDF of Y is given by

F_Y(y_i; θ₁, θ₂, τ, x_i) = Φ_τ(Φ⁻¹[F₁(y_i1; θ₁, x_i)],Φ⁻¹[F₂(y_i2; θ₂, x_i)])

where Φ⁻¹ represents the inverse CDF of N(0, 1).

Marginal Parameterizations

We use the following parameterization of the negative binomial for use in the marginals:

$$ f_{\text{NB}}(y_{ij};\mu_{ij}, \alpha_{j}) = \frac{\Gamma(y_{ij} + \alpha_{j})}{\Gamma(y_{ij}+1)\Gamma(\alpha_{j})}\left(\frac{\alpha_{j}}{\alpha_{j}+\mu_{ij}}\right)^{\alpha_{j}}\left(\frac{\mu_{ij}}{\alpha_{j}+\mu_{ij}}\right)^{y_{ij}} $$

This has mean μ_ij and variance μ_ij + μ_ij²/α_j.

For the gamma distribution, we use the following parameterization:

$$ f(y_{ij};\mu_{ij},\alpha_{j}) = \frac{\alpha_{j} ^{\mu_{ij}\alpha_{j} }}{\Gamma (\mu_{ij}\alpha_{j} )}\,y_i^{\mu_{ij}\alpha_{j} -1}e^{-\alpha_{j} y_{ij}} $$

This has mean μ_ij and variance μ_ij/α_j.

For the beta distribution, we use the following parameterization:

$$ f(y_{ij};\mu_{ij},\alpha_{j}) = \frac{\Gamma(\alpha_j)}{\Gamma(\mu_{ij}\alpha_{j})\Gamma((1-\mu_{ij})\alpha_j)}y_{ij}^{\mu_{ij}\alpha_j-1}(1-y_{ij})^{(1-\mu_{ij})\alpha_j-1} $$

this has mean of μ_ij and variance μ_ij(1 − μ_ij)/(α_ij + 1).

Incorporating Zero-inflation

In multi-omics data, dropout events occur when observations for certain molecules are not detected and thus recorded as zeros. For this reason, we incorporate zero-inflation into the above model by including two additional covariate-dependent parameters p₁ and p₂ which represent the probability of an observation from their respective marginal being zeroed-out by a dropout event.

Thus, for a Gamma or Beta marginal, the zero-inflated PDF is given by

f_j^zinf(y_ij; θ_j, x_i) = (1 − p_j)f_j(y_ij; θ_j, x_i)1(y_ij > 0) + p_j1(y_ij = 0)

and for a NB marginal, the zero-inflated PDF is given by:

f_j^zinf(y_ij; θ_j, x_i) = (1 − p_j)f_j(y_ij; θ_j, x_i) + p_j1(y_ij = 0)

We allow p₁, p₂ to be dependent on a different set of covariates than the x used in the previous section, because often zero-inflation is affected by different covariates than the marginal mean and/or correlation parameter is. We will call this new set of covariates w_i = (1, w_i1, …, w_ik)^′ and it will be tied to p₁, p₂ in the following way:

$$ p_{ij} = \frac{1}{1+\exp\left\{\boldsymbol{w}_i^{\prime}\boldsymbol{\eta^{(j)}}\right\}} $$

where η^(j) = (η₀^(j), η₁^(j), …, η_q^(j))^′.

Parameter Estimation

Parameter estimation is achieved using an adaptive MCMC approach involving a Metropolis scheme, and using a one-margin-at-a-time approach. For more details please refer to the paper.

Citations

Zichen Ma, Shannon W. Davis, Yen-Yi Ho, Flexible Copula Model for Integrating Correlated Multi-Omics Data from Single-Cell Experiments, Biometrics, Volume 79, Issue 2, June 2023, Pages 1559–1572, https://doi.org/10.1111/biom.13701