Single-cell differential expression analysis with FLASHMM

Method

LMM estimation and inference

Consider the linear mixed-effects model (LMM) as expressed below (Searle, Casella, and McCulloch 2006) where y is an n × 1 vector of observed responses, X is an n × p design matrix for fixed effects β, Z is an n × q design matrix for random effects b, and ϵ is an n × 1 vector of residual errors. The term of random effects may be a combination of various components, Zb = Z1b1 + ⋯ + ZKbK, where Z = [Z1, …, ZK], b = [b1T, …, bKT]T, K is the number of random-effect components, and Zk is an n × qk design matrix for the k-th random-effect component. The superscript T denotes a transpose of vector or matrix. The basic assumptions are as follows:

  1. The design matrix X is of full rank, satisfying conditions of estimability for the parameters;
  2. The random vectors bk and ϵ are independent and follow a normal distribution, bk ∼ N(0, σk2Iqk)  and  ϵ ∼ N(0, σ2In).

Here 0 is a vector or matrix of zero elements, In is an n × n identity matrix, and σk2 and σ2 are unknown parameters, called variance components.

Hartley and Rao (1967) developed the maximum likelihood estimation (MLE) method for estimating the LMM parameters (fixed effects and variance components). Patterson and Thompson (1971) proposed a modified maximum likelihood procedure which partitions the data into two mutually uncorrelated parts, one being free of the fixed effects used for estimating variance components, called restricted maximum likelihood (REML) estimator. The REML estimator is unbiased. The MLE of variance components is biased in general. Both methods are asymptotically identical for estimating variance components. With variance components, θ, estimated, the fixed effects estimated by either MLE or REML are given as follows β̂ = (XTVθ−1X)−1XTVθ−1y, with covariance matrix var(β̂) = (XTVθ−1X)−1, where θ = (θ0, θ1, …, θK), θ0 = σ2, θk = σk2, and Vθ = θ0In + θ1Z1Z1T + … + θKZKZKT.

Estimating the variance components by either MLE or REML is a difficult numerical problem. Various iterative methods based on the log likelihood, called gradient methods, have been proposed (Searle, Casella, and McCulloch 2006). The gradient methods are represented by the iteration equation where l(θ)/∂θ is the gradient of the log likelihood function, and Γ(θ) is a modifier matrix of the gradient direction, which can be specified by Newton–Raphson, Fisher scoring or average information, see Supplementary Materials of the manuscript for the details.

The hypotheses for testing fixed effects and variance components can be respectively defined as H0, i : βi = 0  versus  H1, i : βi ≠ 0, H0, k : θk ≤ 0  versus  H1, k : θk > 0, where θk, k = 1, …, K, represent the parameters of variance components σk2 but are allowed to be negative. The lower boundary of the parameters, θ, can be the negative value such that the variance-covariance matrix, Vθ, remains definable (positive definite). Note that the negative lower boundary exists and can be less than σ2/λmax, where λmax > 0, is the largest singular value of ZZT. If θk > 0, then σk2 = θk is definable and the mixed model is well-specified. Otherwise, if θk ≤ 0, the mixed model is miss-specified and no random effects are needed.

Allowing the parameters of variance components to take negative value avoids the zero boundary problem in hypothesis testing for the variance components. Consequently, the asymptotic properties of the maximum likelihood estimation hold under regularity conditions, which enables us to use z-statistics or t-statistics for hypothesis testing of fixed effects and variance components. The t-statistics for fixed effects are given by The t-statistic for a contrast, a linear combination of the estimated fixed effects, cTβ̂, is The z-statistics for the parameters of variance components are given by where I(θ) is the Fisher information matrix.

Fast and scalable algorithm

We developed a summary statistics based algorithm for implementing the gradient method @ref(eq:gradient). Let XX, XY, ZX, ZY, and ZZ denote a matrix, respectively, which define the summary statistics that are computed from cell-level data X, Y and Z as follows: where Y = [y1T, …, ymT]T is a m-by-n matrix of gene expression profile with each row yi corresponding to the expression of gene i, i = 1, …, m. Once the summary statistics are precomputed, the summary-level data based algorithm has a complexity of O(m(p3 + q3)), which doesn’t depend on number of cells (sample size n). In single-cell DE analysis, the numbers of fixed and random effects, p and q, are relatively small. Therefore, the algorithm is fast and scalable, and requires less computer memory. See Supplemental Materials of the manuscript for details.

FLASHMM

FLASHMM package provides two functions, lmm and lmmfit, for fitting LMM. The lmm function uses summary statistics as arguments. The lmmfit function is a wrapper function of lmm, which directly uses cell-level data and computes the summary statistics inside the function. The lmmfit function is simple to be operated but it has a limitation of memory use. For large scale data, we should use lmm instead of lmmfit. That is, we precompute the summary-level data by @ref(eq:sdata) or sslmm function in FLASHMM package, and then use lmm function to fit LMM. FLASHMM provides the lmmtest function to perform statistical test on the fixed effects and the contrasts of fixed effects.

In summary, FLASHMM package provides the following functions.

  • lmm: fit LMM using summary-level data.
  • lmmfit: fit LMM using cell-level data.
  • lmmtest: perform statistical tests on the fixed effects and their contrasts.
  • sslmm: compute the summary-level data using cell-level data.
  • simuRNAseq: simulate multi-sample multi-cell-type scRNA-seq dataset based on a negative binomial distribution.

Quick start

#Install FLASHMM from Github.
devtools::install_github("https://github.com/Baderlab/FLASHMM")

#Load the package.
library(FLASHMM)

Simulating a scRNA-seq dataset by simuRNAseq

Simulate a multi-sample multi-cell-cluster scRNA-seq dataset comprising 25 samples and 4 clusters (cell-types) with 2 treatments.

set.seed(2412)
dat <- simuRNAseq(nGenes = 50, nCells = 1000, 
                  nsam = 25, ncls = 4, ntrt = 2, nDEgenes = 6)

names(dat)
#> [1] "ref.mean.dispersion" "metadata"            "counts"             
#> [4] "DEgenes"             "treatment"
##

#counts and meta data
counts <- dat$counts
metadata <- dat$metadata
rm(dat)

DE analysis using LMM

##(1) Model design
##Y: gene expression profile (log-transformed counts)
##X: design matrix for fixed effects
##Z: design matrix for random effects

Y <- log(counts + 1) 
X <- model.matrix(~ 0 + log(libsize) + cls + cls:trt, data = metadata)
Z <- model.matrix(~ 0 + sam, data = metadata)
d <- ncol(Z)
##

##(2) LMM fitting
##Fit LMM by lmmfit using cell-level data.
fit <- lmmfit(Y, X, Z, d = d)

##Fit LMM by lmm using summary-level data computed as follows.
##- Computing summary statistics
n <- nrow(X)
XX <- t(X)%*%X; XY <- t(Y%*%X)
ZX <- t(Z)%*%X; ZY <- t(Y%*%Z); ZZ <- t(Z)%*%Z
Ynorm <- rowSums(Y*Y)
##- Fitting LMM
fitss <- lmm(XX, XY, ZX, ZY, ZZ, Ynorm = Ynorm, n = n, d = d)

identical(fit, fitss)
#> [1] TRUE
##

##Fit LMM by lmm using summary-level data computed by sslmm.
##- Computing summary statistics
ss <- sslmm(X, Y, Z)
##- Fitting LMM
fitss <- lmm(summary.stats = ss, d = d)

identical(fit, fitss)
#> [1] TRUE
##

##(3) Hypothesis tests
test <- lmmtest(fit)
#head(test)

##t-values
all(t(fit$t) == test[, grep("_t", colnames(test))])
#> [1] TRUE
fit$t[, 1:5]
#>                   Gene1      Gene2      Gene3      Gene4      Gene5
#> log(libsize)  3.5564382  6.1353128  3.6098989  4.1239445  2.8179995
#> clsC1        -2.4158938 -4.8697164 -2.3122737 -2.6709945 -1.3220206
#> clsC2        -2.5957105 -4.9918921 -2.2467924 -2.5108410 -1.2387859
#> clsC3        -2.5576263 -5.0249733 -2.1484996 -2.5856150 -1.2535998
#> clsC4        -2.4466344 -5.0065480 -2.2202673 -2.6983076 -1.2058758
#> clsC1:trtB    0.8697778  0.3659413  0.1515357  0.9707563 -0.9577914
#> clsC2:trtB    2.0700456  1.0976568 -0.1112106  0.7423556 -0.5997369
#> clsC3:trtB    1.5066385  1.5001771 -0.3659261  0.8883383 -1.0410003
#> clsC4:trtB   -0.3263183  1.6810047 -0.4559326  0.6462646 -1.7515738
##

##p-values
all(t(fit$p) == test[, grep("_p", colnames(test))])
#> [1] TRUE
fit$p[, 1:5]
#>                     Gene1        Gene2        Gene3        Gene4       Gene5
#> log(libsize) 0.0003936946 1.226867e-09 0.0003216502 4.036515e-05 0.004928509
#> clsC1        0.0158766072 1.300111e-06 0.0209667982 7.686618e-03 0.186466367
#> clsC2        0.0095791682 7.060831e-07 0.0248727531 1.220264e-02 0.215718133
#> clsC3        0.0106867912 5.971329e-07 0.0319158551 9.862323e-03 0.210283172
#> clsC4        0.0145925607 6.556356e-07 0.0266262016 7.087769e-03 0.228153191
#> clsC1:trtB   0.3846324624 7.144869e-01 0.8795840262 3.319065e-01 0.338401544
#> clsC2:trtB   0.0387066712 2.726210e-01 0.9114719020 4.580478e-01 0.548818677
#> clsC3:trtB   0.1322220329 1.338870e-01 0.7144983040 3.745743e-01 0.298129310
#> clsC4:trtB   0.7442524470 9.307711e-02 0.6485383571 5.182577e-01 0.080156444

Example

We use a simulated multi-sample multi-cell-type scRNA-seq dataset to illustrate how to utilize FLASHMM to perform single-cell differential expression analysis. We are interested in identifying the genes differentially expressed between two treatments (conditions) within a cell type.

Simulating multi-sample multi-cell-type scRNA-seq dataset

We generate a multi-sample multi-cell-type scRNA-seq dataset using a reference dataset by simuRNAseq function in FLASHMM package. We use PBMC 10X droplet-based scRNA-seq dataset (Kang et al. 2018) as the reference dataset, which contains count data and metadata. The metadata should include 3 columns: individuals (subjects or samples), cell types (clusters), and treatments (conditions) which are named as ‘sam’, ‘cls’, and ‘trt’, respectively.

Note that the simuRNAseq function can also generate the scRNA-seq dataset without a reference dataset. In this case, the following step for preparing the reference dataset can be skipped.

Preparing reference dataset

To load the PBMC dataset, we need ExperimentHub package.

library(ExperimentHub)

##Load data.
eh <- ExperimentHub()
#query(eh, "Kang")
sce <- eh[["EH2259"]]

##Remove undetected genes.
sce <- sce[rowSums(counts(sce) > 0) > 0, ]
##Remove cells with few or many detected genes (outliers).
nGenes <- colSums(counts(sce) > 0)
bx <- boxplot(log(nGenes), plot = FALSE)
sce <- sce[, nGenes >= exp(bx$stats[1]) & nGenes <= exp(bx$stats[5])]
##Remove lowly expressed genes.
##counts per cell (cpc) 
cpc <- rowSums(counts(sce))/ncol(sce)
sce <- sce[(rowSums(counts(sce) > 1) >= 10) & (cpc > 0.005), ]

##counts and metadata
counts <- assay(sce, "counts")
coldata <- as.data.frame(colData(sce))
head(coldata)
#>                   ind stim cluster            cell multiplets
#> AAACATACAATGCC-1  107 ctrl       5     CD4 T cells    doublet
#> AAACATACATTTCC-1 1016 ctrl       9 CD14+ Monocytes    singlet
#> AAACATACCAGAAA-1 1256 ctrl       9 CD14+ Monocytes    singlet
#> AAACATACCAGCTA-1 1256 ctrl       9 CD14+ Monocytes    doublet
#> AAACATACCATGCA-1 1488 ctrl       3     CD4 T cells    singlet
#> AAACATACCTCGCT-1 1256 ctrl       9 CD14+ Monocytes    singlet
all(colnames(counts) == rownames(coldata))
#> [1] TRUE
dim(counts)
#> [1]  7483 28721
rm(eh, sce)

Generating dataset

We use the reference dataset to generate a scRNA-seq dataset by simuRNAseq function. First we need to specify which columns represent samples (individuals), cell clusters (types), and treatments (experimental conditions) in the reference metadata by ‘sam’, ‘cls’, and ‘trt’, respectively. We also specify the numbers of genes, cells and differentially expressed (DE) genes to be generated. We use default settings for other arguments in simuRNAseq function. The generated dataset contains count data, metadata, and the DE genes specific to a cell cluster. The DE genes can be considered as the positive controls and the others the negative controls. Without confusion, DE can represent either ‘differentially expressed’ or ‘differential expression’.

##Specify which columns represent samples, treatments, and cell-types.
colnames(coldata)[colnames(coldata) == "ind"] <- "sam"  #samples
colnames(coldata)[colnames(coldata) == "stim"] <- "trt" #treatments
colnames(coldata)[colnames(coldata) == "cell"] <- "cls" #cell-types
coldata <- coldata[, c("sam", "trt", "cls")]
head(coldata)
#>                   sam  trt             cls
#> AAACATACAATGCC-1  107 ctrl     CD4 T cells
#> AAACATACATTTCC-1 1016 ctrl CD14+ Monocytes
#> AAACATACCAGAAA-1 1256 ctrl CD14+ Monocytes
#> AAACATACCAGCTA-1 1256 ctrl CD14+ Monocytes
#> AAACATACCATGCA-1 1488 ctrl     CD4 T cells
#> AAACATACCTCGCT-1 1256 ctrl CD14+ Monocytes
##

##Generate the dataset by simuRNAseq function.
set.seed(2412)
dat <- simuRNAseq(counts, nGenes = 100, nCells = 120000, metadata = coldata, 
                  nsam = 25, ncls = 10, ntrt = 2, nDEgenes = 10, 
                  minbeta = 0.5, maxbeta = 1, var.randomeffects = 0.1)
str(dat)
#> List of 5
#>  $ ref.mean.dispersion:'data.frame': 7483 obs. of  2 variables:
#>   ..$ mu        : num [1:7483] 0.0718 0.2008 19.708 0.0725 0.0755 ...
#>   ..$ dispersion: num [1:7483] 0.5183 0.0814 0.1902 0.0963 0.0483 ...
#>  $ metadata           :'data.frame': 120000 obs. of  4 variables:
#>   ..$ sam    : int [1:120000] 1256 101 1488 1016 1244 1244 107 1256 101 1256 ...
#>   ..$ trt    : Factor w/ 2 levels "ctrl","stim": 1 2 2 1 1 2 2 1 2 2 ...
#>   ..$ cls    : Factor w/ 8 levels "B cells","CD14+ Monocytes",..: 3 3 3 4 3 3 3 4 3 3 ...
#>   ..$ libsize: num [1:120000] 909 1459 1030 1680 856 ...
#>  $ counts             : num [1:100, 1:120000] 0 0 0 0 0 0 0 0 0 0 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:100] "EPC2" "TNFRSF18" "TMEM87A" "SIRT6" ...
#>   .. ..$ : chr [1:120000] "GCAGGGCTATCGTG-1" "AGTCGAACAGAGTA-1" "CCTAAACTAACAGA-1" "AAATGGGAATTCCT-1" ...
#>  $ DEgenes            :'data.frame': 10 obs. of  3 variables:
#>   ..$ gene   : chr [1:10] "RCE1" "CD68" "FUNDC2_ENSG00000165775" "PAIP2" ...
#>   ..$ beta   : num [1:10] 0.714 0.689 -0.77 -0.716 -0.932 ...
#>   ..$ cluster: chr [1:10] "B cells" "B cells" "CD14+ Monocytes" "CD4 T cells" ...
#>  $ treatment          : chr "stim"

##Remove the reference dataset that is no longer needed in the following analysis.
rm(counts, coldata)

DE analysis of the simulated scRNA-seq data

We perform differential expression analysis of the simulated dataset using FLASHMM package. We are to identify the significantly differentially expressed genes between two treatments in a cell-type. The analyses involve following steps: LMM design, LMM fitting, and exploring LMM fit and the DE genes.

LMM design: construct design matrices for fixed and random effects described as @ref(eq:lmm), and compute the gene expression profile. The gene expression is taken as log-transformed count matrix, Y = log (1 + counts), in which each row corresponds to the expression profile for a gene. The design matrix for the fixed effects can be created by the function model.matrix, as follows X = model.matrix( ∼ 0 + log(library.size) + cell.type + cell.type : treatment), where the interaction term cell.type : treatment represents the treatment effect in a specific cell-type. library.size is defined as the total sum of counts across all genes for each cell, a normalization factor of the scRNA-seq counts. We consider the subjects (samples) as random effects that reflect the intra-subject correlation and inter-subject variability. The design matrix for the random effects is given by Z = model.matrix( ∼ 0 + subject).

LMM fitting: We use either lmm or lmmfit function to fit LMMs for all genes. With the cell-level data matrices Y, X and Z, the LMMs can be fit by lmmfit(Y, X, Z, d = d), where d is the number of random-effects. For K components of random-effects, d = (q1, …, qK), where qk is the number of columns of the design matrix Zk for the k-th random-effect component. For a large-scale data, the lmmfit function has a problem of computer memory limit. In this case, it is recommended to pre-compute the summary-level data: XX, XY, ZX, ZY, ZZ, and Ynorm, defined in @ref(eq:sdata), and then use the lmm function to fit the LMMs as follows: lmm(XX, XY, ZX, ZY, ZZ, Ynorm = Ynorm, n = n, d = d). The summary-level data doesn’t depend on the sample size n. This makes lmm memory-efficient. Both lmm and lmmfit functions also perform hypothesis testing on the fixed effects. We can use lmmtest function to further perform statistical test on the contrasts of fixed effects.

LMM fitting returns a list of estimates of LMM parameters (coefficients and variance components), standard errors of the estimates, covariance matrix of the coefficients (fixed effects), and t-values and p-values for hypothesis testing on the coefficients, for each gene. The LMM fitting also returns the numbers of iterations and the first partial derivatives of log likelihood at the termination of iterations for each gene.

Exploring LMM fit and the DE genes: We check if the LMM fitting is convergent, and perform hypothesis testing on the variance components of random effects. Then we identify the DE genes based on hypothesis testing p-values for the coefficients (fixed effects). If the absolute first partial derivatives of log likelihood are all less than the convergence tolerance, the LMM fitting converges, otherwise it doesn’t converge. The genes for which LMM fitting doesn’t converge should be excluded in the subsequent analysis because the estimated coefficients for these genes are not reliable. The DE genes can be identified by adjusting p-values obtained by false discovery rate (FDR) or family-wise error rate (Bonferroni correction). We might exclude the genes with small coefficients (effect size or log-fold-change).

LMM design

##Gene expression matrix, Y = log2(1 + counts)
Y <- log2(1 + dat$counts)
dat$counts <- NULL #Remove the counts.

##Design matrix for fixed effects
X <- model.matrix(~ 0 + log(libsize) + cls + cls:trt, data = dat$meta)
colnames(X) <- gsub("\\+", "p", colnames(X))
colnames(X) <- gsub(" ", "_", colnames(X))

##Design matrix for random effects
Z <- model.matrix(~ 0 + as.factor(sam), data = dat$meta)
##Dimension of random effects
d <- ncol(Z)

LMM fitting

  
##(1) Fit LMM by cell-level data.
max.iter <- 100
epsilon <- 1e-5
fit <- lmmfit(Y, X, Z, d = d, max.iter = max.iter, epsilon = epsilon)
str(fit)
#> List of 12
#>  $ method  : chr "REML-FS"
#>  $ dlogL   : num [1:2, 1:100] 6.60e-06 -8.85e-09 7.68e-06 -1.01e-08 3.24e-08 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:100] "dl" "dl" "dl" "dl" ...
#>  $ niter   : num [1:100] 7 9 6 6 13 11 9 8 7 8 ...
#>  $ coef    : num [1:17, 1:100] 0.021 -0.134 -0.13 -0.132 -0.132 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:17] "log(libsize)" "clsB_cells" "clsCD14p_Monocytes" "clsCD4_T_cells" ...
#>   .. ..$ : chr [1:100] "EPC2" "TNFRSF18" "TMEM87A" "SIRT6" ...
#>  $ se      : num [1:17, 1:100] 0.00109 0.00817 0.00862 0.00799 0.00811 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:17] "log(libsize)" "clsB_cells" "clsCD14p_Monocytes" "clsCD4_T_cells" ...
#>   .. ..$ : chr [1:100] "EPC2" "TNFRSF18" "TMEM87A" "SIRT6" ...
#>  $ t       : num [1:17, 1:100] 19.3 -16.4 -15.1 -16.6 -16.3 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:17] "log(libsize)" "clsB_cells" "clsCD14p_Monocytes" "clsCD4_T_cells" ...
#>   .. ..$ : chr [1:100] "EPC2" "TNFRSF18" "TMEM87A" "SIRT6" ...
#>  $ p       : num [1:17, 1:100] 5.62e-83 2.39e-60 2.89e-51 1.47e-61 8.16e-60 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:17] "log(libsize)" "clsB_cells" "clsCD14p_Monocytes" "clsCD4_T_cells" ...
#>   .. ..$ : chr [1:100] "EPC2" "TNFRSF18" "TMEM87A" "SIRT6" ...
#>  $ cov     : num [1:17, 1:17, 1:100] 1.18e-06 -8.37e-06 -9.02e-06 -8.35e-06 -8.27e-06 ...
#>   ..- attr(*, "dimnames")=List of 3
#>   .. ..$ : chr [1:17] "log(libsize)" "clsB_cells" "clsCD14p_Monocytes" "clsCD4_T_cells" ...
#>   .. ..$ : chr [1:17] "log(libsize)" "clsB_cells" "clsCD14p_Monocytes" "clsCD4_T_cells" ...
#>   .. ..$ : chr [1:100] "EPC2" "TNFRSF18" "TMEM87A" "SIRT6" ...
#>  $ df      : int 119983
#>  $ theta   : num [1:2, 1:100] 3.08e-05 2.16e-02 1.05e-04 8.07e-02 3.31e-04 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:2] "var1" "var0"
#>   .. ..$ : chr [1:100] "EPC2" "TNFRSF18" "TMEM87A" "SIRT6" ...
#>  $ se.theta: num [1:2, 1:100] 1.75e-05 8.84e-05 6.02e-05 3.29e-04 1.80e-04 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:2] "var1" "var0"
#>   .. ..$ : chr [1:100] "EPC2" "TNFRSF18" "TMEM87A" "SIRT6" ...
#>  $ RE      : NULL
##

##(2) Fit LMM by summary-level data.
##- Compute the summary-level data.
n <- nrow(X)
XX <- t(X)%*%X
XY <- t(Y%*%X)
ZX <- t(Z)%*%X
ZY <- t(Y%*%Z)
ZZ <- t(Z)%*%Z
Ynorm <- rowSums(Y*Y)

rm(X, Y, Z) #release the memory.

##- Fit LMM.
fitss <- lmm(XX, XY, ZX, ZY, ZZ, Ynorm = Ynorm, n = n, d = d, 
             max.iter = max.iter, epsilon = epsilon)
identical(fit, fitss)
#> [1] TRUE
rm(fitss)

##Test the treatment effect within all cell-types.
##- Construct a contrast by summing the treatment effects across cell-types.
contrast <- cbind("trt" = numeric(nrow(fit$coef)))
contrast[grep(":", rownames(fit$coef)), ] <- 1

##- Test the contrast.
test <- lmmtest(fit, contrast = contrast)
head(test)
#>               trt_t     trt_p
#> EPC2     -0.9250580 0.3549376
#> TNFRSF18 -0.4655622 0.6415298
#> TMEM87A   0.1783232 0.8584694
#> SIRT6     0.9348075 0.3498894
#> IL12RB1  -0.4848279 0.6277993
#> RBFA     -1.3544274 0.1756026

Exploring LMM fit and the DE genes


##(1) Check which LMM fittings converge.
cvg <- (apply(abs(fit$dlogL), 2, max) < epsilon) 
sum(cvg) 
#> [1] 100
##

##(2) Hypothesis testing for variance components:
##    H0, theta </= 0 vs H1, theta > 0.
z <- fit$theta["var1", ]/fit$se.theta["var1", ]
p <- pnorm(z, lower.tail = FALSE)
sum(p <= 0.05)
#> [1] 90
##

##(3) The DE genes specific to a cell-type
##Coefficients and p-values for the genes specific to a cell-type.
index <- grep(":", rownames(fit$coef))
beta <- t(fit$coef[index, cvg])
p <- t(fit$p[index, cvg])

##Adjust p-values by FDR.
padj <- matrix(p.adjust(c(p), method = "fdr"), nrow = nrow(p), ncol = ncol(p))

##The DE genes specific to a cell cluster with FDR < 0.05.
degenes <- NULL
for (j in 1:ncol(p)){
  i <- (padj[, j] < 0.05)
  if (sum(i) > 0) degenes <- rbind(degenes, data.frame(gene = rownames(p)[i], cluster = j, coef = beta[i, j], p = p[i, j], FDR = padj[i, j]))
}
rownames(degenes) <- NULL
degenes
#>                      gene cluster         coef             p           FDR
#> 1                    RCE1       1  0.023144076  1.285161e-10  2.056257e-08
#> 2                    CD68       1  0.169924579  3.359959e-40  8.959890e-38
#> 3                   N4BP1       2  0.013871752  1.377216e-04  1.001611e-02
#> 4                   BEST1       2  0.011185860  1.056919e-03  4.973737e-02
#> 5                   MYADM       2  0.011748386  2.521124e-07  2.881284e-05
#> 6                C1orf174       2  0.010434876  5.622747e-04  2.998798e-02
#> 7  FUNDC2_ENSG00000165775       2 -0.074345865  1.513410e-92  6.053640e-90
#> 8                   PAIP2       3 -0.162167570 2.257004e-218 1.805603e-215
#> 9                  CAMK2D       3 -0.004915633  6.519633e-07  6.519633e-05
#> 10                  NOC4L       5  0.026286349  4.995860e-04  2.854777e-02
#> 11                 MALSU1       5 -0.033691602  8.530013e-05  6.824010e-03
#> 12                 FAM53B       5 -0.017410267  4.704017e-04  2.854777e-02
#> 13                    HGS       6 -0.013463985  6.440482e-04  3.220241e-02
#> 14                   UTRN       6  0.018717417  1.546949e-04  1.031299e-02
#> 15                 INTS10       6 -0.065494690  3.951917e-27  7.903833e-25
#> 16                 ZNF770       7  0.058368842  1.127945e-05  1.002618e-03
#> 17                  RBBP7       8  0.044868013  1.457681e-09  1.943575e-07

##The simulated DE genes
dat$DEgenes
#>                       gene       beta           cluster
#> 91                    RCE1  0.7138792           B cells
#> 92                    CD68  0.6893330           B cells
#> 93  FUNDC2_ENSG00000165775 -0.7704019   CD14+ Monocytes
#> 94                   PAIP2 -0.7157209       CD4 T cells
#> 95                  CAMK2D -0.9319393       CD4 T cells
#> 96                   DMXL1 -0.7755333   Dendritic cells
#> 97                 SLC35F6 -0.7766885   Dendritic cells
#> 98                  INTS10 -0.8749181 FCGR3A+ Monocytes
#> 99                  ZNF770  0.8655967    Megakaryocytes
#> 100                  RBBP7  0.5337121          NK cells

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
#>  [3] Biobase_2.67.0              GenomicRanges_1.59.1       
#>  [5] GenomeInfoDb_1.43.2         IRanges_2.41.2             
#>  [7] S4Vectors_0.45.2            MatrixGenerics_1.19.1      
#>  [9] matrixStats_1.5.0           ExperimentHub_2.15.0       
#> [11] AnnotationHub_3.15.0        BiocFileCache_2.15.0       
#> [13] dbplyr_2.5.0                BiocGenerics_0.53.3        
#> [15] generics_0.1.3              FLASHMM_1.0.0              
#> [17] bookdown_0.42              
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.1        dplyr_1.1.4             blob_1.2.4             
#>  [4] filelock_1.0.3          Biostrings_2.75.3       fastmap_1.2.0          
#>  [7] promises_1.3.2          digest_0.6.37           mime_0.12              
#> [10] lifecycle_1.0.4         ellipsis_0.3.2          processx_3.8.5         
#> [13] KEGGREST_1.47.0         RSQLite_2.3.9           magrittr_2.0.3         
#> [16] compiler_4.4.2          rlang_1.1.4             sass_0.4.9             
#> [19] tools_4.4.2             yaml_2.3.10             knitr_1.49             
#> [22] S4Arrays_1.7.1          htmlwidgets_1.6.4       bit_4.5.0.1            
#> [25] pkgbuild_1.4.5          curl_6.1.0              DelayedArray_0.33.3    
#> [28] abind_1.4-8             pkgload_1.4.0           miniUI_0.1.1.1         
#> [31] withr_3.0.2             purrr_1.0.2             sys_3.4.3              
#> [34] desc_1.4.3              grid_4.4.2              urlchecker_1.0.1       
#> [37] profvis_0.4.0           xtable_1.8-4            MASS_7.3-64            
#> [40] cli_3.6.3               rmarkdown_2.29          crayon_1.5.3           
#> [43] remotes_2.5.0           httr_1.4.7              sessioninfo_1.2.2      
#> [46] DBI_1.2.3               cachem_1.1.0            AnnotationDbi_1.69.0   
#> [49] BiocManager_1.30.25     XVector_0.47.2          vctrs_0.6.5            
#> [52] devtools_2.4.5          Matrix_1.7-1            jsonlite_1.8.9         
#> [55] callr_3.7.6             bit64_4.5.2             maketools_1.3.1        
#> [58] jquerylib_0.1.4         glue_1.8.0              ps_1.8.1               
#> [61] BiocVersion_3.21.1      later_1.4.1             UCSC.utils_1.3.0       
#> [64] tibble_3.2.1            pillar_1.10.1           rappdirs_0.3.3         
#> [67] htmltools_0.5.8.1       GenomeInfoDbData_1.2.13 R6_2.5.1               
#> [70] evaluate_1.0.3          shiny_1.10.0            lattice_0.22-6         
#> [73] png_0.1-8               memoise_2.0.1           httpuv_1.6.15          
#> [76] bslib_0.8.0             Rcpp_1.0.14             SparseArray_1.7.2      
#> [79] xfun_0.50               fs_1.6.5                buildtools_1.0.0       
#> [82] usethis_3.1.0           pkgconfig_2.0.3

Remarks

Building design matrices for fixed and random effects is a key step for LMM-based DE analysis. Including library size, a normalization factor for scRNA-seq, as a fixed effect can help reduce p-value inflation. If needed, we can add principal components (PCs) as fixed effects to further remove the unknown batch effect.

References

Hartley, H. O., and J. N. K. Rao. 1967. “Maximum-Likelihood Estimation for the Mixed Analysis of Variance Model.” Biometrika 54 (1/2): 93–108. http://www.jstor.org/stable/2333854.
Kang, Hyun Min, Meena Subramaniam, Sasha Targ, Michelle Nguyen, Lenka Maliskova, Elizabeth McCarthy, Eunice Wan, et al. 2018. “Multiplexed Droplet Single-Cell RNA-Sequencing Using Natural Genetic Variation.” Nature Biotechnology 36 (89–94). https://doi.org/10.1038/nbt.4042.
Patterson, H. D., and R. Thompson. 1971. “Recovery of Inter-Block Information When Block Sizes Are Unequal.” Biometrika 58 (3): 545–54. https://doi.org/10.2307/2334389.
Searle, Shayle R., George Casella, and Charles E. McCulloch. 2006. Variance Components. New Jersey: John Wiley & Sons, Inc.