--- title: "CRISPR Screen and Gene Expression Differential Analysis" author: "Lianbo Yu, Yue Zhao, Kevin R. Coombes, and Lang Li" date: "`r Sys.Date()`" output: pdf_document: latex_engine: xelatex number_sections: yes toc: yes vignette: > %\VignetteIndexEntry{CRISPR Screen and Gene Expression Differential Analysis} %\VignetteEncoding{UTF-8} %\VignetteDepends{CEDA} %\VignettePackage{CEDA} %\VignetteEngine{knitr::rmarkdown} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( warning = FALSE, message = FALSE, collapse = TRUE, comment = "#>" ) ``` # Introduction We developed CEDA to analyze read counts of single guide RNAs (sgRNAs) from CRISPR screening experiments. Except for non-targeting sgRNAs (can be used as negative controls), each sgRNA is targeting one gene, and each gene have multiple sgRNAs targeting it. CEDA models the sgRNA counts at different levels of gene expression by multi-component normal mixtures, with the model fit by an EM algorithm. Posterior estimates at sgRNA level are then summarized for each gene. In this document, we use data from an experiment with the MDA231 cell line to illustrate how to use CEDA to perform CRISPR screen data analysis. # Overview CEDA analysis follows a workflow that is typical for most omics level experiments. 1. Put the data into an appropriate format for input to CEDA. 2. Filter out sgRNA with low read counts (optional). 3. Normalize the raw counts. 4. Fit a linear model to the data. 5. Summarize and view the results. ## Data Format In our experiment, three samples of MDA231 cells were untreated at time T=0, and another three samples of MDA231 cells were treated with DMSO at time T=0. We are interested in detecting sgRNAs that are differentially changed by a treatment. The sgRNA read counts, along with a list of non-essential genes, are stored in the dataset `mda231` that we have included in the `CEDA` package. We read that dataset and explore its structure. ```{r data} library(CEDA) library(dplyr) library(ggplot2) library(ggsci) library(ggprism) library(ggridges) set.seed(1) data("mda231") class(mda231) length(mda231) names(mda231) ``` As you can see, this is a list containing two components 1. `sgRNA`, the observed count data of six samples, and 2. `neGene`, the set of non-essential genes. ```{r sgRNA} dim(mda231$sgRNA) length(mda231$neGene$Gene) head(mda231$sgRNA) ``` Notice that the `sgRNA` component includes an extra column, "`exp.level.log2`", that are the expression level (in log2 scale) of genes and was computed from gene expression data in the unit of FPKM. The second component of the list `mda231` is a data frame `neGene`, which is the gene names of the non-essential genes(Ref 1): ```{r neGene} dim(mda231$neGene) head(mda231$neGene) ``` ## Filter out sgRNAs with low read counts Due to the nature of drop-out screen, some sgRNAs targeting essential genes have low read counts at the end time point samples. Therefore, we must be cautious when applying strict filtering criteria to remove a large portion of sgRNAs before analysis. Gentle filtering criteria (i.e., removal of sgRNAs with 0 count in ≥ 75% samples) is recommended. ```{r filter} count_df <- mda231$sgRNA mx.count = apply(count_df[,c(3:8)],1,function(x) sum(x>=1)) table(mx.count) # keep sgRNA with none zero count in at least one sample count_df2 = count_df[mx.count>=1,] ``` ## Normalization The sgRNA read counts needs to be normalized across sample replicates before formal analysis. The non-essential genes are assumed to have no change after DMSO treatment. So, our recommended procedure is to perform median normalization based on the set of non-essential genes. ```{r normalization} mda231.ne <- count_df2[count_df2$Gene %in% mda231$neGene$Gene,] cols <- c(3:8) mda231.norm <- medianNormalization(count_df2[,cols], mda231.ne[,cols])[[2]] ``` ## Analysis Our primary goal is to detect essential sgRNAs that have different count levels between conditions. We rely on the R package `limma` to calculate log2 ratios (i.e., log fold changes or LFCs) between three untreated and three treated samples. ### Calculating fold ratios First, we have to go through the usual `limma` steps to describe the design of the study. There were two groups of replicate samples. We will call these groups "Control" and "Baseline" (although "Treated" and Untreated" would work just as well). Our main interest is determining the differences between the groups. And we have to record this information in a "contrast matrix" so limma knows what we want to compare. ```{r design} group <- gl(2, 3, labels=c("Control","Baseline")) design <- model.matrix(~ 0 + group) colnames(design) <- sapply(colnames(design), function(x) substr(x, 6, nchar(x))) contrast.matrix <- makeContrasts("Control-Baseline", levels=design) ``` Finally, we can run the limma algorithm. ```{r limfit} limma.fit <- runLimma(log2(mda231.norm+1),design,contrast.matrix) ``` We merge the results from our limma analysis with the post filtering sgRNA count data. ```{r merge} mda231.limma <- data.frame(count_df2, limma.fit) head(mda231.limma) ``` ### Fold ratios under the null hypotheses Under the null hypothses, all sgRNAs levels are unchanged between the two conditions. To obtain fold ratios under the null, samples were permuted between two conditions, and log ratios were obtained from limma analysis under each permutation. ```{r betanull} betanull <- permuteLimma(log2(mda231.norm + 1), design, contrast.matrix, 10) theta0 <- sd(betanull) theta0 ``` ### Fitting three-component mixture models A three-component mixture model (unchanged, overexpresssed, and underexpressed) is assumed for log ratios at different level of gene expression. Empirical Bayes method was employed to estimate parematers of the mixtures and posterior means were obtained for estimating actual log ratios between the two conditions. P-values of sgRNAs were then calculated by permutation method. ```{r mm, results='hide'} nmm.fit <- normalMM(mda231.limma, theta0, n.b=3, d=5) ``` Results from the mixture model were shown in Figure $1$. False discovery rate of $0.05$ was used for declaring significant changes in red color between the two conditions for sgRNAs. The vertical lines are dividing sgRNAs into bins accroding to the gene expression levels of their targetted genes. ```{r fig1, fig.cap = "Log fold ratios of sgRNAs vs. gene expression level"} scatterPlot(nmm.fit$data,fdr=0.05,xlim=c(-0.5,12),ylim=c(-8,5)) ``` ### Gene level summarization From the p-values of sgRNAs, gene level p-values were obtained by using modified robust rank aggregation method (alpha-RRA). Log ratios were also summarized at gene level. ```{r pval} mda231.nmm <- nmm.fit[[1]] p.gene <- calculateGenePval(exp(mda231.nmm$log_p), mda231.nmm$Gene, 0.05, nperm=10) gene_fdr <- stats::p.adjust(p.gene$pvalue, method = "fdr") gene_lfc <- calculateGeneLFC(mda231.nmm$lfc, mda231.nmm$Gene) gene_summary <- data.frame(gene_pval=unlist(p.gene$pvalue), gene_fdr=as.matrix(gene_fdr), gene_lfc = as.matrix(gene_lfc)) gene_summary$gene <- rownames(gene_summary) gene_summary <- gene_summary[,c(4,1:3)] ``` ### Gene level summarization From the p-values of sgRNAs, gene level p-values were obtained by using modified robust rank aggregation method (alpha-RRA). Log ratios were also summarized at gene level. ```{r summary} #extract gene expression data gene.express <- mda231.nmm %>% group_by(Gene) %>% summarise_at(vars(exp.level.log2), max) #merge gene summary with gene expression gdata <- left_join(gene_summary, gene.express, by = c("gene" = "Gene")) gdata <- gdata %>% filter(is.na(exp.level.log2)==FALSE) # density plot and ridge plot gdata$gene.fdr <- gdata$gene_fdr data <- preparePlotData(gdata, gdata$gene.fdr) ``` Results from CEDA were shown in Figure $2$. The points in the scatter plot were stratified into five color groups based on FDR. The 2D contour lines showed how the points distributed in 3D space. ```{r fig2, fig.cap = "2D density plot of gene log fold ratios vs. gene expression level for different FDR groups"} densityPlot(data) ``` The density plot of CEDA results shows that the groups of genes selected by CEDA (FDR<0.05, yellow, blue, and red) showed higher expression median values compared to the rest of genes (purple, green).