CRISPRBetaBinomial (CB2) is a package for designing a statistical hypothesis test for robust target identification, developing an accurate mapping algorithm to quantify sgRNA abundances, and minimizing the parameters necessary for CRISPR pooled screen data analysis. This document shows how to use CB2 for the CRISPR pooled screen data analysis.
First, import CB2
package using library(),
and it will be helpful to import other packages as follows:
There are three different basic functions in CB2. The
first function provides a quantification of counts of sgRNAs from the
NGS samples. It requires a library file (.fasta
or
.fa
) and a list of samples (.fastq
). The
library file must contain an annotation of sgRNAs in the library used in
the screen. A sgRNA annotation consists of a barcode sequence (20nt
sequence where sgRNA would target) and a name of a gene which the sgRNA
suppose to target.
Here is an example of the loading data for the screen analysis. Files
in the example are contained in the CB2
package.
# load the file path of the annotation file.
FASTA <- system.file("extdata", "toydata",
"small_sample.fasta",
package = "CB2")
system("tail -6 {FASTA}" %>% glue)
The first two lines of the annotation file indicate the annotation of
the first sgRNA, and the next two lines are the annotation of the second
sgRNA, and so on. The first line of an annotation is formatted as
><genename>_<id>
, where
<genename>
is an id of a symbol of the target gene
for the sgRNA and <id>
is the unique identifier for
the sgRNA. ><genename>_<id>
is the completed
identifier of the sgRNA and a completed identifier should not appear
more than once. The second line of an annotation is the 20nt sequence,
and it indicates which part of the target gene will be targeted by the
sgRNA.
The first annotation indicates the library contains a sgRNA and
RAB_3
is the identifier of the sgRNA. This sgRNA is
supposed to target RAB
gene and the intended target locus
is CTGTAGAAGCTACATCGGCT
We also have an example of the NGS sample file. The following snippet will display the contents of an NGS sample file.
FASTQ <- system.file("extdata", "toydata",
"Base1.fastq",
package = "CB2")
system("head -8 {FASTQ}" %>% glue)
The NGS sample file contains multiple reads, and each read consists of four sequential lines. The first line is the id of the reads, and the second line includes a sequence of the read, and we assume that a read contains a nucleotide sequence of the sgRNA as a substring of the read. The third line only includes ‘+’ and the last line includes the quality of each nucleotide of the read (Phread quality score).
Let’s get start the analysis. Before running the analysis we will see
the list of FASTQ
files we can use from the
toydata
example.
ex_path <- system.file("extdata", "toydata", package = "CB2")
Sys.glob("{ex_path}/*.fastq" %>% glue) %>% basename()
## [1] "Base1.fastq" "Base2.fastq" "High1.fastq" "High2.fastq" "Low1.fastq"
## [6] "Low2.fastq"
From the example directly above, we can recognize there are three
groups (Base, High, and Low) in the example data, and each of them has
two replicates. We will perform an analysis between Base
and High.
The first thing we need to do is creating a
design matrix. The below code shows how to build it.
df_design <- tribble(
~group, ~sample_name,
"Base", "Base1",
"Base", "Base2",
"High", "High1",
"High", "High2"
) %>% mutate(
fastq_path = glue("{ex_path}/{sample_name}.fastq")
)
df_design
## # A tibble: 4 × 3
## group sample_name fastq_path
## <chr> <chr> <glue>
## 1 Base Base1 /tmp/RtmpPz75fp/Rinst13cd1e072a66/CB2/extdata/toydata/Base1…
## 2 Base Base2 /tmp/RtmpPz75fp/Rinst13cd1e072a66/CB2/extdata/toydata/Base2…
## 3 High High1 /tmp/RtmpPz75fp/Rinst13cd1e072a66/CB2/extdata/toydata/High1…
## 4 High High2 /tmp/RtmpPz75fp/Rinst13cd1e072a66/CB2/extdata/toydata/High2…
df_design
contains three columns and each row contains
information of a sample. The first column is group
where
the sample belongs to, and the sample_name
is the name of
the sample for a convenience. fastq_path
is the place where
you will have the NGS sample file.
After creating the df_design,
and we can run a sgRNA
quantification by calling run_sgrna_quant.
## Base1 Base2 High1 High2
## RAB_5 15 21 30 42
## RAB_3 7 21 21 63
## RAB_1 47 14 94 28
## TRIM28_5 42 19 1 0
## TRIM28_3 24 44 2 4
## TRIM28_1 23 47 0 0
After running run_sgrna_quant,
we will have a data frame
(cb2_count$count
) and a numeric vector
(cb2_count$total
). The data frame contains sgRNA counts for
each sample, and the numeric vector contains the number of reads for
each sample. In the data frame, each row corresponds to a sgRNA and each
column belongs to a sample. Each value in the data frame indicates read
counts of the corresponded sgRNA and sample, and it implies how many
reads have been aligned to the sgRNA from the sample file. We assume the
number will be used to approximate the number of knock-out cells of the
target gene of the sgRNA.
## Base1 Base2 High1 High2
## 688 608 703 659
We are also able to lookup CPM (Count Per Million mapped read counts)
using get_CPM.
## Base1 Base2 High1 High2
## RAB_5 21802.326 34539.474 42674.253 63732.929
## RAB_3 10174.419 34539.474 29871.977 95599.393
## RAB_1 68313.953 23026.316 133712.660 42488.619
## TRIM28_5 61046.512 31250.000 1422.475 0.000
## TRIM28_3 34883.721 72368.421 2844.950 6069.803
## TRIM28_1 33430.233 77302.632 0.000 0.000
## PARK2_5 58139.535 74013.158 174964.438 209408.194
## PARK2_4 34883.721 52631.579 149359.886 212443.096
## POSCTRL_1 66860.465 77302.632 12802.276 13657.056
## PARK2_2 56686.047 42763.158 120910.384 84977.238
## RAB_2 50872.093 14802.632 49786.629 13657.056
## TRIM28_4 20348.837 67434.211 1422.475 3034.901
## POSCTRL_2 30523.256 67434.211 2844.950 6069.803
## POSCTRL_3 69767.442 14802.632 0.000 0.000
## NEGCTRL_5 40697.674 23026.316 0.000 0.000
## POSCTRL_4 52325.581 27960.526 2844.950 1517.451
## RAB_4 34883.721 8223.684 17069.701 4552.352
## TRIM28_2 20348.837 41118.421 1422.475 1517.451
## NEGCTRL_1 63953.488 19736.842 0.000 0.000
## NEGCTRL_2 34883.721 9868.421 1422.475 0.000
## NEGCTRL_3 5813.953 44407.895 0.000 0.000
## POSCTRL_5 58139.535 46052.632 0.000 0.000
## NEGCTRL_4 8720.930 18092.105 0.000 1517.451
## PARK2_1 36337.209 6578.947 172119.488 28831.563
## PARK2_3 26162.791 70723.684 82503.556 210925.645
There are four functions we can use to check the quality of the input
data. The first function (plot_count_distribution
) will
give the mappability (the success rate of sgRNA identification from
reads) for each sample.
## Warning: `mutate_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `mutate()` instead.
## ℹ See vignette('programming') for more help
## ℹ The deprecated feature was likely used in the CB2 package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `gather()` instead.
## ℹ The deprecated feature was likely used in the CB2 package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
We can also check the mappability (The proportion of the number of
reads successfully aligned to a sgRNA in the library among the entire
reads) using calc_mappability
function.
## Warning: `select_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `select()` instead.
## ℹ The deprecated feature was likely used in the CB2 package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## # A tibble: 4 × 5
## group sample_name total_reads mapped_reads mappability
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Base Base1 688 688 100
## 2 Base Base2 608 608 100
## 3 High High1 703 703 100
## 4 High High2 659 659 100
plot_PCA
can be a way of checking data quality.
The last function (plot_corr_heatmap
) display a
sgRNA-level correlation heatmap of NGS samples. We assume that samples
in the same group clustered together if the data quality is good.
After we find the data quality is good to move to the next step, then
we can perform an analysis for a sgRNA-level using
measure_sgrna_stats
## # A tibble: 25 × 19
## sgRNA gene n_a n_b phat_a[,1] vhat_a[,1] phat_b[,1] vhat_b[,1] cpm_a
## <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 RAB_5 RAB 2 2 0.0531 9.72e-5 0.0278 0.0000208 5.32e4
## 2 RAB_3 RAB 2 2 0.0627 1.07e-3 0.0223 0.000145 6.27e4
## 3 RAB_1 RAB 2 2 0.0881 2.09e-3 0.0458 0.000521 8.81e4
## 4 TRIM28_5 TRIM… 2 2 0.000734 5.39e-7 0.0463 0.000234 7.11e2
## 5 TRIM28_3 TRIM… 2 2 0.00441 3.22e-6 0.0535 0.000337 4.46e3
## 6 TRIM28_1 TRIM… 2 2 0 0 0.0553 0.000468 0
## 7 PARK2_5 PARK2 2 2 0.192 1.14e-4 0.0656 0.0000473 1.92e5
## 8 PARK2_4 PARK2 2 2 0.181 9.53e-4 0.0432 0.0000319 1.81e5
## 9 POSCTRL… POSC… 2 2 0.0132 9.58e-6 0.0718 0.0000514 1.32e4
## 10 PARK2_2 PARK2 2 2 0.103 3.46e-4 0.0499 0.0000707 1.03e5
## # ℹ 15 more rows
## # ℹ 10 more variables: cpm_b <dbl>, logFC <dbl>, t_value <dbl[,1]>,
## # df <dbl[,1]>, p_ts <dbl[,1]>, p_pa <dbl[,1]>, p_pb <dbl[,1]>, fdr_ts <dbl>,
## # fdr_pa <dbl>, fdr_pb <dbl>
As you can see above, we need four different parameters for the function. The first is a matrix of the read count, and the second parameter is the design data frame. The last two are the groups we are interested in performing differential abundance test for each sgRNA.
Here is the information of each column in the data.frame of the sgRNA-level statistics:
sgRNA
: The sgRNA identifier.gene
: The gene is the target of the sgRNAn_a
: The number of replicates of the first group.n_b
: The number of replicates of the second group.phat_a
: The proportion value of the sgRNA for the first
group.phat_b
: The proportion value of the sgRNA for the
second group.vhat_a
: The variance of the sgRNA for the first
group.vhat_b
: The variance of the sgRNA for the second
group.cpm_a
: The mean CPM of the sgRNA within the first
group.cpm_b
: The mean CPM of the sgRNA within the second
group.logFC
: The log fold change of sgRNA between two groups,
and calculated by $log_{2}\frac{CPM_{b}+1}{CPM_{a}+1}$t_value
: The value for the t-statistics.df
: The value of the degree of freedom, and will be
used to calculate the p-value of the sgRNA.p_ts
: The p-value indicates a difference between the
two groups.p_pa
: The p-value indicates enrichment of the first
group.p_pb
: The p-value indicates enrichment of the second
group.fdr_ts
: The adjusted P-value of p_ts
.fdr_pa
: The adjusted P-value of p_pa
.fdr_pb
: The adjusted P-value of p_pb
.Once we finish the sgRNA-level test, we can perform a gene-level test
using measure_gene_stats.
## # A tibble: 5 × 11
## gene n_sgrna cpm_a cpm_b logFC p_ts p_pa p_pb fdr_ts fdr_pa fdr_pb
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 NEGC… 5 2.94e2 26920. 10.8 0.227 0.999 0.0304 0.283 1.00 0.0507
## 2 PARK2 5 1.45e5 45892. -1.69 0.0352 0.00329 1.00 0.102 0.0164 1.00
## 3 POSC… 5 3.97e3 51117. 8.23 0.0410 1.00 0.00391 0.102 1.00 0.0195
## 4 RAB 5 4.93e4 30118. -0.462 0.678 0.256 0.874 0.678 0.639 1.00
## 5 TRIM… 5 1.77e3 45953. 6.81 0.155 0.999 0.0187 0.258 1.00 0.0469
Here is the information of each column in the data.frame of the gene-level statistics:
gene
: The gene name to be tested.n_sgrna
: The number of sgRNA targets the gene in the
library.cpm_a
: The mean of CPM of sgRNAs within the first
group.cpm_b
: The mean of CPM of sgRNAs within the second
group.logFC
: The log fold change of the gene between two
groups, and calculated by $log_{2}\frac{CPM_{b}+1}{CPM_{a}+1}$p_ts
: The p-value indicates a difference between the
two groups at the gene-level.p_pa
: The p-value indicates enrichment of the first
group at the gene-level.p_pb
: The p-value indicates enrichment of the second
group at the gene-level.fdr_ts
: The adjusted P-value of p_ts
.fdr_pa
: The adjusted P-value of p_pa
.fdr_pb
: The adjusted P-value of p_pb
.After we have a result of the gene-level test, we can filter out a
list of genes using different measures. For example, if you are
considering to find genes has a differential abundance between two
groups, you can use the value fdr_ts
for the hit selection.
If you want to see some genes has enrichment of abundance in the first
group (i.e., depiction in the opposite group), you lookup
fdr_pa
value, and fdr_pb
can be used to see an
enrichment of abundance in the second group. Here, we use
fdr_ts
to identify the hit genes.
## # A tibble: 0 × 11
## # ℹ 11 variables: gene <chr>, n_sgrna <int>, cpm_a <dbl>, cpm_b <dbl>,
## # logFC <dbl>, p_ts <dbl>, p_pa <dbl>, p_pb <dbl>, fdr_ts <dbl>,
## # fdr_pa <dbl>, fdr_pb <dbl>
CB2 also supports a useful dot plot function to lookup the read counts for a gene, and this function can be used to clarify an interesting hit is valid.
## Warning: `filter_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `filter()` instead.
## ℹ See vignette('programming') for more help
## ℹ The deprecated feature was likely used in the CB2 package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.