If an analysis starts with an input matrix, it has to be
appropriately pre-proceed before it is used any functions of
CB2
. CB2
allows two different types of input:
a numeric matrix/data frame with row.names
and a data.frame
contains columns of counts and columns of sgRNA IDs and target genes.
Either of them will work. This document explains how the input should be
formed and how to process the input using CB2
. In the
entire document, (Evers et al. 2016)’s
CRISPR-RT112 screen data are used.
The following code imports required packages which are required to run below codes.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
The following code block shows an example of the first type of input
which CB2
can handle. Each column of
Evers_CRISPRn_RT112$count
contains counts of guide RNAs of
a sample (that was initially extracted from NGS data). A count of the
input shows that how many guide RNA barcodes were observed from a given
NGS sample. Each row of the matrix has a row name (e.g.,
RPS19_sg10
), and the name is the ID of a guide RNA. For
example, RPS19_sg10
, which is the first-row name in the
example, indicates that the first row contains the counts of
RPS_sg10
guide RNA. Every guide RNA ID must have
exactly one _
character, and it is used to be a
separator of two strings. The first string displays the name of a gene
whose gene is targeted by the guide RNA, and the second string is used
as an identifier among guide RNAs that targets the same gene. For
example, RPS_sg10
indicates that the guide RNA is designed
to target the RPS
gene, and sg10
is the unique
identifier.
NOTE : If the input contains multiple _
characters, CB2
is not able to run. In particular, if
Entrez gene IDs are used as the gene names, CB2
does not
handle the input. One of the solutions for this case is changing the
gene names to another identifier (e.g., HGNC symbol) or using another
type of input, which will explain below.
## B1 B2 B3 A1 A2 A3
## RPS19_sg10 10180 9768 9406 1005 1186 911
## NUP93_sg4 9073 8598 8363 688 479 581
## PSMD11_sg2 9408 9573 9384 1014 1196 797
## RPS3A_sg3 8922 8965 8779 1266 1303 1176
## PSMB3_sg3 7434 6958 6871 450 628 398
## POLR2A_sg5 7672 7537 7259 647 786 683
In addition, CB2
requires experiment design information
which is formed as a data.frame and contains sample names and groups of
each sample. In Evers_CRISPRn_RT112
data,
Evers_CRISPRn_RT112$design
is the data.frame.
## # A tibble: 6 × 2
## group sample_name
## <chr> <chr>
## 1 before B1
## 2 before B2
## 3 before B3
## 4 after A1
## 5 after A2
## 6 after A3
With the two variables, CB2
can perform the hypothesis
test with measure_sgrna_stats
and
measure_gene_stats
functions.
sgrna_stats <- measure_sgrna_stats(Evers_CRISPRn_RT112$count, Evers_CRISPRn_RT112$design, "before", "after")
gene_stats <- measure_gene_stats(sgrna_stats)
head(gene_stats)
## # A tibble: 6 × 11
## gene n_sgrna cpm_a cpm_b logFC p_ts p_pa p_pb fdr_ts fdr_pa
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ABCG8 10 1101. 1540. 0.481 1.03e-18 1 e+ 0 2.41e-21 1.56e-18 1 e+ 0
## 2 ADH7 10 799. 1173. 0.522 4.88e-20 1 e+ 0 1.10e-22 8.55e-20 1 e+ 0
## 3 CABP5 10 1171. 1662. 0.493 2.10e-21 1 e+ 0 4.55e-24 4.64e-21 1 e+ 0
## 4 COPA 10 860. 413. -1.51 4.41e-33 5.48e-33 8.12e- 1 1.37e-31 1.70e-31
## 5 COPB1 10 1054. 498. -1.57 3.29e-25 3.04e-24 5.31e- 1 1.26e-24 1.29e-23
## 6 COPS2 10 1055. 842. -0.728 6.56e-19 4.85e-12 1.43e- 4 1.03e-18 1.13e-11
## # ℹ 1 more variable: fdr_pb <dbl>
Another input type is a data.frame that contains two additional
columns, which contain the guide RNA information (target gene and guide
RNA identifier). A CSV file which was used in the CB2
publication ((Jeong et al. 2019)). The CSV
file contains the additional columns, the first is the gene
column, and the second is the sgRNA
column.
df <- read_csv("https://raw.githubusercontent.com/hyunhwaj/CB2-Experiments/master/01_gene-level-analysis/data/Evers/CRISPRn-RT112.csv")
## Rows: 961 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): gene, sgRNA
## dbl (6): B1, B2, B3, A1, A2, A3
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 961 × 8
## gene sgRNA B1 B2 B3 A1 A2 A3
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 RPS19 RPS19-10 10180 9768 9406 1005 1186 911
## 2 NUP93 NUP93-4 9073 8598 8363 688 479 581
## 3 PSMD11 PSMD11-2 9408 9573 9384 1014 1196 797
## 4 RPS3A RPS3A-3 8922 8965 8779 1266 1303 1176
## 5 PSMB3 PSMB3-3 7434 6958 6871 450 628 398
## 6 POLR2A POLR2A-5 7672 7537 7259 647 786 683
## 7 RPS8 RPS8-8 6999 6966 6775 673 665 474
## 8 RPL36 RPL36-3 5988 5893 5498 266 276 201
## 9 RPS13 RPS13-4 5763 5546 5131 187 191 196
## 10 RPS11 RPS11-2 9444 9621 8835 1900 2213 1705
## # ℹ 951 more rows
Two additional parameters have to be set to the
measure_sgrna_stats
function if an input matrix is this
type. The first parameter is ge_id
, which specifies the
column of genes, and the second parameter is sg_id
, which
specifies the column of IDs. In the following code, gene_id
sets as gene
and sg_id
sets as
sgRNA
.
head(measure_sgrna_stats(df, Evers_CRISPRn_RT112$design, "before", "after", ge_id = 'gene', sg_id = 'sgRNA'))
## gene sgRNA n_a n_b phat_a vhat_a phat_b vhat_b
## 1 RPS19 RPS19-10 3 3 0.002323257 5.369646e-10 0.0003257504 4.174644e-10
## 2 NUP93 NUP93-4 3 3 0.002060560 7.828460e-10 0.0001838979 3.300861e-10
## 3 PSMD11 PSMD11-2 3 3 0.002244954 1.772784e-10 0.0003147378 8.295566e-10
## 4 RPS3A RPS3A-3 3 3 0.002110486 1.666823e-10 0.0003935102 4.133227e-11
## 5 PSMB3 PSMB3-3 3 3 0.001683100 8.591075e-10 0.0001547065 3.831946e-10
## 6 POLR2A POLR2A-5 3 3 0.001778234 1.404884e-10 0.0002229180 2.109228e-10
## cpm_a cpm_b logFC t_value df p_ts p_pa
## 1 2323.275 325.7290 -2.830614 -64.65714 3.938262 4.148366e-07 2.074183e-07
## 2 2060.586 183.9145 -3.478824 -56.25377 3.432003 3.271576e-06 1.635788e-06
## 3 2246.609 314.6832 -2.831842 -60.83125 2.817477 1.761248e-05 8.806242e-06
## 4 2111.801 393.9205 -2.419522 -119.04667 2.934424 1.685127e-06 8.425636e-07
## 5 1683.145 154.6872 -3.435294 -43.36323 3.488096 6.833235e-06 3.416617e-06
## 6 1778.610 222.9897 -2.990057 -82.96805 3.845513 2.121616e-07 1.060808e-07
## p_pb fdr_ts fdr_pa fdr_pb
## 1 0.9999998 4.714939e-05 2.357470e-05 0.9999999
## 2 0.9999984 6.164676e-05 3.273760e-05 0.9999999
## 3 0.9999912 1.071240e-04 7.693454e-05 0.9999999
## 4 0.9999992 5.772537e-05 2.886268e-05 0.9999999
## 5 0.9999966 8.042437e-05 4.512440e-05 0.9999999
## 6 0.9999999 4.714939e-05 2.357470e-05 0.9999999