Title: | CRISPR Pooled Screen Analysis using Beta-Binomial Test |
---|---|
Description: | Provides functions for hit gene identification and quantification of sgRNA (single-guided RNA) abundances for CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) pooled screen data analysis. Details are in Jeong et al. (2019) <doi:10.1101/gr.245571.118> and Baggerly et al. (2003) <doi:10.1093/bioinformatics/btg173>. |
Authors: | Hyun-Hwan Jeong [aut, cre] |
Maintainer: | Hyun-Hwan Jeong <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.3.4 |
Built: | 2024-11-27 06:43:46 UTC |
Source: | CRAN |
A function to calculate the mappabilities of each NGS sample.
calc_mappability(count_obj, df_design)
calc_mappability(count_obj, df_design)
count_obj |
A list object is created by 'run_sgrna_quant'. |
df_design |
The table contains a study design. |
library(CB2) library(magrittr) library(tibble) library(dplyr) library(glue) FASTA <- system.file("extdata", "toydata", "small_sample.fasta", package = "CB2") ex_path <- system.file("extdata", "toydata", package = "CB2") df_design <- tribble( ~group, ~sample_name, "Base", "Base1", "Base", "Base2", "High", "High1", "High", "High2") %>% mutate(fastq_path = glue("{ex_path}/{sample_name}.fastq")) cb2_count <- run_sgrna_quant(FASTA, df_design) calc_mappability(cb2_count, df_design)
library(CB2) library(magrittr) library(tibble) library(dplyr) library(glue) FASTA <- system.file("extdata", "toydata", "small_sample.fasta", package = "CB2") ex_path <- system.file("extdata", "toydata", package = "CB2") df_design <- tribble( ~group, ~sample_name, "Base", "Base1", "Base", "Base2", "High", "High1", "High", "High2") %>% mutate(fastq_path = glue("{ex_path}/{sample_name}.fastq")) cb2_count <- run_sgrna_quant(FASTA, df_design) calc_mappability(cb2_count, df_design)
A benchmark CRISPRn pooled screen data from Evers et al.
data(Evers_CRISPRn_RT112)
data(Evers_CRISPRn_RT112)
The data object is a list and contains below information:
The count matrix from Evers et al.'s paper and contains the CRISPRn screening result using RT112 cell-line. It contains three different replicates for T0 (before) and contains different three replicates for T1 (after).
The list of 46 essential genes used in Evers et al.'s study.
The list of 47 non-essential genes used in Evers et al.'s study.
The data.frame contains study design.
The data.frame contains the sgRNA-level statistics.
The data.frame contains the gene-level statistics.
https://www.ncbi.nlm.nih.gov/pubmed/27111720
A C++ function to perform a parameter estimation for the sgRNA-level test. It will estimate two different parameters 'phat' and 'vhat,' and we assume input count data follows the beta-binomial distribution. Dr. Keith Baggerly initially implemented this code in Matlab, and it has been rewritten it in C++ for the speed-up.
fit_ab(xvec, nvec)
fit_ab(xvec, nvec)
xvec |
a matrix contains sgRNA read counts. |
nvec |
a vector contains the library size. |
A function to normalize sgRNA read counts.
get_CPM(sgcount)
get_CPM(sgcount)
sgcount |
The input table contains read counts of sgRNAs for each sample A function to calculate the CPM (Counts Per Million) (required) |
a normalized CPM table will be returned
library(CB2) data(Evers_CRISPRn_RT112) get_CPM(Evers_CRISPRn_RT112$count)
library(CB2) data(Evers_CRISPRn_RT112) get_CPM(Evers_CRISPRn_RT112$count)
A function to join a count table and a design table.
join_count_and_design(sgcount, df_design)
join_count_and_design(sgcount, df_design)
sgcount |
The input matrix contains read counts of sgRNAs for each sample. |
df_design |
The table contains a study design. |
A tall-thin and combined table of the sgRNA read counts and study design will be returned.
library(CB2) data(Evers_CRISPRn_RT112) head(join_count_and_design(Evers_CRISPRn_RT112$count, Evers_CRISPRn_RT112$design))
library(CB2) data(Evers_CRISPRn_RT112) head(join_count_and_design(Evers_CRISPRn_RT112$count, Evers_CRISPRn_RT112$design))
A function to perform gene-level test using a sgRNA-level statistics.
measure_gene_stats(sgrna_stat, logFC_level = "sgRNA")
measure_gene_stats(sgrna_stat, logFC_level = "sgRNA")
sgrna_stat |
A data frame created by ‘measure_sgrna_stats’ |
logFC_level |
The level of ‘logFC’ value. It can be ‘gene’ or ‘sgRNA’. |
A table contains the gene-level test result, and the table contains these columns:
‘gene’: Theg gene name to be tested.
‘n_sgrna’: The number of sgRNA targets the gene in the library.
‘cpm_a’: The mean of CPM of sgRNAs within the first group.
‘cpm_b’: The mean of CPM of sgRNAs within the second group.
‘logFC’: The log fold change of the gene between two groups. Taking the mean of sgRNA ‘logFC’s is default, and ‘logFC' is calculated by 'log2(cpm_b+1) - log2(cpm_a+1)’ if ‘logFC_level’ parameter is set to ‘gene’.
‘p_ts’: The p-value indicates a difference between the two groups at the gene-level.
‘p_pa’: The p-value indicates enrichment of the first group at the gene-level.
‘p_pb’: The p-value indicates enrichment of the second group at the gene-level.
‘fdr_ts’: The adjusted P-value of ‘p_ts’.
‘fdr_pa’: The adjusted P-value of ‘p_pa’.
‘fdr_pb’: The adjusted P-value of ‘p_pb’.
data(Evers_CRISPRn_RT112) measure_gene_stats(Evers_CRISPRn_RT112$sg_stat)
data(Evers_CRISPRn_RT112) measure_gene_stats(Evers_CRISPRn_RT112$sg_stat)
A function to perform a statistical test at a sgRNA-level
measure_sgrna_stats( sgcount, design, group_a, group_b, delim = "_", ge_id = NULL, sg_id = NULL )
measure_sgrna_stats( sgcount, design, group_a, group_b, delim = "_", ge_id = NULL, sg_id = NULL )
sgcount |
This data frame contains read counts of sgRNAs for the samples. |
design |
This table contains study design. It has to contain 'group.' |
group_a |
The first group to be tested. |
group_b |
The second group to be tested. |
delim |
The delimiter between a gene name and a sgRNA ID. It will be used if only rownames contains sgRNA ID. |
ge_id |
The column name of the gene column. |
sg_id |
The column/columns of sgRNA identifiers. |
A table contains the sgRNA-level test result, and the table contains these columns:
‘sgRNA’: The sgRNA identifier.
‘gene’: The gene is the target of the sgRNA
‘n_a’: The number of replicates of the first group.
‘n_b’: The number of replicates of the second group.
‘phat_a’: The proportion value of the sgRNA for the first group.
‘phat_b’: The proportion value of the sgRNA for the second group.
‘vhat_a’: The variance of the sgRNA for the first group.
‘vhat_b’: The variance of the sgRNA for the second group.
‘cpm_a’: The mean CPM of the sgRNA within the first group.
‘cpm_b’: The mean CPM of the sgRNA within the second group.
‘logFC’: The log fold change of sgRNA between two groups.
‘t_value’: The value for the t-statistics.
‘df’: The value of the degree of freedom, and will be used to calculate the p-value of the sgRNA.
‘p_ts’: The p-value indicates a difference between the two groups.
‘p_pa’: The p-value indicates enrichment of the first group.
‘p_pb’: The p-value indicates enrichment of the second group.
‘fdr_ts’: The adjusted P-value of ‘p_ts’.
‘fdr_pa’: The adjusted P-value of ‘p_pa’.
‘fdr_pb’: The adjusted P-value of ‘p_pb’.
library(CB2) data(Evers_CRISPRn_RT112) measure_sgrna_stats(Evers_CRISPRn_RT112$count, Evers_CRISPRn_RT112$design, "before", "after")
library(CB2) data(Evers_CRISPRn_RT112) measure_sgrna_stats(Evers_CRISPRn_RT112$count, Evers_CRISPRn_RT112$design, "before", "after")
A function to show a heatmap sgRNA-level corrleations of the NGS samples.
plot_corr_heatmap(sgcount, df_design, cor_method = "pearson")
plot_corr_heatmap(sgcount, df_design, cor_method = "pearson")
sgcount |
The input matrix contains read counts of sgRNAs for each sample. |
df_design |
The table contains a study design. |
cor_method |
A string parameter of the correlation measure. One of the three - "pearson", "kendall", or "spearman" will be the string. |
A pheatmap object contains the correlation heatmap
library(CB2) data(Evers_CRISPRn_RT112) plot_corr_heatmap(Evers_CRISPRn_RT112$count, Evers_CRISPRn_RT112$design)
A function to plot read count distribution.
plot_count_distribution(sgcount, df_design, add_dots = FALSE)
plot_count_distribution(sgcount, df_design, add_dots = FALSE)
sgcount |
The input matrix contains read counts of sgRNAs for each sample. |
df_design |
The table contains a study design. |
add_dots |
The function will display dots of sgRNA counts if it is set to 'TRUE'. |
A ggplot2 object contains a read count distribution plot for 'sgcount'.
library(CB2) data(Evers_CRISPRn_RT112) cpm <- get_CPM(Evers_CRISPRn_RT112$count) plot_count_distribution(cpm, Evers_CRISPRn_RT112$design)
library(CB2) data(Evers_CRISPRn_RT112) cpm <- get_CPM(Evers_CRISPRn_RT112$count) plot_count_distribution(cpm, Evers_CRISPRn_RT112$design)
A function to visualize dot plots for a gene.
plot_dotplot(sgcount, df_design, gene, ge_id = NULL, sg_id = NULL)
plot_dotplot(sgcount, df_design, gene, ge_id = NULL, sg_id = NULL)
sgcount |
The input matrix contains read counts of sgRNAs for each sample. |
df_design |
The table contains a study design. |
gene |
The gene to be shown. |
ge_id |
A name of the column contains gene names. |
sg_id |
A name of the column contains sgRNA IDs. |
A ggplot2 object contains dot plots of sgRNA read counts for a gene.
library(CB2) data(Evers_CRISPRn_RT112) plot_dotplot(get_CPM(Evers_CRISPRn_RT112$count), Evers_CRISPRn_RT112$design, "RPS7")
library(CB2) data(Evers_CRISPRn_RT112) plot_dotplot(get_CPM(Evers_CRISPRn_RT112$count), Evers_CRISPRn_RT112$design, "RPS7")
This function will perform a principal component analysis, and it returns a ggplot object of the PCA plot.
plot_PCA(sgcount, df_design)
plot_PCA(sgcount, df_design)
sgcount |
The input matrix contains read counts of sgRNAs for each sample. |
df_design |
The table contains a study design. |
A ggplot2 object contains a PCA plot for the input.
library(CB2) data(Evers_CRISPRn_RT112) plot_PCA(Evers_CRISPRn_RT112$count, Evers_CRISPRn_RT112$design)
A C++ function to quantify sgRNA abundance from NGS samples.
quant(ref_path, fastq_path, verbose = FALSE)
quant(ref_path, fastq_path, verbose = FALSE)
ref_path |
the path of the annotation file and it has to be a FASTA formatted file. |
fastq_path |
a list of the FASTQ files. |
verbose |
Display some logs during the quantification if it is set to 'true'. |
A function to perform a statistical test at a sgRNA-level, deprecated.
run_estimation( sgcount, design, group_a, group_b, delim = "_", ge_id = NULL, sg_id = NULL )
run_estimation( sgcount, design, group_a, group_b, delim = "_", ge_id = NULL, sg_id = NULL )
sgcount |
This data frame contains read counts of sgRNAs for the samples. |
design |
This table contains study design. It has to contain 'group.' |
group_a |
The first group to be tested. |
group_b |
The second group to be tested. |
delim |
The delimiter between a gene name and a sgRNA ID. It will be used if only rownames contains sgRNA ID. |
ge_id |
The column name of the gene column. |
sg_id |
The column/columns of sgRNA identifiers. |
A table contains the sgRNA-level test result, and the table contains these columns:
‘sgRNA’: The sgRNA identifier.
‘gene’: The gene is the target of the sgRNA
‘n_a’: The number of replicates of the first group.
‘n_b’: The number of replicates of the second group.
‘phat_a’: The proportion value of the sgRNA for the first group.
‘phat_b’: The proportion value of the sgRNA for the second group.
‘vhat_a’: The variance of the sgRNA for the first group.
‘vhat_b’: The variance of the sgRNA for the second group.
‘cpm_a’: The mean CPM of the sgRNA within the first group.
‘cpm_b’: The mean CPM of the sgRNA within the second group.
‘logFC’: The log fold change of sgRNA between two groups.
‘t_value’: The value for the t-statistics.
‘df’: The value of the degree of freedom, and will be used to calculate the p-value of the sgRNA.
‘p_ts’: The p-value indicates a difference between the two groups.
‘p_pa’: The p-value indicates enrichment of the first group.
‘p_pb’: The p-value indicates enrichment of the second group.
‘fdr_ts’: The adjusted P-value of ‘p_ts’.
‘fdr_pa’: The adjusted P-value of ‘p_pa’.
‘fdr_pb’: The adjusted P-value of ‘p_pb’.
A function to run a sgRNA quantification algorithm from NGS sample
run_sgrna_quant(lib_path, design, map_path = NULL, ncores = 1, verbose = FALSE)
run_sgrna_quant(lib_path, design, map_path = NULL, ncores = 1, verbose = FALSE)
lib_path |
The path of the FASTA file. |
design |
A table contains the study design. It must contain 'fastq_path' and 'sample_name.' |
map_path |
The path of file contains gene-sgRNA mapping. |
ncores |
The number that indicates how many processors will be used with a parallelization. The parallelization will be enabled if users do not set the parameter as '-1“ (it means the full physical cores will be used) or greater than '1'. |
verbose |
Display some logs during the quantification if it is set to 'TRUE' |
It will return a list, and the list contains three elements. The first element (‘count’) is a data frame contains the result of the quantification for each sample. The second element (‘total’) is a numeric vector contains the total number of reads of each sample. The last element (‘sequence’) a data frame contains the sequence of each sgRNA in the library.
library(CB2) library(magrittr) library(tibble) library(dplyr) library(glue) FASTA <- system.file("extdata", "toydata", "small_sample.fasta", package = "CB2") ex_path <- system.file("extdata", "toydata", package = "CB2") df_design <- tribble( ~group, ~sample_name, "Base", "Base1", "Base", "Base2", "High", "High1", "High", "High2") %>% mutate(fastq_path = glue("{ex_path}/{sample_name}.fastq")) cb2_count <- run_sgrna_quant(FASTA, df_design)
library(CB2) library(magrittr) library(tibble) library(dplyr) library(glue) FASTA <- system.file("extdata", "toydata", "small_sample.fasta", package = "CB2") ex_path <- system.file("extdata", "toydata", package = "CB2") df_design <- tribble( ~group, ~sample_name, "Base", "Base1", "Base", "Base2", "High", "High1", "High", "High2") %>% mutate(fastq_path = glue("{ex_path}/{sample_name}.fastq")) cb2_count <- run_sgrna_quant(FASTA, df_design)
A benchmark CRISPRn pooled screen data from Sanson et al.
data(Sanson_CRISPRn_A375)
data(Sanson_CRISPRn_A375)
The data object is a list and contains below information:
The count matrix from Sanson et al.'s paper and contains the CRISPRn screening result using A375 cell-line. It contains a sample of plasimd, and three biological replicates after three weeks.
The list of 1,580 essential genes used in Sanson et al.'s study.
The list of 927 non-essential genes used in Sanson et al.'s study.
The data.frame contains study design.
https://www.ncbi.nlm.nih.gov/pubmed/30575746