Title: | Visualization of Viral Protein Sequence Diversity Dynamics |
---|---|
Description: | To ease the visualization of outputs from Diversity Motif Analyser ('DiMA'; <https://github.com/BVU-BILSAB/DiMA>). 'vDiveR' allows visualization of the diversity motifs (index and its variants – major, minor and unique) for elucidation of the underlying inherent dynamics. Please refer <https://vdiver-manual.readthedocs.io/en/latest/> for more information. |
Authors: | Pendy Tok [aut, cre], Li Chuin Chong [aut], Evgenia Chikina [aut], Yin Cheng Chen [aut], Mohammad Asif Khan [aut] |
Maintainer: | Pendy Tok <[email protected]> |
License: | MIT + file LICENSE |
Version: | 2.0.1 |
Built: | 2024-11-25 16:49:18 UTC |
Source: | CRAN |
This function concatenates completely (index incidence = 100 index incidence < 100 k-mer position or are adjacent to each other and generate the CCS/HCS sequence in either CSv or FASTA format
concat_conserved_kmer( data, conservation_level = "HCS", kmer = 9, threshold_pct = NULL )
concat_conserved_kmer( data, conservation_level = "HCS", kmer = 9, threshold_pct = NULL )
data |
DiMA JSON converted csv file data |
conservation_level |
CCS (completely conserved) / HCS (highly conserved) |
kmer |
size of the k-mer window |
threshold_pct |
manually set threshold of index.incidence for HCS |
A list wit csv and fasta dataframes
csv<-concat_conserved_kmer(proteins_1host)$csv csv_2hosts<-concat_conserved_kmer(protein_2hosts, conservation_level = "CCS")$csv fasta <- concat_conserved_kmer(protein_2hosts, conservation_level = "HCS")$fasta
csv<-concat_conserved_kmer(proteins_1host)$csv csv_2hosts<-concat_conserved_kmer(protein_2hosts, conservation_level = "CCS")$csv fasta <- concat_conserved_kmer(protein_2hosts, conservation_level = "HCS")$fasta
This function get the metadata from each header of GISAID fasta file
extract_from_GISAID(file_path)
extract_from_GISAID(file_path)
file_path |
path of fasta file |
This function get the metadata from each head of fasta file
extract_from_NCBI(file_path)
extract_from_NCBI(file_path)
file_path |
path of fasta file |
A sample DiMA JSON Output File which acts as the input for JSON2CSV()
JSON_sample
JSON_sample
A Diversity Motif Analyzer (DiMA) tool JSON file
This function converts DiMA (v5.0.9) JSON output file to a dataframe with 17 predefined columns which further acts as the input for other functions provided in this vDiveR package.
json2csv( json_data, host_name = "unknown host", protein_name = "unknown protein" )
json2csv( json_data, host_name = "unknown host", protein_name = "unknown protein" )
json_data |
DiMA JSON output dataframe |
host_name |
name of the host species |
protein_name |
name of the protein |
A dataframe which acts as input for the other functions in vDiveR package
inputdf<-json2csv(JSON_sample)
inputdf<-json2csv(JSON_sample)
A dummy dataset that acts as an input for plot_world_map() and plot_time()
metadata
metadata
A data frame with 1000 rows and 3 variables:
unique identifier of the sequence
geographical region of the sequence collection
collection date of the sequence
This function retrieves metadata (ID, region, date) from the input FASTA file, with the source of, either NCBI (with default FASTA header) or GISAID (with default FASTA header). The function will return a dataframe that has three columns consisting ID, collected region and collected date. Records that do not have region or date information will be excluded from the output dataframe.
metadata_extraction(file_path, source)
metadata_extraction(file_path, source)
file_path |
path of fasta file |
source |
the source of fasta file, either "NCBI" or "GISAID" |
A dataframe that has three columns consisting ID, collected region and collected date
filepath <- system.file('extdata','GISAID_EpiCoV.faa', package = 'vDiveR') meta_gisaid <- metadata_extraction(filepath, 'GISAID')
filepath <- system.file('extdata','GISAID_EpiCoV.faa', package = 'vDiveR') meta_gisaid <- metadata_extraction(filepath, 'GISAID')
This function plots conservation levels distribution of k-mer positions, which consists of completely conserved (black) (index incidence = 100%), highly conserved (blue) (90% <= index incidence < 100%), mixed variable (green) (20% < index incidence <= 90%), highly diverse (purple) (10% < index incidence <= 20%) and extremely diverse (pink) (index incidence <= 10%).
plot_conservation_level( df, protein_order = NULL, conservation_label = 1, host = 1, base_size = 11, line_dot_size = 2, label_size = 2.6, alpha = 0.6 )
plot_conservation_level( df, protein_order = NULL, conservation_label = 1, host = 1, base_size = 11, line_dot_size = 2, label_size = 2.6, alpha = 0.6 )
df |
DiMA JSON converted csv file data |
protein_order |
order of proteins displayed in plot |
conservation_label |
0 (partial; show present conservation labels only) or 1 (full; show ALL conservation labels) in plot |
host |
number of host (1/2) |
base_size |
base font size in plot |
line_dot_size |
lines and dots size |
label_size |
conservation labels font size |
alpha |
any number from 0 (transparent) to 1 (opaque) |
A plot
plot_conservation_level(proteins_1host, conservation_label = 1,alpha=0.8, base_size = 15) plot_conservation_level(protein_2hosts, conservation_label = 0, host=2)
plot_conservation_level(proteins_1host, conservation_label = 1,alpha=0.8, base_size = 15) plot_conservation_level(protein_2hosts, conservation_label = 0, host=2)
This function plots the correlation between entropy and total variant incidence of all the provided protein(s).
plot_correlation( df, host = 1, alpha = 1/3, line_dot_size = 3, base_size = 11, ylabel = "k-mer entropy (bits)\n", xlabel = "\nTotal variants (%)", ymax = ceiling(max(df$entropy)), ybreak = 0.5 )
plot_correlation( df, host = 1, alpha = 1/3, line_dot_size = 3, base_size = 11, ylabel = "k-mer entropy (bits)\n", xlabel = "\nTotal variants (%)", ymax = ceiling(max(df$entropy)), ybreak = 0.5 )
df |
DiMA JSON converted csv file data |
host |
number of host (1/2) |
alpha |
any number from 0 (transparent) to 1 (opaque) |
line_dot_size |
dot size in scatter plot |
base_size |
base font size in plot |
ylabel |
y-axis label |
xlabel |
x-axis label |
ymax |
maximum y-axis |
ybreak |
y-axis breaks |
A scatter plot
plot_correlation(proteins_1host) plot_correlation(protein_2hosts, base_size = 2, ybreak=1, ymax=10, host = 2)
plot_correlation(proteins_1host) plot_correlation(protein_2hosts, base_size = 2, ybreak=1, ymax=10, host = 2)
This function compactly display the dynamics of diversity motifs (index and its variants: major, minor and unique) in the form of dot plot(s) as well as violin plots for all the provided individual protein(s).
plot_dynamics_protein( df, host = 1, protein_order = NULL, base_size = 8, alpha = 1/3, line_dot_size = 3, bw = "nrd0", adjust = 1 )
plot_dynamics_protein( df, host = 1, protein_order = NULL, base_size = 8, alpha = 1/3, line_dot_size = 3, bw = "nrd0", adjust = 1 )
df |
DiMA JSON converted csv file data |
host |
number of host (1/2) |
protein_order |
order of proteins displayed in plot |
base_size |
base font size in plot |
alpha |
any number from 0 (transparent) to 1 (opaque) |
line_dot_size |
dot size in scatter plot |
bw |
smoothing bandwidth of violin plot (default: nrd0) |
adjust |
adjust the width of violin plot (default: 1) |
A plot
plot_dynamics_protein(proteins_1host)
plot_dynamics_protein(proteins_1host)
This function compactly display the dynamics of diversity motifs (index and its variants: major, minor and unique) in the form of dot plot as well as violin plot for all the provided proteins at proteome level.
plot_dynamics_proteome( df, host = 1, line_dot_size = 2, base_size = 10, alpha = 1/3, bw = "nrd0", adjust = 1 )
plot_dynamics_proteome( df, host = 1, line_dot_size = 2, base_size = 10, alpha = 1/3, bw = "nrd0", adjust = 1 )
df |
DiMA JSON converted csv file data |
host |
number of host (1/2) |
line_dot_size |
size of dot in plot |
base_size |
word size in plot |
alpha |
any number from 0 (transparent) to 1 (opaque) |
bw |
smoothing bandwidth of violin plot (default: nrd0) |
adjust |
adjust the width of violin plot (default: 1) |
A plot
plot_dynamics_proteome(proteins_1host)
plot_dynamics_proteome(proteins_1host)
This function plot entropy (black) and total variant (red) incidence of each k-mer position across the studied proteins and highlight region(s) with zero entropy in yellow. k-mer position with low support is marked with a red triangle underneath the x-axis line.
plot_entropy( df, host = 1, protein_order = "", kmer_size = 9, ymax = 10, line_size = 2, base_size = 8, all = TRUE, highlight_zero_entropy = TRUE )
plot_entropy( df, host = 1, protein_order = "", kmer_size = 9, ymax = 10, line_size = 2, base_size = 8, all = TRUE, highlight_zero_entropy = TRUE )
df |
DiMA JSON converted csv file data |
host |
number of host (1/2) |
protein_order |
order of proteins displayed in plot |
kmer_size |
size of the k-mer window |
ymax |
maximum y-axis |
line_size |
size of the horizontal (reference) line in plot |
base_size |
word size in plot |
all |
plot both the entropy and total variants (pass FALSE in to plot only the entropy) |
highlight_zero_entropy |
highlight region with zero entropy (default: TRUE) |
A plot
plot_entropy(proteins_1host) plot_entropy(protein_2hosts, host = 2)
plot_entropy(proteins_1host) plot_entropy(protein_2hosts, host = 2)
This function plots the time distribution of provided sequences in the form of bar plot with 'Month' as x-axis and 'Number of Sequences' as y-axis. Aside from the plot, this function also returns a dataframe with 2 columns: 'Date' and 'Number of sequences'. The input dataframe of this function is obtainable from metadata_extraction(), with NCBI Protein / GISAID (EpiFlu/EpiCoV/EpiPox/EpiArbo) FASTA file as input.
plot_time( metadata, date_format = "%Y-%m-%d", base_size = 8, date_break = "2 month", scale = "count" )
plot_time( metadata, date_format = "%Y-%m-%d", base_size = 8, date_break = "2 month", scale = "count" )
metadata |
a dataframe with 3 columns, 'ID', 'region', and 'date' |
date_format |
date format of the input dataframe |
base_size |
word size in plot |
date_break |
date break for the scale_x_date |
scale |
plot counts or log scale the data |
A single plot or a list with 2 elements (a plot followed by a dataframe, default)
time_plot <- plot_time(metadata, date_format="%d/%m/%Y")$plot time_df <- plot_time(metadata, date_format="%d/%m/%Y")$df
time_plot <- plot_time(metadata, date_format="%d/%m/%Y")$plot time_df <- plot_time(metadata, date_format="%d/%m/%Y")$df
This function plots a world map and color the affected geographical region(s) from light (lower) to dark (higher), depends on the cumulative number of sequences. Aside from the plot, this function also returns a dataframe with 2 columns: 'Region' and 'Number of Sequences'. The input dataframe of this function is obtainable from metadata_extraction(), with NCBI Protein / GISAID (EpiFlu/EpiCoV/EpiPox/EpiArbo) FASTA file as input.
plot_world_map(metadata, base_size = 8)
plot_world_map(metadata, base_size = 8)
metadata |
a dataframe with 3 columns, 'ID', 'region', and 'date' |
base_size |
word size in plot |
A list with 2 elements (a plot followed by a dataframe)
geographical_plot <- plot_world_map(metadata)$plot geographical_df <- plot_world_map(metadata)$df
geographical_plot <- plot_world_map(metadata)$plot geographical_df <- plot_world_map(metadata)$df
A dummy dataset with 1 protein (Core) from two hosts, human and bat
protein_2hosts
protein_2hosts
A data frame with 200 rows and 17 variables:
name of the protein
starting position of the aligned, overlapping k-mer window
number of k-mer sequences at the given position
k-mer position with sequences lesser than the minimum support threshold (TRUE) are considered of low support, in terms of sample size
level of variability at the k-mer position, with zero representing completely conserved
the predominant sequence (index motif) at the given k-mer position
the fraction (in percentage) of the index sequences at the k-mer position
the fraction (in percentage) of the major sequence (the predominant variant to the index) at the k-mer position
the fraction (in percentage) of minor sequences (of frequency lesser than the major variant, but not singletons) at the k-mer position
the fraction (in percentage) of unique sequences (singletons, observed only once) at the k-mer position
the fraction (in percentage) of sequences at the k-mer position that are variants to the index (includes: major, minor and unique variants)
incidence of the distinct k-mer peptides at the k-mer position
presence of more than one index sequence of equal incidence
species name of the organism host to the virus
k-mer position that has the highest entropy value
highest entropy values observed in the studied protein
average entropy values across all the k-mer positions
A dummy dataset with two proteins (A and B) from one host, human
proteins_1host
proteins_1host
A data frame with 806 rows and 17 variables:
name of the protein
starting position of the aligned, overlapping k-mer window
number of k-mer sequences at the given position
k-mer position with sequences lesser than the minimum support threshold (TRUE) are considered of low support, in terms of sample size
level of variability at the k-mer position, with zero representing completely conserved
the predominant sequence (index motif) at the given k-mer position
the fraction (in percentage) of the index sequences at the k-mer position
the fraction (in percentage) of the major sequence (the predominant variant to the index) at the k-mer position
the fraction (in percentage) of minor sequences (of frequency lesser than the major variant, but not singletons) at the k-mer position
the fraction (in percentage) of unique sequences (singletons, observed only once) at the k-mer position
the fraction (in percentage) of sequences at the k-mer position that are variants to the index (includes: major, minor and unique variants)
incidence of the distinct k-mer peptides at the k-mer position
presence of more than one index sequence of equal incidence
species name of the organism host to the virus
k-mer position that has the highest entropy value
highest entropy values observed in the studied protein
average entropy values across all the k-mer positions