| Title: | Longitudinal Integration Site Analysis Toolkit |
|---|---|
| Description: | A comprehensive toolkit for the analysis of longitudinal integration site data, including data cleaning, quality control, statistical modeling, and visualization. It streamlines the entire workflow of integration site analysis, supports simple input formats, and provides user-friendly functions for researchers in virus integration site analysis. Ni et al. (2025) <doi:10.64898/2025.12.20.695672>. |
| Authors: | Shuai Ni [aut, cre] |
| Maintainer: | Shuai Ni <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.2 |
| Built: | 2026-05-26 06:39:16 UTC |
| Source: | https://github.com/cran/lisat |
Plot chromosome distribution of integration sites (IS)
chr_distribution(IS_raw, ref_version = "random")chr_distribution(IS_raw, ref_version = "random")
IS_raw |
Data frame containing raw integration site data (must have Chr column) |
ref_version |
Reference version for simulation (options: 'random' or 'LV', default = 'random') |
ggplot object of chromosome distribution (percentage of IS per chromosome)
Visualize and analyze network of common integration sites (CIS)
CIS(IS_raw, connect_distance = 50000)CIS(IS_raw, connect_distance = 50000)
IS_raw |
Data frame containing integration site data (must have Locus, Chr, nearest_gene_name columns) |
connect_distance |
Numeric threshold for connecting IS (default = 50000 bp) |
Data frame with top 10 CIS network metrics (Chr, Locus, Gene, Total_dots, etc.)
Generate colored GT table for CIS overlap across samples/timepoints
CIS_overlap(CIS_data, IS_raw, Timelevels = NULL)CIS_overlap(CIS_data, IS_raw, Timelevels = NULL)
CIS_data |
Data frame of CIS metrics (must have Chr and Locus columns) |
IS_raw |
Data frame of raw integration site data (must have Sample, Chr, Locus columns) |
Timelevels |
Optional vector of sample/timepoint levels for ordered display (default = NULL) |
gt table object with colored CIS overlap status (TRUE/FALSE)
Calculate regional distribution percentages of integration sites (IS)
Count_regions(IS_raw, Patient_timepoint)Count_regions(IS_raw, Patient_timepoint)
IS_raw |
Data frame of raw integration site data (must have Sample column + regional annotation columns) |
Patient_timepoint |
Data frame mapping Sample_ID to Time_Point (columns: Sample_ID, Time_Point) |
List of data frames (per sample) with regional IS percentages (Exonic/Intronic/Enhancer etc.)
Plot cumulative curve and perform statistical analysis
Cumulative_curve(IS_ratio)Cumulative_curve(IS_ratio)
IS_ratio |
A numeric vector of integration site ratios (output of fit_cum_simple) |
A list containing the ggplot object, t-test results, and Wilcoxon test result.
Check if integration sites (IS) are located in enhancer regions
Enhancer_check(IS_raw)Enhancer_check(IS_raw)
IS_raw |
Data frame containing raw integration site data (must have Chr and Locus columns) |
Data frame with an added Enhancer column (TRUE = located in enhancer, FALSE = not located in enhancer)
Calculate normalized cumulative sum for top N elements of a numeric vector
fit_cum_simple(x)fit_cum_simple(x)
x |
Non-empty numeric vector (integration site ratio data) |
Named vector of cumulative sums for predefined target indices + total sum (all = 1)
This function adds genomic feature annotations (gene/exon/intron overlap, nearest gene info) to raw integration site data, standardizes chromosome naming, and calculates clone contribution.
get_feature(IS_raw)get_feature(IS_raw)
IS_raw |
Data frame containing raw IS data with columns: Sample, Chr, Locus, SCount, Strand |
Data frame with annotated genomic features and clone contribution
This function generates a chromosome ideogram plot showing the density and position of integration sites (IS) using the RIdeogram package.
ideogram_plot(IS_raw, output_dir)ideogram_plot(IS_raw, output_dir)
IS_raw |
Data frame containing integration site data (columns: Chr, Locus required) |
output_dir |
Character, path to output directory for the PDF plot |
None (generates a PDF file in output_dir)
This function filters integration site data for AE-associated genes (within specified distance/threshold) and generates a dot plot of clone contribution percentages for these genes.
is_in_AE_gene(IS_raw, Distance = 1e+05, threashold = 0.001)is_in_AE_gene(IS_raw, Distance = 1e+05, threashold = 0.001)
IS_raw |
Data frame with annotated integration site data (columns: nearest_gene_name, nearest_distance, Clone_contribution, Sample required) |
Distance |
Numeric, maximum distance to AE gene (default: 100000 bp) |
threashold |
Numeric, minimum clone contribution threshold (default: 0.001) |
ggplot object (dot plot of clone contribution for AE-associated genes)
This function filters integration site data for cancer-associated genes (within specified distance/threshold) and generates a dot plot of clone contribution percentages for these genes.
is_in_CG_gene(IS_raw, Distance = 1e+05, threashold = 0.001)is_in_CG_gene(IS_raw, Distance = 1e+05, threashold = 0.001)
IS_raw |
Data frame with annotated integration site data (columns: nearest_gene_name, nearest_distance, Clone_contribution, Sample required) |
Distance |
Numeric, maximum distance to cancer gene (default: 100000 bp) |
threashold |
Numeric, minimum clone contribution threshold (default: 0.001) |
ggplot object (dot plot of clone contribution for cancer-associated genes)
This function filters integration site data for immune-associated genes (within specified distance/threshold) and generates a dot plot of clone contribution percentages for these genes.
is_in_immune_gene(IS_raw, Distance = 1e+05, threashold = 0.001)is_in_immune_gene(IS_raw, Distance = 1e+05, threashold = 0.001)
IS_raw |
Data frame with annotated integration site data (columns: nearest_gene_name, nearest_distance, Clone_contribution, Sample required) |
Distance |
Numeric, maximum distance to immune gene (default: 100000 bp) |
threashold |
Numeric, minimum clone contribution threshold (default: 0.001) |
ggplot object (dot plot of clone contribution for immune-associated genes)
This function creates a treemap visualization of the top 1000 integration site (IS) clone contributions, grouped by patient time points with custom color perturbation.
IS_treemap( IS_raw = IS_raw, Patient_timepoint = Patient_timepoint, Timelevels = NULL )IS_treemap( IS_raw = IS_raw, Patient_timepoint = Patient_timepoint, Timelevels = NULL )
IS_raw |
Data frame containing IS data (columns: Sample, Locus, Clone_contribution required) |
Patient_timepoint |
Data frame mapping Sample_ID to Time_Point (columns: Sample_ID, Time_Point required) |
Timelevels |
Character vector, optional custom order of time points (default: NULL, natural sort) |
ggplot object (treemap of IS clone contributions)
Creates a highly customizable combined Sankey-flow + stacked bar chart to visualize clonal proportion changes across timepoints, with manual control over flow polygon shapes and precise formatting of top integration sites (top 10 + "Others" category). All core logic and data processing steps remain identical to the original code - only namespace prefixes (::) added and lag() fixed.
Linked_timepoints(IS_raw, Patient_timepoint, Timelevels = NULL)Linked_timepoints(IS_raw, Patient_timepoint, Timelevels = NULL)
IS_raw |
Data frame containing integration site data (required columns: Clone_contribution, Sample, nearest_gene_name, Chr, Locus) |
Patient_timepoint |
Data frame mapping Sample_ID to Time_Point (columns: Sample_ID, Time_Point required) |
Timelevels |
Character vector (optional). Custom ordered levels for time points (overrides natural sort). Default = NULL. |
ggplot object. Combined Sankey-flow + stacked bar chart of top 10 integration site proportions across timepoints.
Plot Region-wise Donut Charts
plot_regions(Region_data, Timelevels = NULL)plot_regions(Region_data, Timelevels = NULL)
Region_data |
Named list of data frames with Product/Share/Percentage/Time columns |
Timelevels |
Character vector to subset time levels (optional) |
Arranged ggplot object of donut charts
Creates a polished dual Y-axis line chart to visualize clonal richness and evenness over time, with automatic scaling between axes, customizable styling, and optional data labels. All core functionality and parameters remain identical to the original code - only namespace prefixes (::) added.
plot_richness_evenness( PMD_data, time_col = "Time", richness_col = "Richness", evenness_col = "Eveness", plot_title = "Clonal eveness over time", subtitle = NULL, richness_color = "#3366CC", evenness_color = "#CC6677", show_labels = TRUE, Timelevels = NULL )plot_richness_evenness( PMD_data, time_col = "Time", richness_col = "Richness", evenness_col = "Eveness", plot_title = "Clonal eveness over time", subtitle = NULL, richness_color = "#3366CC", evenness_color = "#CC6677", show_labels = TRUE, Timelevels = NULL )
PMD_data |
Data frame containing time, richness, and evenness data (required columns specified by time_col/richness_col/evenness_col) |
time_col |
Character (default = "Time"). Name of column containing time points. |
richness_col |
Character (default = "Richness"). Name of column containing richness values. |
evenness_col |
Character (default = "Eveness"). Name of column containing evenness values (note: intentional spelling match to original code). |
plot_title |
Character (default = "Clonal eveness over time"). Main plot title (spelling preserved as original). |
subtitle |
Character (optional). Plot subtitle (default = NULL). |
richness_color |
Character (default = "#3366CC"). Hex color code for richness line/points/labels. |
evenness_color |
Character (default = "#CC6677"). Hex color code for evenness line/points/labels. |
show_labels |
Logical (default = TRUE). Whether to display numeric labels on data points. |
Timelevels |
Character vector (optional). Custom ordered levels for time factor (overrides default ordering). |
ggplot object. Dual Y-axis line chart of richness (primary) and evenness (secondary) over time.
This function computes UIS count, top clone contribution percentage, and PMD metrics (Richness/Eveness/PMD) for integration site data, and maps samples to patient time points.
pmd_analysis(IS_raw, Patient_timepoint)pmd_analysis(IS_raw, Patient_timepoint)
IS_raw |
Data frame containing integration site data (columns: Sample, Clone_contribution required) |
Patient_timepoint |
Data frame mapping Sample_ID to Time_Point (columns: Sample_ID, Time_Point required) |
Data frame with PMD metrics (UIS, TOP_P, Richness, Eveness, PMD, Sample, Time)
Creates a scatter plot of Richness vs. Eveness for PMD (Proportional Modular Diversity) analysis results, including reference lines, time point labels, and an inset directional legend for polyclonal/monoclonal classification. All core logic and parameters remain identical to the original code - only namespace prefixes (::) are added.
pmd_plot(PMD_data, Timelevels = NULL)pmd_plot(PMD_data, Timelevels = NULL)
PMD_data |
Data frame output from pmd_analysis() function (required columns: Richness, Eveness, Time) |
Timelevels |
Character vector (optional). Custom ordered levels for the Time factor. Default = NULL (uses natural sort) |
ggplot object. Combined plot (main Richness-Eveness plot + inset legend)
Check if integration sites (IS) are located in promoter regions
Promotor_check(IS_raw)Promotor_check(IS_raw)
IS_raw |
Data frame containing raw integration site data (must have Chr and Locus columns) |
Data frame with an added Promotor column (TRUE = located in promoter, FALSE = not located in promoter)
Check if integration sites (IS) are located in safe harbor regions
Safeharbor_check(IS_raw)Safeharbor_check(IS_raw)
IS_raw |
Data frame containing raw integration site data (must have Chr and Locus columns) |
Data frame with an added Safeharbor column (TRUE = located in safe harbor, FALSE = not located in safe harbor)
Validate and standardize integration site (IS) raw data frame
validate_IS_raw(IS_raw)validate_IS_raw(IS_raw)
IS_raw |
Data frame containing IS data (expected columns: Sample, SCount, Chr, Locus) |
List with validation results:
valid (logical): TRUE if data passes validation, FALSE otherwise
errors (character): Validation messages/errors
converted_data (data.frame): Original/cleaned data with numeric conversions (if applicable)