RCytoGPS: Working With LGF-Models of Karyotype in R

Introduction

Modern biological experiments are increasingly producing interesting binary matrices. These may represent the presence or absence of specific gene mutations, copy number variants, microRNAs, or other molecular or clinical phenomena. We recently developed a tool, CytoGPS [^Abrams and colleagues], that converts conventional karyotypes from the standard text-based notation (the International Standard for Human Cytogenetic/Cytogenomic Nomenclature; ISCN) into a binary vector with three bits (loss, gain, or fusion) per cytoband, which we call the “LGF model”.

The CytoGPS tool is available at the web site http://cytogps.org, where the LGF results of processing karyotype data are returned in JSON format. To complement the web site, we have developed RCytoGPS, an R package to extract, format, and visualize genetic data at the resolution of cytobands. RCytoGPS can parse any JSON file (or set of files) produced by CytoGPS.org.

Setup

In order to extract LGF data from JSON files, you must first load the package.

library(RCytoGPS)

Extracting JSON data and formatting to LGF model

We have included a pair of JSON files produced at CytoGPS.org as examples in the package. These are found in the following directory:

wd <-  system.file("Examples/JSONfiles", package = "RCytoGPS")
dir(wd)
## [1] "CytoGPS_Result1.json" "CytoGPS_Result2.json" "input1.txt"          
## [4] "input2.txt"

The two text files contain the inputs that were uploaded to the web site; the two JSON files contain the outputs. You can specify the files and the folder that you want to read. The simplest application is to omit the files variable and read all filed in teh specified folder (which defaults to the current working directory).

temp <- readLGF(folder = wd)
## Reading 2 file(s) from '/tmp/RtmpAmYA3k/Rinst12b62a13fe22/RCytoGPS/Examples/JSONfiles'.
rm(wd)

The return value is a list of five elements.

class(temp)
## [1] "list"
names(temp)
## [1] "source"    "raw"       "frequency" "size"      "CL"

The source element documents which JSON file(s) were read.

temp$source
## [1] "CytoGPS_Result1.json" "CytoGPS_Result2.json"

The size element lists the number of rows returned from each file; each row represents a distinct clone.

temp$size
## CytoGPS_Result1 CytoGPS_Result2 
##               6               4

The CL element is a data frame describing the chromosomal locations of each cytoband.

summary(temp$CL)
##    Chromosome    loc.start            loc.end               Band    
##  chr1   : 65   Min.   :        0   Min.   :   300000   p11.1  : 20  
##  chr2   : 64   1st Qu.: 29275000   1st Qu.: 33350000   q11.1  : 17  
##  chr3   : 62   Median : 62950000   Median : 66400000   q21.2  : 17  
##  chr6   : 50   Mean   : 72922120   Mean   : 76480034   q22.2  : 17  
##  chr4   : 47   3rd Qu.:106700000   3rd Qu.:110725000   q22.1  : 16  
##  chr5   : 47   Max.   :243500000   Max.   :248956422   p11.2  : 15  
##  (Other):533                                           (Other):766  
##      Stain    
##  gneg   :417  
##  gpos50 :122  
##  gpos25 : 89  
##  gpos75 : 89  
##  gpos100: 81  
##  acen   : 48  
##  (Other): 22

The raw element is itself a list, containing the binary LGF data for each JSON file processed. Each file produces a “Status” output along with the LGF data. The Status includes both the input karyotype (in ISCN format) and an indicator of whether CytoGPS could successfully process it. In this example, the first karyotype contained an error. As a result, the LGF component does not contain any rows derived from that karyotype. It does, however. contain three rows derived from the second karyotype, since the “forward slashes” separate the decriptions of three different clones that were detected in that sample.

names(temp$raw)
## [1] "CytoGPS_Result1" "CytoGPS_Result2"
R <- temp$raw[[2]]
names(R)
## [1] "Status" "LGF"
R$Status
##                Status
## RN01 Validation error
## RN02          Success
## RN03          Success
##                                                                         Karyotype
## RN01                                                         46,XY,-8,+12,der(14)
## RN02 48,XY,t(10;13)(q26;q14),+12,+19/47,XY,t(9;13)(p24;q13),-10,+12,+19/45,XY,-13
## RN03                                                    47,XY,+12,dup(14)(q32q32)
dim(R$LGF)
## [1]    4 2750
rownames(R$LGF)
## [1] "2.1.1" "2.2.1" "2.3.1" "3.1.1"
rm(R)

Finally, the frequency element contains summary data from each file read. These summaries consist of the frequencies of loss, gain, and fusion events. Each row of this data frame represents a cytoband. There are three columns from each JSON file, one each for loss, gain, and fusion

F <- temp$frequency
class(F)
## [1] "data.frame"
dim(F)
## [1] 868   6
colnames(F)
## [1] "CytoGPS_Result1.Loss"   "CytoGPS_Result1.Gain"   "CytoGPS_Result1.Fusion"
## [4] "CytoGPS_Result2.Loss"   "CytoGPS_Result2.Gain"   "CytoGPS_Result2.Fusion"

Extracting the cytoband locations, and the frequency data

In order to be able to work with the cytoband-level frequency data, we must combine it with the cytoband location data. Here we assemble them into a single data frame.

cytoData <- data.frame(temp[["CL"]], temp[["frequency"]])

Turning CytoData into an S4 Object

Next, we transfrom the CytoData data frame into an S4 object using the function <tt.cytobandData. The newly acquired object will then be used to generatie plots and will be available for further analyses.

bandData <- CytobandData(cytoData)

Generating Graphs

Plotting Cytoband Data Along the Genome

The first graphs (using barplot ]) summarizes the frequency data from one data column along the genome. This provides a broad overview of the changes, and can be used to visually contrast the locations of changes in different data sets. Here we use barplot twice, showing losses and gains from the first file.

opar <- par(mfrow=c(2,1))
barplot(bandData, what = "CytoGPS_Result1.Loss", col = "forestgreen")
barplot(bandData, what = "CytoGPS_Result1.Gain", col = "orange")
Cytoband level data along the genome.

Cytoband level data along the genome.

par(opar)

Plotting Cytoband-Level Data Along One Chromosome

The next graph allows you to simultaneously compare multiple cytogenetic events one chromosome at a time.

datacolumns <- names(temp[["frequency"]])
datacolumns
## [1] "CytoGPS_Result1.Loss"   "CytoGPS_Result1.Gain"   "CytoGPS_Result1.Fusion"
## [4] "CytoGPS_Result2.Loss"   "CytoGPS_Result2.Gain"   "CytoGPS_Result2.Fusion"
image(bandData, what = datacolumns[1:3], chr = 2, labels = TRUE)
Vertical stacked barplot of LGF frequencies on chromosome 3 for type 1 samples.

Vertical stacked barplot of LGF frequencies on chromosome 3 for type 1 samples.

By adding the parameter horix=TRUE, you can rotate this graph 90 degrees. For more details about the parameters of the image method, see the manual pages and the “gallery” vignette.

Idiograms

We can assemble all of the single-chromosome plots into a single “idiogram” graph that shows all chromosomes at once.

One Data Column

The purpose of this graph is to visualize the chromosomes as well as a barplot of the cytogenetic abnormalities in orderto observe and possibly identify patterns.

image(bandData, what = datacolumns[1], chr = "all", pal = "orange")
Idiogram for one data columm.

Idiogram for one data columm.

More Data Columns

This graph allows the user to compare and contrast two or more cytogenetic events simultaneously. Here we show loss (orange), gain (green), and fusion (purple) events from the Type 1 samples.

image(bandData, what = datacolumns[1:3], chr = "all", 
      pal=c("orange", "forestgreen", "purple"), horiz=TRUE)
Idiogram to contrast two data columms.

Idiogram to contrast two data columms.

Appendix

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] RCytoGPS_1.2.7 rmarkdown_2.28
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.37     R6_2.5.1          fastmap_1.2.0     xfun_0.49        
##  [5] rjson_0.2.23      maketools_1.3.1   cachem_1.1.0      knitr_1.48       
##  [9] htmltools_0.5.8.1 buildtools_1.0.0  lifecycle_1.0.4   cli_3.6.3        
## [13] sass_0.4.9        jquerylib_0.1.4   compiler_4.4.1    highr_0.11       
## [17] sys_3.4.3         tools_4.4.1       evaluate_1.0.1    bslib_0.8.0      
## [21] yaml_2.3.10       jsonlite_1.8.9    rlang_1.1.4