Usage of GeRnika

Introduction

This is a demo for using the GeRnika R package. This document contains examples to help any user to understand the usage of the functionalities offered by the package, which include the simulation of tumor clonal data and the visualization and comparison of tumor phylogenies.

Simulating tumor clonal data

The simulation of tumor clonal data consists of simulating a matrix F that contains the mutation frequency values of a series of mutations in a collection of tumor biopsies or samples. The matrix F is calculated as the product of a matrix B that represents the phylogeny of the tumor, and a matrix U, which contains the clone proportions in each particular sample of that tumor.

Tumor data can be simulated through the create_instance function. The information about its parameters and their usage may be checked in the following table:

Parameter Description Type
n Number of mutations/clones (n). Natural number
m Number of samples (s). Natural number
k Topology parameter that controls for the linearity of the topology. Positive rational number
selection Evolution model followed by the tumor. Categorical: “positive” or “neutral”
noisy Add sequencing noise to the values in the F matrix. Boolean
depth (only if noise = TRUE) Average number of reads that map to the same locus, also known as sequencing depth. Natural number
seed Seed for the pseudo-random topology generator Real number

The following is an example of the generation of a noise-free instance of a tumor that is composed of 5 clones/mutations that has evolved under neutral evolution and has a k value of 0.5. 5 samples have been taken from it:

I <- create_instance(n = 5, m = 4, k = 0.5, selection = "neutral", seed = 1)

This method returns the previously mentioned F, B and U matrices and an additional Ftrue matrix, which we will describe later.

Once we have shown an example of the instantiation of a tumor, we will analyze the effect of varying the values of the parameters.

Topology parameter k

k is the parameter that controls for the linearity of the topology of the tumor. As a result, increasing values of k lead to rather branched phylogenies, while lower values of k produce trees that tend to linearity.

# Simulate a tumor with k=0:
I1 <- create_instance(n = 5, m = 4, k = 0, selection = "neutral", seed = 1)

# Simulate a tumor with k=1:
I2 <- create_instance(n = 5, m = 4, k = 8, selection = "neutral", seed = 1)

# Create a `Phylotree` class object for each tumor:
tree1 <- B_to_phylotree(B = I1$B)
tree2 <- B_to_phylotree(B = I2$B)

# Plot both trees
DiagrammeR::render_graph(DiagrammeR::combine_graphs(data.tree::ToDiagrammeRGraph(tree1@tree), data.tree::ToDiagrammeRGraph(tree2@tree)))

On the left the plot of tree1 (k=0), on the right the plot of tree2 (k=8).

Following the above, the tree on the left is fully branched as it is composed by a root connected to all the leaves of the tree. On the right side we can see a more linear tree, with just two main branches.

After analyzing the effect of parameter k in the generation of tumor data, we will proceed to analyze the effect of the tumor evolution model.

Tumor evolution model

With this parameter we control for the evolution model the tumor follows, either positive selection or neutral evolution. A positive selection-driven evolution model assumes that a few mutations provide cell growth advantage, whereas the remaining mutations do not. Conversely, neutral evolution models assume that, in general, none of the mutations provide significant fitness advantage. As a consequence, tumors under positive selection are dominated by a few clones whereas the rest of clones are present in small proportions. Instead, in tumors with a neutral evolution, all the clones are present in similar proportions.

The clone proportions are well observed in the U matrix, as this indeed contains the fraction of each clone in each sample of the tumor. Below, we can see this effect with a few examples:

# Function to create the heatmap of the U matrix
U_to_heatmap <- function(U, values = TRUE, col_names = c("samples", "clones", "proportion")){
  Upos <-reshape2::melt(U)
  colnames(Upos) <- col_names
  Var1 <- col_names[1]
  Upos<-ggplot(Upos, aes(x = samples, y = clones, fill = proportion)) + geom_tile(col = "black") +    
    theme(
         panel.background = element_rect(fill = 'transparent'),
         plot.background = element_rect(fill = 'transparent', color = NA),
         panel.grid.major = element_blank(),
         panel.grid.minor = element_blank(),
         legend.background = element_rect(fill = 'transparent'),
         legend.box.background = element_rect(fill = 'transparent')
    ) + scale_fill_gradient(limits = c(0.0000000000001, 1)) + theme(axis.title.x = element_blank(),
        axis.ticks.x = element_blank(), 
        axis.title.y = element_blank(),
        axis.ticks.y = element_blank())
  if(values){
    Upos <- Upos + geom_text(aes(label = proportion), size = 4)  
  }
  Upos
}

# simulate a tumor with neutral selection:
Ipos <- create_instance(n = 5, m = 8, k = 0.5, selection = "positive", seed = 1)

# simulate a tumor with positive selection:
Ineu <- create_instance(n = 5, m = 8, k = 0.5, selection = "neutral", seed = 1)

Upos <- U_to_heatmap(Ipos$U)
Uneu <- U_to_heatmap(Ineu$U)

ggarrange(plotlist = list(Upos, Uneu), ncol = 1, nrow = 2)
Heatmaps of the $oldsymbol{U}$ matrices of an instance of a tumor under positive selection (top) and neutral evolution (bottom).

Heatmaps of the $oldsymbol{U}$ matrices of an instance of a tumor under positive selection (top) and neutral evolution (bottom).

In that figure, we may see that in the tumor instance that evolves under neutral evolution, the majority of clones are present in all the samples, with only a few exceptions. Most of the clones have fraction values between 0.05 and 0.4. Conversely, the biggest fraction of the tumor instance under positive selection is taken by a few clones, in particular, by clone 3, and by clones 1 and 2 in a lower proportion. In addition, more clones than in the neutral evolution instance are absent; for instance, clone 5 is even missing in all the samples.

Once we have analyzed the difference between the neutral evolution and positive selection-driven evolution models, we will show how can noisy instances be generated and compare them to noise-free instances.

Sequencing noise

GeRnika has a functionality to add sequencing noise to the simulated instances. In practice, this is done by adding a few parameters to the create_instance function. Sequencing noise is added on top of the noise-free F matrix, and the U and B matrices suffer no changes. The F slot in the resulting object contains the noisy F matrix, while the F_true slot consists of the original noise-free F matrix. Note that these two slots contain equal matrices when no sequencing noise is added.

Down below we compare an instance we have added sequencing noise to, with its noise-free counterpart.

# Function to create a heatmap of F
F_to_heatmap <- function(U, values = TRUE, col_names = c("samples", "mutations", "VAF")){
  Upos <-reshape2::melt(U)
  colnames(Upos) <- col_names
  Var1 <- col_names[1]
  Upos<-ggplot(Upos, aes(x = samples, y = mutations, fill = VAF)) + geom_tile(col = "black") +    
    theme(
         panel.background = element_rect(fill = 'transparent'),
         plot.background = element_rect(fill = 'transparent', color = NA),
         panel.grid.major = element_blank(),
         panel.grid.minor = element_blank(),
         legend.background = element_rect(fill = 'transparent'),
         legend.box.background = element_rect(fill = 'transparent')
    ) + scale_fill_gradient(limits = c(0.00000000000000000001, 1)) + theme(axis.title.x = element_blank(),
        axis.ticks.x = element_blank(), 
        axis.title.y = element_blank(),
        axis.ticks.y = element_blank()) 
  if (values) {
    Upos <- Upos + geom_text(aes(label = round(VAF, digits = 2)), size = 4)  
  }
  Upos
}

# Simulate a tumor with sequencing noise added:
Inoisy <- create_instance(m = 5, n = 8, k = 0.5, selection = "neutral", noisy = TRUE, depth = 5, seed = 1)

Fnoise <- F_to_heatmap(abs(Inoisy$F - Inoisy$F_true))

ggarrange(Fnoise)
The effect of noise.

The effect of noise.

The heatmap from above shows the difference between the F matrix and the Ftrue matrix of a tumor instance, i.e. the noise added to the original VAF values of our samples. The amount of noise added is controlled by the depth parameter, which replicates the effect that the sequencing depth has on the noise level. This is explained in more detail in the following subsection.

Finally, the depth parameter represents the sequencing read depth, that is, the average number of reads that map to the same locus (section of the genome). Therefore, the higher the sequencing depth, the most accurate the VAF values and thus, the lower the noise will be.

After analyzing the parameters for simulating tumor clonal data, we will present the basis of the Phylotree S4 class.

The Phylotree S4 class

In this section, we will be using the Phylotree class for the purpose of visualizing phylogenetic trees on the basis of simulated tumor clonal data. The Phylotree S4 class is a structure that provides facilities for constructing tumor phylogenetic trees. As every S4 R class, the Phylotree class is composed of various attributes that are essential for building a tumor phylogenetic tree in an eficient way. Beware that the users of this package will not need to manipulate the parameters of the Phylotree class as some of these exist only to reduce the computational cost of the visualization of phylogenetic trees and their comparison.

Instantiation of Phylotree class objects

The mutations of each clone in a tumor sample are represented in an n x n binary genotype B matrix. Each bi row of the B matrix represents the mutations in clone vi. Note that we just need a B matrix of a tumor simulation in order to instantiate a new Phylotree class object.

Now, we will show an example of the instantiation of a Phylotree class object:

# Simulate a tumor with 5 clones, 4 samples, k=0.5:
instance <- create_instance(n = 5, m = 4, k = 0.5, selection = "neutral", noisy = FALSE, seed = 1)

# The creation of the Phylotree class object using the previously generated B matrix:
phylotree <- B_to_phylotree(B = instance$B)

The B_to_phylotree method takes the B matrix of a tumor simulation as its main argument and calculates the values for the other attributes present in Phylotree class objects. After instantiating the new Phylotree object, we can visualize it using the generic plot method for this class:

plot(phylotree)

Phylogenetic tree composed by 5 nodes.

Instead of clone numbers, the user can use any other labelling for the nodes in the tree. The example below illustrates how this can be done:

# Create a vector with the tags we want to use (the length of the 
# vector must be equal to the number of clones in the phylogenetic tree):
tags <- c("mut1", "mut2", "mut3", "mut4", "mut5")

# Create the Phylotree class object and include the node labels:
phylotree <- B_to_phylotree(B = instance$B, labels = tags)

After creating the Phylotree class object, we may render it with the custom labels in the following way:


plot(phylotree, labels = TRUE)

Phylogenetic tree of 5 clones with custom tags. The tags in the vector are assigned to the nodes in the tree according to their order. For instance, the first clone in the previous plot is now represented with the first label of the tag vector mut1.

Users are encouraged to use only the B_to_phylotree method for instantiating Phylotree class objects, as the parameters of the class must fulfill various restrictions to work properly.

Resizing nodes to clone proportions

When plotting a Phylotree class object, its nodes can be resized according to some given proportions of the clones that compose the tumor samples. To do this, the plot_proportions function takes a phylotree class object and an additional U matrix determining the proportions of the clones.

# Simulate a tumor
instance <- create_instance(n = 5, m = 4, k = 0.5, selection = "neutral", noisy = FALSE, seed = 1)

# Use the B matrix to instantiate a Phylotree class object
tree <- B_to_phylotree(B = instance$B)

# Plot the phylogenetic tree and resize the nodes according to the U matrix
plot_proportions(tree, instance$U)

Phylogenetic tree of 5 clones using proportions. In this case, there are 4 plots because we have generated an instance based on 4 samples. Then, each tree represents the proportions of the clones in each sample.

Again, in this case we can also use custom labels for the nodes:

tags <- c("A", "B", "C", "D", "E")

# Simulate a tumor
instance <- create_instance(n = 5, m = 4, k = 0.5, selection = "neutral", noisy = FALSE, seed = 1)

# Use the B matrix of the previously generated tumor for creating a Phylotree class object
tree<-B_to_phylotree(B = instance$B, labels = tags)

# Plot the phylogenetic tree, resize the nodes according to the U matrix and include the node labels
plot_proportions(tree, instance$U, labels = TRUE)

Phylogenetic tree of 5 clones using proportions and tags.

Note that the proportions parameter can be a matrix or a vector. If a matrix is given to the function, this method will generate one plot per row in the matrix. Conversely, if a vector is given, a single plot will be generated.

Once we have shown the usage of the methods for instantiating Phylotree class objects and the procedures these can be visualized by, we will present the functions for comparing different tumor phylogenies.

Comparing and combining phylogenetic trees

GeRnika presents different functionalities for comparing tumor phylogenies. In order to show them, we will use the B_mats object included in the GeRnika package. This contains 10 B matrix trios. Each trio corresponds to a different instance, and the B matrices arise in the course of solving the Clonal Deconvolution and Evolution Problem (CDEP) for that instance, as follows:

These matrices are based on the solution of various instances of the Clonal Deconvolution and Evolution Problem given by the ILS and GRASP methods. This trios consist of the following slots:

  • B_true: The real B matrix of a tumor.
  • B_Grasp: A B matrix generated througha greedy, randomized adaptive heuristic strategy (GRASP) given an observed F matrix for the Btrue matrix
  • B_opt: The optimal B matrix found by an iterated local search approach given an observed F matrix for the Btrue matrix.

The example below loads and plots the 3 B matrices of an instance in B_mats:

# Load the predefined B matrices of the package:
B_mats <- GeRnika::B_mats

# Get one of the B matrix trios to compare them:
B_real <- B_mats[[2]]$B_real
B_opt <- B_mats[[2]]$B_opt
B_grasp <- B_mats[[2]]$B_grasp

#Create the list of the tags for the clones that compose the phylogenetic trees:
tags <- c("mut1", "mut2", "mut3", "mut4", "mut5", "mut6", 
            "mut7", "mut8", "mut9", "mut10")

#Create the Phylotree class objects using the previously loaded B matrices:
phylotree_real <- B_to_phylotree(B_real, labels = tags)
phylotree_opt <- B_to_phylotree(B_opt, labels = tags)
phylotree_grasp <- B_to_phylotree(B=B_grasp, labels=tags)


#Render both trees:
DiagrammeR::render_graph(DiagrammeR::combine_graphs(data.tree::ToDiagrammeRGraph(phylotree_real@tree),DiagrammeR::combine_graphs(data.tree::ToDiagrammeRGraph(phylotree_grasp@tree),data.tree::ToDiagrammeRGraph(phylotree_opt@tree))))
Phylogenetic trees according to a trio of B matrices in B_mats.

As these trees above relate to the same tumor instance, it is reasonable that they may present some commonalities. Now, we will show the three different methods offered by the GeRnika package for comparing tumor phylogenies.

The equals method

The three phylogenetic trees above are not equal. For example, phylotree_real and phylotree_opt are not equal as some of the edges of phylotree_real do not exist in phylotree_opt and the other way around.

The equivalence between two phylogenetic trees may be checked by using the equals method as follows:

# Checking if phylotree_real is equal to itself:
equals(phylotree_1 = phylotree_real, phylotree_2 = phylotree_real)
#> [1] TRUE

# Checking if phylotree_real and phylotree_opt are equal:
equals(phylotree_1 = phylotree_real, phylotree_2 = phylotree_opt)
#> [1] FALSE

Equal phylogenetic trees, by definition, are composed by the same nodes, connected by the same edges. As a result, this method returns TRUE when phylotree_real is compared with itself. Likewise, as phylotree_real and phylotree_opt are not equal, this method returns FALSE when we check whether they are equal or not.

Nevertheless, even though two trees may not be equal, they may have common subtrees.

The find_common_subtrees method

The find_common_subtrees function calculates and plots all the maximal common subtrees between two phylogenetic trees. Moreover, it also outputs the number of common and independent (the ones not shared by both trees) edges of the trees. Finally, the method outputs the distance between the trees, which is computed as the sum of their independent edges.

find_common_subtrees(phylotree_1 = phylotree_real, phylotree_2 = phylotree_grasp)
#> Independent edges of tree1: 6
#> Independent edges of tree2: 6
#> Common edges: 3
#> Distance: 12
The maximal common subtrees between phylotree_real and phylotree_grasp.
find_common_subtrees(phylotree_1 = phylotree_real, phylotree_2 = phylotree_opt)
#> Independent edges of tree1: 3
#> Independent edges of tree2: 3
#> Common edges: 6
#> Distance: 6
The maximal common subtrees between phylotree_real and phylotree_opt.

We observe that there are two maximal common subtrees between phylotree_real and phylotree_grasp. Contrariwise, phylotree_real and phylotree_opt present a single maximal common subtree that covers the biggest part of both trees.

As before, we can add custom labels to the trees that result from this function call:

find_common_subtrees(phylotree_1 = phylotree_real, phylotree_2 = phylotree_grasp, labels = TRUE)
#> Independent edges of tree1: 6
#> Independent edges of tree2: 6
#> Common edges: 3
#> Distance: 12
The maximal common subtrees between phylotree_real and phylotree_grasp labelled using custom tags.
find_common_subtrees(phylotree_1 = phylotree_real, phylotree_2 = phylotree_opt, labels = TRUE)
#> Independent edges of tree1: 3
#> Independent edges of tree2: 3
#> Common edges: 6
#> Distance: 6
The common subtrees between phylotree_real and phylotree_opt labelled using custom tags.

According to the plots from above, we conclude that phylotree_real shares more edges with phylotree_opt than it does with phylotree_grasp.

GeRnika offers an additional method to compare two phylogenetic trees: a consensus tree that shows the union of the nodes and edges of both trees.

The combine_trees method

The combine_trees function creates a tree that results from the union of the nodes and edges of two trees. We name this the consensus tree. In this, the nodes and the edges that compose the common subtrees between the original trees are blue. In addition, yellow edges denote to the independent edges of the tree passed as the first parameter of the method, while orange edges represent the independent edges of the second tree. Note that the independent edges of both trees are presented with more translucent colors.

# Creating the consensus tree between phylotree_real and phylotree_grasp
consensus_real_grasp <- combine_trees(phylotree_1 = phylotree_real, phylotree_2 = phylotree_grasp)


# Creating the consensus tree between phylotree_real and phylotree_opt
consensus_real_opt <- combine_trees(phylotree_1 = phylotree_real, phylotree_2 = phylotree_opt)

The consensus tree can be plotted by using the function render_graph of the DiagrammeR R package.

# Rendering the consensus between phylotree_real and phylotree_grasp
DiagrammeR::render_graph(consensus_real_grasp)
The consensus tree between phylotree_real and phylotree_grasp.
# Rendering the consensus between phylotree_real and phylotree_opt
DiagrammeR::render_graph(consensus_real_opt)
The consensus tree between phylotree_real and phylotree_opt.

The above figures present the consensus tree between phylotree_real and phylotree_opt and the consensus tree between phylotree_real and phylotree_grasp, respectively.

Users can customize both the labels and the colors of the nodes. The latter can be done by providing a custom palette –a vector containing the hexadecimal code of various colors– of 3 elements such as c("#AAAAAAAA","#BBBBBBBB","#CCCCCCCC"). Additionally, the user can use one of the color palettes included in GeRnika: Lancet, NEJM and Simpsons (default).

# Load one of the default palettes of the package:
palette <- GeRnika::palettes$Lancet

# Create the consensus tree between phylotree_real and phylotree_opt 
# using clone tags and the custom palette:
consensus <- combine_trees(phylotree_1 = phylotree_real, phylotree_2 = phylotree_opt, 
                           labels = TRUE, palette = palette)

# Render the new consensus phylogenetic tree:
DiagrammeR::render_graph(consensus)
The consensus tree between phylotree_real and phylotree_opt using tags and a selected color palette.

Session information

This is the information of the system this document was compiled on:

sessionInfo(package = NULL)
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] knitr_1.49    ggpubr_0.6.0  ggplot2_3.5.1 GeRnika_1.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.6         xfun_0.49            bslib_0.8.0         
#>  [4] htmlwidgets_1.6.4    visNetwork_2.1.2     rstatix_0.7.2       
#>  [7] vctrs_0.6.5          tools_4.4.2          generics_0.1.3      
#> [10] tibble_3.2.1         fansi_1.0.6          RefManageR_1.4.0    
#> [13] pkgconfig_2.0.3      RColorBrewer_1.1-3   lifecycle_1.0.4     
#> [16] farver_2.1.2         compiler_4.4.2       stringr_1.5.1       
#> [19] munsell_0.5.1        data.tree_1.1.0      knitcitations_1.0.12
#> [22] carData_3.0-5        htmltools_0.5.8.1    sys_3.4.3           
#> [25] buildtools_1.0.0     sass_0.4.9           yaml_2.3.10         
#> [28] Formula_1.2-5        pillar_1.9.0         car_3.1-3           
#> [31] jquerylib_0.1.4      tidyr_1.3.1          cachem_1.1.0        
#> [34] abind_1.4-8          tidyselect_1.2.1     digest_0.6.37       
#> [37] stringi_1.8.4        dplyr_1.1.4          reshape2_1.4.4      
#> [40] purrr_1.0.2          labeling_0.4.3       maketools_1.3.1     
#> [43] cowplot_1.1.3        bibtex_0.5.1         fastmap_1.2.0       
#> [46] grid_4.4.2           colorspace_2.1-1     cli_3.6.3           
#> [49] magrittr_2.0.3       DiagrammeR_1.0.11    utf8_1.2.4          
#> [52] broom_1.0.7          withr_3.0.2          scales_1.3.0        
#> [55] backports_1.5.0      lubridate_1.9.3      timechange_0.3.0    
#> [58] rmarkdown_2.29       httr_1.4.7           ggsignif_0.6.4      
#> [61] evaluate_1.0.1       rlang_1.1.4          Rcpp_1.0.13-1       
#> [64] glue_1.8.0           xml2_1.3.6           rstudioapi_0.17.1   
#> [67] jsonlite_1.8.9       R6_2.5.1             plyr_1.8.9