--- title: "Analyzing non-slendr tree sequences" output: rmarkdown::html_vignette vignette: > %\VignetteEncoding{UTF-8} %\VignetteIndexEntry{Analyzing non-slendr tree sequences} %\VignetteEngine{knitr::rmarkdown} --- ```{r, include = FALSE} env_present <- slendr:::is_slendr_env_present() #if (Sys.getenv("RUNNER_OS") == "Windows") { # bash_path <- "D:/a/_temp/msys64/usr/bin/bash.exe" #} else { # bash_path <- NULL #} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 5, dpi = 60, eval = slendr:::is_slim_present() && env_present && Sys.getenv("RUNNER_OS") != "Windows"#, #engine.path = list(bash = bash_path) ) nonspatial_slim <- normalizePath(tempfile(), winslash = "/", mustWork = FALSE) nonspatial_trees_file <-normalizePath(tempfile(), winslash = "/", mustWork = FALSE) spatial_slim <- normalizePath(tempfile(), winslash = "/", mustWork = FALSE) spatial_trees_file <- normalizePath(tempfile(), winslash = "/", mustWork = FALSE) msprime_trees_file <- normalizePath(tempfile(), winslash = "/", mustWork = FALSE) Sys.setenv(NONSPATIAL_SLIM = nonspatial_slim) Sys.setenv(NONSPATIAL_TREES = nonspatial_trees_file) Sys.setenv(SPATIAL_SLIM = spatial_slim) Sys.setenv(SPATIAL_TREES = spatial_trees_file) Sys.setenv(MSPRIME_TREES = msprime_trees_file) ``` In previous vignettes we have demonstrated the *slendr* R interface for defining and executing msprime and SLiM population genetic models. We have also provided an overview for its interface to the tree-sequence processing library [tskit](https://tskit.dev). However, although its model-definition interface is quite convenient, *slendr* cannot (and never will) support *every* possible msprime or SLiM. model The array of features provided by these simulation frameworks is simply to big and implementing an R interface to every single one of them would not make much sense. That said, the tree-sequence outputs produced by "pure" (i.e. non-slendr) msprime and SLiM scripts are no different from those generated by *slendr* models themselves. For users who would rather use their own simulation scripts but find *slendr*'s tskit R interface (not its simulation interface) appealing, the R package provides a possibility to load, process, visualize and analyze standard tree sequences which were not generated by *slendr* itself. In this vignette we give a brief overview of how this works, using example tree sequences produced by two very simple simulation scripts written in pure SLiM and msprime (i.e. scripts which don't utilize *slendr*'s spatial features, symbolic names of simulated individuals, and automated translation of time units). We won't be going into detail explaining how those scripts work, because we assume this functionality will be used by those already familiar with msprime or SLiM. Similarly, we won't cover the R-tskit functionality of *slendr* in detail either, simply because the contents of this vignette is already covered by [other tutorials provided by *slendr*](index.html). ```{r, message = FALSE} library(slendr) library(dplyr) library(magrittr) library(ggplot2) init_env() SEED <- 42 set.seed(SEED) ``` #### Non-spatial SLiM tree sequences Consider the following SLiM script, which creates a couple of populations (with different $N_e$) splitting from an ancestral population `p1` (lets save it to `/tmp/nonspatial.slim`): ```{bash, include = FALSE} cat < ${NONSPATIAL_SLIM} initialize() { setSeed(42); initializeTreeSeq(); initializeMutationRate(0); initializeMutationType("m1", 0.5, "f", 0.0); initializeGenomicElementType("g1", m1, 1.0); initializeGenomicElement(g1, 0, 1e6); initializeRecombinationRate(1e-8); } 1 early() { sim.addSubpop("p1", 10); } 1000 early() { sim.addSubpopSplit("p2", 500, p1); } 3000 early() { sim.addSubpopSplit("p3", 2500, p1); } 5000 early() { sim.addSubpopSplit("p4", 10000, p1); } 6000 late() { sim.treeSeqOutput("${NONSPATIAL_TREES}"); } EOF ``` ```{bash, echo = FALSE, comment = ""} less ${NONSPATIAL_SLIM} ``` ```{bash, include = FALSE} slim ${NONSPATIAL_SLIM} ``` After we run this script in SLiM, we can use *slendr* to load the output tree sequence (saved to `/tmp/nonspatial-slim.trees`), simplify it, and overlay mutations on it using the [standard functionality](vignette-05-tree-sequences.html#loading-and-processing-tree-sequence-output-files) originally developed for *slendr* tree sequences. Note that this is the same command we would use for loading *slendr* tree sequences, except we direct the `ts_read()` function straight to the tree-sequence output file rather than using the `ts_read()` format used when working with standard *slendr* simulations. This way, *slendr* ```{r} ts <- ts_read(nonspatial_trees_file) %>% ts_simplify() %>% ts_mutate(mutation_rate = 1e-7, random_seed = SEED) ``` We can extract information about individual's names, nodes, population assignments, etc. just as with any *slendr* tree sequence with the function `ts_nodes()`. As with standard *slendr* models, this function loads the "raw" node and individual [tree-sequences tables](https://tskit.dev/tutorials/tables_and_editing.html), performs a couple of join operations, and presents the whole thing in a nice unified form for interactive data analysis (which can also include spatial information—see below): ```{r} data <- ts_nodes(ts) %>% dplyr::filter(sampled) data ``` Moving on to tskit statistics, we can use the data table above to extract a list of nodes belonging to each population (this is what various tskit tree-sequence statistics operate on, and *slendr* follows that design). Here we are computing the nucleotide diversity in each of the four populations using the `ts_diversity()` function, by first creating a list of lists with node IDs (i.e. chromosomes) of individuals in each population: ```{r} sample_sets <- split(data$node_id, data$pop) # compute nucleotide diversity in each population # (any other ts_*() tskit R interface function should work) ts_diversity(ts, sample_sets) ``` Just as with *slendr* tree sequences (as demonstrated in our [paper](https://www.slendr.net/articles/vignette-09-paper.html#example-4-figure-5)) we can get a individual trees too, extracted in the in the phylogenetic format provided by the [ape](https://CRAN.R-project.org/package=ape) R package. Here we first simplify the tree sequence even further to just 10 nodes to make things manageable: ```{r} samples <- sample(data$node_id, 10) ts_small <- ts_simplify(ts, simplify_to = samples) # extract the 42nd tree in the genealogy to an R 'phylo' format tree <- ts_phylo(ts_small, 42 - 1) tree ``` Once we have that R tree object, we can use packages like [ggtree](https://yulab-smu.top/treedata-book/) to visualize the tree (any other phylogenetic package would work too). Note that because nodes of 'ape phylo' trees must conform to a strict format (they must be labelled `1...N`), we will extract the information about the node IDs in the tskit tree-sequence data to be able to plot them in the tree. ```{r, nonslendr_tree, eval = slendr:::is_slim_present() && env_present && Sys.getenv("R_HAS_GGTREE") == TRUE && Sys.getenv("RUNNER_OS") != "Windows"} library(ggtree) labels <- ts_nodes(tree) %>% select(node = phylo_id, tskit_id = node_id) ggtree(tree, branch.length = "none") %<+% labels + geom_label(aes(label = tskit_id)) ``` ```{r, eval = slendr:::is_slim_present() && env_present && Sys.getenv("R_HAS_GGTREE") != TRUE && Sys.getenv("RUNNER_OS") != "Windows"} library(ape) plot(tree, show.tip.label = FALSE) nodelabels() tiplabels() ``` #### msprime (non-slendr) tree sequences The same as above applies also to msprime tree sequences (which is really not that surprising, given that it's all tskit under the hood). We can start with Python: ```{python, eval = FALSE} import msprime ts = msprime.sim_ancestry(100) ts.dump() ``` ```{r, include = FALSE} reticulate::py_run_string(sprintf("import msprime; msprime.simulate(100).dump('%s')", msprime_trees_file)) ``` And then we can proceed with loading the msprime tree sequence into R and analyze it with the slendr functionality: ```{r} ts <- ts_read(msprime_trees_file) ts_nodes(ts) ``` #### Spatial SLiM (non-slendr) tree sequences Furthermore, the generalized interface also supports *slendr*'s [spatial tree-sequence features](https://www.slendr.net/articles/vignette-06-locations.html), with all the [bells and whistles](https://www.slendr.net/articles/vignette-09-paper.html#example-4-figure-5). For instance, lets take the following spatial SLiM script (modified from the SLiM manual) and execute it with SLiM the usual way: ```{bash, include = FALSE} cat < ${SPATIAL_SLIM} initialize() { setSeed(42); initializeSLiMOptions(keepPedigrees=T, dimensionality="xy"); initializeTreeSeq(); initializeMutationRate(1e-7); initializeMutationType("m1", 0.5, "f", 0.0); initializeGenomicElementType("g1", m1, 1.0); initializeGenomicElement(g1, 0, 1e6); initializeRecombinationRate(1e-8); } 1 early() { sim.addSubpop("p1", 500); // initial positions are random in ([0,1], [0,1]) p1.individuals.x = runif(p1.individualCount); p1.individuals.y = runif(p1.individualCount); } modifyChild() { // draw a child position near the first parent, within bounds do child.x = parent1.x + rnorm(1, 0, 0.02); while ((child.x < 0.0) | (child.x > 1.0)); do child.y = parent1.y + rnorm(1, 0, 0.02); while ((child.y < 0.0) | (child.y > 1.0)); return T; } 1: late() { sim.treeSeqRememberIndividuals(sim.subpopulations.individuals, permanent = F); } 10000 late() { sim.treeSeqOutput("${SPATIAL_TREES}"); } EOF ``` ```{bash, echo = FALSE, comment = ""} less ${SPATIAL_SLIM} ``` ```{bash, include = FALSE} slim ${SPATIAL_SLIM} ``` We can then load and simplify the output tree sequence in just as we did above in this vignette (or anywhere in the *slendr* documentation): ```{r} ts <- ts_read(spatial_trees_file) %>% ts_simplify() ``` Finally, we can access the spatio-temporal data embedded in the output tree sequence in the standard *slendr* way (note the spatial [*sf*](https://r-spatial.github.io/sf/) column `location` with the `POINT` data type): ```{r} data <- ts_nodes(ts) data ``` Because we get the tree sequence converted to the spatial [_sf_](https://r-spatial.github.io/sf/) data format, we can use standard geospatial packages to use any spatial data analysis methods that those packages provide. To briefly demonstrate what this means, we can trivially plot the location of each recorded node: ```{r, nonslendr_locations} ggplot() + geom_sf(data = data, aes(color = time), alpha = 0.5) ``` We can also collect spatio-temporal ancestry information of a particular node (i.e. the times and locations of all of its ancestors all the way to the root, with each "link" in the plot signifying parent-child edge somewhere along the tree sequence) and plot it on a 2D surface (x and y dimensions [0, 1]). The plot is a little chaotic, but hopefully conveys the idea (the "focal node" 0 is highlighted in red). This is essentially the same plot we have in [the last figure of our paper](https://www.slendr.net/articles/vignette-09-paper.html#example-4-figure-5). ```{r, nonslendr_ancestries} ancestral_links <- ts_ancestors(ts, 0) ggplot() + geom_sf(data = ancestral_links, size = 0.5, aes(alpha = parent_time)) + geom_sf(data = sf::st_set_geometry(ancestral_links, "parent_location"), aes(color = parent_time)) + geom_sf(data = data[data$node_id == 0, ], size = 3, color = "red") ``` ## Conclusion In this vignette we gave a brief overview of using *slendr*'s [R-tskit interface](../reference/index.html#tree-sequence-loading-and-processing) for loading, processing, and analyzing "pure" non-*slendr* tree sequences produced by msprime and SLiM scripts. Although we have only touched upon the most basic features of its R-tskit interface for standard tree sequences, it is important to note that as far as *slendr* is concerned, it does not matter how a tree sequence was produced, as long as it conforms to the tskit specification. This means that regardless of the source of your tree sequence data, you should be able to use *slendr*'s tskit functionality to run your analyses.