--- title: "Functions for package developers" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Functions for package developers} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} library(strollur) set.seed(19760620) knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` *strollur* includes several functions designed for package developers. These functions provide additional access to the back end data and allow developers to merge, remove and set data directly. The function names begin with *'xdev_'* or *'xint_'* to clearly indicate that they are designed for development or for internal use, and not for the general user. ### Under The Hood The `strollur` object is an R6 object to keep the memory usage low. An R6 class is inherently passed by reference rather than by value. You can make a deep copy of your dataset using the `copy_dataset()` function. Note, if you use an assignment operator to copy it's a shallow copy. The `strollur` object has several public fields, but I'd like to bring special attention to the *data* field. *data* is an Rcpp external pointer (safe pointer) to 'Dataset' c++ class (class definitions found in stroller.h). This format allows package developers an easy access point to the underlying C++ code with additional functionality. While many of the most frequently used functions have exported Rcpp functions, you can also write your own functions to access any public function in the 'Dataset' class. ### Setting functions Over the course of an analysis your package may want to change the abundances of sequences, modify sequence nucleotide strings due to alignment, screening or filtering, and change the data set name. Let's take a close look at functions you will need to do so using the `miseq_sop_example()` data set. ```{r} miseq <- miseq_sop_example() miseq ``` #### Setting the dataset name To change the data set name you can use the `xdev_set_dataset_name()` function. ```{r} names(miseq, type = "dataset") xdev_set_dataset_name(miseq, dataset_name = "modified_miseq") names(miseq, type = "dataset") ``` #### Setting nucleotide sequence strings The miseq example data set is already aligned, filtered and screened, but for the sake of example let's set the nucleotide strings of sequences *exclusive* to sample 'F3D0' to 'NNNN'. First we will get the names of the sequences only present in sample 'F3D0'. ```{r} f3d0_names <- names( miseq, type = "sequence", samples = c("F3D0"), distinct = TRUE ) length(f3d0_names) ``` Now we will assign nucleotide strings of the sequences *exclusive* to sample 'F3D0' to 'NNNN'. Then we can use the `xdev_get_by_sample()` get all the sequences in present in sample 'F3D0' for closer inspection. ```{r} nnnn_neucleotides <- rep("NNNN", length(f3d0_names)) comments <- rep("example_set_sequences", length(f3d0_names)) xdev_set_sequences( miseq, sequence_names = f3d0_names, sequences = nnnn_neucleotides, comments = comments ) f3d0_sequences <- xdev_get_by_sample( miseq, type = "sequence", samples = c("F3D0") ) f3d0_sequences[[1]][300:305] ``` #### Setting sequence abundances Now that we know how to set sequence strings, let's learn how to set sequence abundances. In most cases you can use the `assign()` function to set sequence abundances, but there may be cases where you want more flexibility. The assign function requires abundances to be provided for each sequence. If you only need to update a subset of the sequences, there are two functions to that allow you to do that, `xdev_set_abundances()` and `xdev_set_abundance()`. *xdev_set_abundances* is used with data sets that include samples, *xdev_set_abundance* is used for data sets without samples. Let's set the abundances of all sequences *exclusive* to sample F3D0 to 0, in essence removing them. First we will get the abundance table for the whole data set and create the inputs for `xdev_set_abundances()`. ```{r} abundance_table <- abundance( miseq, type = "sequence", by_sample = TRUE ) head(abundance_table, n = 10) num_samples <- count(miseq, type = "sample") new_abunds <- rep(list(rep(0, num_samples)), length(f3d0_names)) xdev_set_abundances( miseq, sequence_names = f3d0_names, abundances = new_abunds, reason = "F3D0_exclusive_sequences" ) miseq ``` You can see that not all of the sequences in 'F3D0' are removed. That is because the majority of sequences in the data set are present in multiple samples. mothur2 uses xdev_set_abundances when running the pre_cluster() function, sequences are clustered by sample, and the abundances are set accordingly. ### Merging Functions In the course of analyzing your data, there may be times when you want to merge sequence abundances. When sequence nucleotide string are identical you can save time processing by merging the identical read's abundances. mothur2 merges sequences abundances in the unique_seqs function. The miseq example includes merged sequences that have been assigned to bins. In practice, you would would not merge sequences after assigning them to bins. Note, the code below is set to 'eval=FALSE' because you can only merge sequences assigned to the same bin and it will throw errors accordingly. Let's take a look at how to use the `xdev_merge_sequences()` function for your reference. ```{r, eval=FALSE} random_sequence_names <- sample(names(miseq), size = 100, replace = FALSE ) xdev_merge_sequences( miseq, sequence_names = random_sequence_names, reason = "merge_sequences_example" ) ``` Similarly, you may want to merge bins and the `xdev_merge_bins()` function will allow you to do so. For this example we will randomly select 100 'otu' bins to merge. ```{r} random_bin_names <- sample( names( miseq, type = "bin", bin_type = "otu" ), size = 100, replace = FALSE ) xdev_merge_bins( miseq, bin_names = random_bin_names, reason = "merge_bins_example", bin_type = "otu" ) miseq ``` You can see that the sample totals, total sequence and unique sequences remain the same but the number of otu bins is reduced by 99. ### Removing Functions Over the course of your analysis, you may want to remove sequences, bins, samples or contaminants. strollur has several functions to help with that, namely `xdev_remove_sequences()`, `xdev_remove_bins()`, `xdev_remove_samples()`, and `xdev_remove_lineages()`. #### Removing Sequences There are several reasons you may want to remove sequences including removing chimeras, removing sequence without good overlap and removing sequences with ambiguous bases. The `xdev_remove_sequences()` function allows you to easily do that. The `miseq_sop_example()` has already been screened for chimeras, overlap, sequence length and ambiguous bases, so let's randomly select 100 sequences to remove. ```{r} random_sequence_names <- sample(names(miseq), size = 100, replace = FALSE ) xdev_remove_sequences( miseq, sequence_names = random_sequence_names, trash_tags = rep("remove_sequences_example", 100) ) miseq ``` Note, sequences can also be removed by setting their abundance to 0. #### Removing Bins Similarly, lets randomly select and remove 10 phylotype bins with the `xdev_remove_bins()` function. Note, removing the bins also removes sequences. The removal of the sequences from the data set also effects the bins in the 'otu' and 'asv' clusters. ```{r} random_bin_names <- sample( names( miseq, type = "bin", bin_type = "phylotype" ), size = 10, replace = FALSE ) xdev_remove_bins( miseq, bin_names = random_bin_names, trash_tags = rep("remove_bins_example", 10), bin_type = "phylotype" ) miseq ``` Looking closer at the scrap summary we can see that removing 10 phylotype bins, removed 851 unique sequences that represented 21948 reads. After the removal of the 851 unique sequences the 'otu' cluster removed 134 bins and the 'asv' cluster removed 851 bins. #### Removing Samples If you included a mock community in your data set, you will want to remove it after assessing your error rates in preparation for the rest of your analysis. The `miseq_sop_example()` already has the mock community removed so for the sake of example we will remove sample 'F3D142'. ```{r} xdev_remove_samples( miseq, samples = c("F3D142"), reason = "remove_samples_example" ) miseq ``` Lastly, we can remove contaminants from the data set using the `xdev_remove_lineages()` function. The miseq example has already had the contaminants removed after classification and before bin assignment. For the sake of example we will remove all sequences assigned to 'Bacteria;Firmicutes;Clostridia;Clostridiales;Lachnospiraceae;'. ```{r} bad_tax <- paste0( "Bacteria;Firmicutes;Clostridia;Clostridiales;", "Clostridiales_unclassified;" ) xdev_remove_lineages( miseq, contaminants = c(bad_tax), reason = "remove_contaminants_example" ) miseq ``` Thanks for following along. To explore more *xdev_* functions you can look at the rcpp_xint_xdev_functions.h and rcpp_xint_xdev_functions.cpp files located in the src folder of the package.