---
title: "General Importing"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{General Importing}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
library(strollur)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

The strollur package stores data associated with your Amplicon Sequence analysis. This tutorial will familiarize you some of with the functions available in the strollur package. If you haven’t reviewed the “Getting Started” tutorial, we recommend you start there.

## Creating a new dataset

First let's create an empty data set named my_data.

```{r}
data <- new_dataset(dataset_name = "my_data")
```

## Importing Data

The *strollur* package includes two functions to allow you to add sequence data.

`add()` - The add function allows you to add sequences, reports, metadata, and resource references to your data set.

`assign()` - The assign function allows assign sequence abundances, sequence classifications, bins, bin representative sequences, bin classifications, samples and treatments to your data set.

### add
The add function allows you to add sequences, reports, metadata, and resource references to your data set.

#### Adding FASTA sequences

First, let's add some [FASTA](https://www.ncbi.nlm.nih.gov/genbank/fastaformat/) data. strollur has a function for reading FASTA files named `read_fasta()`. We will use it to read the sequence data into a data.frame.

```{r}
fasta_data <- strollur::read_fasta(strollur_example("final.fasta.gz"))
str(fasta_data)

add(
  data,
  table = fasta_data,
  type = "sequence"
)
data
```

If you want to include a resource reference about your fasta data you can use the `new_reference` function and the reference parameter. strollur does not allow you to add sequences with the same name, so let's use the `clear()` function to remove all data from our data set.

```{r}
clear(data)

documentation_url <- "https://mothur.org/wiki/silva_reference_files/"
method_url <- "https://mothur.org/blog/2024/SILVA-v138_2-reference-files/"

silva_resource <- new_reference(
  vendor = "SILVA",
  name = "silva.bacteria.fasta",
  version = "1.38.1",
  usage = "alignment of sequences",
  note = "reference trimmed to V4 region",
  documentation_url = documentation_url,
  method_url = method_url
)

add(
  data,
  table = fasta_data,
  type = "sequence",
  reference = silva_resource
)
data
```

#### Adding Custom Reports

You may want to add custom reports to your data set such as an contigs assembly report, chimera report or alignment report. You can do so by setting type = "report". You must also provide a report_type. 

This is also a good time to explain what the table_names parameter does for you. strollur expects the columns in custom reports to have specific names. If your table's names differ from what strollur is expecting, you will see an error like that below.

```{r, eval=FALSE} 
contigs_report <- readRDS(strollur_example("miseq_contigs_report.rds"))

add(
  data,
  table = contigs_report,
  type = "report",
  report_type = "contigs_report"
)

# Error: The report must include a column containing sequence names.
# sequence_names is not a named column in your report.

# Called from: xdev_add_report(data, table = table, type = report_type,
#    sequence_name = table_names[["sequence_name"]], verbose)
```

You can use the table_names parameter to tell strollur what the specific column is called in your custom report table. In the contigs_report the *sequence_name* column is called *Name*, so we will add one more line to the add function.

```{r} 
contigs_report <- readRDS(strollur_example("miseq_contigs_report.rds"))
str(contigs_report)

add(
  data,
  table = contigs_report,
  type = "report",
  report_type = "contigs_report",
  table_names = list(sequence_name = "Name")
)
data
```


#### Adding Metadata

Now that we have added our custom contigs assembly report, let's learn how to add metadata. We can add metadata to our data set by setting the type = "metadata".

```{r}
metadata <- readRDS(strollur_example("miseq_metadata.rds"))
str(metadata)

add(
  data,
  table = metadata,
  type = "metadata"
)
```


#### Adding Resource References

We can add additional resource references to our data set by setting the type = "resource_reference".

```{r}
reference <- readr::read_csv(strollur_example("references.csv"),
  col_names = TRUE, show_col_types = FALSE
)

add(
  data,
  table = reference,
  type = "resource_reference"
)
```


### assign

The `assign()` function allows assign sequence abundances, sequence classifications, bins, bin representative sequences, bin classifications, samples and treatments to your data set.

#### Assigning Abundances

After adding your FASTA sequences, you can assign abundance and sample data using the assign function with the type = "sequence_abundance".    
    
```{r}
abundance_table <- readRDS(strollur_example("miseq_abundance_by_sample.rds"))
str(abundance_table)

assign(data, table = abundance_table, type = "sequence_abundance")

data
```


#### Assigning Bins

As you can see we now have abundances, samples and treatments added to the data set. Next, let's assign the sequences to bins using type = "bin". When you assign sequences to bins you must provide a *bin_type*. The bin_type is a tag of your choosing used to reference the bin clusters you are adding. Let's add some [Operational Taxonomic Unit](https://en.wikipedia.org/wiki/Operational_taxonomic_unit) clusters and set the bin_type = "otu".

```{r}
bin_table <- readRDS(strollur_example("miseq_list_otu.rds"))
str(bin_table)

assign(data, table = bin_table, type = "bin", bin_type = "otu")

data
```


You can see from the summary, we now have 531 otus in our dataset. 

Note, if you are importing data from packages that preprocess the Amplicon Sequence data into features, you can assign the feature table abundances as sequence abundances and then assign the features to [Amplicon Sequence Variant](https://en.wikipedia.org/wiki/Amplicon_sequence_variant) clusters or *asv* bins. 

#### Assigning Taxonomic Classifications

Now that we have assigned our sequences to bins, let's assign taxonomy to our sequences. 

```{r}
sequence_classification_data <- read_mothur_taxonomy(
  taxonomy = strollur_example("final.taxonomy.gz")
)
str(sequence_classification_data)

assign(
  data,
  table = sequence_classification_data,
  type = "sequence_taxonomy"
)
```

Note, when you assign taxonomy to sequences that are assigned to bins, strollur will automatically assign the bin taxonomies to be the consensus taxonomy of the sequences in the bins. You can also set bin taxonomies independently by setting the type = "bin_taxonomy".

```{r}
otu_taxonomy_data <- read_mothur_cons_taxonomy(strollur_example(
  "final.cons.taxonomy"
))
str(otu_taxonomy_data)

assign(
  data,
  table = otu_taxonomy_data,
  type = "bin_taxonomy",
  bin_type = "otu"
)
```


#### Assigning Bin Representatives

strollur allows you to assign a bin representative sequences to the bins in your clusters. Let's assign bin representatives to our *otu* bins.

```{r}
bin_reps <- readRDS(strollur_example("miseq_representative_sequences.rds"))
str(bin_reps)

assign(
  data,
  table = bin_reps,
  type = "bin_representative"
)
```


#### Assigning Treatments

In our case the abundance_table included treatment assignments, but you can also assign samples to treatments by setting type = "treatment".

```{r}
sample_assignments <- readRDS(strollur_example("miseq_sample_design.rds"))
str(sample_assignments)

assign(
  data,
  table = sample_assignments,
  type = "treatment"
)
```


## Sample Trees and Sequence Trees

Lastly, strollur allows you to add tree that relate your samples or sequences. Let's look at some examples together.

```{r}
sample_tree <- ape::read.tree(strollur_example("final.opti_mcc.jclass.ave.tre"))
sequence_tree <- ape::read.tree(strollur_example("final.phylip.tre.gz"))

data$add_sample_tree(sample_tree)
data$add_sequence_tree(sequence_tree)

#| fig.alt: >
#|   Plot of Miseq_SOP's sample relationship tree
old_par <- par(bg = "white")
ape::plot.phylo(data$get_sample_tree(),
  no.margin = TRUE,
  cex = 0.5, edge.color = "maroon", tip.color = "navy"
)
par(old_par)
```

Thanks for following along. To learn more about the functions used to access the data in your data set, take a look at the [Accessing Data](https://mothur.org/strollur/articles/Accessing_Dataset.html) tutorial.