---
title: "Accessing Data"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Accessing Data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
library(strollur)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Overview

The *strollur* package stores data associated with your Amplicon Sequence Analysis. This tutorial will familiarize you with some of the functions available in the *strollur* package. If you haven't reviewed the "Getting Started" tuturial, we recommend you start there. 

## Loading the example dataset

We can use the `miseq_sop_example()` function to create a dataset object from the [Miseq SOP Example](https://mothur.org/wiki/miseq_sop/). 

```{r}
miseq <- miseq_sop_example()
```

To print a summary of the miseq dataset, run the following:

```{r}
miseq
```

## Accessing Your Data

The *strollur* package includes several functions to access your sequence data.

`names()` - The names function is an extension of base R's names function. It allows you to get the name of your dataset or the names of the sequences, bins, samples, treatments and reports in your dataset.

`count()` - The count function allows you to get the number of sequences, bins, samples and treatments in your dataset.

`abundance()` - The abundance function can be used to access the abundance data for sequences, bins, samples, and treatments.

`report()` - The report function can be used to access the various data associated with your dataset. 

`summary()` - The summary function allows you to summarize sequences, your custom reports, and scrapped data. 


### names

The `names()` function allows you to get the names of sequences, bins, samples, treatments and reports. Let's take a closer look at how use it.

To get the name of the dataset, set the type parameter to 'dataset':

```{r}
names(miseq, type = "dataset")
```


To get the names of the sequences in your dataset, set the type parameter to 'sequence':

```{r}
all_sequences <- names(
  miseq,
  type = "sequence",
  distinct = FALSE
)
length(all_sequences)
head(all_sequences, n = 5)
```


To get the names of the sequences *present* in sample 'F3D0': 

```{r}
include_f3d0 <- names(
  miseq,
  type = "sequence",
  samples = c("F3D0"),
  distinct = FALSE
)
length(include_f3d0)
head(include_f3d0, n = 5)
```


To get the names of the sequences *exclusive* to sample 'F3D0':

```{r}
exclusive_f3d0 <- names(
  miseq,
  type = "sequence",
  samples = c("F3D0"),
  distinct = TRUE
)
length(exclusive_f3d0)
head(exclusive_f3d0, n = 5)
```


To get the names of the bins in your dataset, you need to specify the bin_type. The miseq example contains 3 bin types: *otu*, *asv* and *phylotype*. Let's find the bin names for the 'otu' bin type.

```{r}
otu_bins <- names(
  miseq,
  type = "bin",
  bin_type = "otu"
)
length(otu_bins)
head(otu_bins, n = 5)
```


To get the names of the 'otu' bins that include sequences *present* in sample 'F3D0':
```{r}
include_f3d0 <- names(
  miseq,
  type = "bin",
  samples = c("F3D0"),
  distinct = FALSE
)
length(include_f3d0)
head(include_f3d0, n = 5)
```


To get the names of the "otu" bins that are *exclusive* to sample 'F3D0':

```{r}
exclusive_f3d0 <- names(
  miseq,
  type = "bin",
  samples = c("F3D0"),
  distinct = TRUE
)
length(exclusive_f3d0)
head(exclusive_f3d0, n = 5)
```


To get the names of the samples
```{r}
names(miseq, type = "sample")
```


To get the names of the treatments
```{r}
names(miseq, type = "treatment")
```

To get the names of the custom reports
```{r}
names(miseq, type = "report")
```

### count

The `count()` function allows you to get the number of sequences, bins, samples and treatments in your dataset. 

To find the *total* number of sequences in the dataset, run the following:
```{r}
count(
  miseq,
  type = "sequence",
  distinct = FALSE
)
```

To find the number of *unique* sequences in the dataset, set 'distinct' to TRUE:
```{r}
count(
  miseq,
  type = "sequence",
  distinct = TRUE
)
```

To get total number of sequences *present* in sample 'F3D0', you can set the samples parameter. Note, these sequences will be present in the sample but may be be present in other samples as well.

```{r}
count(
  miseq,
  type = "sequence",
  samples = c("F3D0"),
  distinct = FALSE
)
```

To get number of *unique* sequences *exclusive* to sample 'F3D0':

```{r}    
count(
  miseq,
  type = "sequence",
  samples = c("F3D0"),
  distinct = TRUE
)
```

To get the number of the bins in your dataset, you need to specify the bin_type. The miseq example contains 3 bin types: *otu*, *asv* and *phylotype*. Let's find the number of bins for the 'otu' bin type.
```{r}
count(
  miseq,
  type = "bin",
  bin_type = "otu"
)
```

To get number of *otu* bins with sequences *present* in sample 'F3D0':
*Note these bins will have sequences from sample 'F3D0' and may contain * 
*sequences from other samples as well.*
```{r}
count(
  miseq,
  type = "bin",
  bin_type = "otu",
  samples = c("F3D0"),
  distinct = FALSE
)
```

To get number of *otu* bins *exclusive* to sample 'F3D0':
*Note these bins will have sequences from sample 'F3D0' and NO other samples * 
*will be present in the bins.*
```{r}
count(
  miseq,
  type = "bin",
  bin_type = "otu",
  samples = c("F3D0"),
  distinct = TRUE
)
```

To get the number of samples in the dataset:
```{r}
count(miseq, type = "sample")
```

To get the number of treatments in the dataset:
```{r}
count(miseq, type = "treatment")
```


### abundance

Now that we are familiar with the `names()` and `count()` functions. Let's learn how we can use the `abundance()` function. The abundance function can be used to access the abundance data for sequences, bins, samples, and treatments. It returns a data.frame containing the requested abundance data.

To the total abundance for each sequence, set the type = 'sequence'. This will return a 2 column data.frame containing sequence_names and abundances.
```{r}
sequence_abundance <- abundance(
  miseq,
  type = "sequence",
  by_sample = FALSE
)
head(sequence_abundance, n = 10)
```

To the abundance for each sequence parsed by sample, set by_sample = TRUE. This will return a 3 or 4 column data.frame containing sequence_names, abundances, samples and treatments (if assigned).

```{r}
sequence_abundance_by_sample <- abundance(
  miseq,
  type = "sequence",
  by_sample = TRUE
)
head(sequence_abundance_by_sample, n = 10)
```

To get the total abundance of the bins in your dataset, you need to specify the bin_type. The miseq example contains 3 bin types: *otu*, *asv* and *phylotype*. Let's find the abundance data of bins for the *otu* bin type. This will return a 2 column data.frame containing bin_names and abundances.

```{r}
bin_abundance <- abundance(
  miseq,
  type = "bin",
  bin_type = "otu",
  by_sample = FALSE
)
head(bin_abundance, n = 10)
```

To the abundance for each bin parsed by sample, set by_sample = TRUE. This will return a 3 or 4 column data.frame containing bin_names, abundances, samples and treatments (if assigned).

```{r}
bin_abundance_by_sample <- abundance(
  miseq,
  type = "bin",
  bin_type = "otu",
  by_sample = TRUE
)
head(bin_abundance_by_sample, n = 10)
```


To access the distribution of sequences across the samples in your dataset, set the type = 'sample'. This will return a 2 column data.frame containing sample names and abundances.

```{r}
abundance(
  miseq,
  type = "sample"
)
```

To access the distribution of sequences across the treatments in your dataset, set the type = 'treatment'. This will return a 2 column data.frame containing treatment names and abundances.

```{r}
abundance(miseq, type = "treatment")
```

### report

The `report()` function allows you access you to access FASTA sequences, sequence and classification reports, bin assignments, sample assignments, metadata, sequence data reports, custom reports, resource references and scrapped data reports. It returns a data.frame containing the requested report data. Let's look at some examples together.

To get a report containing the sequences [FASTA](https://www.ncbi.nlm.nih.gov/genbank/fastaformat/) data, set the type = "fasta". This will be a 2 or 3 column data.frame containing sequence names, sequence nucleotide strings, and comments (if provided).

```{r}
fasta_report <- report(
  miseq,
  type = "fasta"
)
head(fasta_report, n = 5)
```

If you want to take a closer look at the sequences, you can request a report containing the sequence names, starts, ends, lengths, ambiguous bases, longest homopolymers, and number of N's. To get a sequence report, set type = "sequence".

```{r}
sequence_report <- report(
  miseq,
  type = "sequence"
)
head(sequence_report, n = 5)
```

The report function also allows you to access classification reports about your dataset. There are 2 types of classification reports: *sequence_taxonomy* and *bin_taxonomy*. To get a report about the sequence classifications, set type = "sequence_taxonomy". This will create a 3 or 4 column report containing sequence names, taxonomic levels, taxons and confidence scores(if provided).

```{r}
sequence_classification <- report(
  miseq,
  type = "sequence_taxonomy"
)
head(sequence_classification, n = 10)
```

To get the consensus taxonomies assigned to the bins in your dataset, you need to specify the bin_type. The miseq example contains 3 bin types: *otu*, *asv* and *phylotype*. Let's find the consensus taxonomies of bins for the *otu* bin type. 

```{r}
bin_classification <- report(
  miseq,
  type = "bin_taxonomy",
  bin_type = "otu"
)
head(bin_classification, n = 10)
```

To get the sequence bin assignments for the *otu* bins, run the following:
```{r}
sequence_bin_assignments <- report(
  miseq,
  type = "sequence_bin_assignment",
  bin_type = "otu"
)
head(sequence_bin_assignments, n = 10)
```

To get the sample treatment assignments, set type = "sample_assignment".

```{r}
sample_treatment_assignments <- report(
  miseq,
  type = "sample_assignment"
)
head(sample_treatment_assignments, n = 5)
```

If you assigned bin representative sequences, you can access the report by setting type = "bin_representative". The miseq example includes bin representatives for the *otu* bins. Let's take a look:
```{r}
otu_bin_representatives <- report(
  miseq,
  type = "bin_representative",
  bin_type = "otu"
)
head(otu_bin_representatives, n = 5)
```

The miseq example contains a custum report. To access the custom reports, first let's find the names. 

```{r}
names(miseq, type = "report")
```

To access the custom contigs assembly report, set type = "contigs_report".

```{r}
contigs_assembly_report <- report(
  miseq,
  type = "contigs_report"
)
head(contigs_assembly_report, n = 5)
```

To get the metadata associated with your dataset, set type = "metadata".
```{r}
metadata <- report(
  miseq,
  type = "metadata"
)
head(metadata, n = 5)
```

To get the resource references associated with your dataset, set type = "resource_reference".
```{r}
report(
  miseq,
  type = "resource_reference"
)
```

Lastly, if sequences or bins have been removed over the course of your analysis you can see a report about the scrapped data by setting type = "sequence_scrap" or "bin_scrap". The miseq example does not include scrapped data, but let's take a look at how to access it.

```{r}
report(
  miseq,
  type = "sequence_scrap"
)
report(
  miseq,
  type = "bin_scrap"
)
```

### summary

The `summary()` function allows you to summarize sequences, custom reports, and scrapped data. 

To get a summary for our custom reports, first let's find the report names.

```{r}
names(miseq, type = "report")
```

You can see we have a custom report so let get the summary by setting report_type = "contigs_report".

```{r}
summary(
  miseq,
  type = "report",
  report_type = "contigs_report"
)
```

Lastly, if sequences or bins have been removed over the course of your analysis you can see a summary about the scrapped data by setting type = "scrap". The miseq example does not include scrapped data, but let's take a look at how to access it.

```{r}
summary(miseq, type = "scrap")
```    

## Sample Trees and Sequence Trees

strollur allows you to include trees relating the sequences and samples within your dataset. You can access and plot them using the following functions.

```{r}
miseq_sequence_tree <- miseq$get_sequence_tree()
miseq_sample_tree <- miseq$get_sample_tree()

#| fig.alt: >
#|   Plot of Miseq_SOP's sample relationship tree
old_par <- par(bg = "white")
ape::plot.phylo(miseq_sample_tree,
  no.margin = TRUE,
  cex = 0.5, edge.color = "maroon", tip.color = "navy"
)
par(old_par)
```    

Thanks for following along. Next, find out how to import your own data using the links below:

* [General Importing](https://mothur.org/strollur/articles/General_Importing.html)
* [Importing data from mothur](https://mothur.org/strollur/articles/Importing_from_mothur.html)
* [Importing data from qiime2](https://mothur.org/strollur/articles/Importing_from_qiime2.html)
* [Importing data from phyloseq](https://mothur.org/strollur/articles/Importing_from_phyloseq.html)