---
title: "ProActive"
author: "Jessie Maier"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
description: >
  The `ProActive` R package automatically detects regions of gapped and elevated 
  read coverage using a pattern-matching algorithm. `ProActive` can detect, characterize 
  and visualize read coverage patterns in both genomes and metagenomes. Optionally, 
  users may provide gene annotations associated with their genome or metagenome 
  in the form of a .gff file. In this case, `ProActive` will generate an additional 
  output table containing the annotations found within the detected regions of gapped 
  and elevated read coverage.
vignette: >
  %\VignetteIndexEntry{ProActive}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  out.width = "100%",
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup, echo=FALSE, warning=FALSE, include=FALSE}
library(ProActive)
library(kableExtra)
library(ggplot2)
library(stringr)
library(dplyr)
```

# Introduction

`ProActive` automatically detects regions of gapped and elevated read coverage 
using a 2D pattern-matching algorithm. `ProActive` can detect, characterize and 
visualize read coverage patterns in both genomes and metagenomes. Optionally, 
users may provide gene predictions associated with their genome or metagenome 
in the form of a .gff file. In this case, `ProActive` will generate an additional 
output table containing the gene predictions found within the detected regions of 
gapped and elevated read coverage.

Analyzing coverage data is important because gaps or elevations in coverage can 
provide insight into a community's (metagenome) or organism's (genome) genetic 
activity, for example-

Elevations in read coverage can be an indication of highly
active and/or abundant mobile genetic elements (MGEs). MGEs that are actively 
replicating or are highly abundant may generate more sequencing reads than the 
rest of their host bacterium's genome thus creating a region of elevated 
read coverage at the element's insertion site(s).   

Gaps in read coverage can indicate genetic heterogeneity in a bacterial population. 
Sub-populations of bacteria with and without specific gene sequences may form 
gaps or dips in mapped read coverages. Genetic regions with high mutation rates 
may also generate gaps in read coverage.

**Note:** The read coverage pattern-matching employed by `ProActive` is only as 
good as the provided data. There are other non-biological situations that can 
create gaps or elevations in read coverage. For example, chimeric assemblies or 
contaminants can create odd read coverage artifacts. Because of this, ProActive 
is best used as a screening method to identify genetic regions for further 
investigation.


# Installation
## CRAN install
```{r, eval=FALSE}
install.packages("ProActive")
library(ProActive)
```

## GitHub install
```{r, eval=FALSE}
if (!require("devtools", quietly = TRUE)) {
  install.packages("devtools")
}

devtools::install_github("jlmaier12/ProActive")
library(ProActive)
```

# Input data

## Pileups

ProActive detects read coverage patterns using a pattern-matching algorithm that 
operates on pileup files. A pileup file is a file format where each row 
summarizes the 'pileup' of reads at specific genomic locations. Pileup files 
can be used to generate a rolling mean of read coverages and associated base 
pair positions which reduces data size while 
preserving read coverage patterns. **ProActive requires that input pileups files** 
**be generated using a 100 bp window/bin size.**  

Pileup files can be generated by mapping sequencing reads to a 
metagenome or genome fasta file. **Read mapping should be performed using a high** 
**minimum identity (0.97 or higher) and random mapping of ambiguous reads.** The 
pileup files needed for ProActive are generated using the .bam files produced 
during read mapping. 

Some read mappers, like 
[BBMap](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbmap-guide/), 
allow for the generation of pileup files in the 
[`bbmap.sh`](https://github.com/BioInfoTools/BBMap/blob/master/sh/bbmap.sh) 
command with use of the `bincov` output with the `covbinsize=100` 
parameter/argument. **Otherwise, BBMap's** 
**[`pileup.sh`](https://github.com/BioInfoTools/BBMap/blob/master/sh/pileup.sh)** 
**can convert .bam files produced by any read mapper to pileup files compatible** 
**with ProActive using the `bincov` output with `binsize=100`.**

**Pileup files must use a 100 bp window/bin size for the rolling mean!** 

**The input pileup file for metagenomes must have the following format:**

Dataframe with four columns:

* V1: Contig accession
* V2: Mapped read coverage values averaged over 100 bp windows
* V3: Starting position (bp) of each 100 bp window. Restarts from 100 at the 
start of each new contig.
* V4: Starting position (bp) of each 100 bp window. Does NOT restart at the 
start of each new contig.

```{r echo=FALSE}
kable(head(sampleMetagenomePileup), row.names = FALSE) %>%
  kable_styling(latex_options = "HOLD_position")
```

**Note that the format for a genome pileup will be slightly different! The third** 
**column (V3) does not restart and the fourth column (V4) starts from 0. ProActive**
**accounts for the differences in pileup formats between genomes and metagenomes.** 

Users may also use the 'sampleMetagenomePileup' and 'sampleGenomePileup' files that 
come pre-loaded with ProActive as references for proper input file format.

## gff TSV
Optionally, ProActive will accept a .gff file as additional input. The .gff file must be 
associated with the same metagenome or genome used to create your pileup file. 
The .gff file should be a TSV and should follow the same general format described [here](https://en.wikipedia.org/wiki/General_feature_format#:~:text=In%20bioinformatics%2C%20the%20general%20feature,DNA%2C%20RNA%20and%20protein%20sequences.).

**The input .gff file must have the following format exactly:**
```{r echo=FALSE}
kable(head(sampleMetagenomegffTSV), row.names = FALSE) %>%
  kable_styling(latex_options = "HOLD_position")
```

(**Hint**- if you are using a gff file output by [PROKKA](https://github.com/tseemann/prokka),
you may need to remove some unnecessary (for ProActive) lines of text at the top 
of the file. There are various ways one can remove these additional lines, however, a
nice command-line solution is:)
```{bash eval=FALSE}
grep ^COMMONID metagenomeAnnots.gff > metagenomeAnnotsForProActive.gff
```

The 'COMMONID' should be a value that your contig or genome accessions start with 
that is the same across all the rows. For example, the 'COMMONID' for the contig 
accessions in the sampleMetagenomegffTSV displayed above could be "NODE" since all 
the accessions start with "NODE".

# ProActive()
`ProActive()` is the main function in the ProActive R package. This function 
filters contigs/chunks based on length and read coverage, performs pattern-matching 
to detect gaps and elevations in read coverage, and identifies start and stop positions 
and sizes of pattern-matches.    

## Function components

### Chunking
Currently, `ProActive()` can only detect one gap or elevation pattern per contig. 
Until `ProActive()` is able to detect multiple read coverage patterns per contig, we 
rely on 'chunking' large contig into smaller subsets (defined by `chunkSize`) so 
that pattern-matching can be performed on each chunk as if it were an individual 
contig. The chunking mechanism is what allows `ProActive()` to pattern-match on 
genomes. When contigs/genomes are chunked, they are assigned a sequential value 
to link chunks back together (i.e. "NODE_1_chunk_1, NODE_1_chunk_2, NODE_1_chunk_3, ...).
Note that the remaining 'chunk' of a contig/genome may not be long enough to 
perform pattern-matching on. Chunks too small for pattern-matching will be put 
in the output FilteredOut table. If a chunk splits a gap or elevation pattern in half,
`ProActive()` will attempt to detect this and report it to the user as a 'possible 
pattern-match continuity' between contig/genome chunks. Pattern-match continuity is
detected when two sequential chunks have a partial elevation/gap pattern going off the 
right and left side of the chunks, respectively.

### Filtering
Contigs/chunks that are too short or have little to no read coverage are filtered out 
prior to pattern-matching. `ProActive()` filters out contigs/chunks that do 
not have at least 10x coverage on a total of 5,000 bp across the whole contig/chunk. 
The read coverage filtering was done in this way to avoid filtering out long 
contigs/chunks with small elevations in read coverage that might get removed if filtering 
was done with read coverage averages or medians. Additionally, contigs/chunks less 
than 30,000 bp are filtered out by default, however this can be changed with 
the `minContigLength` parameter which can be set to a minimum of 25,000 bp. 
**If you would like to reduce the size of your input metagenome pileup file for** 
**`ProActive()`, consider pre-filtering your assembly for contigs greater than** 
**25,000 bp prior to read mapping!** 

### Changing pileup windowSize
The input pileup files have 100 bp windows in which the mapped read coverage is 
averaged over. `ProActive()` increases the window size prior to pattern-matching by 
averaging the read coverages over a value specified with `windowSize`. In 
many cases, read coverage patterns don't require the resolution that 100 bp windows 
provide, however, starting with a 100 bp windowSize means the higher resolution is 
available if needed. While users can use the 100 bp `windowSize` for `ProActive()`, 
the processing time will be increased **significantly** and noisy data may interfere 
with pattern-matching. We find that the default 1,000 bp `windowSize` provides a 
nice balance between processing time and read coverage pattern resolution. If you'd 
like more resolution than the 1,000 bp `windowSize` provides, consider dropping the
`windowSize` to 500. If you'd like fine scale read coverage resolution, consider 
viewing the contigs/genome with a software like Integrative Genomics Viewer 
[IGV](https://github.com/igvteam/igv).

### Pattern-matching
`ProActive()` detects read coverage patterns using a 2D
pattern-matching algorithm. Several predefined patterns, described below, are 
built using the specific length and read coverage values of the contig/chunk being 
assessed. Patterns are translated across each contig/chunk in 1,000 bp sliding 
windows and at each translation, a pattern-match score is calculated by taking 
the mean absolute difference of the read coverage and the pattern 
values. The smaller the match-score, the better the pattern-match. After a 
pattern is fully translated across a contig/chunk, certain aspects of the pattern are 
changed (i.e. height, base, width) and translation is repeated. This process 
of translation and pattern re-scaling is repeated until a large number of 
pattern variations are tested. After pattern-matching is complete, the pattern 
associated with the best match-score is used for contig/chunk classification. 
Contigs/chunks are classified as ‘Elevation’, ‘Gap’, or 'NoPattern' during 
pattern-matching. 

#### Elevation pattern:

The 'elevation' class is defined by a 'block' pattern which represents the read 
covarage pattern formed when reads from a highly active/abundant MGE map back to 
its respective integration site in the host/originating genome. During 
pattern-matching, the height (max), base (min) and width are altered and all 
pattern variations are translated across the contig/chunk. The block 
pattern width never get smaller than 10,000 bp by default, however this can be 
changed with the `minSize` parameter. 

```{r echo=FALSE, out.width = "50%"}
dataframe <- cbind.data.frame(c(1:100), c(rep(0, 20), rep(100, 60), rep(0, 20)))
colnames(dataframe) <- c("mockpos", "mockcov")
plot1 <- ggplot(dataframe, aes(x = mockpos, y = mockcov)) +
  geom_line(linewidth = 1) +
  labs(x = NULL, y = NULL) +
  theme_classic() +
  ggplot2::theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    plot.title = element_text(size = 10),
    panel.border = element_rect(colour = "black", fill = NA, linewidth = 2)
  )

plot1
```


#### Gap pattern:
The 'gap' class is essentially the reverse of the 'elevation' class. It is defined 
by a gapped block pattern which represents a depression in read coverage caused by 
genetic heterogeneity at a specific region leading to poor read mapping. The same 
pattern-matching steps (alteration of pattern and pattern translation) used for the
elevation pattern are used for the gap pattern.

```{r echo=FALSE, out.width = "50%"}
dataframe <- cbind.data.frame(c(1:100), c(rep(100, 20), rep(5, 60), rep(100, 20)))
colnames(dataframe) <- c("mockpos", "mockcov")
plot1 <- ggplot(dataframe, aes(x = mockpos, y = mockcov)) +
  geom_line(linewidth = 1) +
  labs(x = NULL, y = NULL) +
  theme_classic() +
  ggplot2::theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    plot.title = element_text(size = 10),
    panel.border = element_rect(colour = "black", fill = NA, linewidth = 2)
  )

plot1
```

#### Elevation/Gap pattern:
Elevations and gaps that trail off one side of a contig/chunk are hard to classify
as the read coverage could be seen as a gap or elevation depending on how you're 
looking at it. Our solution to this problem was to classify the contig/chunk as 
'gap' if the elevated region was greater than 50% of the length of the contig/chunk.
Otherwise the classification is 'elevated'.

```{r echo=FALSE, out.width = "50%"}
dataframe <- cbind.data.frame(c(1:100), c(rep(100, 50), rep(5, 50)))
colnames(dataframe) <- c("mockpos", "mockcov")
plot1 <- ggplot(dataframe, aes(x = mockpos, y = mockcov)) +
  geom_line(linewidth = 1) +
  labs(x = NULL, y = NULL) +
  theme_classic() +
  ggplot2::theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    plot.title = element_text(size = 10),
    panel.border = element_rect(colour = "black", fill = NA, linewidth = 2)
  )

plot1
```

#### noPattern:
Since the best pattern-match for each contig/chunk is determined by comparing 
match-scores amongst all pattern-variations from all pattern classes, we needed 
a ‘negative control’ pattern to compare against. The 'NoPattern' 'pattern' 
serves as a negative control by matching to contigs/chunks with no read coverage 
patterns. We made two NoPattern patterns which consist of a horizontal line the 
same length as the contig/chunk being assessed at either the contig/chunk's 
average or median read coverage. This pattern is not re-scaled or translated in 
any way. 

```{r echo=FALSE, out.width = "50%"}
dataframe <- cbind.data.frame(c(1:100), rep(10, 100))
colnames(dataframe) <- c("mockpos", "mockcov")
plot1 <- ggplot(dataframe, aes(x = mockpos, y = mockcov)) +
  geom_line(linewidth = 1) +
  ylim(0, 100) +
  labs(x = NULL, y = NULL) +
  theme_classic() +
  ggplot2::theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    plot.title = element_text(size = 10),
    panel.border = element_rect(colour = "black", fill = NA, linewidth = 2)
  )
plot1
```

### Calculating elevation ratios
Every gap and elevation classification receives an elevation ratio which is simply 
the pattern-match's maximum value divided by the minimum value. For elevation classifications,
you can think of the elevation ratio as 'the elevated region has X times greater 
read coverage than the non-elevated region'. Conversely, for gap classifications, the elevation ratio
can be thought of as 'the gapped region has X times less read coverage than the 
non-gapped region'.

### Extracting gene annotations in elevated/gapped regions
If a .gff file is provided, then `ProActive()` will extract the gene annotations
found within the gapped and elevated pattern-match regions and provide them to the user in 
an output table (GeneAnnotTable). An additional column will be added with the classification 
information (gap or elevation) associated with the gene annotations. If the input .gff file 
contains a gene 'product' field in the attributes column (9th column in the dataframe), then 
`ProActive()` will extract the product information into a separate column for easy 
visualization and filtering of annotations of interest. 

## Usage

**Default arguments in metagenome mode:**
```{r}
ProActiveOutputMetagenome <- ProActive(
  pileup = sampleMetagenomePileup,
  mode = "metagenome",
  gffTSV = sampleMetagenomegffTSV
)
```

**Default arguments in genome mode:**
```{r}
ProActiveOutputGenome <- ProActive(
  pileup = sampleGenomePileup,
  mode = "genome",
  gffTSV = sampleGenomegffTSV
)
```

**Note that ProActive *can* be run without the gffTSV file!** 

## Arguments/parameters
```{r, eval=FALSE}
ProActive(pileup,
  mode,
  gffTSV,
  windowSize = 1000,
  minSize = 10000,
  maxSize = Inf,
  minContigLength = 30000,
  chunkSize = 50000,
  chunkContigs = FALSE,
  IncludeNoPatterns = FALSE,
  verbose = TRUE,
  saveFilesTo
)
```

- **`pileup`**: A .txt file containing mapped sequencing read coverages averaged over 
100 bp windows/bins.
- **`mode`**: Either "genome" or "metagenome" 
- **`gffTSV`**: Optional, a .gff file (TSV) containing gene predictions associated with 
the .fasta file used to generate the pileup. 
- **`windowSize`**: The number of basepairs to average read coverage values over. 
Options are 100, 200, 500, 1000 ONLY.  Default is 1000.
- **`minSize`**: The minimum size (in bp) of elevation or gap patterns. Default is 10000.
- **`maxSize`**: The maximum size (in bp) of elevation or gap patterns. Default is NA
(i.e. no maximum).
- **`minContigLength`**: The minimum contig/chunk size (in bp) to perform pattern-matching 
on. Default is 25000.
- **`chunkSize`**: If `mode`="genome" OR if `mode`="metagenome" and `chunkContigs`=TRUE, 
chunk the genome or contigs, respectively, into smaller subsets for pattern-matching. 
`chunkSize` determines the size (in bp) of each 'chunk'. Default is 100000.
- **`chunkContigs`**: TRUE or FALSE, If TRUE and `mode`="metagenome", contigs longer 
than the `chunkSize` will be 'chunked' into smaller subsets and pattern-matching 
will be performed on each subset. Default is FALSE.
- **`IncludeNoPatterns`**: TRUE or FALSE, If TRUE the noPattern pattern-matches will 
be included in the ProActive PatternMatches output list. If you would like to visualize 
the noPattern pattern-matches in `plotProActiveResults()`, this should be set to TRUE. 
- **`verbose`**: TRUE or FALSE. Print progress messages to console. Default is TRUE.
- **`saveFilesTo`**: Optional, Provide a path to the directory you wish to save 
output to. A folder will be made within the provided directory to store 
results.

## Output

The output of `ProActive` is a list containing six objects:

1. SummaryTable: A table containing all pattern-matching classifications
2. CleanSummaryTable: A table containing only gap and elevation pattern-match
classifications (i.e. noPattern classifications removed)
3. PatternMatches: A list object containing information needed to visualize the 
pattern-matches in `plotProActiveResults()`
4. FilteredOut: A table containing contigs/chunks that were filtered out for being 
too small or having too low read coverage
5. Arguments: A list object containing arguments used for pattern-matching (windowSize, 
mode, chunkSize, chunkContigs)
6. GeneAnnotTable: A table containing gene predictions associated with elevated 
or gapped regions in pattern-matches

Save the desired list item to a new variable using its associated name.

**Metagenome results summary table:**
```{r}
MetagenomeCleanSummaryTable <- ProActiveOutputMetagenome$CleanSummaryTable
```

```{r, echo=FALSE}
kable(MetagenomeCleanSummaryTable) %>%
  kable_styling(latex_options = "HOLD_position")
```

**Subset of genome results summary table:**
```{r}
GenomeCleanSummaryTable <- head(ProActiveOutputGenome$CleanSummaryTable)
```

```{r, echo=FALSE}
kable(GenomeCleanSummaryTable) %>%
  kable_styling(latex_options = "HOLD_position")
```

**Subset of GeneAnnotTable for metagenome results:**
```{r}
MetagenomeResultsGenePredictTable <- head(ProActiveOutputMetagenome$GeneAnnotTable)
```

```{r, echo=FALSE}
kable(MetagenomeResultsGenePredictTable) %>%
  kable_styling(latex_options = "HOLD_position")
```

**Subset of GeneAnnotTable for genome results:**
```{r}
GenomeResultsGenePredictTable <- head(ProActiveOutputGenome$GeneAnnotTable)
```

```{r, echo=FALSE}
kable(head(GenomeResultsGenePredictTable)) %>%
  kable_styling(latex_options = "HOLD_position")
```


# plotProActiveResults()

`plotProActiveResults()` allows users to visualize both the read coverage and 
the pattern-match associated with each gap or elevation classification.

## Function components

### Re-building pattern-matches
The `ProActive()` output contains information needed to re-build each 
pattern-match used for classification. To re-build a complete 
pattern-match for visualization, `plotProActiveResults()` uses the 
pattern-match's minimum and maximum values and the start and stop positions.

### Plotting read coverage and associated pattern-matches
By default, the read coverage is plotted for each contig/chunk
classified as having a gap or elevation in read coverage. If you wish to see 
the read coverage for noPattern classifications, be sure to set
`IncludeNoPatterns = TRUE` when running `ProActive()`. The pattern-match
associated with each classification is overlaid on the coverage plot.

## Usage

**Default arguments:**
```{r}
MetagenomeResultsPlots <- plotProActiveResults(
  pileup = sampleMetagenomePileup,
  ProActiveResults = ProActiveOutputMetagenome
)

GenomeResultsPlots <- plotProActiveResults(
  pileup = sampleGenomePileup,
  ProActiveResults = ProActiveOutputGenome
)
```

**Note**- There is no need to set 'genome' or 'metagenome' mode. `plotProActiveResults()`
will get this information from the `ProActive()` output.

## Arguments/parameters
```{r, eval=FALSE}
plotProActiveResults(pileup,
  ProActiveResults,
  elevFilter,
  saveFilesTo
)
```

- **`pileup`**: A .txt file containing mapped sequencing read coverages averaged over 
100 bp windows/bins.
- **`ProActiveResults`**: The output from `ProActive()`.
- **`elevFilter`**: Optional, only plot results with pattern-matches that achieved an 
elevation ratio (max/min) greater than the specified value. Default is no filter.
- **`saveFilesTo`**: Optional, Provide a path to the directory you wish to save 
output to. A folder will be made within the provided directory to store 
results.

## Output

The output of `plotProActiveResults` is a list of ggplot objects. 

### View select metagenome plots
```{r eval=FALSE}
MetagenomeResultsPlots
```

```{r echo=FALSE, fig.width=6}
MetagenomeResultsPlots$NODE_1884
MetagenomeResultsPlots$NODE_368
MetagenomeResultsPlots$NODE_617
```

### View select genome plots

**Notice the 'chunk' information in the plot titles**
```{r fig.width=6}
GenomeResultsPlots$NC_003197.2_chunk_36
GenomeResultsPlots$NC_003197.2_chunk_8
```

# Session Information
```{r}
sessionInfo()
```