---
title: "Using the neonSoilFlux package"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Using the neonSoilFlux package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction
Welcome to the `neonSoilFlux` package! This vignette will guide you through the process of using this package to acquire and compute soil CO~2~ fluxes at different sites in the National Ecological Observatory Network.

You can think about this package working in two primary phases:

1. acquiring the environment data for a given month at a NEON site (`acquire_neon_data`). This includes:
    a. Soil temperature at different depths.
    b. Soil water content at different depths.
    c. Soil CO$_{2}$ concentration.
    d. Atmospheric pressure
    e. Soil properties (bulk density, others)

2. Given those properties, computing the soil surface fluxes and the associated uncertainty using a variety of methods to compute fluxes (`compute_neon_flux`).

We split these two functions in order to optimize time and that both were fundamentally different processes.  Acquiring the NEON data makes use of the `neonUtilities` package. 

This package takes the guess work out of which data products to collect, hoping to reduce the workflow needed.  We rely very much on the `tidyverse` philosophy for computation and coding here.

An overview of the package is also presented in `neonSoilFlux`: An R Package for Continuous Sensor-Based Estimation of Soil CO~2~ Fluxes, published in [*Methods in Ecology and Evolution*](https://doi.org/10.1111/2041-210X.70216).

![Model diagram of the data workflow for the `neonSoilFlux` R package.  a) **Acquire:** Data are obtained for a given NEON location and horizontal sensor location, which includes soil water content, soil temperature, CO$_{2}$ concentration, and atmospheric pressure. All data are screened for quality assurance; if gap-filling of missing data occurs, it is flagged for the user. b) **Harmonize:** Any belowground data are then harmonized to the same depth as CO~2~ concentrations using linear regression. c) **Compute:** The flux across a given depth is computed via Fick's law, denoted with F~ijk~, where $i$, $j$, or $k$ are either 0 or 1 denoting the layers the flux is computed across ($i$ = closest to surface, $k$ = deepest). F~000~ represents a flux estimate where the gradient $dC/dz$ is the slope of a linear regression of CO~2~ with depth.](model-diagram.pdf){width=100%}


## Get a NEON API token
NEON is now requiring an API token to access their data.  You can find information about acquiring a token at [https://www.neonscience.org/resources/learning-hub/tutorials/api-token-setup](https://www.neonscience.org/resources/learning-hub/tutorials/api-token-setup).

Once you have an NEON API token, you can set it with the function `neon_api_token`:

```{r, eval = FALSE}
neonSoilFlux::neon_api_token("YOUR_TOKEN_HERE", install = TRUE)
```


## Acquiring NEON environmental data
Load up the relevant libraries:
```{r, eval = FALSE}
library(tidyverse)
library(neonSoilFlux)
library(neonUtilities)
```

Let's say we want to acquire the NEON soil data at the `SJER` [site](https://www.neonscience.org/field-sites/sjer) during the month June in 2022:


```{r, eval=FALSE}
out_env_data <- acquire_neon_data(site_name = 'SJER',
                  download_date = '2022-06'
                  )
```

Two required inputs are needed to run the function acquire_neon_data:

- NEON site name (a four digit code standard by NEON)
- Download date, a string in the YYYY-MM format
- Additional optional arguments are also 
- Optional arguments are detailed in the section titled "Advanced Options"


As the data are acquired various messages from the `loadByProduct` function from the `neonUtilities` package are shown - this is normal.  Products are acquired from each spatial location (`horizontalPosition`) or vertical depth (`verticalPosition`) at a NEON site.

Outputs for `acquire_neon_data` are two nested data frames:

- `site_data` This contains three variables: the measurement name (one of `soilCO2concentration`, `VSWC` (soil water content), `soilTemp` (soil temperature), and `staPres` (atmospheric pressure)), `monthly_mean` contains the mean value of the measurement at each horizontal and vertical depth.  We compute the monthly mean using a bootstapped technique.  `data` which contains the stacked variables acquired from neonUtilities - the horizontal and vertial positions, timestamp (in UTC), associated values, the QF flag (0 = pass, 1 = fail, [LINK](https://www.neonscience.org/data-samples/data-management/data-quality-program))
- `site_megapit`: the nested data frame of the soil sampling data, found here [LINK](https://doi.org/10.48443/S6ND-Q840).  This data table is essential what is reported back from acquiring the data product from NEON.


### Data preparation
For each data product, the `acquire_neon_data` function also performs two additional checks: 

- The soil water content data product requires some additional calibration to correct both the soil sensor depth and calibration in the function `swc_correct`.  Information about regarding this correction is found here: [LINK](https://data.neonscience.org/data-products/DP1.00094.001). Once updated sensors are installed in the future we will depreciate this function.
- The actual measurement depth (in meters) is extracted for each position.
- The monthly mean for each measurement at each depth is computed, described in the section titled "Computing the monthly mean".

### Advanced options
The function `acquire_neon_data` has additional input options that may be useful for your work:

- `token`: The string of the NEON API token. The default is `NULL`, but you can supply a API token directly. Acquiring a NEON token is at [https://www.neonscience.org/resources/learning-hub/tutorials/api-token-setup](https://www.neonscience.org/resources/learning-hub/tutorials/api-token-setup).
- `time_frequency` Will you be using 30 minute (`"30_minute"`) or 1 minute (`"1_minute"`) recorded data? The currently set default is 30 minutes.  1 minute data is implemented, but has not been sufficiently tested (and it also requires a lot of in-computer memory).
- `provisional`: Should you use provisional data when downloading? This option is useful if you are accessing data that is not part of the most current [NEON data release](https://www.neonscience.org/data-samples/data-management/data-revisions-releases) (i.e. the current year). Defaults to FALSE.
- `depth_chop`: This is useful if you want to only compute fluxes with measurement levels to a certain depth. There are typically 8 measurement levels below ground.  Currently set to `NULL` (all levels).  The provided integer must be greater than 4 (top 4 levels).

### Visualizing outputs
With the resulting output from `acquire_neon_data`, you can then unnest the different data frames to make plots. The following code plots the timeseries of volumetric soil water content across all spatial locatios at SJER:

```{r, eval=FALSE}
library(tidyverse)

# Extract data
VSWC_data <- out_env_data$site_data |>
  filter(measurement == 'VSWC') |>
  unnest(cols=c("data"))

# Plot data
VSWC_data |>
  ggplot(aes(x=startDateTime,y=VSWCMean)) +
  geom_point(aes(color=as.factor(VSWCFinalQF))) +
  facet_grid(verticalPosition~horizontalPosition)
```


### Computing the monthly mean

The monthly mean is utilized when a given measurement fails final QF checks. This function is provided by [code](https://github.com/zoey-rw/microbialForecasts/blob/caa7b1a8aa8a131a5ff9340f1562cd3a3cb6667b/data_construction/covariate_prep/soil_moisture/clean_NEON_sensor_moisture_data.r) from [Zoey Werbin](https://github.com/zoey-rw).  At each replicate location (`horizontalPosition`) and soil depth, and a monthly mean is computed when there are at least 15 days of measurements.

Assume you have a vector of measurements $\vec{y}$, standard errors $\vec{\sigma}$, and expanded uncertainty $\vec{\epsilon}$ (all of length $M$) that passes the QF checks in a given month. By definition, the expanded uncertainty $\vec{\epsilon}$ includes a [95% confidence interval](https://www.neonscience.org/data-samples/data-management/data-quality-program), so $\vec{\sigma}_{i}\leq\vec{\epsilon}_{i}$. Additionally, we define the bias $\vec{b}=\sqrt{\left(\vec{\epsilon}\right)^{2}-\left(\vec{\sigma}\right)^{2}}$ to be the quadrature difference between the expanded uncertainty and the standard error.


We generate a bootstrap sample of the mean $\overline{y}$ and standard error $\overline{s}$ the following ways. For our cases we set the number of bootstrap samples $N$ to be 5000. Individual entries for $\overline{y}_{i}$ and $\overline{s}_{i}$ are determined by the following:

1. Randomly sample from the uncertainty and bias independently: $\vec{\sigma}_{j}$ and the bias $\vec{b}_{k}$ (not necessarily the same sample)
2. Generate $N$ random samples from a normal distribution with mean $\vec{y}$ and standard deviation $\vec{\sigma}_{j}$. Since $M<N$, `R` will recycle the vector $\vec{y}$ so that this sample is of length $M$. We will call the sample of $\vec{y}$ as $\vec{x}$.
3. With these $N$ random samples, $\overline{y}_{i}=\overline{\vec{x}}+\vec{b}_{k}$ and $s_{i}$ is the sample standard deviation of $\vec{x}$. We expect that $s_{i} \approx \vec{\sigma}_{j}$.

Once that is complete, the reported monthly mean and standard deviation is $\overline{\overline{y}}$ and $\overline{s}$.


## Computing soil CO2 fluxes
Once we have `out_env_data` from `acquire_neon_flux`, we then compute the fluxes at this site:
```{r, eval=FALSE}
out_fluxes <- compute_neon_flux(
  input_site_env = out_env_data$site_data,
  input_site_megapit = out_env_data$site_megapit
  )
```

The resulting data frame `out_fluxes` has the following variables:

- `startDateTime`: Time period of measurement (as POSIXct)
- `horizontalPosition`: Sensor location where flux is computed
- `flux_compute`: A nested tibble with soil flux gradients computed via different diffusitivies at different measurement depths. See below.
- `surface_diffusivity`: Computation of surface diffusivity (see below)
- `soilCO2concentrationMeanQF`: QF flag for soil CO2 concentration across all vertical depths at the given horizontal position: 0 = no issues, 1 = monthly mean used in measurement, 2 = QF fail
- `VSWCMeanQF`: QF flag for volumetric soil water content (VSWC) across all vertical depths at the given horizontal position: 0 = no issues, 1 = monthly mean used in measurement, 2 = QF fail
- `soilTempMeanQF`: QF flag for soil temperature across all vertical depths at the given horizontal position: 0 = no issues, 1 = monthly mean used in measurement, 2 = QF fail
- `staPresMeanQF`: QF flag for atmospheric pressure at the given horizontal position: 0 = no issues, 1 = monthly mean used in measurement, 2 = QF fail

A QF measurement fails when there is a monthly mean could not be computed for a measurement. If any of the input variables (soil CO2, VSWC, soil temperature, and atmospheric pressure) have a QF fail, then **all** flux calculations to fail at that given horizontal position.

The nested data frame `flux_compute` has the following structure:

- `diffus_method`: The type of diffusivity used to compute fluxes. Currently implemented are "Marshall" or "Millington-Quirk"
- `flux`: The calculated soil flux ($\mu$mol m^-2^ s^−1^)
- `flux_err`: The calculated flux error (by quadrature)
- `gradient`: The computed CO~2~ flux gradient ($\mu$mol m^-3^ m^{−1}^)
- `gradient_error`: The computed CO~2~ flux gradient error (by quadrature)
- `method`: Each site had three measurement layers, so we denote the flux as a three-digit subscript $F_{ijk}$ with indicator variables $i$, $j$, and $k$ indicate if a given layer was used (written in order of increasing depth), according to the following:

  | -   F~000~ ("000") is a surface flux estimate using the intercept of the linear regression of $D_{a}$ with depth and the slope from the linear regression of CO~2~ with depth (which represents $\displaystyle \frac{dC}{dz}$ in Fick's Law)..
  | -   F~110~ ("110") is a flux estimate across the two shallowest measurement layers.
  | -   F~011~ ("011") is a flux estimate across the two deepest measurement layers.
  | -   F~101~ ("101") is a flux estimate across the shallowest and deepest measurement layers.

  | For F~110~, F~011~, and F~101~, the diffusivity used in Fick's Law is always at the deeper measurement layer. When used as a surface flux estimate we assume CO~2~ remains constant above this flux depth.

- `r2`: The R^2^ value from the linear regression for F~000~.  Otherwise it is `NA`.

The nested data frame `surface_diffusivity` has the following structure:

- `zOffset`: The depth that diffusivity is computed at.
- `diffusivity`: The calculated soil flux ($\mu$mol m^-2^ s^{−1}^)
- `diffusExpUncert`: The calculated diffusivity uncertainty (by quadrature)
- `diffus_method`: The type of diffusivity used to compute fluxes. Currently implemented are "Marshall" or "Millington-Quirk"

### Assessing Environmental QF flags
You can see the distribution the QF flags for each environmental measurement with `env_fingerprint_plot`:

```{r, eval = FALSE}
env_fingerprint_plot(out_fluxes)
```

The resulting plot has rows corresponding to the replicate plots (`horizontalPosition`), and columns corresponding to the different environmental measurements used when computing fluxes.


#### Explanation of QF check values:
- "Pass" means that for the given timepoint, the monthly mean was not used or the sensor was not offline.  This is the highest quality measurement.
- "Monthly Mean" means that for the given timepoint the measurement value was replaced by the monthly mean.
- "Fail" means that no measurement was available. This occurs if there is not sufficient data to compute the monthly mean.  When a measurement fails it usually will be for the entire month.

### Assessing flux QF flags
Similarly, you can see the distribution of QF flags for each diffusivity and flux computation with `flux_fingerprint_plot`. Because there are two different diffusivities implemented ("Marshall" or "Millington-Quirk"), that option needs to be passed to `flux_fingerprint_plot`:

```{r, eval = FALSE}
# Fingerprint plot for Marshall method:
flux_fingerprint_plot(
  input_fluxes = out_fluxes,
  input_diffus_method = "Marshall")

# Fingerprint plot for Marshall method:
flux_fingerprint_plot(
  input_fluxes = out_fluxes,
  input_diffus_method = "Millington-Quirk")

```

(The default method is `"Marshall"`).  The resulting plot has rows corresponding to the replicate plots (`horizontalPosition`), and columns corresponding to the vertical levels using when computing the gradient for the soil flux.


#### Explanation of QF check values:
- "Pass" means that for the given timepoint, the computed flux measurement was not NA or positive (the sign of the derived flux conformed to expectations). Monthly means could be used in the computation.
- "Fail" means that the flux was not computed.  This occurs if there is not sufficient data to compute the monthly mean (one environmental measurement was "Fail"), or the computed flux was negative.


### Visualizing outputs
To plot the flux results: 

```{r, eval=FALSE}
out_fluxes |>
  select(-surface_diffusivity) |>
  unnest(cols=c(flux_compute)) |>
  ggplot(aes(x=startDateTime,y=flux,color=method)) +
    geom_line() +
    facet_wrap(~horizontalPosition,scales = "free_y")
```

The diffusivity can be plotted similarly:

```{r, eval=FALSE}
out_fluxes |>
  select(-flux_compute) |>
  unnest(cols=c(surface_diffusivity)) |>
  ggplot(aes(x=startDateTime,y=diffusivity,color=as.factor(zOffset))) +
  geom_line() +
  facet_wrap(~horizontalPosition,scales = "free_y")  
```