--- title: "Quality Control of IFCB Data" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Quality Control of IFCB Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Introduction This vignette provides an overview of quality control (QC) methods for Imaging FlowCytobot (IFCB) data using the `iRfcb` package. The package offers tools to analyze Particle Size Distribution (PSD) following Hayashi et al. *in prep*, verify geographical positions, and integrate contextual data from sources like ferrybox systems. These QC workflows ensure high-quality datasets for phytoplankton and microzooplankton monitoring in marine ecosystems. You'll learn how to: - Set up the `iRfcb` package and `Python` environment. - Analyze particle size distributions for data quality. - Check spatial metadata for proximity to land, basin classification, and missing positions. Follow this tutorial to streamline the QC process and ensure reliable IFCB data. ## Getting Started ### Installation You can install the package from GitHub using the `remotes` package: ```{r, eval=FALSE} # install.packages("remotes") remotes::install_github("EuropeanIFCBGroup/iRfcb") ``` Some functions from the `iRfcb` package used in this tutorial require `Python` to be installed. You can download `Python` from the official website: [python.org/downloads](https://www.python.org/downloads/). Load the `iRfcb` and `ggplot2` libraries: ```{r, eval=FALSE} library(iRfcb) library(ggplot2) ``` ```{r, include=FALSE} library(iRfcb) library(ggplot2) ``` ```{r, include=FALSE} library(reticulate) # Define path to virtual environment env_path <- file.path(tempdir(), "iRfcb") # Or your preferred venv path # Install python virtual environment tryCatch({ ifcb_py_install(envname = env_path) }, error = function(e) { warning("Python environment could not be installed.") }) ``` ```{r, echo=FALSE} # Check if Python is available if (!py_available(initialize = TRUE)) { knitr::opts_chunk$set(eval = FALSE) warning("Python is not available. Skipping vignette evaluation.") } else { # List available packages available_packages <- py_list_packages(python = reticulate::py_discover_config()$python) # Check if pandas and matplotlib are available if (!"pandas" %in% available_packages$package || !"matplotlib" %in% available_packages$package) { knitr::opts_chunk$set(eval = FALSE) warning("Required python modules are not available. Skipping vignette evaluation.") } } ``` ### Download Sample Data To get started, download sample data from the [SMHI IFCB Plankton Image Reference Library](https://doi.org/10.17044/scilifelab.25883455.v3) (Torstensson et al. 2024) with the following function: ```{r, eval=FALSE} # Define data directory data_dir <- "data" # Download and extract test data in the data folder ifcb_download_test_data(dest_dir = data_dir, max_retries = 10, sleep_time = 30, verbose = FALSE) ``` ```{r, include=FALSE} # Define data directory data_dir <- "data" # Download and extract test data in the data folder if (!dir.exists(data_dir)) { # Download and extract test data if the folder does not exist ifcb_download_test_data(dest_dir = data_dir, max_retries = 10, sleep_time = 30, verbose = FALSE) } ``` ## Particle Size Distribution IFCB data can be quality controlled by analyzing the particle size distribution (PSD) (Hayashi et al. in prep). `iRfcb` uses the code available at [https://github.com/kudelalab/PSD](https://github.com/kudelalab/PSD), which is efficient in detecting samples with bubbles, beads, incomplete runs etc. Before running the PSD quality check, ensure the necessary Python environment is set up and activated: ```{r, eval=FALSE} # Define path to virtual environment env_path <- file.path(tempdir(), "iRfcb") # Or your preferred venv path # Install python virtual environment ifcb_py_install(envname = env_path) # Run PSD quality control psd <- ifcb_psd(feature_folder = "data/features/2023", hdr_folder = "data/data/2023", save_data = FALSE, output_file = NULL, plot_folder = NULL, use_marker = FALSE, start_fit = 10, r_sqr = 0.5, beads = 10 ** 12, bubbles = 150, incomplete = c(1500, 3), missing_cells = 0.7, biomass = 1000, bloom = 5, humidity = 70) ``` ```{r, include=FALSE} # Run PSD quality control psd <- ifcb_psd(feature_folder = "data/features/2023", hdr_folder = "data/data/2023", save_data = FALSE, output_file = NULL, plot_folder = NULL, use_marker = FALSE, start_fit = 10, r_sqr = 0.5, beads = 10 ** 12, bubbles = 150, incomplete = c(1500, 3), missing_cells = 0.7, biomass = 1000, bloom = 5, humidity = 70) ``` The results can be printed and visualized through plots: ```{r} # Print output from PSD head(psd$fits) head(psd$flags) # Plot PSD of the first sample plot <- ifcb_psd_plot(sample_name = psd$data$sample[1], data = psd$data, fits = psd$fits, start_fit = 10) + # Set white background and ensure plot background is white theme( panel.background = element_rect(fill = "white", color = NA), plot.background = element_rect(fill = "white", color = NA) ) # Print the plot print(plot) ``` ## Geographical QC/QA ### Check If IFCB Is Near Land To determine if the IFCB is near land (i.e. ship in harbor), examine the position data in the `.hdr` files (or from vectors of latitudes and longitudes): ```{r} # Read HDR data and extract GPS position (when available) gps_data <- ifcb_read_hdr_data("data/data/", gps_only = TRUE, verbose = FALSE) # Do not print progress bar # Create new column with the results gps_data$near_land <- ifcb_is_near_land(gps_data$gpsLatitude, gps_data$gpsLongitude, distance = 100, # 100 meters from shore shape = NULL) # Using the default NE 1:10m Land Polygon # Print output head(gps_data) ``` For more accurate determination, a detailed coastline .shp file may be required (e.g. the [EEA Coastline Polygon](https://www.eea.europa.eu/data-and-maps/data/eea-coastline-for-analysis-2/gis-data/eea-coastline-polygon)). Refer to the help pages of [`ifcb_is_near_land`](../reference/ifcb_is_near_land.html) for further information. ### Check Which Sub-Basin an IFCB Sample Is From To identify the specific sub-basin of the Baltic Sea (or using a custom shape-file) from which an IFCB sample was collected, analyze the position data: ```{r} # Define example latitude and longitude vectors latitudes <- c(55.337, 54.729, 56.311, 57.975) longitudes <- c(12.674, 14.643, 12.237, 10.637) # Check in which Baltic sea basin the points are in points_in_the_baltic <- ifcb_which_basin(latitudes, longitudes, shape_file = NULL) # Print output print(points_in_the_baltic) # Plot the points and the basins ifcb_which_basin(latitudes, longitudes, plot = TRUE, shape_file = NULL) + # Set white background and ensure plot background is white theme( panel.background = element_rect(fill = "white", color = NA), plot.background = element_rect(fill = "white", color = NA) ) ``` This function reads a pre-packaged shapefile of the Baltic Sea, Kattegat, and Skagerrak basins from the `iRfcb` package by default, or a user-supplied shapefile if provided. The shapefiles provided in `iRfcb` originate from [SHARK](https://shark.smhi.se/hamta-data/). ### Check If Positions Are Within the Baltic Sea or Elsewhere This check is useful if only you want to apply a classifier specifically to phytoplankton from the Baltic Sea. ```{r} # Define example latitude and longitude vectors latitudes <- c(55.337, 54.729, 56.311, 57.975) longitudes <- c(12.674, 14.643, 12.237, 10.637) # Check if the points are in the Baltic Sea Basin points_in_the_baltic <- ifcb_is_in_basin(latitudes, longitudes) # Print results print(points_in_the_baltic) # Plot the points and the basin ifcb_is_in_basin(latitudes, longitudes, plot = TRUE) + # Set white background and ensure plot background is white theme( panel.background = element_rect(fill = "white", color = NA), plot.background = element_rect(fill = "white", color = NA) ) ``` This function reads a land-buffered shapefile of the Baltic Sea Basin from the `iRfcb` package by default, or a user-supplied shapefile if provided. ### Find Missing Positions from RV Svea Ferrybox This function is used by SMHI to collect and match stored ferrybox positions when they are not available in the `.hdr` files. An example ferrybox data file is provided in `iRfcb` with data matching sample **D20220522T000439_IFCB134**. ```{r} # Print available coordinates from .hdr files head(gps_data, 4) # Define path where ferrybox data are located ferrybox_folder <- "data/ferrybox_data" # Get GPS position from ferrybox data positions <- ifcb_get_ferrybox_data(gps_data$timestamp, ferrybox_folder) # Print result head(positions) ``` ### Find Contextual Ferrybox Data from RV Svea The [`ifcb_get_ferrybox_data`](../reference/ifcb_get_ferrybox_data.html) function can also be used to extract additional ferrybox parameters, such as temperature (parameter number **8180**) and salinity (parameter number **8181**). ```{r} # Get salinity and temperature from ferrybox data ferrybox_data <- ifcb_get_ferrybox_data(gps_data$timestamp, ferrybox_folder, parameters = c("8180", "8181")) # Print result head(ferrybox_data) ``` This concludes this tutorial for the `iRfcb` package. For more detailed information, refer to the package documentation or the other [tutorials](../articles/index.html). See how data pipelines can be constructed using `iRfcb` in the following [Example Project](https://github.com/nodc-sweden/ifcb-data-pipeline). Happy analyzing! ```{r, include=FALSE} # Clean up unlink(data_dir, recursive = TRUE) unlink(env_path, recursive = TRUE) ``` ## Citation ```{r, echo=FALSE} # Print citation citation("iRfcb") ``` ## References - Hayashi, K., Walton, J., Lie, A., Smith, J. and Kudela M. Using particle size distribution (PSD) to automate imaging flow cytobot (IFCB) data quality in coastal California, USA. In prep. - Torstensson, A., Skjevik, A-T., Mohlin, M., Karlberg, M. and Karlson, B. (2024). SMHI IFCB Plankton Image Reference Library. SciLifeLab. Dataset. https://doi.org/10.17044/scilifelab.25883455.v3