Title: | Genetic Analysis of Pooled Samples |
---|---|
Description: | Analyzing genetic data obtained from pooled samples. This package can read in Fragment Analysis output files, process the data, and score peaks, as well as facilitate various analyses, including cluster analysis, calculation of genetic distances and diversity indices, as well as bootstrap resampling for statistical inference. Specifically tailored to handle genetic data efficiently, researchers can explore population structure, genetic differentiation, and genetic relatedness among samples. We updated some functions from Covarrubias-Pazaran et al. (2016) <doi:10.1186/s12863-016-0365-6> to allow for the use of new file formats and referenced the following to write our genetic analysis functions: Long et al. (2022) <doi:10.1038/s41598-022-04776-0>, Jost (2008) <doi:10.1111/j.1365-294x.2008.03887.x>, Nei (1973) <doi:10.1073/pnas.70.12.3321>, Foulley et al. (2006) <doi:10.1016/j.livprodsci.2005.10.021>, Chao et al. (2008) <doi:10.1111/j.1541-0420.2008.01010.x>. |
Authors: | Kathleen Kuesters [aut, cre, cph] , Jeffrey Long [aut], Jessica Blanton [aut], Walter Blank [ctb], Jeffrey Kovach [ctb], Ronald Blanton [ctb] |
Maintainer: | Kathleen Kuesters <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.0.7 |
Built: | 2024-11-25 16:23:05 UTC |
Source: | CRAN |
This function calculates allelic richness based on provided genetic data.
AlRich(datafile = data.frame, n = matrix)
AlRich(datafile = data.frame, n = matrix)
datafile |
A data frame containing the data as read in by LoadData |
n |
A matrix representing the number of markers successfully genotyped like the output of the TypedLoci function. |
A vector containing the allelic richness for each locus.
genetic_data <- data.frame( Locus = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), Locus_allele = c("Marker1", "n", 1, 2, 3, "Marker2", "n", 1, 2, 3), Sample1 = c(NA, 10, 0.5, 0.5, 0, NA, 10, 0.2, 0.3, 0.5), Sample2 = c(NA, 20, 0.1, 0.2, 0.7, NA, 20, 0.3, 0.4, 0.3), Sample3 = c(NA, 30, 0.3, 0.4, 0.3, NA, 30, 0.4, 0.2, 0.4) ) n_alleles <- matrix(c( 3, 3, 3, 3, 3, 3, 3, 3, 3 ), nrow = 3, byrow = TRUE, dimnames = list(paste0("Sample", 1:3), paste0("Sample", 1:3))) AlRich(datafile=genetic_data,n=n_alleles)
genetic_data <- data.frame( Locus = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), Locus_allele = c("Marker1", "n", 1, 2, 3, "Marker2", "n", 1, 2, 3), Sample1 = c(NA, 10, 0.5, 0.5, 0, NA, 10, 0.2, 0.3, 0.5), Sample2 = c(NA, 20, 0.1, 0.2, 0.7, NA, 20, 0.3, 0.4, 0.3), Sample3 = c(NA, 30, 0.3, 0.4, 0.3, NA, 30, 0.4, 0.2, 0.4) ) n_alleles <- matrix(c( 3, 3, 3, 3, 3, 3, 3, 3, 3 ), nrow = 3, byrow = TRUE, dimnames = list(paste0("Sample", 1:3), paste0("Sample", 1:3))) AlRich(datafile=genetic_data,n=n_alleles)
This function associates dye info with fragman channel names. It was designed to be performed on any fsa formats after final columns are correctly imported.
associate_dyes(x, y)
associate_dyes(x, y)
x |
The Output list of data frames from fsa_batch_imp. |
y |
The path to the folder from the current directory where the .fsa files that will be analyzed are stored. |
The input dataframe with an added column assigning fluorescent dye colors.
y <- system.file("extdata", package = "pooledpeaks") x <- fsa_batch_imp(y, channels = 5, fourier = FALSE, saturated = FALSE , lets.pullup = FALSE, plotting = FALSE, rawPlot = FALSE, llength = 3000, ulength = 80000 ) associate_dyes(x,y)
y <- system.file("extdata", package = "pooledpeaks") x <- fsa_batch_imp(y, channels = 5, fourier = FALSE, saturated = FALSE , lets.pullup = FALSE, plotting = FALSE, rawPlot = FALSE, llength = 3000, ulength = 80000 ) associate_dyes(x,y)
This function performs bootstrap analysis on genetic data.
BootStrap3(A = data.frame, Rep = 20, Stat = 1)
BootStrap3(A = data.frame, Rep = 20, Stat = 1)
A |
Data frame containing data as read in by LoadData |
Rep |
Number of bootstrap replicates. |
Stat |
Type of statistic to compute (1 for AlRich, 2 for TwoLevelGST) |
Either a matrix of AlRich statistics or a list containing various statistics computed using TwoLevelGST.
genetic_data <- data.frame( Locus = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), Locus_allele = c("Marker1", "n", 1, 2, 3, "Marker2", "n", 1, 2, 3), Sample1 = c(NA, 10, 0.5, 0.5, 0, NA, 10, 0.2, 0.3, 0.5), Sample2 = c(NA, 20, 0.1, 0.2, 0.7, NA, 20, 0.3, 0.4, 0.3), Sample3 = c(NA, 30, 0.3, 0.4, 0.3, NA, 30, 0.4, 0.2, 0.4) ) BootStrap3(A=genetic_data, Rep=10, Stat=1) BootStrap3(A=genetic_data, Rep=10, Stat=2)
genetic_data <- data.frame( Locus = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), Locus_allele = c("Marker1", "n", 1, 2, 3, "Marker2", "n", 1, 2, 3), Sample1 = c(NA, 10, 0.5, 0.5, 0, NA, 10, 0.2, 0.3, 0.5), Sample2 = c(NA, 20, 0.1, 0.2, 0.7, NA, 20, 0.3, 0.4, 0.3), Sample3 = c(NA, 30, 0.3, 0.4, 0.3, NA, 30, 0.4, 0.2, 0.4) ) BootStrap3(A=genetic_data, Rep=10, Stat=1) BootStrap3(A=genetic_data, Rep=10, Stat=2)
This function analyzes .fsa files in a specified folder, providing a summary of their version and batch information.
check_fsa_v_batch(x)
check_fsa_v_batch(x)
x |
The path to the folder from the current directory where the .fsa files that will be analyzed are stored. |
A written summary of how many .fsa files are in the folder and which version they are.
file_path <- system.file("extdata", package = "pooledpeaks") check_fsa_v_batch(x = file_path)
file_path <- system.file("extdata", package = "pooledpeaks") check_fsa_v_batch(x = file_path)
This function cleans the score_markers_rev3 data by applying specified patterns and replacements to the ID and filename columns.
clean_scores( scores_data, pattern1 = NULL, replacement1 = NULL, pattern2 = NULL, replacement2 = NULL, pattern3 = NULL, replacement3 = NULL )
clean_scores( scores_data, pattern1 = NULL, replacement1 = NULL, pattern2 = NULL, replacement2 = NULL, pattern3 = NULL, replacement3 = NULL )
scores_data |
The list containing the output scores data from the score_markers_rev3. |
pattern1 |
The first pattern to replace in the ID.This is intended to clean up the ID names for when the machine adds substrings to the names. For example 104.1a_FA060920_2020-06-09_C05.fsa.1 becomes 104.1a using pattern1="_FA.*" and replacement1= "" |
replacement1 |
Replacement for the first pattern. |
pattern2 |
The second pattern to replace in the ID. See pattern1 for more details. |
replacement2 |
Replacement for the second pattern. |
pattern3 |
The pattern to replace in the file name.This is intended to clean up the file names for when the machine adds sub strings to the names. For example 104.1a_FA060920_2020-06-09_C05.fsa.1 becomes 104.1a_FA060920_2020-06-09_C05.fsa using pattern3= "\.1*$" and replacement3= "" |
replacement3 |
Replacement for the file name pattern. |
A cleaned long format data frame
scores_data <- list( data.frame(Score = c(90, 85, 70), stringsAsFactors = FALSE), data.frame(Score = c(80, 75, 60), stringsAsFactors = FALSE) ) rownames(scores_data[[1]]) <- c("104.1a_FA060920_2020-06-09_C05.fsa_Sa.1", "105.2b_FA060920_2020-06-09_C05.fsa_Sa.1", "106.3c_FA060920_2020-06-09_C05.fsa_Fa.1") rownames(scores_data[[2]]) <- c("107.4d_FA060920_2020-06-09_C05.fsa_Sa.1", "108.5e_FA060920_2020-06-09_C05.fsa_Sa.1", "109.6f_SA060920_2020-06-09_C05.fsa_Fa.1") clean_scores(scores_data,pattern1= "_SA.*", replacement1="", pattern2= "_FA.*",replacement2="")
scores_data <- list( data.frame(Score = c(90, 85, 70), stringsAsFactors = FALSE), data.frame(Score = c(80, 75, 60), stringsAsFactors = FALSE) ) rownames(scores_data[[1]]) <- c("104.1a_FA060920_2020-06-09_C05.fsa_Sa.1", "105.2b_FA060920_2020-06-09_C05.fsa_Sa.1", "106.3c_FA060920_2020-06-09_C05.fsa_Fa.1") rownames(scores_data[[2]]) <- c("107.4d_FA060920_2020-06-09_C05.fsa_Sa.1", "108.5e_FA060920_2020-06-09_C05.fsa_Sa.1", "109.6f_SA060920_2020-06-09_C05.fsa_Fa.1") clean_scores(scores_data,pattern1= "_SA.*", replacement1="", pattern2= "_FA.*",replacement2="")
K-means Clustering
cluster(RawData = data.frame, K = 2)
cluster(RawData = data.frame, K = 2)
RawData |
A data frame containing the raw data as read in by LoadData |
K |
An integer specifying the number of clusters. |
A list containing the results of the K-means cluster analysis, including cluster assignments and original data.
genetic_data <- data.frame( Locus = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), Locus_allele = c("Marker1", "n", 1, 2, 3, "Marker2", "n", 1, 2, 3), Sample1 = c(NA, 10, 0.5, 0.5, 0, NA, 10, 0.2, 0.3, 0.5), Sample2 = c(NA, 20, 0.1, 0.2, 0.7, NA, 20, 0.3, 0.4, 0.3), Sample3 = c(NA, 30, 0.3, 0.4, 0.3, NA, 30, 0.4, 0.2, 0.4) ) cluster(RawData=genetic_data, K=2)
genetic_data <- data.frame( Locus = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), Locus_allele = c("Marker1", "n", 1, 2, 3, "Marker2", "n", 1, 2, 3), Sample1 = c(NA, 10, 0.5, 0.5, 0, NA, 10, 0.2, 0.3, 0.5), Sample2 = c(NA, 20, 0.1, 0.2, 0.7, NA, 20, 0.3, 0.4, 0.3), Sample3 = c(NA, 30, 0.3, 0.4, 0.3, NA, 30, 0.4, 0.2, 0.4) ) cluster(RawData=genetic_data, K=2)
Perform clustering on samples of loci from a data frame and calculate statistics.
ClusterFromSamples(datafile = data.frame, numloci = 5, reps = 100)
ClusterFromSamples(datafile = data.frame, numloci = 5, reps = 100)
datafile |
A data frame containing the input data must be in LoadData style LoadData. |
numloci |
An integer specifying the number of loci to sample. |
reps |
An integer specifying the number of repetitions. |
A matrix containing statistics calculated from the clustering results.
genetic_data <- data.frame( Locus = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), Locus_allele = c("Marker1", "n", 1, 2, 3, "Marker2", "n", 1, 2, 3), Sample1 = c(NA, 10, 0.5, 0.5, 0, NA, 10, 0.2, 0.3, 0.5), Sample2 = c(NA, 20, 0.1, 0.2, 0.7, NA, 20, 0.3, 0.4, 0.3), Sample3 = c(NA, 30, 0.3, 0.4, 0.3, NA, 30, 0.4, 0.2, 0.4) ) ClusterFromSamples(datafile=genetic_data, numloci=5, reps=10)
genetic_data <- data.frame( Locus = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), Locus_allele = c("Marker1", "n", 1, 2, 3, "Marker2", "n", 1, 2, 3), Sample1 = c(NA, 10, 0.5, 0.5, 0, NA, 10, 0.2, 0.3, 0.5), Sample2 = c(NA, 20, 0.1, 0.2, 0.7, NA, 20, 0.3, 0.4, 0.3), Sample3 = c(NA, 30, 0.3, 0.4, 0.3, NA, 30, 0.4, 0.2, 0.4) ) ClusterFromSamples(datafile=genetic_data, numloci=5, reps=10)
This function ensures that at least one peak for each sample is greater than a specified threshold (default: 500) and then formats the data frame for the next steps in the analysis.
data_manipulation(marker, threshold = 500)
data_manipulation(marker, threshold = 500)
marker |
A data frame containing marker data, where each row represents a marker and each column represents a sample. |
threshold |
The threshold value for peak height. Peaks below this threshold will be replaced with 0. |
A formatted data frame where at least one peak for each sample is greater than the specified threshold.
marker_data <- data.frame( Sample1 = c(400, 600, 700,0), Sample2 = c(450, 550, 480,0), Sample3 = c(300, 200, 400,200), Sample4 = c(0,0,0,0), row.names=c(185,188,191,194) ) data_manipulation(marker_data,threshold=500)
marker_data <- data.frame( Sample1 = c(400, 600, 700,0), Sample2 = c(450, 550, 480,0), Sample3 = c(300, 200, 400,200), Sample4 = c(0,0,0,0), row.names=c(185,188,191,194) ) data_manipulation(marker_data,threshold=500)
Calculate the correlation between expected and realized genetic distances and plot them.
DistCor(GD = matrix)
DistCor(GD = matrix)
GD |
A matrix containing the genetic distance data. |
A plot showing the Expected Genetic Distance versus Realized Genetic Distance
genetic_distance_matrix <- matrix(c(0.316455, 0.2836333, 0.2760485, 0.2685221, 0.2797302,0.3202661,0.2836333, 0.3106084, 0.2867215, 0.2687472, 0.2596309, 0.2957862,0.2760485,0.2867215, 0.3338663, 0.297918, 0.3057039, 0.3153261,0.2685221, 0.2687472, 0.297918,0.3107094, 0.2753477, 0.3042383, 0.2797302, 0.2596309, 0.3057039, 0.2753477, 0.3761386, 0.3398558,0.3202661, 0.2957862, 0.3153261, 0.3042383, 0.3398558, 0.4402125), nrow = 6, byrow = TRUE,dimnames = list(c("Sample1", "Sample2", "Sample3", "Ind1", "Ind2", "Ind3"), c("Sample1", "Sample2", "Sample3", "Ind1", "Ind2", "Ind3"))) DC<- DistCor(genetic_distance_matrix)
genetic_distance_matrix <- matrix(c(0.316455, 0.2836333, 0.2760485, 0.2685221, 0.2797302,0.3202661,0.2836333, 0.3106084, 0.2867215, 0.2687472, 0.2596309, 0.2957862,0.2760485,0.2867215, 0.3338663, 0.297918, 0.3057039, 0.3153261,0.2685221, 0.2687472, 0.297918,0.3107094, 0.2753477, 0.3042383, 0.2797302, 0.2596309, 0.3057039, 0.2753477, 0.3761386, 0.3398558,0.3202661, 0.2957862, 0.3153261, 0.3042383, 0.3398558, 0.4402125), nrow = 6, byrow = TRUE,dimnames = list(c("Sample1", "Sample2", "Sample3", "Ind1", "Ind2", "Ind3"), c("Sample1", "Sample2", "Sample3", "Ind1", "Ind2", "Ind3"))) DC<- DistCor(genetic_distance_matrix)
This function calculates the empirical standard error based on repeated sampling.
EmpiricalSE(datafile = data.frame, NLoci = 10)
EmpiricalSE(datafile = data.frame, NLoci = 10)
datafile |
A data frame containing genetic data from the LoadData |
NLoci |
Number of loci to sample in each iteration. |
A numeric vector containing the empirical standard error estimates.
genetic_data <- data.frame( Locus = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), Locus_allele = c("Marker1", "n", 1, 2, 3, "Marker2", "n", 1, 2, 3), Sample1 = c(NA, 10, 0.5, 0.5, 0, NA, 10, 0.2, 0.3, 0.5), Sample2 = c(NA, 20, 0.1, 0.2, 0.7, NA, 20, 0.3, 0.4, 0.3), Sample3 = c(NA, 30, 0.3, 0.4, 0.3, NA, 30, 0.4, 0.2, 0.4) ) EmpiricalSE(datafile=genetic_data, NLoci=3)
genetic_data <- data.frame( Locus = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), Locus_allele = c("Marker1", "n", 1, 2, 3, "Marker2", "n", 1, 2, 3), Sample1 = c(NA, 10, 0.5, 0.5, 0, NA, 10, 0.2, 0.3, 0.5), Sample2 = c(NA, 20, 0.1, 0.2, 0.7, NA, 20, 0.3, 0.4, 0.3), Sample3 = c(NA, 30, 0.3, 0.4, 0.3, NA, 30, 0.4, 0.2, 0.4) ) EmpiricalSE(datafile=genetic_data, NLoci=3)
This function imports and extracts all of the information out of the .fsa
files and combines them into one list type object.fsa_batch_imp
is a
modification of the original Fragman import script function, storing.inds
,
This revised script accommodates ABI's .fsa file format up to version 3.
It retains Fragman functions for Fourier transformation, saturated peaks,
and pull-up correction. Notable adjustments include updating channel
parameters, utilizing Dyechannel count from the file directory, and
streamlining the script by extracting data only from "DATA" tags.
Major changes involve column selection for v3 formats and modifications to
the "channel" parameter. Minor changes include allowing relative paths for
the data directory, importing only .fsa files, and renaming channels with
dye names. This revision ensures successful execution for any format version
up to 3.
fsa_batch_imp( folder, channels = NULL, fourier = TRUE, saturated = TRUE, lets.pullup = FALSE, plotting = FALSE, rawPlot = FALSE, llength = 3000, ulength = 80000 )
fsa_batch_imp( folder, channels = NULL, fourier = TRUE, saturated = TRUE, lets.pullup = FALSE, plotting = FALSE, rawPlot = FALSE, llength = 3000, ulength = 80000 )
folder |
The path to the folder from the current directory where the .fsa files that will be analyzed are stored. |
channels |
The number of dye channels expected, including the ladder. |
fourier |
True/False Should fourier transformation be applied. |
saturated |
True/False whether to Check and correct for saturated peaks. |
lets.pullup |
True/False Applying pull up correction to the samples to decrease noise from channel to channel. The default is FALSE, please do not change this. |
plotting |
True/False Should plots be drawn of all channels after data cleaning. |
rawPlot |
True/False indicating whether a plot should be drawn of all vectors. |
llength |
A numeric value for the minimum number of indexes in each channel. |
ulength |
A numeric value for the maximum number fo indexes in each channel. |
Output is a LIST where each element of the list is a DATAFRAME with the channels in columns for each FSA file
file_path <- system.file("extdata", package = "pooledpeaks") fsa_batch_imp(file_path, channels = 5, fourier = FALSE, saturated = FALSE , lets.pullup = FALSE,plotting = FALSE, rawPlot = FALSE)
file_path <- system.file("extdata", package = "pooledpeaks") fsa_batch_imp(file_path, channels = 5, fourier = FALSE, saturated = FALSE , lets.pullup = FALSE,plotting = FALSE, rawPlot = FALSE)
Retrieves basic info from .fsa files about the sample and run,and aggregates multiple samples in a single object.
fsa_metadata(x)
fsa_metadata(x)
x |
The path to the folder from the current directory where the .fsa files that will be analyzed are stored. |
A data frame that contains the metadata of the machine and run extracted from the .fsa file. One row for each .fsa file in directory x and the following columns: retrieved_sample_name, batch_container_name, fsa_version, user, run_start_date, run_start_time, machine_type,machineN_serial.
file_path <- system.file("extdata", package = "pooledpeaks") fsa_metadata(x = file_path)
file_path <- system.file("extdata", package = "pooledpeaks") fsa_metadata(x = file_path)
Using the number of typed loci, this function calculates the gene identity between all possible pairwise combinations between individuals for all markers creating a matrix.
GeneIdentityMatrix(RawData = data.frame, LociGenotyped = matrix)
GeneIdentityMatrix(RawData = data.frame, LociGenotyped = matrix)
RawData |
A data frame containing the input data must be in LoadData style LoadData. |
LociGenotyped |
The Output from the TypedLoci function |
The Gene Identity Matrix
genetic_data <- data.frame( Locus = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), Locus_allele = c("Marker1", "n", 1, 2, 3, "Marker2", "n", 1, 2, 3), Sample1 = c(NA, 10, 0.5, 0.5, 0, NA, 10, 0.2, 0.3, 0.5), Sample2 = c(NA, 20, 0.1, 0.2, 0.7, NA, 20, 0.3, 0.4, 0.3), Sample3 = c(NA, 30, 0.3, 0.4, 0.3, NA, 30, 0.4, 0.2, 0.4) ) n_alleles <- matrix(c( 3, 3, 3, 3, 3, 3, 3, 3, 3 ), nrow = 3, byrow = TRUE, dimnames = list(paste0("Sample", 1:3), paste0("Sample", 1:3))) GeneIdentityMatrix(RawData=genetic_data,LociGenotyped=n_alleles)
genetic_data <- data.frame( Locus = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), Locus_allele = c("Marker1", "n", 1, 2, 3, "Marker2", "n", 1, 2, 3), Sample1 = c(NA, 10, 0.5, 0.5, 0, NA, 10, 0.2, 0.3, 0.5), Sample2 = c(NA, 20, 0.1, 0.2, 0.7, NA, 20, 0.3, 0.4, 0.3), Sample3 = c(NA, 30, 0.3, 0.4, 0.3, NA, 30, 0.4, 0.2, 0.4) ) n_alleles <- matrix(c( 3, 3, 3, 3, 3, 3, 3, 3, 3 ), nrow = 3, byrow = TRUE, dimnames = list(paste0("Sample", 1:3), paste0("Sample", 1:3))) GeneIdentityMatrix(RawData=genetic_data,LociGenotyped=n_alleles)
This function calculates the genetic distance matrix from a given gene identity matrix.
GeneticDistanceMatrix(J = matrix)
GeneticDistanceMatrix(J = matrix)
J |
The Gene Identity Matrix created using GeneIdentityMatrix |
The Genetic Distance Matrix
gene_identity_matrix <- matrix(c( 0.3164550, 0.2836333, 0.2760485, 0.2836333, 0.3106084, 0.2867215, 0.2760485, 0.2867215, 0.3338663 ), nrow = 3, byrow = TRUE, dimnames = list(paste0("Sample", 1:3), paste0("Sample", 1:3))) GeneticDistanceMatrix(gene_identity_matrix)
gene_identity_matrix <- matrix(c( 0.3164550, 0.2836333, 0.2760485, 0.2836333, 0.3106084, 0.2867215, 0.2760485, 0.2867215, 0.3338663 ), nrow = 3, byrow = TRUE, dimnames = list(paste0("Sample", 1:3), paste0("Sample", 1:3))) GeneticDistanceMatrix(gene_identity_matrix)
This function calculates GST (Nei's standard genetic distance) measure from a gene identity matrix.
GST(J = matrix, pairwise = TRUE)
GST(J = matrix, pairwise = TRUE)
J |
A square matrix representing a gene identity matrix. |
pairwise |
Logical indicating whether to calculate pairwise GST. If set to FALSE, must not contain any missing data. |
If pairwise = TRUE, returns a matrix of pairwise GST values. If pairwise = FALSE, returns the overall GST value.
gene_identity_matrix <- matrix(c( 0.3164550, 0.2836333, 0.2760485, 0.2836333, 0.3106084, 0.2867215, 0.2760485, 0.2867215, 0.3338663 ), nrow = 3, byrow = TRUE, dimnames = list(paste0("Sample", 1:3), paste0("Sample", 1:3))) GST(J=gene_identity_matrix, pairwise=TRUE) GST(J=gene_identity_matrix, pairwise=FALSE)
gene_identity_matrix <- matrix(c( 0.3164550, 0.2836333, 0.2760485, 0.2836333, 0.3106084, 0.2867215, 0.2760485, 0.2867215, 0.3338663 ), nrow = 3, byrow = TRUE, dimnames = list(paste0("Sample", 1:3), paste0("Sample", 1:3))) GST(J=gene_identity_matrix, pairwise=TRUE) GST(J=gene_identity_matrix, pairwise=FALSE)
This function calculates Jost's D measure from a gene identity matrix.
JostD(J = matrix, pairwise = TRUE)
JostD(J = matrix, pairwise = TRUE)
J |
A gene identity matrix. |
pairwise |
Logical indicating whether to calculate pairwise Jost's D. If pairwise=FALSE, must not have any missing data. |
If pairwise = TRUE, returns a matrix of pairwise Jost's D values. If pairwise = FALSE, returns the overall Jost's D value.
gene_identity_matrix <- matrix(c( 0.3164550, 0.2836333, 0.2760485, 0.2836333, 0.3106084, 0.2867215, 0.2760485, 0.2867215, 0.3338663 ), nrow = 3, byrow = TRUE, dimnames = list(paste0("Sample", 1:3), paste0("Sample", 1:3))) JostD(J=gene_identity_matrix, pairwise=TRUE) JostD(J=gene_identity_matrix, pairwise=FALSE)
gene_identity_matrix <- matrix(c( 0.3164550, 0.2836333, 0.2760485, 0.2836333, 0.3106084, 0.2867215, 0.2760485, 0.2867215, 0.3338663 ), nrow = 3, byrow = TRUE, dimnames = list(paste0("Sample", 1:3), paste0("Sample", 1:3))) JostD(J=gene_identity_matrix, pairwise=TRUE) JostD(J=gene_identity_matrix, pairwise=FALSE)
This function calculates Jost's D between two columns, specifically designed for comparing duplicate samples based on allele frequencies.
JostD_KK(Ni1, Ni2)
JostD_KK(Ni1, Ni2)
Ni1 |
Vector containing the allele frequencies of the first duplicate sample. |
Ni2 |
Vector containing the allele frequencies of the second duplicate sample. |
The calculated Jost's D value.
This function transforms a data frame from LF (long format) to TDF (table format),performing various data manipulation steps such as spreading data across columns,removing NA and/or 0 columns, merging ID allele heights within each replicate,transposing the table, converting from character to numeric class, and replacing empty data with "0".
lf_to_tdf(x)
lf_to_tdf(x)
x |
A data frame in LF format ideally coming out of the clean_scores function. |
A transformed data frame in TDF format.
scores<- data.frame(ID=c("104.1a","105.2b","106.3c","107.4d","108.5e", "109.6f"), filename=c("104.1a_FA060920_2020-06-09_C05.fsa_Sa.1", "105.2b_FA060920_2020-06-09_C05.fsa_Sa.1", "106.3c_FA060920_2020-06-09_C05.fsa_Fa.1", "107.4d_FA060920_2020-06-09_C05.fsa_Sa.1" , "108.5e_FA060920_2020-06-09_C05.fsa_Sa.1" , "109.6f_SA060920_2020-06-09_C05.fsa_Fa.1"), hei=c(2000,3000,4000,5000,2500, 1000), pos=c(2000,3000,4000,5000,2500, 1000), wei=c(290,285,280,275,270,260), row.names= c("104.1a_FA060920_2020-06-09_C05.fsa_Sa.1", "105.2b_FA060920_2020-06-09_C05.fsa_Sa.1", "106.3c_FA060920_2020-06-09_C05.fsa_Fa.1", "107.4d_FA060920_2020-06-09_C05.fsa_Sa.1" , "108.5e_FA060920_2020-06-09_C05.fsa_Sa.1" , "109.6f_SA060920_2020-06-09_C05.fsa_Fa.1")) lf_to_tdf(scores)
scores<- data.frame(ID=c("104.1a","105.2b","106.3c","107.4d","108.5e", "109.6f"), filename=c("104.1a_FA060920_2020-06-09_C05.fsa_Sa.1", "105.2b_FA060920_2020-06-09_C05.fsa_Sa.1", "106.3c_FA060920_2020-06-09_C05.fsa_Fa.1", "107.4d_FA060920_2020-06-09_C05.fsa_Sa.1" , "108.5e_FA060920_2020-06-09_C05.fsa_Sa.1" , "109.6f_SA060920_2020-06-09_C05.fsa_Fa.1"), hei=c(2000,3000,4000,5000,2500, 1000), pos=c(2000,3000,4000,5000,2500, 1000), wei=c(290,285,280,275,270,260), row.names= c("104.1a_FA060920_2020-06-09_C05.fsa_Sa.1", "105.2b_FA060920_2020-06-09_C05.fsa_Sa.1", "106.3c_FA060920_2020-06-09_C05.fsa_Fa.1", "107.4d_FA060920_2020-06-09_C05.fsa_Sa.1" , "108.5e_FA060920_2020-06-09_C05.fsa_Sa.1" , "109.6f_SA060920_2020-06-09_C05.fsa_Fa.1")) lf_to_tdf(scores)
This function imports data for genetic analysis.
LoadData(datafile = NULL)
LoadData(datafile = NULL)
datafile |
The path to your datafile. The format of your data should be .txt or .csv. |
A data frame containing the imported data formatted in the way necessary for downstream population genetic functions.
file<-system.file("extdata", "Multiplex_frequencies.txt", package = "pooledpeaks") LoadData(file)
file<-system.file("extdata", "Multiplex_frequencies.txt", package = "pooledpeaks") LoadData(file)
Generate a multidimensional scaling (MDS) plot from genetic distance data.
MDSplot( distance = matrix, pcs = c(1, 2), PF = NULL, y = c("dodgerblue", "red", "turquoise3", "purple", "olivedrab3") )
MDSplot( distance = matrix, pcs = c(1, 2), PF = NULL, y = c("dodgerblue", "red", "turquoise3", "purple", "olivedrab3") )
distance |
A matrix containing the genetic distance data. |
pcs |
A numeric vector specifying the principal coordinates to plot. |
PF |
A factor vector specifying population labels. |
y |
A character vector specifying colors for population labels. |
The output is the MDS plot for the samples for the specified principal coordinates.
genetic_distance_matrix <- matrix(c( 0, 0.2836333, 0.2760485, 0.2685221, 0.2797302,0.3202661, 0.2836333, 0, 0.2867215, 0.2687472, 0.2596309, 0.2957862, 0.2760485,0.2867215, 0, 0.297918, 0.3057039, 0.3153261, 0.2685221, 0.2687472, 0.297918,0, 0.2753477, 0.3042383, 0.2797302, 0.2596309, 0.3057039, 0.2753477, 0,0.3398558, 0.3202661, 0.2957862, 0.3153261, 0.3042383, 0.3398558, 0), nrow = 6, byrow = TRUE,dimnames = list(c("Sample1", "Sample2", "Sample3", "Ind1", "Ind2", "Ind3"), c("Sample1", "Sample2", "Sample3", "Ind1", "Ind2", "Ind3"))) MDSplot(distance=genetic_distance_matrix, pcs=c(1,3))
genetic_distance_matrix <- matrix(c( 0, 0.2836333, 0.2760485, 0.2685221, 0.2797302,0.3202661, 0.2836333, 0, 0.2867215, 0.2687472, 0.2596309, 0.2957862, 0.2760485,0.2867215, 0, 0.297918, 0.3057039, 0.3153261, 0.2685221, 0.2687472, 0.297918,0, 0.2753477, 0.3042383, 0.2797302, 0.2596309, 0.3057039, 0.2753477, 0,0.3398558, 0.3202661, 0.2957862, 0.3153261, 0.3042383, 0.3398558, 0), nrow = 6, byrow = TRUE,dimnames = list(c("Sample1", "Sample2", "Sample3", "Ind1", "Ind2", "Ind3"), c("Sample1", "Sample2", "Sample3", "Ind1", "Ind2", "Ind3"))) MDSplot(distance=genetic_distance_matrix, pcs=c(1,3))
This function manipulates consolidated marker data and egg count data to prepare them for further analysis.
PCDM(consolidated_marker = data.frame, eggcount = data.frame, marker_name)
PCDM(consolidated_marker = data.frame, eggcount = data.frame, marker_name)
consolidated_marker |
A data frame containing consolidated marker data. |
eggcount |
A data frame containing egg count data. |
marker_name |
A string specifying the marker name. |
A dataframe containing the allele frequencies and eggcounts for each sample.
marker_data <- data.frame( Sample1 = c(400, 600, 700), Sample2 = c(450, 550, 480), Sample3 = c(300, 200, 500), row.names=c(185,188,191) ) eggs<-data.frame( ID=c("Sample1","Sample2","Sample3"),n=c(3000,400,50)) PCDM(consolidated_marker=marker_data, eggcount= eggs,"SMMS2")
marker_data <- data.frame( Sample1 = c(400, 600, 700), Sample2 = c(450, 550, 480), Sample3 = c(300, 200, 500), row.names=c(185,188,191) ) eggs<-data.frame( ID=c("Sample1","Sample2","Sample3"),n=c(3000,400,50)) PCDM(consolidated_marker=marker_data, eggcount= eggs,"SMMS2")
This function calculates the GST from a gene identity matrix.
preGST(G = matrix)
preGST(G = matrix)
G |
The gene identity matrix |
The GST value.
This function calculates the pre-Jost's D measure from a gene identity matrix.
preJostD(G = matrix)
preJostD(G = matrix)
G |
A square matrix representing a gene identity matrix. |
The Jost's D value.
This function checks for duplicate samples in the input data frame and calculates the average peak heights for each sample. If the Jost's D between duplicate samples exceeds 0.05, it flags those samples.
Rep_check(df)
Rep_check(df)
df |
The input data frame containing peak heights for each sample. |
A data frame containing the average peak heights for each sample, with flagged samples where duplicates have a Jost's D exceeding 0.05.
marker_data <- data.frame( Sample.1a = c(400, 600, 700), Sample.1b = c(420, 606, 710), Sample.2a = c(450, 550, 480), Sample.2b = c(500, 540, 480), Sample.3a = c(300, 200, 500), Sample.3b = c(290, 100, 400), row.names=c(185,188,191) ) Rep_check(marker_data)
marker_data <- data.frame( Sample.1a = c(400, 600, 700), Sample.1b = c(420, 606, 710), Sample.2a = c(450, 550, 480), Sample.2b = c(500, 540, 480), Sample.3a = c(300, 200, 500), Sample.3b = c(290, 100, 400), row.names=c(185,188,191) ) Rep_check(marker_data)
This function calculates the RWC (Random Walk Covariance) distance matrix from a given matrix of genetic distances.
RWCDistanceMatrix(J = matrix)
RWCDistanceMatrix(J = matrix)
J |
The Genetic Distance Matrix calculated using GeneticDistanceMatrix |
A matrix representing the distance matrix calculated using the Random Walk Covariance method.
genetic_distance_matrix <- matrix(c(0.316455, 0.2836333, 0.2760485, 0.2685221, 0.2797302,0.3202661,0.2836333, 0.3106084, 0.2867215, 0.2687472, 0.2596309, 0.2957862,0.2760485,0.2867215, 0.3338663, 0.297918, 0.3057039, 0.3153261,0.2685221, 0.2687472, 0.297918,0.3107094, 0.2753477, 0.3042383, 0.2797302, 0.2596309, 0.3057039, 0.2753477, 0.3761386,0.3398558,0.3202661, 0.2957862, 0.3153261, 0.3042383, 0.3398558, 0.4402125), nrow = 6, byrow = TRUE, dimnames = list(c("Sample1", "Sample2", "Sample3", "Ind1", "Ind2", "Ind3"), c("Sample1", "Sample2", "Sample3", "Ind1", "Ind2", "Ind3"))) RWCDistanceMatrix(genetic_distance_matrix)
genetic_distance_matrix <- matrix(c(0.316455, 0.2836333, 0.2760485, 0.2685221, 0.2797302,0.3202661,0.2836333, 0.3106084, 0.2867215, 0.2687472, 0.2596309, 0.2957862,0.2760485,0.2867215, 0.3338663, 0.297918, 0.3057039, 0.3153261,0.2685221, 0.2687472, 0.297918,0.3107094, 0.2753477, 0.3042383, 0.2797302, 0.2596309, 0.3057039, 0.2753477, 0.3761386,0.3398558,0.3202661, 0.2957862, 0.3153261, 0.3042383, 0.3398558, 0.4402125), nrow = 6, byrow = TRUE, dimnames = list(c("Sample1", "Sample2", "Sample3", "Ind1", "Ind2", "Ind3"), c("Sample1", "Sample2", "Sample3", "Ind1", "Ind2", "Ind3"))) RWCDistanceMatrix(genetic_distance_matrix)
An internal function that supports ClusterFromSamples. Sample loci from a dataset based on the number of loci specified.
SampleOfLoci(aaax = data.frame, NLoci = max(aaax[, 1]))
SampleOfLoci(aaax = data.frame, NLoci = max(aaax[, 1]))
aaax |
A data frame containing the input data must be in LoadData style LoadData. |
NLoci |
An integer specifying the number of loci to sample. |
A data frame containing the sampled loci.
This is a revision of the Fragman script score.markers, for the original instructions and parameters, run '?score.markers'. This revision designates separate parameters for Left and Right search windows.
score_markers_rev3( my.inds, channel = 1, n.inds = NULL, panel = NULL, shift = 0.8, ladder, channel.ladder = NULL, ploidy = 2, left.cond = c(0.6, 3), right.cond = 0.35, warn = FALSE, windowL = 0.5, windowR = 0.5, init.thresh = 200, ladd.init.thresh = 200, method = "iter2", env = parent.frame(), my.palette = NULL, plotting = FALSE, plotdir = "plots_scoring", pref = 3 )
score_markers_rev3( my.inds, channel = 1, n.inds = NULL, panel = NULL, shift = 0.8, ladder, channel.ladder = NULL, ploidy = 2, left.cond = c(0.6, 3), right.cond = 0.35, warn = FALSE, windowL = 0.5, windowR = 0.5, init.thresh = 200, ladd.init.thresh = 200, method = "iter2", env = parent.frame(), my.palette = NULL, plotting = FALSE, plotdir = "plots_scoring", pref = 3 )
my.inds |
The list output from the fsa_batch_imp or storing.inds function that contains the channel information from the individuals that you want to score. |
channel |
The number of the channel you wish to analyze. Typically 1 is blue, 2 is green, 3 yellow, and 4 red. |
n.inds |
(optional) A vector specifying which fsa files to score. |
panel |
A vector containing the expected allele sizes for this marker. |
shift |
All peaks at that distance from the tallest peak will be ignored and be considered noise. |
ladder |
A vector containing the expected peaks for your ladder. |
channel.ladder |
The channel number where your ladder can be found. |
ploidy |
The name is a relic of the fact that Fragman::score.markers was originally written for plants. In the context of pooled egg samples it is used to specify the number of possible alleles in the marker. |
left.cond |
The first part is a percentile (0-1) that corresponds to the height that a peak to the left of the tallest peak must be in order to be considered real. The second argument is a number of base pairs that a peak to the left of the tallest peak must be away to be considered as real. |
right.cond |
A percentile (0-1) that corresponds to the height that a peak to the right of the tallest peak must be in order to be real. |
warn |
TRUE/FAlSE Do you want to receive warnings when detecting the ladder? |
windowL |
the window means that all peaks closer by that distance to the left of the panel peaks will be accounted as peaks. |
windowR |
the window means that all peaks closer by that distance to the right of the panel peaks will be accounted as peaks. |
init.thresh |
A value that sets a minimum intensity in order for a peak to be called. |
ladd.init.thresh |
We don't recommend messing with this parameter unless your ladder has special circumstances. See Fragman::score.markers |
method |
In cases where samples weren't sized using the info.ladder.attach function, this technique steps in to identify ladder peaks. You have three method options using an argument: "cor" explores all potential peak combinations and thoroughly searches for correlations to identify the correct peaks corresponding to expected DNA weights; "ci" constructs confidence intervals to identify peaks meeting specified conditions from earlier arguments; "iter2" applies an iterative strategy to identify the most likely peaks aligning with your ladder expectations. The default method is "iter2." |
env |
Please do not change this parameter, it is used to detect the users environment. |
my.palette |
(optional) A character vector specifying which colors to use for the output RFU plots. |
plotting |
TRUE/FALSE Do you want to create pdf output plots? |
plotdir |
The name of the directory where output pdf plots should be stored. |
pref |
The number of plots to be drawn in the output plot. |
The score_markers_rev3 function will return a list containing three variables: $pos, $hei, and $wei. These correspond to the index position for the intensities, the intensity of each peak, and the weight in base pairs based on the ladder respectively. If plotting = TRUE, a pdf file will also have been created in the specified directory. This pdf file allows you to visually inspect how all of the peaks were scored.
file_path <- system.file("extdata", package = "pooledpeaks") mock_fsa_batch_imp_output<- fsa_batch_imp(file_path, channels = 5, fourier = FALSE, saturated = FALSE, lets.pullup = FALSE, plotting = FALSE, rawPlot = FALSE) panel <- c(176,179,182,185,188,191,194,197,200,203,206) ladder <- c( 140, 160, 180, 200, 214, 220,240, 250, 260, 280, 300, 314) mock_fsa_batch_imp_output <- associate_dyes(mock_fsa_batch_imp_output, file_path) score_markers_rev3(my.inds = mock_fsa_batch_imp_output, channel = 1, channel.ladder = 5, panel = "panel", ladder = ladder, init.thresh = 200, ploidy = length(panel), plotting = FALSE)
file_path <- system.file("extdata", package = "pooledpeaks") mock_fsa_batch_imp_output<- fsa_batch_imp(file_path, channels = 5, fourier = FALSE, saturated = FALSE, lets.pullup = FALSE, plotting = FALSE, rawPlot = FALSE) panel <- c(176,179,182,185,188,191,194,197,200,203,206) ladder <- c( 140, 160, 180, 200, 214, 220,240, 250, 260, 280, 300, 314) mock_fsa_batch_imp_output <- associate_dyes(mock_fsa_batch_imp_output, file_path) score_markers_rev3(my.inds = mock_fsa_batch_imp_output, channel = 1, channel.ladder = 5, panel = "panel", ladder = ladder, init.thresh = 200, ploidy = length(panel), plotting = FALSE)
This function calculates two-level GST (Nei's standard gene identity) measure from a gene identity matrix.
TwoLevelGST(G = matrix)
TwoLevelGST(G = matrix)
G |
A square matrix representing a gene identity matrix. |
A list containing the components of two-level GST including within-group gene identity, between-group gene identity, and GST values.
gene_identity_matrix <- matrix(c( 0.3164550, 0.2836333, 0.2760485, 0.2836333, 0.3106084, 0.2867215, 0.2760485, 0.2867215, 0.3338663 ), nrow = 3, byrow = TRUE, dimnames = list(paste0("Sample", 1:3), paste0("Sample", 1:3))) TwoLevelGST(G=gene_identity_matrix)
gene_identity_matrix <- matrix(c( 0.3164550, 0.2836333, 0.2760485, 0.2836333, 0.3106084, 0.2867215, 0.2760485, 0.2867215, 0.3338663 ), nrow = 3, byrow = TRUE, dimnames = list(paste0("Sample", 1:3), paste0("Sample", 1:3))) TwoLevelGST(G=gene_identity_matrix)
This function calculates the number of loci successfully genotyped by each individual included in our data set
TypedLoci(datafile = data.frame)
TypedLoci(datafile = data.frame)
datafile |
A data frame containing the input data must be in LoadData style LoadData. |
A matrix representing processed data.
genetic_data <- data.frame( Locus = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), Locus_allele = c("Marker1", "n", 1, 2, 3, "Marker2", "n", 1, 2, 3), Sample1 = c(NA, 10, 0.5, 0.5, 0, NA, 10, 0.2, 0.3, 0.5), Sample2 = c(NA, 20, 0.1, 0.2, 0.7, NA, 20, 0.3, 0.4, 0.3), Sample3 = c(NA, 30, 0.3, 0.4, 0.3, NA, 30, 0.4, 0.2, 0.4) ) TypedLoci(datafile=genetic_data)
genetic_data <- data.frame( Locus = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), Locus_allele = c("Marker1", "n", 1, 2, 3, "Marker2", "n", 1, 2, 3), Sample1 = c(NA, 10, 0.5, 0.5, 0, NA, 10, 0.2, 0.3, 0.5), Sample2 = c(NA, 20, 0.1, 0.2, 0.7, NA, 20, 0.3, 0.4, 0.3), Sample3 = c(NA, 30, 0.3, 0.4, 0.3, NA, 30, 0.4, 0.2, 0.4) ) TypedLoci(datafile=genetic_data)