Title: | Fragment Analysis in R |
---|---|
Description: | Performs fragment analysis using genetic data coming from capillary electrophoresis machines. These are files with FSA extension which stands for FASTA-type file, and .txt files from Beckman CEQ 8000 system, both contain DNA fragment intensities read by machinery. In addition to visualization, it performs automatic scoring of SSRs (Sample Sequence Repeats; a type of genetic marker very common across the genome) and other type of PCR markers (standing for Polymerase Chain Reaction) in biparental populations such as F1, F2, BC (backcross), and diversity panels (collection of genetic diversity). |
Authors: | Giovanny Covarrubias-Pazaran, Luis Diaz-Garcia, Brandon Schlautman, Walter Salazar, Juan Zalapa. |
Maintainer: | Giovanny Covarrubias-Pazaran <[email protected]> |
License: | GPL-3 |
Version: | 1.0.9 |
Built: | 2024-12-11 06:57:23 UTC |
Source: | CRAN |
Fragman is a package designed for Fragment analysis and automatic scoring of biparental populations (such as F1, F2, BC types) and populations for diversity studies. The program is designed to read files with FSA extension (which stands for FASTA-type file and contains lectures for DNA fragments), and .txt files from Beckman CEQ 8000 system, and extract the DNA intensities from the channels/colors where they are located, based on ABi machine plattforms to perform sizing and allele scoring.
The core of the package and the workflow of the fragment analysis rely in the following 4 functions;
1) storing.inds
(function in charge of reading the FSA or txt(CQS) files and storing them with a list structure)
2) ladder.info.attach
(uses the information read from the FSA files and a vector containing the ladder information (DNA size of the fragments) and matches the peaks from the channel where the ladder was run with the DNA sizes for all samples. Then loads such information in the R environment for the use of posterior functions)
3) overview2
(create friendly plots for any number of individuals specified and can be used to design panels (overview2
) for posterior automatic scoring (like licensed software does), or make manual scoring (overview
) of individuals such as parents of biparental populations or diversity populations)
4) The score.markers
(function score the alleles by finding the peaks provided in the panel (if provided), otherwise returns all peaks present in the channel). Thisfinal function can be automatized if several markers are located in the same channel by creating lists of panels taking advantage of R capabilities and data structures.
** Sometimes during the ladder sizing process some samples can go wrong for several reasons related to the sample quality (low intensity in ladder channel, extreme number of noisy peaks, etc.), because of that we have introduced ladder.corrector
function which allows the user to correct the bad samples by clicking over the real peaks, by default the ladder.info.attach
function returns the names of the samples that had a low correlation with the expected peaks.
When automatic scoring is not desired the function overview
can be used for getting an interactive session and click over the peaks (using the locator
function) in order to get the allele sizes.
Feel free to contact us with questions and improvement suggestions at:
Just send a sample file with your question to recreate the issue or bug reported along with vector for your ladder.
We have spent valuable time developing this package, please cite it in your publication:
Covarrubias-Pazaran G, Diaz-Garcia L, Schlautman B, Salazar W, Zalapa J. Fragman: An R package for fragment analysis. 2016. BMC Genetics 17(62):1-8.
Giovanny Covarrubias-Pazaran, Luis Diaz-Garcia, Brandon Schlautman, Walter Salazar, Juan Zalapa.
Covarrubias-Pazaran G, Diaz-Garcia L, Schlautman B, Salazar W, Zalapa J. Fragman: An R package for fragment analysis. 2016. BMC Genetics 17(62):1-8.
Robert J. Henry. 2013. Molecular Markers in Plants. Wiley-Blackwell. ISBN 978-0-470-95951-0.
Ben Hui Liu. 1998. Statistical Genomics. CRC Press LLC. ISBN 0-8493-3166-8.
http://cggl.horticulture.wisc.edu/home-page/
## ================================= ## ## ================================= ## ## Fragment analysis requires ## 1) loading your data ## 2) matching your ladder ## 3) define a panel for scoring ## 4) score the samples ## ================================= ## ## ================================= ## ##################### ## 1) Load your data ##################### ### you would use something like: # folder <- "~/myfolder" # my.plants <- storing.inds(folder) ### here we just load our sample data and use the first 2 plants ?my.plants data(my.plants) my.plants <- my.plants[1:2] class(my.plants) <- "fsa_stored" # plot(my.plants) # to visualize the raw data ####################### ## 2) Match your ladder ####################### ### create a vector indicating the sizes of your ladder and do the match my.ladder <- c(50, 75, 100, 125, 129, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375) ladder.info.attach(stored=my.plants, ladder=my.ladder) ### matching your ladder is a critical step and should only happen once per batch of ### samples read ###****************************************************************************************### ### OPTIONAL: ### If the ladder.info attach function detects some bad samples ### that you can correct them manually using ### the ladder.corrector() function ### For example to correct one sample in the previous data ### ladder.corrector(stored=my.plants, #to.correct="FHN152-CPN01_01A_GH1x35_152-148-209_717-704-793_367-382-381.fsa", #ladder=my.ladder) ###****************************************************************************************### ####################### ## 3) Define a panel ####################### ### In fragment analysis you usually design a panel where you indicate ### which peaks are real. You may use the overview2 function which plots all the ### plants in the channel you want in the base pair range you want overview2(my.inds=my.plants, channel = 2:3, ladder=my.ladder, init.thresh=5000) ### You can click on the peaks you think are real, given that the ones ### suggested by the program may not be correct. This can be done by using the ### 'locator' function and press 'Esc' when you're done, i.e.: # my.panel <- locator(type="p", pch=20, col="red")$x ### That way you can click over the peaks and get the sizes ### in base pairs stored in a vector named my.panel ### Just for demonstration purposes I will use the suggested peaks by ### the program using overview2, which will return a vector with ### expected DNA sizes to be used in the next step for scoring ### we'll do it in the 160-190 bp region my.panel <- overview2(my.inds=my.plants, channel = 3, ladder=my.ladder, init.thresh=7000, xlim=c(160,190)); my.panel ########################## ## 4) Score the samples ########################## ### When a panel is created is time to score the samples by providing the initial ### data we read, the ladder vector, the panel vector, and our specifications ### of channel to score (other arguments are available) ### Here we will score our samples for channel 3 with our panel created previously res <- score.markers(my.inds=my.plants, channel = 3, panel=my.panel$channel_3, ladder=my.ladder, electro=FALSE) ### Check the plots and make sure they were scored correctly. In case some samples ### are wrong you might want to use the locator function again and figure out ### the size of your peaks. To extract your peaks in a data.frame do the following: final.results <- get.scores(res) final.results
## ================================= ## ## ================================= ## ## Fragment analysis requires ## 1) loading your data ## 2) matching your ladder ## 3) define a panel for scoring ## 4) score the samples ## ================================= ## ## ================================= ## ##################### ## 1) Load your data ##################### ### you would use something like: # folder <- "~/myfolder" # my.plants <- storing.inds(folder) ### here we just load our sample data and use the first 2 plants ?my.plants data(my.plants) my.plants <- my.plants[1:2] class(my.plants) <- "fsa_stored" # plot(my.plants) # to visualize the raw data ####################### ## 2) Match your ladder ####################### ### create a vector indicating the sizes of your ladder and do the match my.ladder <- c(50, 75, 100, 125, 129, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375) ladder.info.attach(stored=my.plants, ladder=my.ladder) ### matching your ladder is a critical step and should only happen once per batch of ### samples read ###****************************************************************************************### ### OPTIONAL: ### If the ladder.info attach function detects some bad samples ### that you can correct them manually using ### the ladder.corrector() function ### For example to correct one sample in the previous data ### ladder.corrector(stored=my.plants, #to.correct="FHN152-CPN01_01A_GH1x35_152-148-209_717-704-793_367-382-381.fsa", #ladder=my.ladder) ###****************************************************************************************### ####################### ## 3) Define a panel ####################### ### In fragment analysis you usually design a panel where you indicate ### which peaks are real. You may use the overview2 function which plots all the ### plants in the channel you want in the base pair range you want overview2(my.inds=my.plants, channel = 2:3, ladder=my.ladder, init.thresh=5000) ### You can click on the peaks you think are real, given that the ones ### suggested by the program may not be correct. This can be done by using the ### 'locator' function and press 'Esc' when you're done, i.e.: # my.panel <- locator(type="p", pch=20, col="red")$x ### That way you can click over the peaks and get the sizes ### in base pairs stored in a vector named my.panel ### Just for demonstration purposes I will use the suggested peaks by ### the program using overview2, which will return a vector with ### expected DNA sizes to be used in the next step for scoring ### we'll do it in the 160-190 bp region my.panel <- overview2(my.inds=my.plants, channel = 3, ladder=my.ladder, init.thresh=7000, xlim=c(160,190)); my.panel ########################## ## 4) Score the samples ########################## ### When a panel is created is time to score the samples by providing the initial ### data we read, the ladder vector, the panel vector, and our specifications ### of channel to score (other arguments are available) ### Here we will score our samples for channel 3 with our panel created previously res <- score.markers(my.inds=my.plants, channel = 3, panel=my.panel$channel_3, ladder=my.ladder, electro=FALSE) ### Check the plots and make sure they were scored correctly. In case some samples ### are wrong you might want to use the locator function again and figure out ### the size of your peaks. To extract your peaks in a data.frame do the following: final.results <- get.scores(res) final.results
This function converts a data frame containing the joinmap code into the readable file for joinmap. This format still needs some extra information, specifically the header indicating the population type, no.loci and no. of individuals, please check file examples included in JoinMap software.
arrange.jm(x, par=FALSE)
arrange.jm(x, par=FALSE)
x |
A data frame containing the scores in jm coding. |
par |
A TRUE/FALSE value indicating if the data returned should include the parents or not, the dafault value is FALSE, indicating that the first 2 individuals will not be included. |
No major details.
If arguments are correct the function returns a new data frame
A new data frame with markers in joinmap readable format
Covarrubias-Pazaran G, Diaz-Garcia L, Schlautman B, Salazar W, Zalapa J. Fragman: An R package for fragment analysis. 2016. BMC Genetics 17(62):1-8.
Robert J. Henry. 2013. Molecular Markers in Plants. Wiley-Blackwell. ISBN 978-0-470-95951-0.
Ben Hui Liu. 1998. Statistical Genomics. CRC Press LLC. ISBN 0-8493-3166-8.
xx <- data.frame(cbind(a=rep(150, 96), b=c(rep(100,48), rep(150,48)))); xx[1,] <- c(150,150) xx2 <- cbind(jm.conv(xx), jm.conv(xx), jm.conv(xx)) xx3 <- arrange.jm(xx2, par=FALSE) xx3[,1:10]
xx <- data.frame(cbind(a=rep(150, 96), b=c(rep(100,48), rep(150,48)))); xx[1,] <- c(150,150) xx2 <- cbind(jm.conv(xx), jm.conv(xx), jm.conv(xx)) xx3 <- arrange.jm(xx2, par=FALSE) xx3[,1:10]
This function just find the best layout fit for a number of plots desired.
best.layout(x)
best.layout(x)
x |
A scalar value indicating the number of plots desired |
No major details
Returns the best layout
the number of rows and columns giving the best fit
Covarrubias-Pazaran G, Diaz-Garcia L, Schlautman B, Salazar W, Zalapa J. Fragman: An R package for fragment analysis. 2016. BMC Genetics 17(62):1-8.
Robert J. Henry. 2013. Molecular Markers in Plants. Wiley-Blackwell. ISBN 978-0-470-95951-0.
Ben Hui Liu. 1998. Statistical Genomics. CRC Press LLC. ISBN 0-8493-3166-8.
best.layout(9)
best.layout(9)
This function find all peaks by taking the first derivative based on 'rle' function. Is used in different Fragma functions.
big.peaks.col(x,tre)
big.peaks.col(x,tre)
x |
A vector of heights or intensities |
tre |
A scalar value deciding when peaks will be ignored |
No major details.
Retuns the biggest peaks for a vector of intensities.
a vector of positions where the derivative is zero and therefore a peak was found
Covarrubias-Pazaran G, Diaz-Garcia L, Schlautman B, Salazar W, Zalapa J. Fragman: An R package for fragment analysis. 2016. BMC Genetics 17(62):1-8.
Robert J. Henry. 2013. Molecular Markers in Plants. Wiley-Blackwell. ISBN 978-0-470-95951-0.
Ben Hui Liu. 1998. Statistical Genomics. CRC Press LLC. ISBN 0-8493-3166-8.
data(my.plants) big.peaks.col(my.plants[[1]][,1],100)#for any color
data(my.plants) big.peaks.col(my.plants[[1]][,1],100)#for any color
This function takes a vector of color heights/intensities from the fragment analysis containing the ladder/standard channel, and detects the biggest peaks where the derivative is equal zero and uses the information from the expected weights for the ladder to construct confidence intervals in order to detect the ladder peaks.
Please! if using the confidence interval method ("ci"), which is NOT the default, once you have found the best parameters for the arguments to match your ladder using this function, please pass those values to all the posterior functions, making sure the 'dev' argument is passed to the new functions. If using the correlation method ("cor"), don't worry about it.
detect.ladder(stored, ind=1, ladder, channel.ladder=dim(stored[[1]])[2], ci.upp=1.96, ci.low=1.96, draw=TRUE, dev=50, warn=TRUE, init.thresh=250,sep.index=8, method="cor", avoid=1500, who="sample")
detect.ladder(stored, ind=1, ladder, channel.ladder=dim(stored[[1]])[2], ci.upp=1.96, ci.low=1.96, draw=TRUE, dev=50, warn=TRUE, init.thresh=250,sep.index=8, method="cor", avoid=1500, who="sample")
stored |
Lis of dataframes obtained by using the |
ind |
The individual that you wish to analyze for assessing that the ladder was correctly detected. |
ladder |
Vector containing the expected weights of the dna fragments of the ladder in use |
channel.ladder |
A scalar value indicating in which channel or color the ladder was read |
ci.upp |
A scalar value indicating how many standar errors will be used to detect peaks when checking the height of the ladder peaks(upper bound). To be used in the |
ci.low |
A scalar value indicating how many standar errors will be used to detect peaks when checking the height of the ladder peaks(lower bound). To be used in the 'find.ladder' function |
draw |
A TRUE/FALSE value indicating if the plot for the ladder found should be printed or not |
dev |
A scalar value indicating the number of indexes to be used as peak separation when deciding the ladder peaks. Some ladders contain dna fragments of very closed weights and modifying this parameter helps to detect them correctly |
warn |
A TRUE/FALSE value indicating if warnings should be provided when detecting the ladder |
init.thresh |
An initial value of color intensity to be used when detecting the ladder, could be really important for the correlation method |
sep.index |
A scalar value indicating how many indexes should be allowed to considered a true peak from noisy peaks |
method |
An argument indicating one of the 2 methods available; "cor" makes all possible combination of peaks and searches exhaustive correlations to find the right peaks corresponsding to the expected DNA weights, or "ci" constructing confidence intervals to look for peaks meeting the conditions specified in the previous arguments |
who |
A name to indicate which sample is being analyzed |
avoid |
A scalar value indicating how many indexes should be avoided when the method of correlation fails to find peaks and a random sample will be drawn from the existing peaks. The default is 1500 indexes which will samples peaks avoiding the first 1500 indexes which is usually related to noisy area in some ladders. |
The peaks are detected by default using a correlation method bu the user can use confidence intervals if desired.
If parameters are indicated correctly the function returns:
the index positions for the intensities
the intensities for the fragments found
the putative weights in base pairs based on the ladder provided
Covarrubias-Pazaran G, Diaz-Garcia L, Schlautman B, Salazar W, Zalapa J. Fragman: An R package for fragment analysis. 2016. BMC Genetics 17(62):1-8.
Robert J. Henry. 2013. Molecular Markers in Plants. Wiley-Blackwell. ISBN 978-0-470-95951-0.
Ben Hui Liu. 1998. Statistical Genomics. CRC Press LLC. ISBN 0-8493-3166-8.
data(my.plants) my.ladder <- c(120, 125, 129, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375) # looking at the first individual detect.ladder(stored=my.plants, ind=1, ladder=my.ladder)
data(my.plants) my.ladder <- c(120, 125, 129, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375) # looking at the first individual detect.ladder(stored=my.plants, ind=1, ladder=my.ladder)
This function takes a vector of color heights/intensities from the fragment analysis containing the ladder/standard channel, and detects the biggest peaks where the derivative is equal zero and uses the information from the expected weights for the ladder to construct confidence intervals in order to detect the ladder peaks.
Please! if using the confidence interval method ("ci"), which is not the default, once you have found the best parameters for the arguments to match your ladder using this function, please pass those values to all the posterior functions, please make sure the 'dev' argument is passed to the new functions.
find.ladder(x, ladder, draw=TRUE, dev=50, warn=TRUE, init.thresh=NULL, sep.index=8, method=NULL, reducing=NULL, who="sample", attempt=10, cex.title=0.8)
find.ladder(x, ladder, draw=TRUE, dev=50, warn=TRUE, init.thresh=NULL, sep.index=8, method=NULL, reducing=NULL, who="sample", attempt=10, cex.title=0.8)
x |
Vector of heights from the ladder channel. See example to see how to access to it. |
ladder |
Vector containing the expected weights of the dna fragments of the ladder in use |
draw |
A TRUE/FALSE value indicating if the plot for the ladder found should be printed or not |
dev |
A scalar value indicating the number of indexes to be used as peak separation when deciding the ladder peaks. Some ladders contain dna fragments of very closed weights and modifying this parameter helps to detect them correctly |
warn |
A TRUE/FALSE value indicating if warnings should be provided when detecting the ladder |
init.thresh |
An initial value of color intensity to be used when detecting the ladder |
sep.index |
A scalar value indicating how many indexes should be allowed to considered a true peak from noisy peaks |
method |
An argument indicating one of the 2 methods available; "cor" makes all possible combination of peaks and searches exhaustive correlations to find the right peaks corresponsding to the expected DNA weights, or "ci" constructing confidence intervals to look for peaks meeting the conditions specified in the previous arguments |
who |
A name to indicate which sample is being analyzed |
attempt |
A scalar value indicating how many attempts should be made to find the real ladder peaks. By default is 7 attempts, which means that will try to build the model assuming that the first peak found in the ladder is the corresponding first peak of the expected ladder, then moves to the 2nd peak until the 7th and the seven models are compared picking the most likely model based on the R2 value for each of the models. |
reducing |
A vector of values to reduce the search of peaks to certain indexes in the x axis. Default is NULL so it looks for all peaks for matching the ladder. |
cex.title |
A scalar value indicating how big the title (name of the sample) in the plot should be. |
We have implemented 3 methods for sizing the ladder, each with their advantages and disadvantages. The default method named "red" which stands for "reduction" detect the region where peaks exist (in indexes) in the ladder channel and assumes that your ladder should have some equivalence in indexes and creates an 'expected ladder', then the putative ladder moves along the peak region and correlations and squared distances to the closest peaks are calculated. We have define the coefficient of similarity (CS) as cor(x,y)/var(z), where:
cor(x,y) are the correlations between expected and observed peaks, and var(z) is the sum of squares between the differences of expected and observed peaks.
This value usually let us identify the most likely peaks and then all possible combinations for those peaks are computed followed by exhaustive correlations of those combinations with the actual ladder. The highest correlation usually points to the right peaks, which is selected.
In addition the method "cor" is the previous version to "red" which doesn't reduce the search of peaks and computes all possible combinations of peaks from the beggining, with the drawback that slows down the detection process especially when the ladder intensities are low and noisy peaks exist in abundance.
The last method that has been superseded by the previous 2 is the "ci" method based on confidence intervals, which assumes that real ladder peaks have more or less the same intensity and a they can be found by finding the median intensity and computing a 90 percent confidence interval to find the rest of the peaks. This method has been proved to fail when the first condition is broken and ladder have real peaks with intensities greater than the expected.
If parameters are indicated correctly the function returns:
the index positions for the intensities
the intensities for the fragments found
the putative weights in base pairs based on the ladder provided
Covarrubias-Pazaran G, Diaz-Garcia L, Schlautman B, Salazar W, Zalapa J. Fragman: An R package for fragment analysis. 2016. BMC Genetics 17(62):1-8.
Robert J. Henry. 2013. Molecular Markers in Plants. Wiley-Blackwell. ISBN 978-0-470-95951-0.
Ben Hui Liu. 1998. Statistical Genomics. CRC Press LLC. ISBN 0-8493-3166-8.
data(my.plants) my.ladder <- c(50, 75, 100, 125, 129, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375) find.ladder(my.plants[[1]][,4], ladder=my.ladder)
data(my.plants) my.ladder <- c(50, 75, 100, 125, 129, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375) find.ladder(my.plants[[1]][,4], ladder=my.ladder)
This function extracts the information from auto.score and score.easy functions and fits everything in a dataframe.
get.scores(my.scores, mark = "mark")
get.scores(my.scores, mark = "mark")
my.scores |
List of individuals which contains at the same time a list of positions, heights and weights for the fragment analysis |
mark |
A vector of names of the markers scored in the panel or the parents |
Nomajor details.
If arguments are correct the function returns a data frame containing
A dataframe containing the ssr calls
Covarrubias-Pazaran G, Diaz-Garcia L, Schlautman B, Salazar W, Zalapa J. Fragman: An R package for fragment analysis. 2016. BMC Genetics 17(62):1-8.
Robert J. Henry. 2013. Molecular Markers in Plants. Wiley-Blackwell. ISBN 978-0-470-95951-0.
Ben Hui Liu. 1998. Statistical Genomics. CRC Press LLC. ISBN 0-8493-3166-8.
## here "a" is a similar ouptput to the `score.easy` function par1 <- list(pos=c(3100, 3240), hei=c(22917,20563), wei=c(202,212)) par2 <- list(pos=c(3100, 3240), hei=c(22917,20563), wei=c(202,214)) a <- list(i1=par1, i2=par2) get.scores(a)
## here "a" is a similar ouptput to the `score.easy` function par1 <- list(pos=c(3100, 3240), hei=c(22917,20563), wei=c(202,212)) par2 <- list(pos=c(3100, 3240), hei=c(22917,20563), wei=c(202,214)) a <- list(i1=par1, i2=par2) get.scores(a)
This functions takes a list of positions, heights and weights for ssr calls of a certain plant and uses panel information and a window to homogenize the weights of base pairs to the panel calls provided.
homo.panel(x, panel, window)
homo.panel(x, panel, window)
x |
list of positions, heights and weights |
panel |
different dna sizes usually obtained by using overview and locator functions |
window |
window in base pairs indicating when should be considered the same panel allele |
No major details.
If arguments are correct the function returns a list containing
the index positions for the intensities
the intensities for the fragments found
the putative weights in base pairs based on the panel provided
Robert J. Henry. 2013. Molecular Markers in Plants. Wiley-Blackwell. ISBN 978-0-470-95951-0.
Ben Hui Liu. 1998. Statistical Genomics. CRC Press LLC. ISBN 0-8493-3166-8.
#No example provided, internally working to round to the closest parent ssr call. x <- 1:10
#No example provided, internally working to round to the closest parent ssr call. x <- 1:10
This functions takes a list of positions, heights and weights for ssr calls of a certain plant and uses parental information and a window to homogenize to the parent calls.
homogenize.to.parentals(x, parents, window)
homogenize.to.parentals(x, parents, window)
x |
list of positions, heights and weights |
parents |
different parental calls |
window |
window in base pairs indicating when should be considered the same allele |
No major details.
If arguments are correct the function returns a list containing
the index positions for the intensities
the intensities for the fragments found
the putative weights in base pairs based on the ladder provided
Robert J. Henry. 2013. Molecular Markers in Plants. Wiley-Blackwell. ISBN 978-0-470-95951-0.
Ben Hui Liu. 1998. Statistical Genomics. CRC Press LLC. ISBN 0-8493-3166-8.
#No example provided, internally working to round to the closest parent ssr call. x <- 1:10
#No example provided, internally working to round to the closest parent ssr call. x <- 1:10
This function converts a data frame containing the scores from score.easy to a new data frame with joinmap calls. The parents need to be provided in the first and second row. Currently only works for CP type of crosses. From CP type to F2 and BC2 is straight forward.
jm.conv(a)
jm.conv(a)
a |
A data frame containing the scores from score.easy function extracted from get.scores function containing the parental calls in the first 2 rows. |
No major details.
If arguments are correct the function returns a new data frame
A new data frame with markers in joinmap format
Covarrubias-Pazaran G, Diaz-Garcia L, Schlautman B, Salazar W, Zalapa J. Fragman: An R package for fragment analysis. 2016. BMC Genetics 17(62):1-8.
Robert J. Henry. 2013. Molecular Markers in Plants. Wiley-Blackwell. ISBN 978-0-470-95951-0.
Ben Hui Liu. 1998. Statistical Genomics. CRC Press LLC. ISBN 0-8493-3166-8.
xx <- data.frame(cbind(a=rep(150, 96), b=c(rep(100,48), rep(150,48)))) jm.conv(xx) # try using apply to a dataframe
xx <- data.frame(cbind(a=rep(150, 96), b=c(rep(100,48), rep(150,48)))) jm.conv(xx) # try using apply to a dataframe
This function was designed to correct manually samples that could not be correctly detected by the ladder.info.attach
function and allows the user to select manually the peaks he knows are the correct peaks. This function uses the output of the ladder.info.attach
function which is basically a vector with the names of the samples that were too dificult for the algorithm to find. The console will draw a plot and will ask the user to click over the peaks expected for a ladder provided, once the user is done should press the 'esc' key and continue to the next sample. This process is repeated until the all samples with the names provided are adjusted.
ladder.corrector(stored, to.correct, ladder, thresh=200, env = parent.frame(),...)
ladder.corrector(stored, to.correct, ladder, thresh=200, env = parent.frame(),...)
stored |
List with the channels information from the individuals specified, usually coming from the |
to.correct |
Vector containing the names ofthe samples to be corrected and usually the output of the |
ladder |
Vector containing the expected weights of the dna fragments of the ladder in use. |
thresh |
A scalar value indicating the minimum value in RFUs to look for peaks to match the user-selected peaks with the program recognized peaks. |
env |
this is used to detect the environment of the user and load the result in the same environment. |
... |
Further arguments to be passed |
Once the user has selected the right peaks the function will fix the ladder and attach such information to the R environment.
This function does not produce any output.
the program will attach the information to the R environment
We have spent valuable time developing this package, please cite it in your publication:
Covarrubias-Pazaran G, Diaz-Garcia L, Schlautman B, Salazar W, Zalapa J. Fragman: An R package for fragment analysis. 2016. BMC Genetics 17(62):1-8.
Robert J. Henry. 2013. Molecular Markers in Plants. Wiley-Blackwell. ISBN 978-0-470-95951-0.
Ben Hui Liu. 1998. Statistical Genomics. CRC Press LLC. ISBN 0-8493-3166-8.
data(my.plants) my.plants <- my.plants[1:2] my.ladder <- c(50, 75, 100, 125, 129, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375) ladder.info.attach(stored=my.plants, ladder=my.ladder, ladd.init.thresh=300) ## now if something goes wrong use the corrector: #ladder.corrector(stored=my.plants, #to.correct="FHN152-CPN01_01A_GH1x35_152-148-209_717-704-793_367-382-381.fsa", #ladder=my.ladder)
data(my.plants) my.plants <- my.plants[1:2] my.ladder <- c(50, 75, 100, 125, 129, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375) ladder.info.attach(stored=my.plants, ladder=my.ladder, ladd.init.thresh=300) ## now if something goes wrong use the corrector: #ladder.corrector(stored=my.plants, #to.correct="FHN152-CPN01_01A_GH1x35_152-148-209_717-704-793_367-382-381.fsa", #ladder=my.ladder)
This function uses the information stored by the storing.inds
function and a vector specifying the ladder/standard and finds the real peaks corresponding to the expected weights. The user may use this function to be able to load the ladder information in the global environment of R, so when using the overview
or score.markers
functions calculations will be performed faster, if the function is not used the program will calculate the ladder information each time overview
or score.markers
functions are used.
NOTE: THE STEP OF MATCHING THE LADDER WITH YOUR SAMPLES USING THE ‘ladder.info.attach' FUNCTION IS CRITICAL. IF YOU HAVE ANY PROBLEM TRY MODIFYING THE ARGUMENT ’method', WITH THE 2 MOST EFFECTIVE METHODS method="iter" OR method="iter2"
ladder.info.attach(stored, ladder, channel.ladder=NULL, method="iter2", ladd.init.thresh=NULL, env = parent.frame(), prog=TRUE, draw=TRUE, attempt=10)
ladder.info.attach(stored, ladder, channel.ladder=NULL, method="iter2", ladd.init.thresh=NULL, env = parent.frame(), prog=TRUE, draw=TRUE, attempt=10)
stored |
List with the channels information from the individuals specified, usually coming from the |
ladder |
Vector containing the expected weights of the dna fragments of the ladder in use. |
channel.ladder |
A scalar value indicating in which channel or color the ladder was read |
method |
An argument indicating one of the methods available; "iter" makes an iterative procedure to find the ladder, "iter2" the same but backwards, "cor" makes all possible combination of peaks and searches exhaustive correlations to find the right peaks corresponsding to the expected DNA weights, or "ci" constructing confidence intervals to look for peaks meeting the conditions specified in the previous arguments. The default allows the program to pick among "iter" and "iter2". Older methods have been depreciated. |
ladd.init.thresh |
A value of intensity to detect peaks in the internal use of the |
env |
this is used to detect the environment of the user and load the result in the same environment. |
prog |
A TRUE/FALSE value indicating if a progress bar should be drawn while processing the samples in order to assess the time it takes to find the ladder. The dafault value is TRUE but usually this makes the process slower. Please feel free to set it equal FALSE if the number of samples is quite large and speed is a concern. |
draw |
A TRUE/FALSE value indicating if a plot showing the peaks matched with your ladder should be drawn. The dafault value is FALSE to avoid extra delay. Please feel free to set it equal TRUE if you prefer to assess the sizing process. |
attempt |
A scalar value indicating how many attempts should be made to find the real ladder peaks when using the "iter" and "iter2" methods. By default is 7 attempts, which means that will try to build the model assuming that the first peak found in the ladder is the corresponding first peak of the expected ladder, then moves to the 2nd peak until the 7th and the seven models are compared picking the most likely model based on the R2 value for each of the models. |
Method "ci" has been depreciated, currently the method "iter2" is the default and uses the ladder provided and observed peaks to match them using an iterative procedure based on least squares.
If parameters are indicated correctly the function returns:
the index positions for the intensities
the intensities for the fragments found
the putative weights in base pairs based on the ladder provided
We have spent valuable time developing this package, please cite it in your publication:
Covarrubias-Pazaran G, Diaz-Garcia L, Schlautman B, Salazar W, Zalapa J. Fragman: An R package for fragment analysis. 2016. BMC Genetics 17(62):1-8.
Robert J. Henry. 2013. Molecular Markers in Plants. Wiley-Blackwell. ISBN 978-0-470-95951-0.
Ben Hui Liu. 1998. Statistical Genomics. CRC Press LLC. ISBN 0-8493-3166-8.
data(my.plants) my.plants <- my.plants[1:2] my.ladder <- c(50, 75, 100, 125, 129, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375) ladder.info.attach(stored=my.plants, ladder=my.ladder)
data(my.plants) my.plants <- my.plants[1:2] my.ladder <- c(50, 75, 100, 125, 129, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375) ladder.info.attach(stored=my.plants, ladder=my.ladder)
This function is a wrapper of lapply function that allows the drawing of a progress bar to assess the speed of the process.
lapply_pb (X, FUN, ...)
lapply_pb (X, FUN, ...)
X |
a vector (atomic or list) or an expression object. See see |
FUN |
the function to be applied to each element of X: see |
... |
passes another arguments to the typical lapply function. |
No major details
Performs lapply drawing a progress bar
the same result than using lapply
See see lapply
l <- sapply(1:200, function(x) list(rnorm(1000))) lapply_pb(l, mean)
l <- sapply(1:200, function(x) list(rnorm(1000))) lapply_pb(l, mean)
This function converts a vector of ssr calls in letter format to joinmap code using information from mother and father provided in first and second row respectively
letter.to.jm(x)
letter.to.jm(x)
x |
A vector of ssr calls in letter format or snp types, mother and father of the population should be in 1st and 2nd position respectively |
If numeric data exists first needs to be converted to letter code in order to use this function.
If arguments are correct the function returns a list containing
A vector with ssr calls in joinmap format
Covarrubias-Pazaran G, Diaz-Garcia L, Schlautman B, Salazar W, Zalapa J. Fragman: An R package for fragment analysis. 2016. BMC Genetics 17(62):1-8.
Robert J. Henry. 2013. Molecular Markers in Plants. Wiley-Blackwell. ISBN 978-0-470-95951-0.
Ben Hui Liu. 1998. Statistical Genomics. CRC Press LLC. ISBN 0-8493-3166-8.
xx <- data.frame(cbind(a=rep(150, 96), b=c(rep(100,48), rep(150,48)))) xx1 <- num.to.lett(xx) letter.to.jm(unlist(xx1)) # try using apply to a dataframe
xx <- data.frame(cbind(a=rep(150, 96), b=c(rep(100,48), rep(150,48)))) xx1 <- num.to.lett(xx) letter.to.jm(unlist(xx1)) # try using apply to a dataframe
This dataset are 60 individuals from a progeny coming from the cross of 2 cranberry plants. Six SSR markers were run, 2 in the first channel (blue), 2 in the second channel (green), 2 in the third channel (yellow) and the Roxtrash375 ladder was run in the fourth channel (red).
data("my.plants")
data("my.plants")
The format is: chr "my.plants"
The data is basically the raw FSA files coming from the ABi machine. No more details for this data.
This data was generated by the Cranberry Genomics Lab.
Covarrubias-Pazaran G, Diaz-Garcia L, Schlautman B, Salazar W, Zalapa J. Fragma: An R package for fragment analysis. http://horticulture.wisc.edu/cggl/ZalapaLab/People.html. 2015.
data(my.plants) ## look at the list structure str(my.plants)
data(my.plants) ## look at the list structure str(my.plants)
This function converts dataframes with rounded calls (numeric format in 2 cells), to letter format based on the GBS pipeline developed by Elshire et al. (2011) which can be used as intermediate step to transform to joinmap and Onemap formats, it requires mother and father in first and second row
num.to.lett(xx)
num.to.lett(xx)
xx |
matrix with numbers, every 2 columns is a marker and each row is an individual, parents are located in the first 2 rows |
No major details.
If arguments are correct the function returns a list containing
matrix coded in letter format where each column is a marker and each row is an individual
Covarrubias-Pazaran G, Diaz-Garcia L, Schlautman B, Salazar W, Zalapa J. Fragman: An R package for fragment analysis. 2016. BMC Genetics 17(62):1-8.
Robert J. Henry. 2013. Molecular Markers in Plants. Wiley-Blackwell. ISBN 978-0-470-95951-0.
Ben Hui Liu. 1998. Statistical Genomics. CRC Press LLC. ISBN 0-8493-3166-8.
xx <- data.frame(cbind(a=rep(150, 96), b=c(rep(100,48), rep(150,48)))) num.to.lett(xx)
xx <- data.frame(cbind(a=rep(150, 96), b=c(rep(100,48), rep(150,48)))) num.to.lett(xx)
This function uses information from the FSA files read from storing.inds
function and creates a plot to assess graphically the peaks of several plants in certain channel in order to score manually or assess the parental fragments in the case of biparentla ppulations. If you desire to create a panel you may want to take a look at overview2
. The function contains several defaults in most of the arguments, please check arguments but in general.
overview(my.inds, channel = 1, n.inds = c(1:length(my.inds)), xlimi=c(min(ladder),max(ladder)), ladder, channel.ladder=dim(my.inds[[1]])[2], ploidy=2, dev=50, method="iter", init.thresh=200, ladd.init.thresh=200, warn=TRUE, my.palette=NULL, env = parent.frame())
overview(my.inds, channel = 1, n.inds = c(1:length(my.inds)), xlimi=c(min(ladder),max(ladder)), ladder, channel.ladder=dim(my.inds[[1]])[2], ploidy=2, dev=50, method="iter", init.thresh=200, ladd.init.thresh=200, warn=TRUE, my.palette=NULL, env = parent.frame())
my.inds |
List with the channels information from the individuals specified, usually coming from the |
channel |
The channel you wish to analyze, usually 1 is blue, 2 is green, 3 is yellow, 4 is red and so on |
n.inds |
Vector specifying the plants to be scored |
xlimi |
A vector containing the base pair interval where the plot should be drawn |
ladder |
A vector containing the expected weights for the ladder peaks that will be found the using the |
channel.ladder |
A scalar value indicating in which channel or color the ladder was read |
ploidy |
A scalar value indicating the ploidy of the organism to be scored |
dev |
A scalar value indicating the number of indexes to be used as peak separation when deciding the ladder peaks, for more details check |
method |
An argument indicating one of the 2 methods available; "cor" makes all possible combination of peaks and searches exhaustive correlations to find the right peaks corresponsding to the expected DNA weights, or "ci" constructing confidence intervals to look for peaks meeting the conditions specified in the previous arguments |
init.thresh |
An initial value of intensity to detect peaks. We recommend not to deal to much with unless you have highly controlled dna concentrations in your experiment |
ladd.init.thresh |
A value of intensity to detect peaks in the internal use of the |
warn |
A TRUE/FALSE value indicating if warnings should be provided when detecting the ladder |
my.palette |
A character vector with the colors to be used when drawing the RFU plots. If NULL it will use the programmed palette. |
env |
this is used to detect the environment of the user and load the result in the same environment. |
No major details.
If rarguments are correct the function returns a list containing
Returns a plot joining the channel for the plants specified for the color desired and the peaks found by the function using the parameters specified
Returns a vector with the names of the plants specified in the function
Covarrubias-Pazaran G, Diaz-Garcia L, Schlautman B, Salazar W, Zalapa J. Fragman: An R package for fragment analysis. 2016. BMC Genetics 17(62):1-8.
Robert J. Henry. 2013. Molecular Markers in Plants. Wiley-Blackwell. ISBN 978-0-470-95951-0.
Ben Hui Liu. 1998. Statistical Genomics. CRC Press LLC. ISBN 0-8493-3166-8.
data(my.plants) my.plants <- my.plants[1:10] my.ladder <- c(50, 75, 100, 125, 129, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375) overview(my.inds=my.plants, channel = 1, n.inds = c(1:5), ladder=my.ladder, xlim=c(200,220)) # now use: # locator(type="p", pch=20, col="red")$x # to click over the peaks and get the sizes in base pairs # when you are done make sure you press the "Esc" key, # do not push the stop button, some versions of R usually crash # by stopping instead of pressing 'Esc'.
data(my.plants) my.plants <- my.plants[1:10] my.ladder <- c(50, 75, 100, 125, 129, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375) overview(my.inds=my.plants, channel = 1, n.inds = c(1:5), ladder=my.ladder, xlim=c(200,220)) # now use: # locator(type="p", pch=20, col="red")$x # to click over the peaks and get the sizes in base pairs # when you are done make sure you press the "Esc" key, # do not push the stop button, some versions of R usually crash # by stopping instead of pressing 'Esc'.
This function uses information from the FSA files read from storing.inds
function and creates an overlapping plot to assess graphically the peaks of several plants in certain channel in order to create a panel for the scoring functions score.markers
. The function contains several defaults in most of the arguments, please check arguments but in general you only need the first 4 arguments to create a panel.
overview2(my.inds, channel = 1, ladder, xlim = NULL, ylim = NULL, n.inds = NULL, channel.ladder = NULL, ploidy = 2, method="iter2", init.thresh=NULL, ladd.init.thresh=200, lwd=.25, warn=TRUE, min.panel=100, suggested=TRUE, env = parent.frame(), my.palette=NULL, verbose=TRUE)
overview2(my.inds, channel = 1, ladder, xlim = NULL, ylim = NULL, n.inds = NULL, channel.ladder = NULL, ploidy = 2, method="iter2", init.thresh=NULL, ladd.init.thresh=200, lwd=.25, warn=TRUE, min.panel=100, suggested=TRUE, env = parent.frame(), my.palette=NULL, verbose=TRUE)
my.inds |
List with the channels information from the individuals specified, usually coming from the |
channel |
The channel/color you wish to analyze, usually 1 is blue, 2 is green, 3 is yellow, 4 is red and so on |
ladder |
A vector containing the expected weights for the ladder peaks that will be found the using the |
xlim |
A vector containing the base pair interval where the plot should be drawn |
ylim |
A vector containing the intensity interval where the plot should be drawn |
n.inds |
Vector specifying the plants to be scored |
channel.ladder |
A scalar value indicating in which channel or color the ladder was read |
ploidy |
A scalar value indicating the ploidy of the organism to be scored |
method |
An argument indicating one of the 2 methods available; "cor" makes all possible combination of peaks and searches exhaustive correlations to find the right peaks corresponsding to the expected DNA weights, or "ci" constructing confidence intervals to look for peaks meeting the conditions specified in the previous arguments |
init.thresh |
An initial value of intensity to detect peaks. We recommend not to deal to much with unless you have highly controlled dna concentrations in your experiment |
ladd.init.thresh |
A value of intensity to detect peaks in the internal use of the |
lwd |
The width of the line |
warn |
A TRUE/FALSE value indicating if warnings should be provided when detecting the ladder |
min.panel |
A scalar value indicating which peak values should be ignored when creating a panel. If 'xlim' values are specified the 'min.panel' value is ignored and instead the panel peaks provided by the program are based in the region where you are zooming in. |
suggested |
a TRUE/FALSE value statement declaring if you want the program to return suggested peaks for your panel. The default is TRUE but can be anoying if the program draws too many peaks. |
env |
this is used to detect the environment of the user and load the result in the same environment. |
my.palette |
A character vector with the colors to be used when drawing the RFU plots. If NULL it will use the programmed palette. |
verbose |
A TRUE/FALSE statement indicating if the function should return informative messages. |
No major details.
If rarguments are correct the function returns a list containing
Returns a plot joining the channel for the plants specified for the color desired and the peaks found by the function using the parameters specified
Returns a vector with the names of the plants specified in the function
We have spent valuable time developing this package, please cite it in your publication:
Covarrubias-Pazaran G, Diaz-Garcia L, Schlautman B, Salazar W, Zalapa J. Fragman: An R package for fragment analysis. 2016. BMC Genetics 17(62):1-8.
Robert J. Henry. 2013. Molecular Markers in Plants. Wiley-Blackwell. ISBN 978-0-470-95951-0.
Ben Hui Liu. 1998. Statistical Genomics. CRC Press LLC. ISBN 0-8493-3166-8.
data(my.plants) my.plants <- my.plants[1] my.ladder <- c(50, 75, 100, 125, 129, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375) overview2(my.inds=my.plants, channel = 1, ladder=my.ladder, lwd=1) # now use: # my.panel <- locator(type="p", pch=20, col="red")$x # to click over the peaks and get the sizes in base pairs # when you are done make sure you press the "Esc" key, do not push the stop button ## to look at many channels at the same time you ## can use the par(new=TRUE) and a for loop for(u in 1:4){ overview2(my.inds=my.plants, channel = u, ladder=my.ladder, lwd=1, xlim=c(240,350), ylim=c(0,30000)) par(new=TRUE) }
data(my.plants) my.plants <- my.plants[1] my.ladder <- c(50, 75, 100, 125, 129, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375) overview2(my.inds=my.plants, channel = 1, ladder=my.ladder, lwd=1) # now use: # my.panel <- locator(type="p", pch=20, col="red")$x # to click over the peaks and get the sizes in base pairs # when you are done make sure you press the "Esc" key, do not push the stop button ## to look at many channels at the same time you ## can use the par(new=TRUE) and a for loop for(u in 1:4){ overview2(my.inds=my.plants, channel = u, ladder=my.ladder, lwd=1, xlim=c(240,350), ylim=c(0,30000)) par(new=TRUE) }
plot
method for class "fsa_stored"
.
## S3 method for class 'fsa_stored' plot(x, lay=c(2,1), channel=NULL, cex.legend=.5, ncol.legend=4,lims=NULL, color=NULL, ...)
## S3 method for class 'fsa_stored' plot(x, lay=c(2,1), channel=NULL, cex.legend=.5, ncol.legend=4,lims=NULL, color=NULL, ...)
x |
an object of class |
lay |
layout for the numberof plots to visualize. |
channel |
if preferred a single channel can be specified. |
cex.legend |
value of the size of the legend. |
ncol.legend |
number of columns to divide the legend. |
lims |
equivalen to ylim argument in plot, xlim still can be used. |
color |
specific colors to use in case the user don't want the default palete. |
... |
Further arguments to be passed |
vector of plot
Giovanny Covarrubias [email protected]
This function takes a matrix of DNA intensities and merge all the channels (columns) to identify overall peaks and then creates a window moving from peak to peak looking for the channel where this peak is real and adjust the intensities in the other channels.
pullup(mati, plotting=FALSE, channel=4)
pullup(mati, plotting=FALSE, channel=4)
mati |
matrix of intensities where each column is a channel/color for a given sample. |
plotting |
a TRUE/FALSE value indicating if the results from adjusting the intensities should be drawn or not. |
channel |
a numeric value indicating which of the channles/color (column) allocates the ladder intensties. |
No major details.
If arguments are correctly specified the function returns:
A new matrix of DNA intensities corrected for overlapping of wavelenth readings in different channels.
Covarrubias-Pazaran G, Diaz-Garcia L, Schlautman B, Salazar W, Zalapa J. Fragman: An R package for fragment analysis. 2016. BMC Genetics 17(62):1-8.
Robert J. Henry. 2013. Molecular Markers in Plants. Wiley-Blackwell. ISBN 978-0-470-95951-0.
Ben Hui Liu. 1998. Statistical Genomics. CRC Press LLC. ISBN 0-8493-3166-8.
data(my.plants) layout(matrix(1:2,2,1)) # without pull up adjustment plot(my.plants[[1]][,1], type="l", col="blue", xlim=c(2750,2850)) lines(my.plants[[1]][,2], col="green") lines(my.plants[[1]][,3], col="gold") ## adjusted yy <- pullup(my.plants[[1]]) plot(yy[,1], type="l", col="blue", xlim=c(2750,2850)) lines(yy[,2], col="green") lines(yy[,3], col="gold") # general view yy1 <- pullup(my.plants[[1]], plotting=TRUE)
data(my.plants) layout(matrix(1:2,2,1)) # without pull up adjustment plot(my.plants[[1]][,1], type="l", col="blue", xlim=c(2750,2850)) lines(my.plants[[1]][,2], col="green") lines(my.plants[[1]][,3], col="gold") ## adjusted yy <- pullup(my.plants[[1]]) plot(yy[,1], type="l", col="blue", xlim=c(2750,2850)) lines(yy[,2], col="green") lines(yy[,3], col="gold") # general view yy1 <- pullup(my.plants[[1]], plotting=TRUE)
ABIF stands for Applied Biosystem Inc. Format, a binary fromat modeled after TIFF format. Corresponding files usually have an .ab1 or .fsa extension.
read.abif(filename, max.bytes.in.file = file.info(filename)$size, pied.de.pilote = 1.2, verbose = FALSE)
read.abif(filename, max.bytes.in.file = file.info(filename)$size, pied.de.pilote = 1.2, verbose = FALSE)
filename |
The name of the file. |
max.bytes.in.file |
The size in bytes of the file, defaulting to what is returned by file.info |
pied.de.pilote |
Safety factor: the argument readBin is set as pied.de.pilote*max.bytes.in.file. |
verbose |
logical [FALSE]. If TRUE verbose mode is on. |
All data are imported into memory, there is no attempt to read items on the fly.
A list with three components: Header which is a list that contains various low-level information, among which numelements is the number of elements in the directory and dataoffset the offset to find the location of the directory. Directory is a data.frame for the directory of the file with the number of row being the number of elements in the directory and the 7 columns describing various low-level information about the elements.
J.R. Lobry
Covarrubias-Pazaran G, Diaz-Garcia L, Schlautman B, Salazar W, Zalapa J. Fragman: An R package for fragment analysis. 2016. BMC Genetics 17(62):1-8.
Anonymous (2006) Applied Biosystem Genetic Analysis Data File Format. Available at http://www.appliedbiosystems.com/support/software_community/ABIF_File_Format.pdf. Last visited on 03-NOV-2008.
The figure in the example section is an attempt to reproduce figure 1A from:
Krawczyk, J., Goesmann, A., Nolte, R., Werber, M., Weisshaar, B. (2009) Trace2PS and FSA2PS: two software toolkits for converting trace and fsa files to PostScript format. Source Code for Biology and Medicine, 4:4.
#No examples provided, please download seqinr package for going deeper in this function.
#No examples provided, please download seqinr package for going deeper in this function.
This function takes a list with the information of positions, heights and weights for an individual and using the panel information finds the real peaks by using the separate function and getting the tallest peaks in the confidence interval constructed for the heights in the inteval of interest.
reals(x, panel=c(100:400), shi=1, ploidy=2, left.cond=c(0.4,3), right.cond=0.2, window=0.5)
reals(x, panel=c(100:400), shi=1, ploidy=2, left.cond=c(0.4,3), right.cond=0.2, window=0.5)
x |
List with 3 elements; the information of positions, heights and weights for an individual in certain channel |
panel |
A vector containing the base pair interval where the peaks should be searched for |
shi |
The number of base pairs to be used for discarding neighboring peaks to the tallest peaks, i.e. if 2 peaks are 0.3 bp together the smalles will be discarded |
ploidy |
A scalar value indicating the ploidy of the organism to be scored |
left.cond |
A percentage value indicating when peaks to the leaft of the tallest peaks should be considered real based on the height, i.e. a very close peak right before the tallest peak if smaller than the tallest (half the size of the tallest one will be real or not) |
right.cond |
A percentage value indicating when peaks to the right of the tallest peaks should be considered real based on the height, i.e. a very close peak right after the tallest peak if smaller than the tallest (half the size of the tallest one will be real or not) |
window |
A value in base pairs indicating how much is the error for detecting a peak in a sample that was provided in the panelas a real peak |
No major details.
If arguments are correct the function returns a list containing
the index positions for the intensities
the intensities for the fragments found
the putative weights in base pairs based on the ladder provided
Robert J. Henry. 2013. Molecular Markers in Plants. Wiley-Blackwell. ISBN 978-0-470-95951-0.
Ben Hui Liu. 1998. Statistical Genomics. CRC Press LLC. ISBN 0-8493-3166-8.
data(my.plants) x <- big.peaks.col(my.plants[[1]][,1],100)#for any color #reals(x, panel=c(260,280), shi=1, ploidy=2) #still needs weight information in order to find the reals, #works internally of score.easy function
data(my.plants) x <- big.peaks.col(my.plants[[1]][,1],100)#for any color #reals(x, panel=c(260,280), shi=1, ploidy=2) #still needs weight information in order to find the reals, #works internally of score.easy function
This function takes a vector of intensities and looks for peaks above 8000 RFUs and correct for possible splits at the top of the peaks by inverting the vally between splitted peaks and correcting the peak.
saturate(y)
saturate(y)
y |
a vector containing the DNA intensities for the capillary electrophoresis. |
No major details.
If arguments are correctly specified the function returns:
A new vector of DNA intensities adjusted for saturated peaks over 8000 RFUs.
Robert J. Henry. 2013. Molecular Markers in Plants. Wiley-Blackwell. ISBN 978-0-470-95951-0.
Ben Hui Liu. 1998. Statistical Genomics. CRC Press LLC. ISBN 0-8493-3166-8.
data(my.plants) y <- my.plants[[1]][,3] layout(matrix(1:2,2,1)) plot(y, type="l", xlim=c(2750,2850)) y2 <- saturate(y=y) plot(y2, type="l", xlim=c(2750,2850))
data(my.plants) y <- my.plants[[1]][,3] layout(matrix(1:2,2,1)) plot(y, type="l", xlim=c(2750,2850)) y2 <- saturate(y=y) plot(y2, type="l", xlim=c(2750,2850))
This function uses information from the fsa files read from storing.inds
function and does the ssr calling in the channel specified and returns the index position, height and base pair position.
score.markers(my.inds, channel = 1, n.inds = NULL, panel=NULL, shift=0.8, ladder, channel.ladder=NULL, ploidy=2, left.cond=c(0.6,3), right.cond=0.35, warn=FALSE, window=0.5, init.thresh=200, ladd.init.thresh=200, method="iter2", env = parent.frame(), my.palette=NULL, plotting=TRUE, electro=FALSE, pref=3)
score.markers(my.inds, channel = 1, n.inds = NULL, panel=NULL, shift=0.8, ladder, channel.ladder=NULL, ploidy=2, left.cond=c(0.6,3), right.cond=0.35, warn=FALSE, window=0.5, init.thresh=200, ladd.init.thresh=200, method="iter2", env = parent.frame(), my.palette=NULL, plotting=TRUE, electro=FALSE, pref=3)
my.inds |
List with the channels information from the individuals specified, usually coming from the |
channel |
The channel you wish to analyze, usually 1 is blue, 2 is green, 3 is yellow, 4 is red and so on |
n.inds |
Vector specifying the plants to be scored |
panel |
A vector containing the base pair interval where the peaks should be searched for |
shift |
The number of base pairs to be used for discarding neighboring peaks to the tallest peaks, i.e. if 2 peaks are 0.3 bp together the smalles will be discarded |
ladder |
A vector containing the expected weights for the ladder peaks that will be found the using the |
channel.ladder |
A scalar value indicating in which channel or color the ladder was read |
ploidy |
A scalar value indicating the ploidy of the organism to be scored to decide the maximum number of peaks the program should look for. TO BE IMPLEMENTED SOON. STILL NOT FUNCTIONAL. |
left.cond |
A percentage value (0-1) indicating when peaks to the left of the tallest peaks should be considered real based on the height, i.e. a value of 0.5 would mean that a close peak (to the left of the tallest peak) will be picked only if such peak is at least 50 percent as tall with respect to the tallest peak. The second argument is the number of base pair indicating when peaks to the left of the tallest peaks should be considered real based on the distance, i.e. a value of 3 would mean that a close peak (to the left of the tallest peak) will be picked only if such peak is at least 3 base pairs far away from the tallest peak |
right.cond |
A percentage value (0-1) indicating when peaks to the right of the tallest peaks should be considered real based on the height, i.e. a value of 0.5 would mean that a close peak (to the right of the tallest peak) will be picked only if such peak is at least 50 percent as tall with respect to the tallest peak. |
warn |
A TRUE/FALSE value indicating if warnings should be provided when detecting the ladder |
window |
A value in base pairs indicating how much is the error for detecting a peak in a sample when providing a panel with expected peaks. |
init.thresh |
An initial value of intensity to detect peaks. We recommend not to deal to much with it unless you have highly controlled dna concentrations in your experiment. |
ladd.init.thresh |
If samples were not sized using the info.ladder.attach function this value will be used to detect ladder peaks. Internally the program will use the |
method |
If samples were not sized using the info.ladder.attach function this method will be used to detect ladder peaks. An argument indicating one of the 3 methods available; "cor" makes all possible combination of peaks and searches exhaustive correlations to find the right peaks corresponsding to the expected DNA weights, or "ci" constructing confidence intervals to look for peaks meeting the conditions specified in the previous arguments, "iter2" an iterative procedure looking for the most likely peaks meeting your ladder expectation. Default is "iter2". |
env |
this is used to detect the environment of the user and load the result in the same environment. Don't mess with it please. |
my.palette |
A character vector with the colors to be used when drawing the RFU plots. If NULL it will use the programmed palette. |
plotting |
a TRUE/FALSE value indicating if the plots should be drawn or not. The default value is TRUE. |
electro |
A TRUE/FALSE value indicating if the electrogram/gel should be drawn or not. The default value is FALSE. |
pref |
A scalar value indicating how many plots should be drawn in the output plotting. The dafault is 3. |
Method "ci" has been depreciated, currently the method "iter2" is the default and uses the ladder provided and observed peaks to match them using an iterative procedure based on least squares.
If arguments are correct the function returns a plot and a list containing
the index positions for the intensities
the intensities for the fragments found
the putative weights in base pairs based on the ladder provided
We have spent valuable time developing this package, please cite it in your publication:
Covarrubias-Pazaran G, Diaz-Garcia L, Schlautman B, Salazar W, Zalapa J. Fragman: An R package for fragment analysis. 2016. BMC Genetics 17(62):1-8.
Robert J. Henry. 2013. Molecular Markers in Plants. Wiley-Blackwell. ISBN 978-0-470-95951-0.
Ben Hui Liu. 1998. Statistical Genomics. CRC Press LLC. ISBN 0-8493-3166-8.
## ================================= ## ## ================================= ## ## Fragment analysis requires ## 1) loading your data ## 2) matching your ladder ## 3) define a panel for scoring ## 4) score the samples ## ================================= ## ## ================================= ## ##################### ## 1) Load your data ##################### ### you would use something like: # folder <- "~/myfolder" # my.plants <- storing.inds(folder) ### here we just load our sample data and use the first 2 plants ?my.plants data(my.plants) my.plants <- my.plants[1:2] class(my.plants) <- "fsa_stored" ####################### ## 2) Match your ladder ####################### ### create a vector indicating the sizes of your ladder and do the match my.ladder <- c(50, 75, 100, 125, 129, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375) ladder.info.attach(stored=my.plants, ladder=my.ladder) ### matching your ladder is a critical step and should only happen once per batch of ### samples read ###****************************************************************************************### ### OPTIONAL: ### If the ladder.info attach function detects some bad samples ### that you can correct them manually using ### the ladder.corrector() function ### For example to correct one sample in the previous data ### ladder.corrector(stored=my.plants, #to.correct="FHN152-CPN01_01A_GH1x35_152-148-209_717-704-793_367-382-381.fsa", #ladder=my.ladder) ###****************************************************************************************### ####################### ## 3) Define a panel ####################### ### In fragment analysis you usually design a panel where you indicate ### which peaks are real. You may use the overview2 function which plots all the ### plants in the channel you want in the base pair range you want overview2(my.inds=my.plants, channel = 2:3, ladder=my.ladder, init.thresh=5000) ### You can click on the peaks you think are real, given that the ones ### suggested by the program may not be correct. This can be done by using the ### 'locator' function and press 'Esc' when you're done, i.e.: # my.panel <- locator(type="p", pch=20, col="red")$x ### That way you can click over the peaks and get the sizes ### in base pairs stored in a vector named my.panel ### Just for demonstration purposes I will use the suggested peaks by ### the program using overview2, which will return a vector with ### expected DNA sizes to be used in the next step for scoring ### we'll do it in the 160-190 bp region my.panel <- overview2(my.inds=my.plants, channel = 3, ladder=my.ladder, init.thresh=7000, xlim=c(160,190)); my.panel ########################## ## 4) Score the samples ########################## ### When a panel is created is time to score the samples by providing the initial ### data we read, the ladder vector, the panel vector, and our specifications ### of channel to score (other arguments are available) ### Here we will score our samples for channel 3 with our panel created previously res <- score.markers(my.inds=my.plants, channel = 3, panel=my.panel$channel_3, ladder=my.ladder, electro=FALSE) ### Check the plots and make sure they were scored correctly. In case some samples ### are wrong you might want to use the locator function again and figure out ### the size of your peaks. To extract your peaks in a data.frame do the following: final.results <- get.scores(res) final.results
## ================================= ## ## ================================= ## ## Fragment analysis requires ## 1) loading your data ## 2) matching your ladder ## 3) define a panel for scoring ## 4) score the samples ## ================================= ## ## ================================= ## ##################### ## 1) Load your data ##################### ### you would use something like: # folder <- "~/myfolder" # my.plants <- storing.inds(folder) ### here we just load our sample data and use the first 2 plants ?my.plants data(my.plants) my.plants <- my.plants[1:2] class(my.plants) <- "fsa_stored" ####################### ## 2) Match your ladder ####################### ### create a vector indicating the sizes of your ladder and do the match my.ladder <- c(50, 75, 100, 125, 129, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375) ladder.info.attach(stored=my.plants, ladder=my.ladder) ### matching your ladder is a critical step and should only happen once per batch of ### samples read ###****************************************************************************************### ### OPTIONAL: ### If the ladder.info attach function detects some bad samples ### that you can correct them manually using ### the ladder.corrector() function ### For example to correct one sample in the previous data ### ladder.corrector(stored=my.plants, #to.correct="FHN152-CPN01_01A_GH1x35_152-148-209_717-704-793_367-382-381.fsa", #ladder=my.ladder) ###****************************************************************************************### ####################### ## 3) Define a panel ####################### ### In fragment analysis you usually design a panel where you indicate ### which peaks are real. You may use the overview2 function which plots all the ### plants in the channel you want in the base pair range you want overview2(my.inds=my.plants, channel = 2:3, ladder=my.ladder, init.thresh=5000) ### You can click on the peaks you think are real, given that the ones ### suggested by the program may not be correct. This can be done by using the ### 'locator' function and press 'Esc' when you're done, i.e.: # my.panel <- locator(type="p", pch=20, col="red")$x ### That way you can click over the peaks and get the sizes ### in base pairs stored in a vector named my.panel ### Just for demonstration purposes I will use the suggested peaks by ### the program using overview2, which will return a vector with ### expected DNA sizes to be used in the next step for scoring ### we'll do it in the 160-190 bp region my.panel <- overview2(my.inds=my.plants, channel = 3, ladder=my.ladder, init.thresh=7000, xlim=c(160,190)); my.panel ########################## ## 4) Score the samples ########################## ### When a panel is created is time to score the samples by providing the initial ### data we read, the ladder vector, the panel vector, and our specifications ### of channel to score (other arguments are available) ### Here we will score our samples for channel 3 with our panel created previously res <- score.markers(my.inds=my.plants, channel = 3, panel=my.panel$channel_3, ladder=my.ladder, electro=FALSE) ### Check the plots and make sure they were scored correctly. In case some samples ### are wrong you might want to use the locator function again and figure out ### the size of your peaks. To extract your peaks in a data.frame do the following: final.results <- get.scores(res) final.results
This function takes a list with positions, heights and weights called "g" and using a shift in base pairs determines when 2 neighboring peaks should be considered only one by getting the tallest peak. For example two peaks found at 173 and 173.5 base pairs are unlikely to be 2 different peaks, therefore only the tallest peak will pe chosen.
separate(g, shift=1, type="bp")
separate(g, shift=1, type="bp")
g |
List with 3 elements; the information of positions, heights and weights for an individual in certain channel |
shift |
The number of base pairs to be used for discarding neighboring peaks to the tallest peaks, i.e. if 2 peaks are 0.3 bp together the smalles will be discarded |
type |
A word indicating if the shift to be used should be used in base pairs or in index. The use is "bp" or "ind" |
No major details.
If arguments are correct the function returns a new list containing
the index positions for the intensities
the intensities for the fragments found
the putative weights in base pairs based on the ladder provided
Covarrubias-Pazaran G, Diaz-Garcia L, Schlautman B, Salazar W, Zalapa J. Fragman: An R package for fragment analysis. 2016. BMC Genetics 17(62):1-8.
Robert J. Henry. 2013. Molecular Markers in Plants. Wiley-Blackwell. ISBN 978-0-470-95951-0.
Ben Hui Liu. 1998. Statistical Genomics. CRC Press LLC. ISBN 0-8493-3166-8.
data(my.plants) x <- big.peaks.col(my.plants[[1]][,1],100)#for any color #separate(x, shift=1, type="bp") #still needs weight information
data(my.plants) x <- big.peaks.col(my.plants[[1]][,1],100)#for any color #separate(x, shift=1, type="bp") #still needs weight information
This function reads the FSA files using a function named 'read.abif' from another R package called seqinr. This will extract the information of the DNA intensities of the capillary electrophoresis and will store it in a data structure know in R as a list. The usage of the function and the arguments it takes are as follows:
storing.inds(folder, channels=NULL, fourier=TRUE, saturated=TRUE, lets.pullup=TRUE, plotting=FALSE, rawPlot=FALSE, llength=3000, ulength=80000)
storing.inds(folder, channels=NULL, fourier=TRUE, saturated=TRUE, lets.pullup=TRUE, plotting=FALSE, rawPlot=FALSE, llength=3000, ulength=80000)
folder |
A path/directory where the FSA files are located. We recommend to use the Seession tab -> Set working directory and provide that folder as an argument |
channels |
A scalar value indicating how many channels/colors should be found in the FSA files, usually people using the rox375 ladder in the red channel have 4 channels, people using the LIZ ladder have 5 channels. The default is the last channel. |
fourier |
A FALSE/TRUE value indicating if data should be smooth aplying a Fourier transormation using 40 percent of the lowest frequencies. The dafault is TRUE |
saturated |
A FALSE/TRUE value indicating if data should be checked and treated for saturated peaks above 8000 RFU which usually split at the top in 2 different peaks. The dafault is TRUE |
lets.pullup |
A FALSE/TRUE value indicating if data should be treated for noise from channel to channel known as pull up or pull down peaks since wavelengths where the dyes are read usually overlap (blue->green->yellow->red->orange. The dafault is TRUE |
plotting |
A FALSE/TRUE value indicating if results after data cleaning steps should be plotted to asses graphically how data was handled. The dafault is FALSE |
rawPlot |
A FALSE/TRUE value indicating if a plot drawing all vectors read should be plotted. The dafault is FALSE since this consumes a lot of memory. |
llength |
A numeric value indicating how small can be the number of indexes in each channel. Default is 3000 indexes for small runs. |
ulength |
A numeric value indicating how big can be the number of indexes in each channel. Default is 80000 indexes for long runs. |
No major details.
If arguments are correct the function returns a list containing
A list where each element is a data frame containing the "n" channels of an individual
We have spent valuable time developing this package, please cite it in your publication:
Covarrubias-Pazaran G, Diaz-Garcia L, Schlautman B, Salazar W, Zalapa J. Fragman: An R package for fragment analysis. 2016. BMC Genetics 17(62):1-8.
Robert J. Henry. 2013. Molecular Markers in Plants. Wiley-Blackwell. ISBN 978-0-470-95951-0.
Ben Hui Liu. 1998. Statistical Genomics. CRC Press LLC. ISBN 0-8493-3166-8.
data(my.plants) ### the correct way to do it for a population of inds with ### 4 colors + ladder= 5 would be: # my.plants <- storing.inds(folder)
data(my.plants) ### the correct way to do it for a population of inds with ### 4 colors + ladder= 5 would be: # my.plants <- storing.inds(folder)
This function takes data contained in a list with 2 elements, the first containing the position in base pairs and the heights of the positions and given a panel finds the best minimum threshold by creating a confidence interval in order to get only the real peaks. Implemented internally of auto,score and score.easy function.
threshs(my.plant, min.thre=200, panel, ci=1.9)
threshs(my.plant, min.thre=200, panel, ci=1.9)
my.plant |
List with 2 elements; the position in base pairs and the heights associated with them |
min.thre |
Starting threshold to find the peaks and refine the search |
panel |
Two scalar values indicatin the interval where the peaks should be found |
ci |
Scalar value indicating the number of standard errors to be used when creating the confidence interval for ssr calling |
No major details.
If arguments are correctly specified the function returns a list containing
A scalar value indicating the new threshold to be used when calling the real peaks
Robert J. Henry. 2013. Molecular Markers in Plants. Wiley-Blackwell. ISBN 978-0-470-95951-0.
Ben Hui Liu. 1998. Statistical Genomics. CRC Press LLC. ISBN 0-8493-3166-8.
data(my.plants) # implemented internally of auto,score and score.easy function # threshs(my.plant, min.thre=200, panel=(260,290), ci=1.9)
data(my.plants) # implemented internally of auto,score and score.easy function # threshs(my.plant, min.thre=200, panel=(260,290), ci=1.9)
This function takes a vector and applies a fourier transformation in order to smooth the peaks usinf the fft function in the base package. Use only top 40 percent of the lowest frequencies.
transfft(sn, top=0.3)
transfft(sn, top=0.3)
sn |
a numeric vector containing the DNA intensities for the capillary electrophoresis. |
top |
percent of lowest frequencies that should be used for the fourier transformation. |
No major details.
If arguments are correctly specified the function returns:
A new vector of DNA intensities smoothed to avoid extra noisy peaks.
Covarrubias-Pazaran G, Diaz-Garcia L, Schlautman B, Salazar W, Zalapa J. Fragman: An R package for fragment analysis. 2016. BMC Genetics 17(62):1-8.
Robert J. Henry. 2013. Molecular Markers in Plants. Wiley-Blackwell. ISBN 978-0-470-95951-0.
Ben Hui Liu. 1998. Statistical Genomics. CRC Press LLC. ISBN 0-8493-3166-8.
data(my.plants) g1 <- transfft(my.plants[[1]][,4], top=0.8) g2 <- transfft(my.plants[[1]][,4], top=0.4) g3 <- transfft(my.plants[[1]][,4], top=0.1) layout(matrix(1:3,3,1)) plot(g1, type="l") lines(g2, col="red") lines(g3, col="blue") par1 <- c("top=0.8", "top=0.4", "top=0.1") par2 <- c("black", "red", "blue") par3 <- c(1,1,1) legend("topright", legend=par1, col=par2, bty = "n", lty=par3, lwd=par3, cex=0.75)
data(my.plants) g1 <- transfft(my.plants[[1]][,4], top=0.8) g2 <- transfft(my.plants[[1]][,4], top=0.4) g3 <- transfft(my.plants[[1]][,4], top=0.1) layout(matrix(1:3,3,1)) plot(g1, type="l") lines(g2, col="red") lines(g3, col="blue") par1 <- c("top=0.8", "top=0.4", "top=0.1") par2 <- c("black", "red", "blue") par3 <- c(1,1,1) legend("topright", legend=par1, col=par2, bty = "n", lty=par3, lwd=par3, cex=0.75)
This function takes a color and returns the same with a certain alpha grade transparency.
transp(col, alpha=0.5)
transp(col, alpha=0.5)
col |
Color to be used for transparency |
alpha |
Grade of transparency desired |
No major details.
If arguments are correctly specified the function returns:
A new color with certain grade of transparency
Covarrubias-Pazaran G, Diaz-Garcia L, Schlautman B, Salazar W, Zalapa J. Fragman: An R package for fragment analysis. 2016. BMC Genetics 17(62):1-8.
Robert J. Henry. 2013. Molecular Markers in Plants. Wiley-Blackwell. ISBN 978-0-470-95951-0.
Ben Hui Liu. 1998. Statistical Genomics. CRC Press LLC. ISBN 0-8493-3166-8.
transp("red", alpha=0.5)
transp("red", alpha=0.5)