Title: | Manipulate and Visualize VCF Data |
---|---|
Description: | Facilitates easy manipulation of variant call format (VCF) data. Functions are provided to rapidly read from and write to VCF files. Once VCF data is read into R a parser function extracts matrices of data. This information can then be used for quality control or other purposes. Additional functions provide visualization of genomic data. Once processing is complete data may be written to a VCF file (*.vcf.gz). It also may be converted into other popular R objects (e.g., genlight, DNAbin). VcfR provides a link between VCF data and familiar R software. |
Authors: | Brian J. Knaus [cre, aut] , Niklaus J. Grunwald [aut] , Eric C. Anderson [ctb] , David J. Winter [ctb], Zhian N. Kamvar [ctb] , Javier F. Tabima [ctb] |
Maintainer: | Brian J. Knaus <[email protected]> |
License: | GPL-3 |
Version: | 1.15.0 |
Built: | 2024-11-02 06:39:48 UTC |
Source: | CRAN |
Create allele frequencies from matrices of allelic depths (AD)
AD_frequency(ad, delim = ",", allele = 1L, sum_type = 0L, decreasing = 1L)
AD_frequency(ad, delim = ",", allele = 1L, sum_type = 0L, decreasing = 1L)
ad |
a matrix of allele depths (e.g., "7,2") |
delim |
character that delimits values |
allele |
which (1-based) allele to report frequency for |
sum_type |
type of sum to calculate, see details |
decreasing |
should the values be sorted decreasing (1) or increasing (0)? |
Files containing VCF data frequently include data on allelic depth (e.g., AD). This is the number of times each allele has been sequenced. Our naive assumption for diploids is that these alleles should be observed at a frequency of 1 or zero for homozygous positions and near 0.5 for heterozygous positions. Deviations from this expectation may indicate allelic imbalance or ploidy differences. This function is intended to facilitate the exploration of allele frequencies for all positions in a sample.
The alleles are sorted by their frequency within the function. The user can then specify is the would like to calculate the frequency of the most frequent allele (allele = 1), the second most frequent allele (allele = 2) and so one. If an allele is requested that does not exist it should result in NA for that position and sample.
There are two methods to calculate a sum for the denominator of the frequency. When sum_type = 0 the alleles are sorted decendingly and the first two allele counts are used for the sum. This may be useful when a state of diploidy may be known to be appropriate and other alleles may be interpreted as erroneous. When sum_type = 1 a sum is taken over all the observed alleles for a variant.
A numeric matrix of frequencies
set.seed(999) x1 <- round(rnorm(n=9, mean=10, sd=2)) x2 <- round(rnorm(n=9, mean=20, sd=2)) ad <- matrix(paste(x1, x2, sep=","), nrow=3, ncol=3) colnames(ad) <- paste('Sample', 1:3, sep="_") rownames(ad) <- paste('Variant', 1:3, sep="_") ad[1,1] <- "9,23,12" AD_frequency(ad=ad)
set.seed(999) x1 <- round(rnorm(n=9, mean=10, sd=2)) x2 <- round(rnorm(n=9, mean=20, sd=2)) ad <- matrix(paste(x1, x2, sep=","), nrow=3, ncol=3) colnames(ad) <- paste('Sample', 1:3, sep="_") rownames(ad) <- paste('Variant', 1:3, sep="_") ad[1,1] <- "9,23,12" AD_frequency(ad=ad)
Populate the ID column of VCF data by concatenating the chromosome, position and optionally an index.
addID(x, sep = "_")
addID(x, sep = "_")
x |
an object of class vcfR or chromR. |
sep |
a character string to separate the terms. |
Variant callers typically leave the ID column empty in VCF data. However, the VCF data may potentially include variants with IDs as well as variants without. This function populates the missing elements by concatenating the chromosome and position. When this concatenation results in non-unique names, an index is added to force uniqueness.
data(vcfR_test) head(vcfR_test) vcfR_test <- addID(vcfR_test) head(vcfR_test)
data(vcfR_test) head(vcfR_test) vcfR_test <- addID(vcfR_test) head(vcfR_test)
The INFO and FORMAT columns contain information in key-value pairs. If for some reason a key is not unique it will create issues in retrieving this information. This function checks the keys defined in the meta section to make sure they are unique. Note that it does not actually check the INFO and FORMAT columns, just their definitions in the meta section. This is because each variant can have different information in their INFO and META cells. Checking these on large files will therefore come with a performance cost.
check_keys(x)
check_keys(x)
x |
an oblect of class vcfR |
queryMETA()
data(vcfR_test) check_keys(vcfR_test) queryMETA(vcfR_test) queryMETA(vcfR_test, element = 'DP') # Note that DP occurs as unique in INFO and FORMAT but they may be different.
data(vcfR_test) check_keys(vcfR_test) queryMETA(vcfR_test) queryMETA(vcfR_test, element = 'DP') # Note that DP occurs as unique in INFO and FORMAT but they may be different.
plot chromR objects
chromo( chrom, boxp = TRUE, dp.alpha = TRUE, chrom.s = 1, chrom.e = NULL, drlist1 = NULL, drlist2 = NULL, drlist3 = NULL, ... ) chromoqc(chrom, boxp = TRUE, dp.alpha = 255, ...)
chromo( chrom, boxp = TRUE, dp.alpha = TRUE, chrom.s = 1, chrom.e = NULL, drlist1 = NULL, drlist2 = NULL, drlist3 = NULL, ... ) chromoqc(chrom, boxp = TRUE, dp.alpha = 255, ...)
chrom |
an object of class chrom. |
boxp |
logical specifying whether marginal boxplots should be plotted [T/F]. |
dp.alpha |
degree of transparency applied to points in dot plots [0-255]. |
chrom.s |
start position for the chromosome. (Deprecated. use xlim) |
chrom.e |
end position for the chromosome. (Deprecated. use xlim) |
drlist1 |
a named list containing elements to create a drplot |
drlist2 |
a named list containing elements to create a drplot |
drlist3 |
a named list containing elements to create a drplot |
... |
arguments to be passed to other methods. |
Each drlist parameter is a list containing elements necessarry to plot a dr.plot. This list should contain up to seven elements named title, dmat, rlist, dcol, rcol, rbcol and bwcol. These elements are documented in the dr.plot page where they are presented as individual parameters. The one exception is bwcol which is a vector of colors for the marginal box and whisker plot. This is provided so that different colors may be used in the dot plot and the box and whisker plot. For example, transparency may be desired in the dot plot but not the box and whisker plot. When one (or more) of these elements is omitted an attempt to use default values is made.
Returns an invisible NULL.
Functions which act on chromR objects
masker( x, min_QUAL = 1, min_DP = 1, max_DP = 10000, min_MQ = 20, max_MQ = 100, preserve = FALSE, ... ) variant.table(x) win.table(x)
masker( x, min_QUAL = 1, min_DP = 1, max_DP = 10000, min_MQ = 20, max_MQ = 100, preserve = FALSE, ... ) variant.table(x) win.table(x)
x |
object of class chromR |
min_QUAL |
minimum variant quality |
min_DP |
minimum cumulative depth |
max_DP |
maximum cumulative depth |
min_MQ |
minimum mapping quality |
max_MQ |
maximum mapping quality |
preserve |
a logical indicating whether or not to preserve the state of
the current mask field. Defaults to |
... |
arguments to be passed to methods |
The function masker creates a logical vector that determines which variants are masked. By masking certain variants, instead of deleting them, it preserves the dimensions of the data structure until a change needs to be committed. Variants can be masked based on the value of the QUAL column of the vcf object. Experience seems to show that this value is either at its maximum (999) or a rather low value. The maximum and minimum sequence depth can also be used (mindp and maxdp). The default is to mask all variants with depths of less than the 0.25 quantile and greater than the 0.75 quantile (these are also known as the lower and upper quartile). The minimum and maximum mapping qualities (minmq, maxmq) can also be used.
This vector is stored in the var.info$mask slot of a chromR object.
The function variant.table creates a data.frame containing information about variants.
The funciton win.table
An example chromR object containing parts of the *Phytophthora infestans* genome.
A chromR object
This data is a subset of the pinfsc50 dataset. It has been subset to positions between 500 and 600 kbp. The coordinate systems of the vcf and gff file have been altered by subtracting 500,000. This results in a 100 kbp section of supercontig_1.50 that has positional data ranging from 1 to 100 kbp.
data(chromR_example)
data(chromR_example)
A class for representing chromosomes (or supercontigs, contigs, scaffolds, etc.).
Defines a class for chromosomal or contig data. This
This object has a number of slots.
name name of the object (character)
len length of the sequence (integer)
window_size window size for windowing analyses (integer)
seq object of class ape::DNAbin
vcf object of class vcfR
ann annotation data in a gff-like data.frame
var.info a data.frame containing information on variants
win.info a data.frame containing information on windows
seq.info a list containing information on the sequence
The seq slot contains an object of class ape::DNAbin.
A DNAbin object is typically either a matrix or list of DNAbin objects.
The matrix form appears to be better behaved than the list form.
Because of this behavior this slot should be the matrix form.
When this slot is not populated it is of class "NULL" instead of "DNAbin".
Note that characters need to be lower case when inserted into an object of class DNAbin.
The function tolower
can facilitate this.
The vcf slot is an object of class vcfR vcfR-class
.
The ann slot is a data.frame containing gff format data. When this slot is not populated it has nrows equal to zero.
The var.info slot contains a data.frame containing information about variants. Every row of this data.frame is a variant. Columns will typically contain the chromosome name, the position of the variant (POS), the mask as well as any other per variant information.
The win.info slot contains a data.frame containing information about windows. For example, window, start, end, length, A, C, G, T, N, other, variants and genic fields are stored here.
The seq.info slot is a list containing two matrices. The first matrix describes rectangles for called nucleotides and the second describes rectangles for 'N' calls. Within each matrix, the first column indicates the start position and the second column indicates the end position of each rectangle.
vcfR-class
, DNAbin
,
VCF specification
gff3 format
Convert chrom objects to vcfR objects.
chromR2vcfR(x, use.mask = FALSE)
chromR2vcfR(x, use.mask = FALSE)
x |
Object of class chrom |
use.mask |
Logical, determine if mask from chrom object should be used to subset vcf data |
The chrom object is subset and recast as a vcfR object. When use.mask is set to TRUE (the default), the object is subset to only the variants (rows) indicated to include by the mask. When use.mask is set to FALSE, all variants (rows) from the chrom object are included in the new vcfR object.
Returns an object of class vcfR.
Convert the information in a vcfR object to a long-format data frame suitable for analysis or use with Hadley Wickham's packages, dplyr, tidyr, and ggplot2. These packages have been optimized for operation on large data frames, and, though they can bog down with very large data sets, they provide a good framework for handling and filtering large variant data sets. For some background on the benefits of such "tidy" data frames, see doi:10.18637/jss.v059.i10.
For some filtering operations, such as those where one wants to filter genotypes
upon GT fields in combination with INFO fields, or more complex
operations in which one wants to filter
loci based upon the number of individuals having greater than a certain quality score,
it will be advantageous to put all the information into a long format data frame
and use dplyr
to perform the operations. Additionally, a long data format is
required for using ggplot2
. These functions convert vcfR objects to long format
data frames.
vcfR2tidy( x, info_only = FALSE, single_frame = FALSE, toss_INFO_column = TRUE, ... ) extract_info_tidy(x, info_fields = NULL, info_types = TRUE, info_sep = ";") extract_gt_tidy( x, format_fields = NULL, format_types = TRUE, dot_is_NA = TRUE, alleles = TRUE, allele.sep = "/", gt_column_prepend = "gt_", verbose = TRUE ) vcf_field_names(x, tag = "INFO")
vcfR2tidy( x, info_only = FALSE, single_frame = FALSE, toss_INFO_column = TRUE, ... ) extract_info_tidy(x, info_fields = NULL, info_types = TRUE, info_sep = ";") extract_gt_tidy( x, format_fields = NULL, format_types = TRUE, dot_is_NA = TRUE, alleles = TRUE, allele.sep = "/", gt_column_prepend = "gt_", verbose = TRUE ) vcf_field_names(x, tag = "INFO")
x |
an object of class vcfR |
info_only |
if TRUE return a list with only a |
single_frame |
return a single tidy data frame in list component
|
toss_INFO_column |
if TRUE (the default) the INFO column will be removed from output as its consituent parts will have been parsed into separate columns. |
... |
more options to pass to |
info_fields |
names of the fields to be extracted from the INFO column into a long format data frame. If this is left as NULL (the default) then the function returns a column for every INFO field listed in the metadata. |
info_types |
named vector of "i" or "n" if you want the fields extracted from the INFO column to be converted to integer or numeric types, respectively.
When set to NULL they will be characters.
The names have to be the exact names of the fields.
For example |
info_sep |
the delimiter used in the data portion of the INFO fields to separate different entries. By default it is ";", but earlier versions of the VCF standard apparently used ":" as a delimiter. |
format_fields |
names of the fields in the FORMAT column to be extracted from each individual in the vcfR object into a long format data frame. If left as NULL, the function will extract all the FORMAT columns that were documented in the meta section of the VCF file. |
format_types |
named vector of "i" or "n" if you want the fields extracted according to the FORMAT column to be converted to integer or numeric types, respectively.
When set to TRUE an attempt to determine their type will be made from the meta information.
When set to NULL they will be characters.
The names have to be the exact names of the format_fields.
Works equivalently to the |
dot_is_NA |
if TRUE then a single "." in a character field will be set to NA. If FALSE no conversion is done. Note that "." in a numeric or integer field (according to format_types) with Number == 1 is always going to be set to NA. |
alleles |
if TRUE (the default) then this will return a column, |
allele.sep |
character which delimits the alleles in a genotype (/ or |) to be passed to
|
gt_column_prepend |
string to prepend to the names of the FORMAT columns |
verbose |
logical to specify if verbose output should be produced in the output so that they do not conflict with any INFO columns in the output. Default is "gt_". Should be a valid R name. (i.e. don't start with a number, have a space in it, etc.) |
tag |
name of the lines in the metadata section of the VCF file to parse out. Default is "INFO". The only other one tested and supported, currently is, "FORMAT". |
The function vcfR2tidy is the main function in this series. It takes a vcfR
object and converts the information to a list of long-format data frames. The user can
specify whether only the INFO or both the INFO and the FORMAT columns should be extracted, and also
which INFO and FORMAT fields to extract. If no specific INFO or FORMAT fields are asked
for, then they will all be returned. If single_frame == FALSE
and
info_only == FALSE
(the default),
the function returns a list with three components: fix
, gt
, and meta
as follows:
fix
A data frame of the fixed information columns and the parsed INFO columns, and
an additional column, ChromKey
—an integer identifier
for each locus, ordered by their appearance in the original data frame—that serves
together with POS as a key back to rows in gt
.
gt
A data frame of the genotype-related fields. Column names are the names of the
FORMAT fields with gt_column_prepend
(by default, "gt_") prepended to them. Additionally
there are columns ChromKey
, and POS
that can be used to associate
each row in gt
with a row in fix
.
meta
The meta-data associated with the columns that were extracted from the INFO and FORMAT
columns in a tbl_df-ed data frame.
This is the default return object because it might be space-inefficient to
return a single tidy data frame if there are many individuals and the CHROM names are
long and/or there are many INFO fields. However, if
single_frame = TRUE
, then the results are returned as a list with component meta
as before, but rather than having fix
and gt
as before, both those data frames
have been joined into component dat
and a ChromKey column is not returned, because
the CHROM column is available.
If info_only == FALSE
, then just the fixed columns and the parsed INFO columns are
returned, and the FORMAT fields are not parsed at all. The return value is a list with
components fix
and meta
. No column ChromKey appears.
The following functions are called by vcfR2tidy but are documented below because they may be useful individually.
The function extract_info_tidy let's you pass in a vector of the INFO fields that
you want extracted to a long format data frame. If you don't tell it which fields to
extract it will extract all the INFO columns detailed in the VCF meta section.
The function returns a tbl_df data frame of the INFO fields along with with an additional
integer column Key
that associates
each row in the output data frame with each row (i.e. each CHROM-POS combination)
in the original vcfR object x
.
The function extract_gt_tidy let's you pass in a vector of the FORMAT fields that
you want extracted to a long format data frame. If you don't tell it which fields to
extract it will extract all the FORMAT columns detailed in the VCF meta section.
The function returns a tbl_df data frame of the FORMAT fields with an additional
integer column Key
that associates
each row in the output data frame with each row (i.e. each CHROM-POS combination),
in the original vcfR object x
, and an additional column Indiv
that gives
the name of the individual.
The function vcf_field_names is a helper function that
parses information from the metadata section of the
VCF file to return a data frame with the metadata information about either the INFO
or FORMAT tags. It
returns a tbl_df
-ed data frame with column names: "Tag", "ID", "Number","Type",
"Description", "Source", and "Version".
An object of class tidy::data_frame or a list where every element is of class tidy::data_frame.
To run all the examples, you can issue this:
example("vcfR2tidy")
Eric C. Anderson <[email protected]>
# load the data data("vcfR_test") vcf <- vcfR_test # extract all the INFO and FORMAT fields into a list of tidy # data frames: fix, gt, and meta. Here we don't coerce columns # to integer or numeric types... Z <- vcfR2tidy(vcf) names(Z) # here is the meta data in a table Z$meta # here is the fixed info Z$fix # here are the GT fields. Note that ChromKey and POS are keys # back to Z$fix Z$gt # Note that if you wanted to tidy this data set even further # you could break up the comma-delimited columns easily # using tidyr::separate # here we put the data into a single, joined data frame (list component # dat in the returned list) and the meta data. Let's just pick out a # few fields: vcfR2tidy(vcf, single_frame = TRUE, info_fields = c("AC", "AN", "MQ"), format_fields = c("GT", "PL")) # note that the "gt_GT_alleles" column is always returned when any # FORMAT fields are extracted. # Here we extract a single frame with all fields but we automatically change # types of the columns according to the entries in the metadata. vcfR2tidy(vcf, single_frame = TRUE, info_types = TRUE, format_types = TRUE) # for comparison, here note that all the INFO and FORMAT fields that were # extracted are left as character ("chr" in the dplyr summary) vcfR2tidy(vcf, single_frame = TRUE) # Below are some examples with the vcfR2tidy "subfunctions" # extract the AC, AN, and MQ fields from the INFO column into # a data frame and convert the AN values integers and the MQ # values into numerics. extract_info_tidy(vcf, info_fields = c("AC", "AN", "MQ"), info_types = c(AN = "i", MQ = "n")) # extract all fields from the INFO column but leave # them as character vectors extract_info_tidy(vcf) # extract all fields from the INFO column and coerce # types according to metadata info extract_info_tidy(vcf, info_types = TRUE) # get the INFO field metadata in a data frame vcf_field_names(vcf, tag = "INFO") # get the FORMAT field metadata in a data frame vcf_field_names(vcf, tag = "FORMAT")
# load the data data("vcfR_test") vcf <- vcfR_test # extract all the INFO and FORMAT fields into a list of tidy # data frames: fix, gt, and meta. Here we don't coerce columns # to integer or numeric types... Z <- vcfR2tidy(vcf) names(Z) # here is the meta data in a table Z$meta # here is the fixed info Z$fix # here are the GT fields. Note that ChromKey and POS are keys # back to Z$fix Z$gt # Note that if you wanted to tidy this data set even further # you could break up the comma-delimited columns easily # using tidyr::separate # here we put the data into a single, joined data frame (list component # dat in the returned list) and the meta data. Let's just pick out a # few fields: vcfR2tidy(vcf, single_frame = TRUE, info_fields = c("AC", "AN", "MQ"), format_fields = c("GT", "PL")) # note that the "gt_GT_alleles" column is always returned when any # FORMAT fields are extracted. # Here we extract a single frame with all fields but we automatically change # types of the columns according to the entries in the metadata. vcfR2tidy(vcf, single_frame = TRUE, info_types = TRUE, format_types = TRUE) # for comparison, here note that all the INFO and FORMAT fields that were # extracted are left as character ("chr" in the dplyr summary) vcfR2tidy(vcf, single_frame = TRUE) # Below are some examples with the vcfR2tidy "subfunctions" # extract the AC, AN, and MQ fields from the INFO column into # a data frame and convert the AN values integers and the MQ # values into numerics. extract_info_tidy(vcf, info_fields = c("AC", "AN", "MQ"), info_types = c(AN = "i", MQ = "n")) # extract all fields from the INFO column but leave # them as character vectors extract_info_tidy(vcf) # extract all fields from the INFO column and coerce # types according to metadata info extract_info_tidy(vcf, info_types = TRUE) # get the INFO field metadata in a data frame vcf_field_names(vcf, tag = "INFO") # get the FORMAT field metadata in a data frame vcf_field_names(vcf, tag = "FORMAT")
Creates and populates an object of class chromR.
create.chromR(vcf, name = "CHROM", seq = NULL, ann = NULL, verbose = TRUE) vcfR2chromR(x, vcf) seq2chromR(x, seq = NULL) ann2chromR(x, gff)
create.chromR(vcf, name = "CHROM", seq = NULL, ann = NULL, verbose = TRUE) vcfR2chromR(x, vcf) seq2chromR(x, seq = NULL) ann2chromR(x, gff)
vcf |
an object of class vcfR |
name |
a name for the chromosome (for plotting purposes) |
seq |
a sequence as a DNAbin object |
ann |
an annotation file (gff-like) |
verbose |
should verbose output be printed to the console? |
x |
an object of class chromR |
gff |
a data.frame containing annotation data in the gff format |
Creates and names a chromR object from a name, a chromosome (an ape::DNAbin object), variant data (a vcfR object) and annotation data (gff-like). The function create.chromR is a wrapper which calls functions to populate the slots of the chromR object.
The function vcf2chromR is called by create.chromR and transfers the data from the slots of a vcfR object to the slots of a chromR object. It also tries to extract the 'DP' and 'MQ' fileds (when present) from the fix slot's INFO column. It is not anticipated that a user would need to use this function directly, but its placed here in case they do.
The function seq2chromR is currently defined as a generic function. This may change in the future. This function takes an object of class DNAbin and assigns it to the 'seq' slot of a chromR object.
The function ann2chromR is called by create.chromR and transfers the information from a gff-like object to the 'ann' slot of a chromR object. It is not anticipated that a user would need to use this function directly, but its placed here in case they do.
chromR-class
,
vcfR-class
,
DNAbin
,
VCF specification
gff3 format
library(vcfR) data(vcfR_example) chrom <- create.chromR('sc50', seq=dna, vcf=vcf, ann=gff) head(chrom) chrom plot(chrom) chrom <- masker(chrom, min_QUAL = 1, min_DP = 300, max_DP = 700, min_MQ = 59, max_MQ = 61) chrom <- proc.chromR(chrom, win.size=1000) plot(chrom) chromoqc(chrom)
library(vcfR) data(vcfR_example) chrom <- create.chromR('sc50', seq=dna, vcf=vcf, ann=gff) head(chrom) chrom plot(chrom) chrom <- masker(chrom, min_QUAL = 1, min_DP = 300, max_DP = 700, min_MQ = 59, max_MQ = 61) chrom <- proc.chromR(chrom, win.size=1000) plot(chrom) chromoqc(chrom)
Plot chromR objects and their components
dr.plot( dmat = NULL, rlst = NULL, chrom.s = 1, chrom.e = NULL, title = NULL, hline = NULL, dcol = NULL, rcol = NULL, rbcol = NULL, ... ) null.plot()
dr.plot( dmat = NULL, rlst = NULL, chrom.s = 1, chrom.e = NULL, title = NULL, hline = NULL, dcol = NULL, rcol = NULL, rbcol = NULL, ... ) null.plot()
dmat |
a numeric matrix for dot plots where the first column is position (POS) and subsequent columns are y-values. |
rlst |
a list containing numeric matrices containing rectangle coordinates. |
chrom.s |
start position for the chromosome |
chrom.e |
end position for the chromosome |
title |
optional string to be used for the plot title. |
hline |
vector of positions to be used for horizontal lines. |
dcol |
vector of colors to be used for dot plots. |
rcol |
vector of colors to be used for rectangle plots. |
rbcol |
vector of colors to be used for rectangle borders. |
... |
arguments to be passed to other methods. |
Plot details The parameter rlist is list of numeric matrices containing rectangle coordinates. The first column of each matrix is the left positions, the second column is the bottom coordinates, the third column is the right coordinates and the fourth column is the top coordinates.
Returns the y-axis minimum and maximum values invisibly.
Extract elements from the 'gt' slot, convert extracted genotypes to their allelic state, extract indels from the data structure or extract elements from the INFO column of the 'fix' slot.
extract.gt( x, element = "GT", mask = FALSE, as.numeric = FALSE, return.alleles = FALSE, IDtoRowNames = TRUE, extract = TRUE, convertNA = TRUE ) extract.haps(x, mask = FALSE, unphased_as_NA = TRUE, verbose = TRUE) is.indel(x) extract.indels(x, return.indels = FALSE) extract.info(x, element, as.numeric = FALSE, mask = FALSE)
extract.gt( x, element = "GT", mask = FALSE, as.numeric = FALSE, return.alleles = FALSE, IDtoRowNames = TRUE, extract = TRUE, convertNA = TRUE ) extract.haps(x, mask = FALSE, unphased_as_NA = TRUE, verbose = TRUE) is.indel(x) extract.indels(x, return.indels = FALSE) extract.info(x, element, as.numeric = FALSE, mask = FALSE)
x |
An object of class chromR or vcfR |
element |
element to extract from vcf genotype data. Common options include "DP", "GT" and "GQ" |
mask |
a logical indicating whether to apply the mask (TRUE) or return all variants (FALSE). Alternatively, a vector of logicals may be provided. |
as.numeric |
logical, should the matrix be converted to numerics |
return.alleles |
logical indicating whether to return the genotypes (0/1) or alleles (A/T) |
IDtoRowNames |
logical specifying whether to use the ID column from the FIX region as rownames |
extract |
logical indicating whether to return the extracted element or the remaining string |
convertNA |
logical indicating whether to convert "." to NA. |
unphased_as_NA |
logical specifying how to handle unphased genotypes |
verbose |
should verbose output be generated |
return.indels |
logical indicating whether to return indels or not |
The function extract.gt isolates elements from the 'gt' portion of vcf data. Fields available for extraction are listed in the FORMAT column of the 'gt' slot. Because different vcf producing software produce different fields the options will vary by software. The mask parameter allows the mask to be implemented when using a chromR object. The 'as.numeric' option will convert the results from a character to a numeric. Note that if the data is not actually numeric, it will result in a numeric result which may not be interpretable. The 'return.alleles' option allows the default behavior of numerically encoded genotypes (e.g., 0/1) to be converted to their nucleic acid representation (e.g., A/T). Note that this is not used for a regular expression as similar parameters are used in other functions. Extract allows the user to extract just the specified element (TRUE) or every element except the one specified.
Note that when 'as.numeric' is set to 'TRUE' but the data are not actually numeric, unexpected results will likely occur. For example, the genotype field will typically be populated with values such as "0/1" or "1|0". Although these may appear numeric, they contain a delimiter (the forward slash or the pipe) that is non-numeric. This means that there is no straight forward conversion to a numeric and unexpected values should be expected.
The function extract.haps uses extract.gt to isolate genotypes. It then uses the information in the REF and ALT columns as well as an allele delimiter (gt_split) to split genotypes into their allelic state. Ploidy is determined by the first non-NA genotype in the first sample.
The VCF specification allows for genotypes to be delimited with a '|' when they are phased and a '/' when unphased. This becomes important when dividing a genotype into two haplotypes. When the alleels are phased this is straight forward. When the alleles are unphased it presents a decision. The default is to handle unphased data by converting them to NAs. When unphased_as_NA is set to TRUE the alleles will be returned in the order they appear in the genotype. This does not assign each allele to it's correct chromosome. It becomes the user's responsibility to make informed decisions at this point.
The function is.indel returns a logical vector indicating which variants are indels (variants where an allele is greater than one character).
The function extract.indels is used to remove indels from SNPs. The function queries the 'REF' and 'ALT' columns of the 'fix' slot to see if any alleles are greater than one character in length. When the parameter return_indels is FALSE only SNPs will be returned. When the parameter return_indels is TRUE only indels will be returned.
The function extract.info is used to isolate elements from the INFO column of vcf data.
is.polymorphic
data(vcfR_test) gt <- extract.gt(vcfR_test) gt <- extract.gt(vcfR_test, return.alleles = TRUE) data(vcfR_test) is.indel(vcfR_test) data(vcfR_test) getFIX(vcfR_test) vcf <- extract.indels(vcfR_test) getFIX(vcf) vcf@fix[nrow(vcf@fix),'ALT'] <- ".,A" vcf <- extract.indels(vcf) getFIX(vcf) data(vcfR_test) vcfR_test@fix[1,'ALT'] <- "<NON_REF>" vcf <- extract.indels(vcfR_test) getFIX(vcf) data(vcfR_test) extract.haps(vcfR_test, unphased_as_NA = FALSE) extract.haps(vcfR_test)
data(vcfR_test) gt <- extract.gt(vcfR_test) gt <- extract.gt(vcfR_test, return.alleles = TRUE) data(vcfR_test) is.indel(vcfR_test) data(vcfR_test) getFIX(vcfR_test) vcf <- extract.indels(vcfR_test) getFIX(vcf) vcf@fix[nrow(vcf@fix),'ALT'] <- ".,A" vcf <- extract.indels(vcf) getFIX(vcf) data(vcfR_test) vcfR_test@fix[1,'ALT'] <- "<NON_REF>" vcf <- extract.indels(vcfR_test) getFIX(vcf) data(vcfR_test) extract.haps(vcfR_test, unphased_as_NA = FALSE) extract.haps(vcfR_test)
Convert vcfR objects to objects supported by other R packages
vcfR2genind(x, sep = "[|/]", return.alleles = FALSE, ...) vcfR2loci(x, return.alleles = FALSE) vcfR2genlight(x, n.cores = 1)
vcfR2genind(x, sep = "[|/]", return.alleles = FALSE, ...) vcfR2loci(x, return.alleles = FALSE) vcfR2genlight(x, n.cores = 1)
x |
an object of class chromR or vcfR |
sep |
character (to be used in a regular expression) to delimit the alleles of genotypes |
return.alleles |
should the VCF encoding of the alleles be returned (FALSE) or the actual alleles (TRUE). |
... |
pass other parameters to adegenet::df2genlight |
n.cores |
integer specifying the number of cores to use. |
After processing vcf data in vcfR, one will likely proceed to an analysis step. Within R, three obvious choices are: pegas, adegenet and poppr. The package pegas uses objects of type loci. The function vcfR2loci calls extract.gt to create a matrix of genotypes which is then converted into an object of type loci.
The packages adegenet and poppr use the genind object. The function vcfR2genind uses extract.gt to create a matrix of genotypes and uses the adegenet function df2genind to create a genind object. The package poppr additionally uses objects of class genclone which can be created from genind objects using poppr::as.genclone. A genind object can be converted to a genclone object with the function poppr::as.genclone.
The function vcfR2genlight calls the 'new' method for the genlight object.
This method implements multi-threading through calls to the function parallel::mclapply
.
Because 'forks' do not exist in the windows environment, this will only work for windows users when n.cores=1.
In the Unix environment, users may increase this number to allow the use of multiple threads (i.e., cores).
The parameter ... is used to pass parameters to other functions.
In vcfR2genind
it is used to pass parameters to adegenet::df2genind
.
For example, setting check.ploidy=FALSE
may improve the performance of adegenet::df2genind
, as long as you know the ploidy.
See ?adegenet::df2genind
to see these options.
If you wish to use vcfR2genind()
, it is strongly recommended to use it with the option return.alleles = TRUE
.
The reason for this is because the poppr package accomodates mixed-ploidy data by interpreting "0" alleles in genind objects to be NULL alleles in both poppr::poppr.amova()
and poppr::locus_table()
.
extract.gt
,
alleles2consensus
,
adegenet::df2genind
,
adegenet::genind
,
pegas,
adegenet,
and
poppr.
To convert to objects of class DNAbin see vcfR2DNAbin
.
adegenet_installed <- require("adegenet") if (adegenet_installed) { data(vcfR_test) # convert to genlight (preferred method with bi-allelic SNPs) gl <- vcfR2genlight(vcfR_test) # convert to genind, keeping information about allelic state # (slightly slower, but preferred method for use with the "poppr" package) gid <- vcfR2genind(vcfR_test, return.alleles = TRUE) # convert to genind, returning allelic states as 0, 1, 2, etc. # (not preferred, but slightly faster) gid2 <- vcfR2genind(vcfR_test, return.alleles = FALSE) }
adegenet_installed <- require("adegenet") if (adegenet_installed) { data(vcfR_test) # convert to genlight (preferred method with bi-allelic SNPs) gl <- vcfR2genlight(vcfR_test) # convert to genind, keeping information about allelic state # (slightly slower, but preferred method for use with the "poppr" package) gid <- vcfR2genind(vcfR_test, return.alleles = TRUE) # convert to genind, returning allelic states as 0, 1, 2, etc. # (not preferred, but slightly faster) gid2 <- vcfR2genind(vcfR_test, return.alleles = FALSE) }
Find density peaks in frequency data.
freq_peak(myMat, pos, winsize = 10000L, bin_width = 0.02, lhs = TRUE)
freq_peak(myMat, pos, winsize = 10000L, bin_width = 0.02, lhs = TRUE)
myMat |
a matrix of frequencies [0-1]. |
pos |
a numeric vector describing the position of variants in myMat. |
winsize |
sliding window size. |
bin_width |
Width of bins to summarize ferequencies in (0-1]. |
lhs |
logical specifying whether the search for the bin of greatest density should favor values from the left hand side. |
Noisy data, such as genomic data, lack a clear consensus. Summaries may be made in an attempt to 'clean it up.' Common summaries, such as the mean, rely on an assumption of normalicy. An assumption that frequently can be violated. This leaves a conundrum as to how to effectively summarize these data.
Here we implement an attempt to summarize noisy data through binning the data and selecting the bin containing the greatest density of data. The data are first divided into parameter sized windows. Next the data are categorized by parameterizable bin widths. Finally, the bin with the greatest density, the greatest count of data, is used as a summary. Because this method is based on binning the data it does not rely on a distributional assumption.
The parameter lhs
specifyies whether the search for the bin of greatest density should be performed from the left hand side.
The default value of TRUE starts at the left hand side, or zero, and selects a new bin as having the greatest density only if a new bin has a greater density.
If the new bin has an equal density then no update is made.
This causees the analysis to select lower frequencies.
When this parameter is set to FALSE ties result in an update of the bin of greatest density.
This causes the analysis to select higher frequencies.
It is recommended that when testing the most abundant allele (typically [0.5-1]) to use the default of TRUE so that a low value is preferred.
Similarly, when testing the less abundant alleles it is recommended to set this value at FALSE to preferentially select high values.
A freq_peak object (a list) containing:
The window size
The binwidth used for peak binning
a matrix containing window coordinates
a matrix containing peak locations
a matrix containing the counts of variants for each sample in each window
The window matrix contains start and end coordinates for each window, the rows of the original matrix that demarcate each window and the position of the variants that begin and end each window.
The matrix of peak locations contains the midpoint for the bin of greatest density for each sample and each window. Alternatively, if 'count = TRUE' the number of non-missing values in each window is reported. The number of non-mising values in each window may be used to censor windows containing low quantities of data.
peak_to_ploid, freq_peak_plot
data(vcfR_example) gt <- extract.gt(vcf) hets <- is_het(gt) # Censor non-heterozygous positions. is.na(vcf@gt[,-1][!hets]) <- TRUE # Extract allele depths. ad <- extract.gt(vcf, element = "AD") ad1 <- masplit(ad, record = 1) ad2 <- masplit(ad, record = 2) freq1 <- ad1/(ad1+ad2) freq2 <- ad2/(ad1+ad2) myPeaks1 <- freq_peak(freq1, getPOS(vcf)) is.na(myPeaks1$peaks[myPeaks1$counts < 20]) <- TRUE myPeaks2 <- freq_peak(freq2, getPOS(vcf), lhs = FALSE) is.na(myPeaks2$peaks[myPeaks2$counts < 20]) <- TRUE myPeaks1 # Visualize mySample <- "P17777us22" myWin <- 2 hist(freq1[myPeaks1$wins[myWin,'START_row']:myPeaks1$wins[myWin,'END_row'], mySample], breaks=seq(0,1,by=0.02), col="#A6CEE3", main="", xlab="", xaxt="n") hist(freq2[myPeaks2$wins[myWin,'START_row']:myPeaks2$wins[myWin,'END_row'], mySample], breaks=seq(0,1,by=0.02), col="#1F78B4", main="", xlab="", xaxt="n", add = TRUE) axis(side=1, at=c(0,0.25,0.333,0.5,0.666,0.75,1), labels=c(0,'1/4','1/3','1/2','2/3','3/4',1), las=3) abline(v=myPeaks1$peaks[myWin,mySample], col=2, lwd=2) abline(v=myPeaks2$peaks[myWin,mySample], col=2, lwd=2) # Visualize #2 mySample <- "P17777us22" plot(getPOS(vcf), freq1[,mySample], ylim=c(0,1), type="n", yaxt='n', main = mySample, xlab = "POS", ylab = "Allele balance") axis(side=2, at=c(0,0.25,0.333,0.5,0.666,0.75,1), labels=c(0,'1/4','1/3','1/2','2/3','3/4',1), las=1) abline(h=c(0.25,0.333,0.5,0.666,0.75), col=8) points(getPOS(vcf), freq1[,mySample], pch = 20, col= "#A6CEE3") points(getPOS(vcf), freq2[,mySample], pch = 20, col= "#1F78B4") segments(x0=myPeaks1$wins[,'START_pos'], y0=myPeaks1$peaks[,mySample], x1=myPeaks1$wins[,'END_pos'], lwd=3) segments(x0=myPeaks1$wins[,'START_pos'], y0=myPeaks2$peaks[,mySample], x1=myPeaks1$wins[,'END_pos'], lwd=3)
data(vcfR_example) gt <- extract.gt(vcf) hets <- is_het(gt) # Censor non-heterozygous positions. is.na(vcf@gt[,-1][!hets]) <- TRUE # Extract allele depths. ad <- extract.gt(vcf, element = "AD") ad1 <- masplit(ad, record = 1) ad2 <- masplit(ad, record = 2) freq1 <- ad1/(ad1+ad2) freq2 <- ad2/(ad1+ad2) myPeaks1 <- freq_peak(freq1, getPOS(vcf)) is.na(myPeaks1$peaks[myPeaks1$counts < 20]) <- TRUE myPeaks2 <- freq_peak(freq2, getPOS(vcf), lhs = FALSE) is.na(myPeaks2$peaks[myPeaks2$counts < 20]) <- TRUE myPeaks1 # Visualize mySample <- "P17777us22" myWin <- 2 hist(freq1[myPeaks1$wins[myWin,'START_row']:myPeaks1$wins[myWin,'END_row'], mySample], breaks=seq(0,1,by=0.02), col="#A6CEE3", main="", xlab="", xaxt="n") hist(freq2[myPeaks2$wins[myWin,'START_row']:myPeaks2$wins[myWin,'END_row'], mySample], breaks=seq(0,1,by=0.02), col="#1F78B4", main="", xlab="", xaxt="n", add = TRUE) axis(side=1, at=c(0,0.25,0.333,0.5,0.666,0.75,1), labels=c(0,'1/4','1/3','1/2','2/3','3/4',1), las=3) abline(v=myPeaks1$peaks[myWin,mySample], col=2, lwd=2) abline(v=myPeaks2$peaks[myWin,mySample], col=2, lwd=2) # Visualize #2 mySample <- "P17777us22" plot(getPOS(vcf), freq1[,mySample], ylim=c(0,1), type="n", yaxt='n', main = mySample, xlab = "POS", ylab = "Allele balance") axis(side=2, at=c(0,0.25,0.333,0.5,0.666,0.75,1), labels=c(0,'1/4','1/3','1/2','2/3','3/4',1), las=1) abline(h=c(0.25,0.333,0.5,0.666,0.75), col=8) points(getPOS(vcf), freq1[,mySample], pch = 20, col= "#A6CEE3") points(getPOS(vcf), freq2[,mySample], pch = 20, col= "#1F78B4") segments(x0=myPeaks1$wins[,'START_pos'], y0=myPeaks1$peaks[,mySample], x1=myPeaks1$wins[,'END_pos'], lwd=3) segments(x0=myPeaks1$wins[,'START_pos'], y0=myPeaks2$peaks[,mySample], x1=myPeaks1$wins[,'END_pos'], lwd=3)
Converts allele balance data produced by freq_peak()
to a copy number by assinging the allele balance data (frequencies) to its closest expected ratio.
freq_peak_plot( pos, posUnits = "bp", ab1 = NULL, ab2 = NULL, fp1 = NULL, fp2 = NULL, mySamp = 1, col1 = "#A6CEE3", col2 = "#1F78B4", alpha = 44, main = NULL, mhist = TRUE, layout = TRUE, ... )
freq_peak_plot( pos, posUnits = "bp", ab1 = NULL, ab2 = NULL, fp1 = NULL, fp2 = NULL, mySamp = 1, col1 = "#A6CEE3", col2 = "#1F78B4", alpha = 44, main = NULL, mhist = TRUE, layout = TRUE, ... )
pos |
chromosomal position of variants |
posUnits |
units ('bp', 'Kbp', 'Mbp', 'Gbp') for 'pos' to be converted to in the main plot |
ab1 |
matrix of allele balances for allele 1 |
ab2 |
matrix of allele balances for allele 2 |
fp1 |
freq_peak object for allele 1 |
fp2 |
freq_peak object for allele 2 |
mySamp |
sample indicator |
col1 |
color 1 |
col2 |
color 2 |
alpha |
sets the transparency for dot plot (0-255) |
main |
main plot title. |
mhist |
logical indicating to include a marginal histogram |
layout |
call layout |
... |
parameters passed on to other functions |
Creates a visualization of allele balance data consisting of a dot plot with position as the x-axis and frequency on the y-axis and an optional marginal histogram. The only required information is a vector of chromosomal positions, however this is probably not going to create an interesting plot.
An invisible NULL.
freq_peak, peak_to_ploid
# An empty plot. freq_peak_plot(pos=1:40) data(vcfR_example) gt <- extract.gt(vcf) hets <- is_het(gt) # Censor non-heterozygous positions. is.na(vcf@gt[,-1][!hets]) <- TRUE # Extract allele depths. ad <- extract.gt(vcf, element = "AD") ad1 <- masplit(ad, record = 1) ad2 <- masplit(ad, record = 2) freq1 <- ad1/(ad1+ad2) freq2 <- ad2/(ad1+ad2) myPeaks1 <- freq_peak(freq1, getPOS(vcf)) is.na(myPeaks1$peaks[myPeaks1$counts < 20]) <- TRUE myPeaks2 <- freq_peak(freq2, getPOS(vcf), lhs = FALSE) is.na(myPeaks2$peaks[myPeaks2$counts < 20]) <- TRUE freq_peak_plot(pos = getPOS(vcf), ab1 = freq1, ab2 = freq2, fp1 = myPeaks1, fp2=myPeaks2)
# An empty plot. freq_peak_plot(pos=1:40) data(vcfR_example) gt <- extract.gt(vcf) hets <- is_het(gt) # Censor non-heterozygous positions. is.na(vcf@gt[,-1][!hets]) <- TRUE # Extract allele depths. ad <- extract.gt(vcf, element = "AD") ad1 <- masplit(ad, record = 1) ad2 <- masplit(ad, record = 2) freq1 <- ad1/(ad1+ad2) freq2 <- ad2/(ad1+ad2) myPeaks1 <- freq_peak(freq1, getPOS(vcf)) is.na(myPeaks1$peaks[myPeaks1$counts < 20]) <- TRUE myPeaks2 <- freq_peak(freq2, getPOS(vcf), lhs = FALSE) is.na(myPeaks2$peaks[myPeaks2$counts < 20]) <- TRUE freq_peak_plot(pos = getPOS(vcf), ab1 = freq1, ab2 = freq2, fp1 = myPeaks1, fp2=myPeaks2)
Calculate measures of genetic differentiation.
genetic_diff(vcf, pops, method = "nei")
genetic_diff(vcf, pops, method = "nei")
vcf |
a vcfR object |
pops |
factor indicating populations |
method |
the method to measure differentiation |
Measures of genetic differentiation, or fixation indicies, are commonly reported population genetic parameters. This function reports genetic differentiation for all variants presented to it.
The method nei returns Nei's Gst as well as Hedrick's G'st, a correction for high alleleism (Hedrick 2005).
Here it is calculated as in equation 2 from Hedrick (2005) with the exception that the heterozygosities are weighted by the number of alleles observed in each subpopulation.
This is similar to hierfstat::pairwise.fst()
but by using the number of alleles instead of the number of individuals it avoids making an assumption about how many alleles are contributed by each individual.
G'st is calculated as in equation 4b from Hedrick (2005).
This method is based on heterozygosity where all of the alleles in a population are used to calculate allele frequecies.
This may make this a good choice when there is a mixture of ploidies in the sample.
The method jost return's Jost's D as a measure of differentiation. This is calculated as in equation 13 from Jost (2008). Examples are available at Jost's website: http://www.loujost.com.
A nice review of Fst and some of its analogues can be found in Holsinger and Weir (2009).
Hedrick, Philip W. "A standardized genetic differentiation measure." Evolution 59.8 (2005): 1633-1638.
Holsinger, Kent E., and Bruce S. Weir. "Genetics in geographically structured populations: defining, estimating and interpreting FST." Nature Reviews Genetics 10.9 (2009): 639-650.
Jost, Lou. "GST and its relatives do not measure differentiation." Molecular ecology 17.18 (2008): 4015-4026.
Whitlock, Michael C. "G'ST and D do not replace FST." Molecular Ecology 20.6 (2011): 1083-1091.
poppr.amova in poppr, amova in ade4, amova in pegas, hierfstat, DEMEtics, and, mmod.
data(vcfR_example) myPops <- as.factor(rep(c('a','b'), each = 9)) myDiff <- genetic_diff(vcf, myPops, method = "nei") colMeans(myDiff[,c(3:8,11)], na.rm = TRUE) hist(myDiff$Gprimest, xlab = expression(italic("G'"["ST"])), col='skyblue', breaks = seq(0, 1, by = 0.01))
data(vcfR_example) myPops <- as.factor(rep(c('a','b'), each = 9)) myDiff <- genetic_diff(vcf, myPops, method = "nei") colMeans(myDiff[,c(3:8,11)], na.rm = TRUE) hist(myDiff$Gprimest, xlab = expression(italic("G'"["ST"])), col='skyblue', breaks = seq(0, 1, by = 0.01))
Functions which modify a matrix or vector of genotypes.
alleles2consensus(x, sep = "/", NA_to_n = TRUE) get.alleles(x2, split = "/", na.rm = FALSE, as.numeric = FALSE)
alleles2consensus(x, sep = "/", NA_to_n = TRUE) get.alleles(x2, split = "/", na.rm = FALSE, as.numeric = FALSE)
x |
a matrix of alleles as genotypes (e.g., A/A, C/G, etc.) |
sep |
a character which delimits the alleles in a genotype (/ or |) |
NA_to_n |
logical indicating whether NAs should be scores as n |
x2 |
a vector of genotypes |
split |
character passed to strsplit to split the genotype into alleles |
na.rm |
logical indicating whether to remove NAs |
as.numeric |
logical specifying whether to convert to a numeric |
The function alleles2consensus converts genotypes to a single consensus allele using IUPAC ambiguity codes for heterozygotes. Note that some functions, such as ape::seg.sites do not recognize ambiguity characters (other than 'n'). This means that these functions, as well as functions that depend on them (e.g., pegas::tajima.test), will produce unexpected results.
Missing data are handled in a number of steps. When both alleles are missing ('.') the genotype is converted to NA. Secondly, if one of the alleles is missing ('.') the genotype is converted to NA> Lastly, NAs can be optionally converted to 'n' for compatibility with DNAbin objects.
The function get.alleles takes a vector of genotypes and returns the unique alleles.
Both chromR objects and vcfR objects contain a region with fixed variables. These accessors allow you to isolate these variables from these objects.
getFIX(x, getINFO = FALSE) getCHROM(x) getPOS(x) getQUAL(x) getALT(x) getREF(x) getID(x) getFILTER(x) getINFO(x)
getFIX(x, getINFO = FALSE) getCHROM(x) getPOS(x) getQUAL(x) getALT(x) getREF(x) getID(x) getFILTER(x) getINFO(x)
x |
a vcfR or chromR object |
getINFO |
logical specifying whether getFIX should return the INFO column |
a vector or data frame
library("vcfR") data("vcfR_example") data("chromR_example") getFIX(vcf) %>% head getFIX(chrom) %>% head getCHROM(vcf) %>% head getCHROM(chrom) %>% head getPOS(vcf) %>% head getPOS(chrom) %>% head getID(vcf) %>% head getID(chrom) %>% head getREF(vcf) %>% head getREF(chrom) %>% head getALT(vcf) %>% head getALT(chrom) %>% head getQUAL(vcf) %>% head getQUAL(chrom) %>% head getFILTER(vcf) %>% head getFILTER(chrom) %>% head getINFO(vcf) %>% head getINFO(chrom) %>% head
library("vcfR") data("vcfR_example") data("chromR_example") getFIX(vcf) %>% head getFIX(chrom) %>% head getCHROM(vcf) %>% head getCHROM(chrom) %>% head getPOS(vcf) %>% head getPOS(chrom) %>% head getID(vcf) %>% head getID(chrom) %>% head getREF(vcf) %>% head getREF(chrom) %>% head getALT(vcf) %>% head getALT(chrom) %>% head getQUAL(vcf) %>% head getQUAL(chrom) %>% head getFILTER(vcf) %>% head getFILTER(chrom) %>% head getINFO(vcf) %>% head getINFO(chrom) %>% head
Functions that make population genetics summaries
gt2popsum(x, deprecated = TRUE) gt.to.popsum(x)
gt2popsum(x, deprecated = TRUE) gt.to.popsum(x)
x |
object of class chromR or vcfR |
deprecated |
logical specifying whether to run the function (FALSE) or present deprecation message (TRUE). |
The function 'gt2popsum' was deprecated in vcfR 1.8.0. This was because it was written entirely in R and did not perform well. Users should use 'gt.to.popsum()' instead because it has similar functionality but includes calls to C++ to increase its performance.
This function creates common population genetic summaries from either a chromR or vcfR object.
The default is to return a matrix containing allele counts, He, and Ne.
Allele_counts is the a comma delimited string of counts.
The first position is the count of reference alleles, the second positions is the count of the first alternate alleles, the third is the count of second alternate alleles, and so on.
He is the gene diversity, or heterozygosity, of the population.
This is , or the probability that two alleles sampled from the population are different, following Nei (1973).
Ne is the effective number of alleles in the population.
This is
or one minus the homozygosity, from Nei (1987) equation 8.17.
Nei, M., 1973. Analysis of gene diversity in subdivided populations. Proceedings of the National Academy of Sciences, 70(12), pp.3321-3323.
Nei, M., 1987. Molecular evolutionary genetics. Columbia University Press.
data(vcfR_test) # Check the genotypes. extract.gt(vcfR_test) # Summarize the genotypes. gt.to.popsum(vcfR_test)
data(vcfR_test) # Check the genotypes. extract.gt(vcfR_test) # Summarize the genotypes. gt.to.popsum(vcfR_test)
Heatmap of a numeric matrix with barplots summarizing columns and rows.
heatmap.bp( x, cbarplot = TRUE, rbarplot = TRUE, legend = TRUE, clabels = TRUE, rlabels = TRUE, na.rm = TRUE, scale = c("row", "column", "none"), col.ramp = viridisLite::viridis(n = 100, alpha = 1), ... )
heatmap.bp( x, cbarplot = TRUE, rbarplot = TRUE, legend = TRUE, clabels = TRUE, rlabels = TRUE, na.rm = TRUE, scale = c("row", "column", "none"), col.ramp = viridisLite::viridis(n = 100, alpha = 1), ... )
x |
a numeric matrix. |
cbarplot |
a logical indicating whether the columns should be summarized with a barplot. |
rbarplot |
a logical indicating whether the rows should be summarized with a barplot. |
legend |
a logical indicating whether a legend should be plotted. |
clabels |
a logical indicating whether column labels should be included. |
rlabels |
a logical indicating whether row labels should be included. |
na.rm |
a logical indicating whether missing values should be removed. |
scale |
character indicating if the values should be centered and scaled in either the row direction or the column direction, or none. The default is "none". |
col.ramp |
vector of colors to be used for the color ramp. |
... |
additional arguments to be passed on. |
The function heatmap.bp creates a heatmap from a numeric matrix with optional barplots to summarize the rows and columns.
heatmap
, image
, heatmap2 in gplots, pheatmap.
library(vcfR) x <- as.matrix(mtcars) heatmap.bp(x) heatmap.bp(x, scale="col") # Use an alternate color ramp heatmap.bp(x, col.ramp = colorRampPalette(c("red", "yellow", "#008000"))(100)) heatmap.bp(x) ## Not run: heatmap.bp(x, cbarplot = FALSE, rbarplot = FALSE, legend = FALSE) heatmap.bp(x, cbarplot = FALSE, rbarplot = TRUE, legend = FALSE) heatmap.bp(x, cbarplot = FALSE, rbarplot = FALSE, legend = TRUE) heatmap.bp(x, cbarplot = FALSE, rbarplot = TRUE, legend = TRUE) heatmap.bp(x, cbarplot = TRUE, rbarplot = FALSE, legend = FALSE) heatmap.bp(x, cbarplot = TRUE, rbarplot = TRUE, legend = FALSE) heatmap.bp(x, cbarplot = TRUE, rbarplot = FALSE, legend = TRUE) heatmap.bp(x, cbarplot = TRUE, rbarplot = TRUE, legend = TRUE) ## End(Not run)
library(vcfR) x <- as.matrix(mtcars) heatmap.bp(x) heatmap.bp(x, scale="col") # Use an alternate color ramp heatmap.bp(x, col.ramp = colorRampPalette(c("red", "yellow", "#008000"))(100)) heatmap.bp(x) ## Not run: heatmap.bp(x, cbarplot = FALSE, rbarplot = FALSE, legend = FALSE) heatmap.bp(x, cbarplot = FALSE, rbarplot = TRUE, legend = FALSE) heatmap.bp(x, cbarplot = FALSE, rbarplot = FALSE, legend = TRUE) heatmap.bp(x, cbarplot = FALSE, rbarplot = TRUE, legend = TRUE) heatmap.bp(x, cbarplot = TRUE, rbarplot = FALSE, legend = FALSE) heatmap.bp(x, cbarplot = TRUE, rbarplot = TRUE, legend = FALSE) heatmap.bp(x, cbarplot = TRUE, rbarplot = FALSE, legend = TRUE) heatmap.bp(x, cbarplot = TRUE, rbarplot = TRUE, legend = TRUE) ## End(Not run)
Reformat INFO data as a data.frame and handle class when possible.
INFO2df(x) metaINFO2df(x, field = "INFO")
INFO2df(x) metaINFO2df(x, field = "INFO")
x |
an object of class vcfR or chromR. |
field |
should either the INFo or FORMAT data be returned? |
The INFO column of VCF data contains descriptors for each variant. Because this column may contain many comma delimited descriptors it may be difficult to interpret. The function INFO2df converts the data into a data.frame. The function metaINFO2df extracts the information in the meta section that describes the INFO descriptors. This function is called by INFO2df to help it handle the class of the data.
A data.frame
data(vcfR_test) metaINFO2df(vcfR_test) getINFO(vcfR_test) INFO2df(vcfR_test)
data(vcfR_test) metaINFO2df(vcfR_test) getINFO(vcfR_test) INFO2df(vcfR_test)
Query a matrix of genotypes for heterozygotes
is_het(x, na_is_false = TRUE) is.het(x, na_is_false = TRUE)
is_het(x, na_is_false = TRUE) is.het(x, na_is_false = TRUE)
x |
a matrix of genotypes |
na_is_false |
should missing data be returned as NA (FALSE) or FALSE (TRUE) |
This function was designed to identify heterozygous positions in a matrix of genotypes.
The matrix of genotypes can be created with extract.gt
.
Because the goal was to identify heterozygotes it may be reasonable to ignore missing values by setting na_is_false to TRUE so that the resulting matrix will consist of only TRUE and FALSE.
In order to preserve missing data as missing na_is_false can be set to FALSE where if at least one allele is missing NA is returned.
data(vcfR_test) gt <- extract.gt(vcfR_test) hets <- is_het(gt) # Censor non-heterozygous positions. is.na(vcfR_test@gt[,-1][!hets]) <- TRUE
data(vcfR_test) gt <- extract.gt(vcfR_test) hets <- is_het(gt) # Censor non-heterozygous positions. is.na(vcfR_test@gt[,-1][!hets]) <- TRUE
Calculate the minor (or other) allele frequency.
maf(x, element = 2)
maf(x, element = 2)
x |
an object of class vcfR or chromR |
element |
specify the allele number to return |
The function maf() calculates the counts and frequency for an allele. A variant may contain more than two alleles. Rare alleles may be true rare alleles or the result of genotyping error. In an attempt to address these competing issues we sort the alleles by their frequency and the report statistics based on their position. For example, setting element=1 would return information about the major (most common) allele. Setting element=2 returns information about the second allele.
a matrix of four columns. The first column is the total number of alleles, the second is the number of NA genotypes, the third is the count and fourth the frequency.
Split a matrix of delimited strings.
masplit( myMat, delim = ",", count = 0L, record = 1L, sort = 1L, decreasing = 1L )
masplit( myMat, delim = ",", count = 0L, record = 1L, sort = 1L, decreasing = 1L )
myMat |
a matrix of delimited strings (e.g., "7,2"). |
delim |
character that delimits values. |
count |
return the count of delimited records. |
record |
which (1-based) record to return. |
sort |
should the records be sorted prior to selecting the element (0,1)? |
decreasing |
should the values be sorted decreasing (1) or increasing (0)? |
Split a matrix of delimited strings that represent numerics into numerics. The parameter count returns a matrix of integers indicating how many delimited records exist in each element. This is intended to help if you do not know how many records are in each element particularly if there is a mixture of numbers of records. The parameter record indicates which record to return (first, second, third, ...). The parameter sort indicates whether the records in each element should be sorted (1) or not (0) prior to selection. When sorting has been selected decreasing indicates if the sorting should be performed in a decreasing (1) or increasing (0) manner prior to selection.
A numeric matrix
set.seed(999) x1 <- round(rnorm(n=9, mean=10, sd=2)) x2 <- round(rnorm(n=9, mean=20, sd=2)) ad <- matrix(paste(x1, x2, sep=","), nrow=3, ncol=3) colnames(ad) <- paste('Sample', 1:3, sep="_") rownames(ad) <- paste('Variant', 1:3, sep="_") ad[1,1] <- "9,23,12" is.na(ad[3,1]) <- TRUE ad masplit(ad, count = 1) masplit(ad, sort = 0) masplit(ad, sort = 0, record = 2) masplit(ad, sort = 0, record = 3) masplit(ad, sort = 1, decreasing = 0)
set.seed(999) x1 <- round(rnorm(n=9, mean=10, sd=2)) x2 <- round(rnorm(n=9, mean=20, sd=2)) ad <- matrix(paste(x1, x2, sep=","), nrow=3, ncol=3) colnames(ad) <- paste('Sample', 1:3, sep="_") rownames(ad) <- paste('Variant', 1:3, sep="_") ad[1,1] <- "9,23,12" is.na(ad[3,1]) <- TRUE ad masplit(ad, count = 1) masplit(ad, sort = 0) masplit(ad, sort = 0, record = 2) masplit(ad, sort = 0, record = 3) masplit(ad, sort = 1, decreasing = 0)
Ordinate information from a sample's GT region and INFO column.
ordisample( x, sample, distance = "bray", plot = TRUE, alpha = 88, verbose = TRUE, ... )
ordisample( x, sample, distance = "bray", plot = TRUE, alpha = 88, verbose = TRUE, ... )
x |
an object of class vcfR or chromR. |
sample |
a sample number where the first sample (column) is 2 |
distance |
metric to be used for ordination, options are in |
plot |
logical specifying whether to plot the ordination |
alpha |
alpha channel (transparency) ranging from 0-255 |
verbose |
logical specifying whether to produce verbose output |
... |
parameters to be passed to child processes |
The INFO column of VCF data contains descriptors for each variant. Each sample typically includes several descriptors of each variant in the GT region as well. This can present an overwhelming amount of information. Ordination is used in this function to reduce this complexity.
The ordination procedure can be rather time consuming depending on how much data is used. I good recommendation is to always start with a small subset of your full dataset and slowly scale up. There are several steps in this function that attempt to eliminate variants or characters that have missing values in them. This that while starting with a small number is good, you will need to have a large enough number so that a substantial amount of the data make it to the ordination step. In the example I use 100 variants which appears to be a reasonable compromise.
The data contained in VCF files can frequently contain a large fraction of missing data. I advovate censoring data that does not meet quality control thresholds as missing which compounds the problem. An attempt is made to omit these missing data by querying the GT and INFO data for missingness and omitting the missing variants. The data may also include characters (columns) that contain all missing values which are omitted as well. When verbose == TRUE these omissions are reported as messages.
Some data may contain multiple values. For example, AD is the sequence depth for each observed allele. In these instances the values are sorted and the largest value is used.
Several of the steps of this ordination make distributional assumptions. That is, they assume the data to be normally distributed. There is no real reason to assume this assumption to be valid with VCF data. It has been my experience that this assumption is frequently violated with VCF data. It is therefore suggested to use this funciton as an exploratory tool that may help inform other decisions. These analyst may be able to address these issues through data transformation or other topics beyond the scope of this function. This function is intended to provide a rapid assessment of the data which may help determine if more elegant handling of the data may be required. Interpretation of the results of this function need to take into account that assumptions may have been violated.
A list consisting of two objects.
an object of class 'metaMDS' created by the function vegan::metaMDS
an object of class 'envfit' created by the function vegan::envfit
This list is returned invisibly.
metaMDS
,
vegdist
,
monoMDS
,
isoMDS
## Not run: # Example of normally distributed, random data. set.seed(9) x1 <- rnorm(500) set.seed(99) y1 <- rnorm(500) plot(x1, y1, pch=20, col="#8B451388", main="Normal, random, bivariate data") data(vcfR_example) ordisample(vcf[1:100,], sample = "P17777us22") vars <- 1:100 myOrd <- ordisample(vcf[vars,], sample = "P17777us22", plot = FALSE) names(myOrd) plot(myOrd$metaMDS, type = "n") points(myOrd$metaMDS, display = "sites", pch=20, col="#8B451366") text(myOrd$metaMDS, display = "spec", col="blue") plot(myOrd$envfit, col = "#008000", add = TRUE) head(myOrd$metaMDS$points) myOrd$envfit pairs(myOrd$data1) # Seperate heterozygotes and homozygotes. gt <- extract.gt(vcf) hets <- is_het(gt, na_is_false = FALSE) vcfhe <- vcf vcfhe@gt[,-1][ !hets & !is.na(hets) ] <- NA vcfho <- vcf vcfho@gt[,-1][ hets & !is.na(hets) ] <- NA myOrdhe <- ordisample(vcfhe[vars,], sample = "P17777us22", plot = FALSE) myOrdho <- ordisample(vcfho[vars,], sample = "P17777us22", plot = FALSE) pairs(myOrdhe$data1) pairs(myOrdho$data1) hist(myOrdho$data1$PL, breaks = seq(0,9000, by=100), col="#8B4513") ## End(Not run)
## Not run: # Example of normally distributed, random data. set.seed(9) x1 <- rnorm(500) set.seed(99) y1 <- rnorm(500) plot(x1, y1, pch=20, col="#8B451388", main="Normal, random, bivariate data") data(vcfR_example) ordisample(vcf[1:100,], sample = "P17777us22") vars <- 1:100 myOrd <- ordisample(vcf[vars,], sample = "P17777us22", plot = FALSE) names(myOrd) plot(myOrd$metaMDS, type = "n") points(myOrd$metaMDS, display = "sites", pch=20, col="#8B451366") text(myOrd$metaMDS, display = "spec", col="blue") plot(myOrd$envfit, col = "#008000", add = TRUE) head(myOrd$metaMDS$points) myOrd$envfit pairs(myOrd$data1) # Seperate heterozygotes and homozygotes. gt <- extract.gt(vcf) hets <- is_het(gt, na_is_false = FALSE) vcfhe <- vcf vcfhe@gt[,-1][ !hets & !is.na(hets) ] <- NA vcfho <- vcf vcfho@gt[,-1][ hets & !is.na(hets) ] <- NA myOrdhe <- ordisample(vcfhe[vars,], sample = "P17777us22", plot = FALSE) myOrdho <- ordisample(vcfho[vars,], sample = "P17777us22", plot = FALSE) pairs(myOrdhe$data1) pairs(myOrdho$data1) hist(myOrdho$data1$PL, breaks = seq(0,9000, by=100), col="#8B4513") ## End(Not run)
pairwise_genetic_diff
Calculate measures of genetic differentiation across all population pairs.
pairwise_genetic_diff(vcf, pops, method = "nei")
pairwise_genetic_diff(vcf, pops, method = "nei")
vcf |
a vcfR object |
pops |
factor indicating populations |
method |
the method to measure differentiation |
a data frame containing the pairwise population differentiation indices of interest across all pairs of populations in the population factor.
Javier F. Tabima
genetic_diff
in vcfR
data(vcfR_example) pops <- as.factor(rep(c('a','b'), each = 9)) myDiff <- pairwise_genetic_diff(vcf, pops, method = "nei") colMeans(myDiff[,c(4:ncol(myDiff))], na.rm = TRUE) pops <- as.factor(rep(c('a','b','c'), each = 6)) myDiff <- pairwise_genetic_diff(vcf, pops, method = "nei") colMeans(myDiff[,c(4:ncol(myDiff))], na.rm = TRUE)
data(vcfR_example) pops <- as.factor(rep(c('a','b'), each = 9)) myDiff <- pairwise_genetic_diff(vcf, pops, method = "nei") colMeans(myDiff[,c(4:ncol(myDiff))], na.rm = TRUE) pops <- as.factor(rep(c('a','b','c'), each = 6)) myDiff <- pairwise_genetic_diff(vcf, pops, method = "nei") colMeans(myDiff[,c(4:ncol(myDiff))], na.rm = TRUE)
Converts allele balance data produced by freq_peak()
to a copy number by assinging the allele balance data (frequencies) to its closest expected ratio.
peak_to_ploid(x)
peak_to_ploid(x)
x |
an object produced by |
Converts allele balance data produced by freq_peak()
to copy number.
See the examples section for a graphical representation of the expectations and the bins around them.
Once a copy number has called a distance from expectation (dfe) is calculated as a form of confidence.
The bins around different copy numbers are of different width, so the dfe is scaled by its respective bin width.
This results in a dfe that is 0 when it is exactly at our expectation (high confidence) and at 1 when it is half way between two expectations (low confidence).
A list consisting of two matrices containing the calls and the distance from expectation (i.e., confidence).
freq_peak
, freq_peak_plot
# Thresholds. plot(c(0.0, 1), c(0,1), type = "n", xaxt = "n", xlab = "Expectation", ylab = "Allele balance") myCalls <- c(1/5, 1/4, 1/3, 1/2, 2/3, 3/4, 4/5) axis(side = 1, at = myCalls, labels = c('1/5', '1/4', '1/3','1/2', '2/3', '3/4', '4/5'), las=2) abline(v=myCalls) abline(v=c(7/40, 9/40, 7/24, 5/12), lty=3, col ="#B22222") abline(v=c(7/12, 17/24, 31/40, 33/40), lty=3, col ="#B22222") text(x=7/40, y=0.1, labels = "7/40", srt = 90) text(x=9/40, y=0.1, labels = "9/40", srt = 90) text(x=7/24, y=0.1, labels = "7/24", srt = 90) text(x=5/12, y=0.1, labels = "5/12", srt = 90) text(x=7/12, y=0.1, labels = "7/12", srt = 90) text(x=17/24, y=0.1, labels = "17/24", srt = 90) text(x=31/40, y=0.1, labels = "31/40", srt = 90) text(x=33/40, y=0.1, labels = "33/40", srt = 90) # Prepare data and visualize data(vcfR_example) gt <- extract.gt(vcf) # Censor non-heterozygous positions. hets <- is_het(gt) is.na(vcf@gt[,-1][!hets]) <- TRUE # Extract allele depths. ad <- extract.gt(vcf, element = "AD") ad1 <- masplit(ad, record = 1) ad2 <- masplit(ad, record = 2) freq1 <- ad1/(ad1+ad2) freq2 <- ad2/(ad1+ad2) myPeaks1 <- freq_peak(freq1, getPOS(vcf)) # Censor windows with fewer than 20 heterozygous positions is.na(myPeaks1$peaks[myPeaks1$counts < 20]) <- TRUE # Convert peaks to ploidy call peak_to_ploid(myPeaks1)
# Thresholds. plot(c(0.0, 1), c(0,1), type = "n", xaxt = "n", xlab = "Expectation", ylab = "Allele balance") myCalls <- c(1/5, 1/4, 1/3, 1/2, 2/3, 3/4, 4/5) axis(side = 1, at = myCalls, labels = c('1/5', '1/4', '1/3','1/2', '2/3', '3/4', '4/5'), las=2) abline(v=myCalls) abline(v=c(7/40, 9/40, 7/24, 5/12), lty=3, col ="#B22222") abline(v=c(7/12, 17/24, 31/40, 33/40), lty=3, col ="#B22222") text(x=7/40, y=0.1, labels = "7/40", srt = 90) text(x=9/40, y=0.1, labels = "9/40", srt = 90) text(x=7/24, y=0.1, labels = "7/24", srt = 90) text(x=5/12, y=0.1, labels = "5/12", srt = 90) text(x=7/12, y=0.1, labels = "7/12", srt = 90) text(x=17/24, y=0.1, labels = "17/24", srt = 90) text(x=31/40, y=0.1, labels = "31/40", srt = 90) text(x=33/40, y=0.1, labels = "33/40", srt = 90) # Prepare data and visualize data(vcfR_example) gt <- extract.gt(vcf) # Censor non-heterozygous positions. hets <- is_het(gt) is.na(vcf@gt[,-1][!hets]) <- TRUE # Extract allele depths. ad <- extract.gt(vcf, element = "AD") ad1 <- masplit(ad, record = 1) ad2 <- masplit(ad, record = 2) freq1 <- ad1/(ad1+ad2) freq2 <- ad2/(ad1+ad2) myPeaks1 <- freq_peak(freq1, getPOS(vcf)) # Censor windows with fewer than 20 heterozygous positions is.na(myPeaks1$peaks[myPeaks1$counts < 20]) <- TRUE # Convert peaks to ploidy call peak_to_ploid(myPeaks1)
Functions which process chromR objects
Create representation of a sequence. Begining and end points are determined for stretches of nucleotides. Stretches are determined by querying each nucleotides in a sequence to determine if it is represented in the database of characters (chars).
proc.chromR(x, win.size = 1000, verbose = TRUE) regex.win(x, max.win = 1000, regex = "[acgtwsmkrybdhv]") seq2rects(x, chars = "acgtwsmkrybdhv", lower = TRUE) var.win(x, win.size = 1000)
proc.chromR(x, win.size = 1000, verbose = TRUE) regex.win(x, max.win = 1000, regex = "[acgtwsmkrybdhv]") seq2rects(x, chars = "acgtwsmkrybdhv", lower = TRUE) var.win(x, win.size = 1000)
x |
object of class chromR |
win.size |
integer indicating size for windowing processes |
verbose |
logical indicating whether verbose output should be reported |
max.win |
maximum window size |
regex |
a regular expression to indicate nucleotides to be searched for |
chars |
a vector of characters to be used as a database for inclusion in rectangles |
lower |
converts the sequence and database to lower case, making the search case insensitive |
The function proc_chromR() calls helper functions to process the data present in a chromR object into summaries statistics.
The function regex.win() is used to generate coordinates to define rectangles to represent regions of the chromosome containing called nucleotides (acgtwsmkrybdhv). It is then called a second time to generate coordinates to define rectangles to represent regions called as uncalled nucleotides (n, but not gaps).
The function gt2popsum is called to create summaries of the variant data.
The function var.win is called to create windowized summaries of the chromR object.
Each window receives a name and its coordinates. Several attempts are made to name the windows appropriately. First, the CHROM column of vcfR@fix is queried for a name. Next, the label of the sequence is queried for a name. Next, the first cell of the annotation matrix is queried. If an appropriate name was not found in the above locations the chromR object's 'name' slot is used. Note that the 'name' slot has a default value. If this default value is not updated then all of your windows may receive the same name.
Query the 'gt' slot of objects of class vcfR
is.polymorphic(x, na.omit = FALSE) is.biallelic(x)
is.polymorphic(x, na.omit = FALSE) is.biallelic(x)
x |
an object of class vcfR |
na.omit |
logical to omit missing data |
The function is_polymorphic returns a vector of logicals indicating whether a variant is polymorphic. Only variable sites are reported in vcf files. However, once someone manipulates a vcfR object, a site may become invariant. For example, if a sample is removed it may result in a site becoming invariant. This function queries the sites in a vcfR object and returns a vector of logicals (TRUE/FALSE) to indicate if they are actually variable.
The function is_bialleleic returns a vector of logicals indicating whether a variant is biallelic. Some analyses or downstream analyses only work with biallelic loci. This function can help manage this.
Note that is_bialleleic queries the ALT column in the fix slot to count alleles. If you remove samples from the gt slot you may invalidate the information in the fix slot. For example, if you remove the samples with the alternate allele you will make the position invariant and this function will provide inaccurate information. So use caution if you've made many modifications to your data.
Query the META section of VCF data for information about acronyms.
queryMETA(x, element = NULL, nice = TRUE)
queryMETA(x, element = NULL, nice = TRUE)
x |
an object of class vcfR or chromR. |
element |
an acronym to search for in the META portion of the VCF data. |
nice |
logical indicating whether to format the data in a 'nice' manner. |
The META portion of VCF data defines acronyms that are used elsewhere in the data.
In order to better understand these acronyms they should be referenced.
This function facilitates looking up of acronyms to present their relevant information.
When 'element' is 'NULL' (the default), all acronyms from the META region are returned.
When 'element' is specified an attempt is made to return information about the provided element.
The function grep
is used to perform this query.
If 'nice' is set to FALSE then the data is presented as it was in the file.
If 'nice' is set to TRUE the data is processed to make it appear more 'nice'.
data(vcfR_test) queryMETA(vcfR_test) queryMETA(vcfR_test, element = "DP")
data(vcfR_test) queryMETA(vcfR_test) queryMETA(vcfR_test, element = "DP")
Rank variants within windows.
rank.variants.chromR(x, scores)
rank.variants.chromR(x, scores)
x |
an object of class Crhom or a data.frame containing... |
scores |
a vector of scores for each variant to be used to rank the data |
Converts allele balance data produced by freq_peak()
to a copy number by assinging the allele balance data (frequencies) to its closest expected ratio.
rePOS(x, lens, ret.lens = FALSE, buff = 0)
rePOS(x, lens, ret.lens = FALSE, buff = 0)
x |
a vcfR object |
lens |
a data.frame describing the reference |
ret.lens |
logical specifying whether lens should be returned |
buff |
an integer indicating buffer length |
Each chromosome in a genome typically begins with position one.
This creates a problem when plotting the data associated with each chromosome because the information will overlap.
This function uses the information in the data.frame lens
to create a new coordinate system where chromosomes do not overlap.
The data.frame lens should have a row for each chromosome and two columns. The first column is the name of each chromosome as it appears in the vcfR object. The second column is the length of each chromosome.
The parameter buff indicates the length of a buffer to put in between each chromosome. This buffer may help distinguish chromosomes from one another.
In order to create the new coordinates the lens
data.frame is updated with the new start positions.
The parameter
Either a vector of integers that represent the new coordinate system or a list containing the vector of integers and the lens data.frame.
# Create some VCF data. data(vcfR_example) vcf1 <-vcf[1:500,] vcf2 <-vcf[500:1500,] vcf3 <- vcf[1500:2533] vcf1@fix[,'CHROM'] <- 'chrom1' vcf2@fix[,'CHROM'] <- 'chrom2' vcf3@fix[,'CHROM'] <- 'chrom3' vcf2@fix[,'POS'] <- as.character(getPOS(vcf2) - 21900) vcf3@fix[,'POS'] <- as.character(getPOS(vcf3) - 67900) vcf <- rbind2(vcf1, vcf2) vcf <- rbind2(vcf, vcf3) rm(vcf1, vcf2, vcf3) # Create lens lens <- data.frame(matrix(nrow=3, ncol=2)) lens[1,1] <- 'chrom1' lens[2,1] <- 'chrom2' lens[3,1] <- 'chrom3' lens[1,2] <- 22000 lens[2,2] <- 47000 lens[3,2] <- 32089 # Illustrate the issue. dp <- extract.info(vcf, element="DP", as.numeric=TRUE) plot(getPOS(vcf), dp, col=as.factor(getCHROM(vcf))) # Resolve the issue. newPOS <- rePOS(vcf, lens) dp <- extract.info(vcf, element="DP", as.numeric=TRUE) plot(newPOS, dp, col=as.factor(getCHROM(vcf))) # Illustrate the buffer newPOS <- rePOS(vcf, lens, buff=10000) dp <- extract.info(vcf, element="DP", as.numeric=TRUE) plot(newPOS, dp, col=as.factor(getCHROM(vcf)))
# Create some VCF data. data(vcfR_example) vcf1 <-vcf[1:500,] vcf2 <-vcf[500:1500,] vcf3 <- vcf[1500:2533] vcf1@fix[,'CHROM'] <- 'chrom1' vcf2@fix[,'CHROM'] <- 'chrom2' vcf3@fix[,'CHROM'] <- 'chrom3' vcf2@fix[,'POS'] <- as.character(getPOS(vcf2) - 21900) vcf3@fix[,'POS'] <- as.character(getPOS(vcf3) - 67900) vcf <- rbind2(vcf1, vcf2) vcf <- rbind2(vcf, vcf3) rm(vcf1, vcf2, vcf3) # Create lens lens <- data.frame(matrix(nrow=3, ncol=2)) lens[1,1] <- 'chrom1' lens[2,1] <- 'chrom2' lens[3,1] <- 'chrom3' lens[1,2] <- 22000 lens[2,2] <- 47000 lens[3,2] <- 32089 # Illustrate the issue. dp <- extract.info(vcf, element="DP", as.numeric=TRUE) plot(getPOS(vcf), dp, col=as.factor(getCHROM(vcf))) # Resolve the issue. newPOS <- rePOS(vcf, lens) dp <- extract.info(vcf, element="DP", as.numeric=TRUE) plot(newPOS, dp, col=as.factor(getCHROM(vcf))) # Illustrate the buffer newPOS <- rePOS(vcf, lens, buff=10000) dp <- extract.info(vcf, element="DP", as.numeric=TRUE) plot(newPOS, dp, col=as.factor(getCHROM(vcf)))
Methods that act on objects of class chromR
## S4 method for signature 'chromR' show(object) ## S4 method for signature 'chromR' plot(x, y, ...) ## S4 method for signature 'chromR' print(x, y, ...) ## S4 method for signature 'chromR' head(x, n = 6) ## S4 replacement method for signature 'chromR,character' names(x) <- value ## S4 method for signature 'chromR' length(x)
## S4 method for signature 'chromR' show(object) ## S4 method for signature 'chromR' plot(x, y, ...) ## S4 method for signature 'chromR' print(x, y, ...) ## S4 method for signature 'chromR' head(x, n = 6) ## S4 replacement method for signature 'chromR,character' names(x) <- value ## S4 method for signature 'chromR' length(x)
object |
an object of class chromR |
x |
an object of class chromR |
y |
not currently used |
... |
Arguments to be passed to methods |
n |
integer indicating the number of elements to be printed from an object |
value |
a character containing a name |
Methods that act on objects of class chromR.
Display a summary of a vcfR object.
head returns the first parts of an object of class vcfR.
The brackets ('[]') subset objects of class vcfR
The plot method visualizes objects of class vcfR
## S4 method for signature 'vcfR' show(object) ## S4 method for signature 'vcfR' head(x, n = 6, maxchar = 80) ## S4 method for signature 'vcfR,ANY,ANY,ANY' x[i, j, samples = NULL, ..., drop] ## S4 method for signature 'vcfR' plot(x, y, ...) ## S4 method for signature 'vcfR,missing' rbind2(x, y, ...) ## S4 method for signature 'vcfR,ANY' rbind2(x, y, ...) ## S4 method for signature 'vcfR,vcfR' rbind2(x, y, ...) ## S4 method for signature 'vcfR' dim(x) ## S4 method for signature 'vcfR' nrow(x)
## S4 method for signature 'vcfR' show(object) ## S4 method for signature 'vcfR' head(x, n = 6, maxchar = 80) ## S4 method for signature 'vcfR,ANY,ANY,ANY' x[i, j, samples = NULL, ..., drop] ## S4 method for signature 'vcfR' plot(x, y, ...) ## S4 method for signature 'vcfR,missing' rbind2(x, y, ...) ## S4 method for signature 'vcfR,ANY' rbind2(x, y, ...) ## S4 method for signature 'vcfR,vcfR' rbind2(x, y, ...) ## S4 method for signature 'vcfR' dim(x) ## S4 method for signature 'vcfR' nrow(x)
object |
a vcfR object |
x |
object of class vcfR |
n |
number of rows to print |
maxchar |
maximum number of characters to print per line |
i |
vector of rows (variants) to include |
j |
vector of columns (samples) to include |
samples |
vector (numeric, character or logical) specifying samples, see details |
... |
arguments to be passed to other methods |
drop |
delete the dimensions of an array which only has one level |
y |
not used |
The method show is used to display an object. Because vcf data are relatively large, this has been abbreviated. Here we display the first four lines of the meta section, and truncate them to no more than 80 characters. The first eight columns and six rows of the fix section are also displayed.
The method head is similar to show, but is more flexible. The number of rows displayed is parameterized by the variable n. And the maximum number of characters to print per line (row) is also parameterized. In contract to show, head includes a summary of the gt portion of the vcfR object.
The square brackets ([]) are used to subset objects of class vcfR. Rows are subset by providing a vector i to specify which rows to use. The columns in the fix slot will not be subset by j. The parameter j is a vector used to subset the columns of the gt slot. Note that it is essential to include the first column here (FORMAT) or downsream processes will encounter trouble.
The samples parameter allows another way to select samples. Because the first column of the gt section is the FORMAT column you typically need to include that column and sample numbers therefore begin at two. Use of the samples parameter allows you to select columns by a vector of numerics, logicals or characters. When numerics are used the samples can be selected starting at one. The function will then add one to this vector and include one to select the desired samples and the FORMAT column. When a vector of characters is used it should contain the desired sample names. The function will add the FORMAT column if it is not the first element. When a vector of logicals is used a TRUE will be added to the vector to ensure the FORMAT column is selected. Note that specification of samples will override specification of j.
The plot method generates a histogram from data found in the 'QUAL' column from the 'fix' slot.
Read and files in the *.vcf structured text format, as well as the compressed *.vcf.gz format. Write objects of class vcfR to *.vcf.gz.
read.vcfR( file, limit = 1e+07, nrows = -1, skip = 0, cols = NULL, convertNA = TRUE, checkFile = TRUE, check_keys = TRUE, verbose = TRUE ) write.vcf(x, file = "", mask = FALSE, APPEND = FALSE)
read.vcfR( file, limit = 1e+07, nrows = -1, skip = 0, cols = NULL, convertNA = TRUE, checkFile = TRUE, check_keys = TRUE, verbose = TRUE ) write.vcf(x, file = "", mask = FALSE, APPEND = FALSE)
file |
A filename for a variant call format (vcf) file. |
limit |
amount of memory (in bytes) not to exceed when reading in a file. |
nrows |
integer specifying the maximum number of rows (variants) to read in. |
skip |
integer specifying the number of rows (variants) to skip before beginning to read data. |
cols |
vector of column numbers to extract from file. |
convertNA |
logical specifying to convert VCF missing data to NA. |
checkFile |
test if the first line follows the VCF specification. |
check_keys |
logical determining if |
verbose |
report verbose progress. |
x |
An object of class vcfR or chromR. |
mask |
logical vector indicating rows to use. |
APPEND |
logical indicating whether to append to existing vcf file or write a new file. |
The function read.vcfR reads in files in *.vcf (text) and *.vcf.gz (gzipped text) format and returns an object of class vcfR. The parameter 'limit' is an attempt to keep the user from trying to read in a file which contains more data than there is memory to hold. Based on the dimensions of the data matrix, an estimate of how much memory needed is made. If this estimate exceeds the value of 'limit' an error is thrown and execution stops. The user may increase this limit to any value, but is encourages to compare that value to the amout of available physical memory.
It is possible to input part of a VCF file by using the parameters nrows, skip and cols. The first eight columns (the fix region) are part of the definition and will always be included. Any columns beyond eight are optional (the gt region). You can specify which of these columns you would like to input by setting the cols parameter. If you want a usable vcfR object you will want to always include nine (the FORMAT column). If you do not include column nine you may experience reduced functionality.
According to the VCF specification missing data are encoded by a period (".").
Within the R language, missing data can be encoded as NA.
The parameter 'convertNA' allows the user to either retain the VCF representation or the R representation of missing data.
Note that the conversion only takes place when the entire value can be determined to be missing.
For example, ".|.:48:8:51,51" would be retained because the missing genotype is accompanied by other delimited information.
In contrast, ".|." should be converted to NA when convertNA = TRUE
.
If file begins with http://, https://, ftp://, or ftps:// it is interpreted as a link. When this happens, file is split on the delimiter '/' and the last element is used as the filename. A check is performed to determine if this file exists in the working directory. If a local file is found it is used. If a local file is not found the remote file is downloaded to the working directory and read in.
The function write.vcf takes an object of either class vcfR or chromR and writes the vcf data to a vcf.gz file (gzipped text). If the parameter 'mask' is set to FALSE, the entire object is written to file. If the parameter 'mask' is set to TRUE and the object is of class chromR (which has a mask slot), this mask is used to subset the data. If an index is supplied as 'mask', then this index is used, and recycled as necessary, to subset the data.
Because vcfR provides the opportunity to manipulate VCF data, it also provides the opportunity for the user to create invalid VCF files. If there is a question regarding the validity of a file you have created one option is the VCF validator from VCF tools.
read.vcfR returns an object of class vcfR-class
.
See the vignette: vignette('vcf_data')
.
The function write.vcf creates a gzipped VCF file.
CRAN: pegas::read.vcf, PopGenome::readVCF, data.table::fread
Bioconductor: VariantAnnotation::readVcf
Use: browseVignettes('vcfR') to find examples.
data(vcfR_test) vcfR_test head(vcfR_test) # CRAN requires developers to us a tempdir when writing to the filesystem. # You may want to implement this example elsewhere. orig_dir <- getwd() temp_dir <- tempdir() setwd( temp_dir ) write.vcf( vcfR_test, file = "vcfR_test.vcf.gz" ) vcf <- read.vcfR( file = "vcfR_test.vcf.gz", verbose = FALSE ) vcf setwd( orig_dir )
data(vcfR_test) vcfR_test head(vcfR_test) # CRAN requires developers to us a tempdir when writing to the filesystem. # You may want to implement this example elsewhere. orig_dir <- getwd() temp_dir <- tempdir() setwd( temp_dir ) write.vcf( vcfR_test, file = "vcfR_test.vcf.gz" ) vcf <- read.vcfR( file = "vcfR_test.vcf.gz", verbose = FALSE ) vcf setwd( orig_dir )
An example dataset containing parts of the *Phytophthora infestans* genome.
A DNAbin object, a data.frame and a vcfR object
dna DNAbin object
gff gff format data.frame
vcf vcfR object
This data is a subset of the pinfsc50 dataset. It has been subset to positions between 500 and 600 kbp. The coordinate systems of the vcf and gff file have been altered by subtracting 500,000. This results in a 100 kbp section of supercontig_1.50 that has positional data ranging from 1 to 100 kbp.
Note that it is encouraged to keep package contents small to facilitate easy downloading and installation. This is why a mitochondrion was chosen as an example. In practice I've used this package on supercontigs. This package was designed for much larger datasets in mind than in this example.
data(vcfR_example)
data(vcfR_example)
A test file containing a diversity of examples intended to test functionality.
A vcfR object
vcfR_test vcfR object
This data set began as the example (section 1.1) from The Variant Call Format Specification VCFv4.3 . This data consisted of 3 samples and 5 variants. As I encounter examples that challenge the code in vcfR they can be added to this data set.
data(vcfR_test) ## Not run: # When I add data it can be saved with this command. save(vcfR_test, file="data/vcfR_test.RData") ## End(Not run)
data(vcfR_test) ## Not run: # When I add data it can be saved with this command. save(vcfR_test, file="data/vcfR_test.RData") ## End(Not run)
An S4 class for storing VCF data.
Defines a class for variant call format data. A vcfR object contains three slots. The first slot is a character vector which holds the meta data. The second slot holds an eight column matrix to hold the fixed data. The third slot is a matrix which holds the genotype data. The genotype data is optional according to the VCF definition. When it is missing the gt slot should consist of a character matrix with zero rows and columns.
See vignette('vcf_data')
for more information.
See the VCF specification for the file specification.
meta
character vector for the meta information
fix
matrix for the fixed information
gt
matrix for the genotype information
Convert objects of class vcfR to objects of class ape::DNAbin
vcfR2DNAbin( x, extract.indels = TRUE, consensus = FALSE, extract.haps = TRUE, unphased_as_NA = TRUE, asterisk_as_del = FALSE, ref.seq = NULL, start.pos = NULL, verbose = TRUE )
vcfR2DNAbin( x, extract.indels = TRUE, consensus = FALSE, extract.haps = TRUE, unphased_as_NA = TRUE, asterisk_as_del = FALSE, ref.seq = NULL, start.pos = NULL, verbose = TRUE )
x |
an object of class chromR or vcfR |
extract.indels |
logical indicating to remove indels (TRUE) or to include them while retaining alignment |
consensus |
logical, indicates whether an IUPAC ambiguity code should be used for diploid heterozygotes |
extract.haps |
logical specifying whether to separate each genotype into alleles based on a delimiting character |
unphased_as_NA |
logical indicating how to handle alleles in unphased genotypes |
asterisk_as_del |
logical indicating that the asterisk allele should be converted to a deletion (TRUE) or NA (FALSE) |
ref.seq |
reference sequence (DNAbin) for the region being converted |
start.pos |
chromosomal position for the start of the ref.seq |
verbose |
logical specifying whether to produce verbose output |
Objects of class DNAbin, from the package ape, store nucleotide sequence information. Typically, nucleotide sequence information contains all the nucleotides within a region, for example, a gene. Because most sites are typically invariant, this results in a large amount of redundant data. This is why files in the vcf format only contain information on variant sites, it results in a smaller file. Nucleotide sequences can be generated which only contain variant sites. However, some applications require the invariant sites. For example, inference of phylogeny based on maximum likelihood or Bayesian methods requires invariant sites. The function vcfR2DNAbin therefore includes a number of options in attempt to accomodate various scenarios.
The presence of indels (insertions or deletions)in a sequence typically presents a data analysis problem. Mutation models typically do not accomodate this data well. For now, the only option is for indels to be omitted from the conversion of vcfR to DNAbin objects. The option extract.indels was included to remind us of this, and to provide a placeholder in case we wish to address this in the future.
The ploidy of the samples is inferred from the first non-missing genotype. All samples and all variants within each sample are assumed to be of the same ploid.
Conversion of haploid data is fairly straight forward.
The options consensus
and extract.haps
are not relevant here.
When vcfR2DNAbin encounters missing data in the vcf data (NA) it is coded as an ambiguous nucleotide (n) in the DNAbin object.
When no reference sequence is provided (option ref.seq
), a DNAbin object consisting only of variant sites is created.
When a reference sequence and a starting position are provided the entire sequence, including invariant sites, is returned.
The reference sequence is used as a starting point and variable sitees are added to this.
Because the data in the vcfR object will be using a chromosomal coordinate system, we need to tell the function where on this chromosome the reference sequence begins.
Conversion of diploid data presents a number of scenarios.
When the option consensus
is TRUE and extract.haps
is FALSE, each genotype is split into two alleles and the two alleles are converted into their IUPAC ambiguity code.
This results in one sequence for each diploid sample.
This may be an appropriate path when you have unphased data.
Note that functions called downstream of this choice may handle IUPAC ambiguity codes in unexpected manners.
When extract.haps is set to TRUE, each genotype is split into two alleles.
These alleles are inserted into two sequences.
This results in two sequences per diploid sample.
Note that this really only makes sense if you have phased data.
The options ref.seq and start.pos are used as in halpoid data.
When a variant overlaps a deletion it may be encoded by an asterisk allele (*).
The GATK site covers this in a post on Spanning or overlapping deletions ].
This is handled in vcfR by allowing the user to decide how it is handled with the paramenter asterisk_as_del
.
When asterisk_as_del
is TRUE this allele is converted into a deletion ('-').
When asterisk_as_del
is FALSE the asterisk allele is converted to NA.
If extract.indels
is set to FALSE it should override this decision.
Conversion of polyploid data is currently not supported. However, I have made some attempts at accomodating polyploid data. If you have polyploid data and are interested in giving this a try, feel free. But be prepared to scrutinize the output to make sure it appears reasonable.
Creation of DNAbin objects from large chromosomal regions may result in objects which occupy large amounts of memory. If in doubt, begin by subsetting your data and the scale up to ensure you do not run out of memory.
library(ape) data(vcfR_test) # Create an example reference sequence. nucs <- c('a','c','g','t') set.seed(9) myRef <- as.DNAbin(matrix(nucs[round(runif(n=20, min=0.5, max=4.5))], nrow=1)) # Recode the POS data for a smaller example. set.seed(99) vcfR_test@fix[,'POS'] <- sort(sample(10:20, size=length(getPOS(vcfR_test)))) # Just vcfR myDNA <- vcfR2DNAbin(vcfR_test) seg.sites(myDNA) image(myDNA) # ref.seq, no start.pos myDNA <- vcfR2DNAbin(vcfR_test, ref.seq = myRef) seg.sites(myDNA) image(myDNA) # ref.seq, start.pos = 4. # Note that this is the same as the previous example but the variants are shifted. myDNA <- vcfR2DNAbin(vcfR_test, ref.seq = myRef, start.pos = 4) seg.sites(myDNA) image(myDNA) # ref.seq, no start.pos, unphased_as_NA = FALSE myDNA <- vcfR2DNAbin(vcfR_test, unphased_as_NA = FALSE, ref.seq = myRef) seg.sites(myDNA) image(myDNA)
library(ape) data(vcfR_test) # Create an example reference sequence. nucs <- c('a','c','g','t') set.seed(9) myRef <- as.DNAbin(matrix(nucs[round(runif(n=20, min=0.5, max=4.5))], nrow=1)) # Recode the POS data for a smaller example. set.seed(99) vcfR_test@fix[,'POS'] <- sort(sample(10:20, size=length(getPOS(vcfR_test)))) # Just vcfR myDNA <- vcfR2DNAbin(vcfR_test) seg.sites(myDNA) image(myDNA) # ref.seq, no start.pos myDNA <- vcfR2DNAbin(vcfR_test, ref.seq = myRef) seg.sites(myDNA) image(myDNA) # ref.seq, start.pos = 4. # Note that this is the same as the previous example but the variants are shifted. myDNA <- vcfR2DNAbin(vcfR_test, ref.seq = myRef, start.pos = 4) seg.sites(myDNA) image(myDNA) # ref.seq, no start.pos, unphased_as_NA = FALSE myDNA <- vcfR2DNAbin(vcfR_test, unphased_as_NA = FALSE, ref.seq = myRef) seg.sites(myDNA) image(myDNA)
Converts a vcfR object to hapmap
vcfR2hapmap(vcf)
vcfR2hapmap(vcf)
vcf |
a vcfR object. |
Converts a vcfR object to a hapmap format.
a data.frame that can be used as an input for GAPIT.
Brian J. Knaus
data(vcfR_test) myHapMap <- vcfR2hapmap(vcfR_test) class(myHapMap) ## Not run: # Example of how to create a (GAPIT compliant) HapMap file. write.table(myHapMap, file = "myHapMap.hmp.txt", sep = "\t", row.names = FALSE, col.names = FALSE) ## End(Not run)
data(vcfR_test) myHapMap <- vcfR2hapmap(vcfR_test) class(myHapMap) ## Not run: # Example of how to create a (GAPIT compliant) HapMap file. write.table(myHapMap, file = "myHapMap.hmp.txt", sep = "\t", row.names = FALSE, col.names = FALSE) ## End(Not run)
The function converts a vcfR object to a text format that can be used as an infile for MigrateN.
vcfR2migrate( vcf, pop, in_pop, out_file = "MigrateN_infile.txt", method = c("N", "H") )
vcfR2migrate( vcf, pop, in_pop, out_file = "MigrateN_infile.txt", method = c("N", "H") )
vcf |
a vcfR object. |
pop |
factor indicating population membership for each sample. |
in_pop |
vector of population names indicating which population to include in migrate output file. |
out_file |
name of output file. |
method |
should 'N' or 'H' format data be generated? |
This function converts a vcfR object to a text file which can be used as input for MigrateN. The function will remove loci with missing data, indels, and loci that are not bialleleic (loci with more than two alleles). Thus, only SNP data analysed where the length of each locus (inmutational steps) is 1 (as opposed to microsatellites or indels).
The output file should contain Unix line endings ("\n"). Note that opening the output file in a Windows text editor (just to validate number of markers, individuals or populations) might change the end of line character (eol) to a Windows line ending ("\r\n"). This may produce an error running migrate-n. Because these are typically non-printing characters, this may be a difficult problem to troubleshoot. The easiest way to circumvent the problem is to transfer the output file to Unix machine and view it there. If you do introduce Windows line endings you can convert them back to Unix with a program such as 'dos2unix' or 'fromdos' to change the line endings.
a text file that can be used as an input for MigrateN software (SNP format).
Shankar Shakya and Brian J. Knaus
Migrate-N website.
## Not run: data(vcfR_example) my_pop <- as.factor(paste("pop_", rep(c("A", "B", "C"), each = 6), sep = "")) vcfR2migrate(vcf = vcf , pop = my_pop , in_pop = c("pop_A","pop_C"), out_file = "my2pop.txt", method = 'H') ## End(Not run)
## Not run: data(vcfR_example) my_pop <- as.factor(paste("pop_", rep(c("A", "B", "C"), each = 6), sep = "")) vcfR2migrate(vcf = vcf , pop = my_pop , in_pop = c("pop_A","pop_C"), out_file = "my2pop.txt", method = 'H') ## End(Not run)
Example data to use with unit tests.
A vcfR object
vep vcfR object
Output from the VEP may include values with multiple equals signs. This does not appear to conform with the VCF specification (at the time of writing this VCF v4.3). But it appears fairly easy to accomodate. This example data can be used to make unit tests to validate functionality.
data(vep) vcfR2tidy(vep, info_only = TRUE)$fix
data(vep) vcfR2tidy(vep, info_only = TRUE)$fix
Create windows of non-overlapping data and summarize.
NM2winNM(x, pos, maxbp, winsize = 100L, depr = TRUE) z.score(x) windowize.NM(x, pos, starts, ends, summary = "mean", depr = TRUE)
NM2winNM(x, pos, maxbp, winsize = 100L, depr = TRUE) z.score(x) windowize.NM(x, pos, starts, ends, summary = "mean", depr = TRUE)
x |
A NumericMatrix |
pos |
A vector of chromosomal positions for each row of data (variants) |
maxbp |
Length of chromosome |
winsize |
Size (in bp) for windows |
depr |
logical (T/F), this function has been deprecated, set to FALSE to override. |
starts |
integer vector of starting positions for windows |
ends |
integer vector of ending positions for windows |
summary |
string indicating type of summary (mean, median, sum) |
The numeric matrix where samples are in columns and variant data are in rows.
The windowing process therefore occurs along columns of data.
This matrix could be created with extract.gt
.
The chromosome is expected to contain positions 1 though maxbp. If maxbp is not specified this can be inferred from the last element in pos.
Generate fasta format output
write.fasta( x, file = "", rowlength = 80, tolower = TRUE, verbose = TRUE, APPEND = FALSE, depr = TRUE )
write.fasta( x, file = "", rowlength = 80, tolower = TRUE, verbose = TRUE, APPEND = FALSE, depr = TRUE )
x |
object of class chromR |
file |
name for output file |
rowlength |
number of characters each row should not exceed |
tolower |
convert all characters to lowercase (T/F) |
verbose |
should verbose output be generated (T/F) |
APPEND |
should data be appended to an existing file (T/F) |
depr |
logical (T/F), this function has been deprecated, set to FALSE to override. |
The function write_fasta takes an object of class chromR and writes it to a fasta.gz (gzipped text) format file. The sequence in the seq slot of the chromR object is used to fill in the invariant sites. The parameter 'tolower', when set to TRUE, converts all the characters in teh sequence to lower case. This is important because some software, such as ape::DNAbin, requires sequences to be in lower case.
Write summary tables from chromR objects.
write.var.info(x, file = "", mask = FALSE, APPEND = FALSE) write.win.info(x, file = "", APPEND = FALSE)
write.var.info(x, file = "", mask = FALSE, APPEND = FALSE) write.win.info(x, file = "", APPEND = FALSE)
x |
An object of class chromR |
file |
A filename for the output file |
mask |
logical vector indicating rows to use |
APPEND |
logical indicating whether to append to existing file (omitting the header) or write a new file |
The function write.var.info takes the variant information table from a chromR object and writes it as a comma delimited file.
The function write.win.info takes the window information table from a chromR object and writes it as a comma delimited file.