Title: | Searching for Footprints of Selection using 'Extended Haplotype Homozygosity' Based Tests |
---|---|
Description: | Population genetic data such as 'Single Nucleotide Polymorphisms' (SNPs) is often used to identify genomic regions that have been under recent natural or artificial selection and might provide clues about the molecular mechanisms of adaptation. One approach, the concept of an 'Extended Haplotype Homozygosity' (EHH), introduced by (Sabeti 2002) <doi:10.1038/nature01140>, has given rise to several statistics designed for whole genome scans. The package provides functions to compute three of these, namely: 'iHS' (Voight 2006) <doi:10.1371/journal.pbio.0040072> for detecting positive or 'Darwinian' selection within a single population as well as 'Rsb' (Tang 2007) <doi:10.1371/journal.pbio.0050171> and 'XP-EHH' (Sabeti 2007) <doi:10.1038/nature06250>, targeted at differential selection between two populations. Various plotting functions are included to facilitate visualization and interpretation of these statistics. |
Authors: | Alexander Klassmann [aut, cre], Mathieu Gautier [aut], Renaud Vitalis [aut] |
Maintainer: | Alexander Klassmann <[email protected]> |
License: | GPL (>= 2) |
Version: | 3.2.2 |
Built: | 2024-12-02 06:50:39 UTC |
Source: | CRAN |
Population genetic data such as 'Single Nucleotide Polymorphisms' (SNPs) is often used to identify genomic regions that have been under recent natural or artificial selection and might provide clues about the molecular mechanisms of adaptation. One approach, the concept of an 'Extended Haplotype Homozygosity' (EHH), introduced by (Sabeti 2002) <doi:10.1038/nature01140>, has given rise to several statistics designed for whole genome scans. The package provides functions to compute three of these, namely: 'iHS' (Voight 2006) <doi:10.1371/journal.pbio.0040072> for detecting positive or 'Darwinian' selection within a single population as well as 'Rsb' (Tang 2007) <doi:10.1371/journal.pbio.0050171> and 'XP-EHH' (Sabeti 2007) <doi:10.1038/nature06250>, targeted at differential selection between two populations. Various plotting functions are included to facilitate visualization and interpretation of these statistics.
See vignette("rehh", package = "rehh")
for an overview of the package and
vignette("examples", package = "rehh")
for a more detailed discussion of two small example
data sets.
Maintainer: Alexander Klassmann [email protected]
Authors:
Mathieu Gautier [email protected]
Renaud Vitalis
Gautier, M. and Naves, M. (2011). Footprints of selection in the ancestral admixture of a New World Creole cattle breed. Molecular Ecology, 20, 3128-3143.
Gautier M. and Vitalis R. (2012). rehh: An R package to detect footprints of selection in genome-wide SNP data from haplotype structure. Bioinformatics, 28(8), 1176-1177.
Gautier M., Klassmann A., and Vitalis R. (2017). rehh 2.0: a reimplementation of the R package rehh to detect positive selection from haplotype structure. Molecular Ecology Resources, 17, 78-90.
Klassmann, A. and Gautier, M. (2020). Detecting selection using Extended Haplotype Homozygosity-based statistics on unphased or unpolarized data (preprint). https://doi.org/10.22541/au.160405572.29972398/v1
Sabeti, P.C. et al. (2002). Detecting recent positive selection in the human genome from haplotype structure. Nature, 419, 832-837.
Sabeti, P.C. et al. (2007). Genome-wide detection and characterization of positive selection in human populations. Nature, 449, 913-918.
Tang, K. and Thornton, K.R. and Stoneking, M. (2007). A New Approach for Using Genome Scans to Detect Recent Positive Selection in the Human Genome. Plos Biology, 7, e171.
Voight, B.F. and Kudaravalli, S. and Wen, X. and Pritchard, J.K. (2006). A map of recent positive selection in the human genome. Plos Biology, 4, e72.
Useful links:
Report bugs at https://gitlab.com/oneoverx/rehh/-/issues
An S4 class containing the furcation trees for both sides of a focal marker for one allele.
allele
the allele of the focal marker.
description
"ancestral", "derived", "major", "minor", etc.
count
the number of chromosomes with that allele.
left
furcation tree to the left of the marker.
right
furcation tree to the right of the marker.
Convert a furcation tree into Newick format.
as.newick(furcation, allele = 0, side, hap.names = seq_len(furcation@nhap))
as.newick(furcation, allele = 0, side, hap.names = seq_len(furcation@nhap))
furcation |
an object of |
allele |
the allele to be considered (default 0). |
side |
side (either |
hap.names |
names/labels of chromosomes in haplotype data file. Per default haplotypes are numbered by their order in the input file. |
ftree-class
, calc_furcation
,
plot.furcation
#example haplohh object (280 haplotypes, 1424 SNPs) #see ?haplohh_cgu_bta12 for details data(haplohh_cgu_bta12) #calculate furcation for the marker "F1205400" #which displays a strong signal of selection f <- calc_furcation(haplohh_cgu_bta12, mrk = "F1205400") #get left tree of ancestral allele (coded as '0') as.newick(f, 0, "left")
#example haplohh object (280 haplotypes, 1424 SNPs) #see ?haplohh_cgu_bta12 for details data(haplohh_cgu_bta12) #calculate furcation for the marker "F1205400" #which displays a strong signal of selection f <- calc_furcation(haplohh_cgu_bta12, mrk = "F1205400") #get left tree of ancestral allele (coded as '0') as.newick(f, 0, "left")
Determine candidate regions of selection.
calc_candidate_regions( scan, threshold = NA, pval = FALSE, ignore_sign = FALSE, window_size = 1e+06, overlap = 0, right = TRUE, min_n_mrk = 1, min_n_extr_mrk = 1, min_perc_extr_mrk = 0, join_neighbors = TRUE )
calc_candidate_regions( scan, threshold = NA, pval = FALSE, ignore_sign = FALSE, window_size = 1e+06, overlap = 0, right = TRUE, min_n_mrk = 1, min_n_extr_mrk = 1, min_perc_extr_mrk = 0, join_neighbors = TRUE )
scan |
a data frame containing scores (output of |
threshold |
boundary score above which markers are defined as "extreme". |
pval |
logical. If |
ignore_sign |
logical. If |
window_size |
size of sliding windows. If set to 1, no windows are constructed and only the individual extremal markers are reported. |
overlap |
size of window overlap (default 0, i.e. no overlap). |
right |
logical, indicating if the windows should be closed on the right (and open on the left) or vice versa. |
min_n_mrk |
minimum number of markers per window. |
min_n_extr_mrk |
minimum number of markers with extreme value in a window. |
min_perc_extr_mrk |
minimum percentage of extremal markers among all markers. |
join_neighbors |
logical. If |
There is no generally agreed method how to determine genomic regions which might have been under recent selection. Since selection tends to yield clusters of markers with outlier values, a common approach is to search for regions with an elevated number or fraction of outlier or extremal markers. This function allows to set three conditions a window must fulfill in order to classify as candidate region:
min_n_mrk
a minimum number of (any) markers.
min_n_extr_mrk
a minimum number of markers with outlier / extreme value.
min_perc_extr_mrk
a minimum percentage of extremal markers among all markers.
"Extreme" markers are defined by having a score above the specified threshold
.
A data frame with chromosomal regions, i.e. windows that fulfill the necessary conditions to qualify as candidate regions under selection. For each region the overall number of markers, their mean and maximum, the number of markers with extremal values, their percentage of all markers and their average are reported.
Compute Extended Haplotype Homozygosity (EHH) and integrated EHH (iHH) for a given focal marker.
calc_ehh( haplohh, mrk, limhaplo = 2, limhomohaplo = 2, limehh = 0.05, include_zero_values = FALSE, include_nhaplo = FALSE, phased = TRUE, polarized = TRUE, scalegap = NA, maxgap = NA, discard_integration_at_border = TRUE, lower_y_bound = limehh, interpolate = TRUE )
calc_ehh( haplohh, mrk, limhaplo = 2, limhomohaplo = 2, limehh = 0.05, include_zero_values = FALSE, include_nhaplo = FALSE, phased = TRUE, polarized = TRUE, scalegap = NA, maxgap = NA, discard_integration_at_border = TRUE, lower_y_bound = limehh, interpolate = TRUE )
haplohh |
an object of class |
mrk |
integer representing the number of the focal marker within the haplohh object or string representing its ID/name. |
limhaplo |
if there are less than |
limhomohaplo |
if there are less than |
limehh |
limit at which EHH stops to be evaluated |
include_zero_values |
logical. If |
include_nhaplo |
logical. If |
phased |
logical. If |
polarized |
logical. |
scalegap |
scale or cap gaps larger than the specified size to the specified size (default= |
maxgap |
maximum allowed gap in bp between two markers. If exceeded, further calculation of EHH is stopped at the gap
(default= |
discard_integration_at_border |
logical. If |
lower_y_bound |
lower y boundary of the area to be integrated over (default: |
interpolate |
logical. Affects only IHH values. If |
Values for allele-specific Extended Haplotype Homozygosity (EHH) are computed upstream and downstream of the focal marker for each of its alleles. These values are integrated with respect to their genomic positions to yield an 'integrated EHH' (iHH) value for each allele.
The returned value is a list containing the following elements:
The name/identifier of the focal marker.
A vector with the frequencies of the alleles of the focal marker.
A data frame with EHH values for each allele of the focal marker.
A vector with iHH (integrated EHH) values for each allele of the focal marker.
Gautier, M. and Naves, M. (2011). Footprints of selection in the ancestral admixture of a New World Creole cattle breed. Molecular Ecology, 20, 3128-3143.
Klassmann, A. and Gautier, M. (2020). Detecting selection using Extended Haplotype Homozygosity-based statistics on unphased or unpolarized data (preprint). https://doi.org/10.22541/au.160405572.29972398/v1
Sabeti, P.C. et al. (2002). Detecting recent positive selection in the human genome from haplotype structure. Nature, 419, 832-837.
Sabeti, P.C. et al. (2007). Genome-wide detection and characterization of positive selection in human populations. Nature, 449, 913-918.
Tang, K. and Thornton, K.R. and Stoneking, M. (2007). A New Approach for Using Genome Scans to Detect Recent Positive Selection in the Human Genome. Plos Biology, 7, e171.
Voight, B.F. and Kudaravalli, S. and Wen, X. and Pritchard, J.K. (2006). A map of recent positive selection in the human genome. Plos Biology, 4, e72.
data2haplohh
, plot.ehh
, calc_ehhs
, scan_hh
.
#example haplohh object (280 haplotypes, 1424 SNPs) #see ?haplohh_cgu_bta12 for details data(haplohh_cgu_bta12) #computing EHH statistics for the marker "F1205400" #which displays a strong signal of selection ehh <- calc_ehh(haplohh_cgu_bta12, mrk = "F1205400")
#example haplohh object (280 haplotypes, 1424 SNPs) #see ?haplohh_cgu_bta12 for details data(haplohh_cgu_bta12) #computing EHH statistics for the marker "F1205400" #which displays a strong signal of selection ehh <- calc_ehh(haplohh_cgu_bta12, mrk = "F1205400")
Compute site-specific Extended Haplotype Homozygosity (EHHS) and integrated EHHS (iES) for a given focal marker.
calc_ehhs( haplohh, mrk, limhaplo = 2, limhomohaplo = 2, limehhs = 0.05, include_zero_values = FALSE, include_nhaplo = FALSE, phased = TRUE, scalegap = NA, maxgap = NA, discard_integration_at_border = TRUE, lower_y_bound = limehhs, interpolate = TRUE )
calc_ehhs( haplohh, mrk, limhaplo = 2, limhomohaplo = 2, limehhs = 0.05, include_zero_values = FALSE, include_nhaplo = FALSE, phased = TRUE, scalegap = NA, maxgap = NA, discard_integration_at_border = TRUE, lower_y_bound = limehhs, interpolate = TRUE )
haplohh |
an object of class |
mrk |
integer representing the number of the focal marker within the haplohh object or string representing its ID/name. |
limhaplo |
if there are less than |
limhomohaplo |
if there are less than |
limehhs |
limit at which EHHS stops to be evaluated. |
include_zero_values |
logical. If |
include_nhaplo |
logical. If |
phased |
logical. If |
scalegap |
scale or cap gaps larger than the specified size to the specified size (default= |
maxgap |
maximum allowed gap in bp between two markers. If exceeded, further calculation of EHHS is stopped at the gap
(default= |
discard_integration_at_border |
logical. If |
lower_y_bound |
lower y boundary of the area to be integrated over (default: |
interpolate |
logical. Affects only IES and INES values. If |
Values for site-specific Extended Haplotype Homozygosity (EHHS) are computed at each position upstream and downstream of the focal marker. These values are integrated with respect to their genomic position to yield an 'integrated EHHS' (iES) value.
The returned value is a list containing the following elements:
The name/identifier of the focal marker.
A table containing EHHS values as used by Sabeti et al. (2007), resp. the same values normalized to 1 at the focal marker (nEHHS) as used by Tang et al. (2007).
Integrated EHHS.
Integrated normalized EHHS.
Gautier, M. and Naves, M. (2011). Footprints of selection in the ancestral admixture of a New World Creole cattle breed. Molecular Ecology, 20, 3128-3143.
Klassmann, A. and Gautier, M. (2020). Detecting selection using Extended Haplotype Homozygosity-based statistics on unphased or unpolarized data (preprint). https://doi.org/10.22541/au.160405572.29972398/v1
Sabeti, P.C. et al. (2002). Detecting recent positive selection in the human genome from haplotype structure. Nature, 419, 832-837.
Sabeti, P.C. et al. (2007). Genome-wide detection and characterization of positive selection in human populations. Nature, 449, 913-918.
Tang, K. and Thornton, K.R. and Stoneking, M. (2007). A New Approach for Using Genome Scans to Detect Recent Positive Selection in the Human Genome. Plos Biology, 7, e171.
Voight, B.F. and Kudaravalli, S. and Wen, X. and Pritchard, J.K. (2006). A map of recent positive selection in the human genome. Plos Biology, 4, e72.
data2haplohh
, plot.ehhs
, calc_ehh
, scan_hh
.
#example haplohh object (280 haplotypes, 1424 SNPs) #see ?haplohh_cgu_bta12 for details data(haplohh_cgu_bta12) #computing EHHS statistics for the marker "F1205400" #which displays a strong signal of selection ehhs <- calc_ehhs(haplohh_cgu_bta12, mrk = "F1205400")
#example haplohh object (280 haplotypes, 1424 SNPs) #see ?haplohh_cgu_bta12 for details data(haplohh_cgu_bta12) #computing EHHS statistics for the marker "F1205400" #which displays a strong signal of selection ehhs <- calc_ehhs(haplohh_cgu_bta12, mrk = "F1205400")
Calculate furcation trees around a focal marker. A furcation tree captures in greater detail than EHH values the decrease of extended haplotype homozygosity at increasing distances from the selected focal marker.
calc_furcation( haplohh, mrk, allele = NA, limhaplo = 2, phased = TRUE, polarized = TRUE )
calc_furcation( haplohh, mrk, allele = NA, limhaplo = 2, phased = TRUE, polarized = TRUE )
haplohh |
an object of class haplohh (see |
mrk |
integer representing the number of the focal marker within the haplohh object or string representing its ID/name. |
allele |
a vector of alleles as coded internally, i.e. in case of polarized alleles,
0 represents the ancestral, 1 or higher the derived alleles.
If |
limhaplo |
if there are less than |
phased |
logical. If |
polarized |
logical. Affects only the order of furcations. If |
A haplotype furcation tree visualizes the breakdown of LD at increasing distances from the focal marker. The root of each tree is an allele of the focal marker, which in turn is identified by a vertical dashed line. Moving either to the "left" or to the "right" of the focal marker, each further marker is an opportunity for a node; the tree either divides or does not, based on whether alleles at that marker distinguish between hitherto identical extended haplotypes. The thickness of the lines corresponds to the number of chromosomes sharing an extended haplotype.
An object of class furcation, containing the furcation structure of the specified alleles at the focal marker.
Sabeti, P.C. and Reich, D.E. and Higgins, J.M. and Levine, H.Z.P and Richter, D.J. and Schaffner, S.F. and Gabriel, S.B. and Platko, J.V. and Patterson, N.J. and McDonald, G.J. and Ackerman, H.C. and Campbell, S.J. and Altshuler, D. and Cooper, R. and Kwiatkowski, D. and Ward, R. and Lander, E.S. (2002). Detecting recent positive selection in the human genome from haplotype structure. Nature, 419, 832-837.
#example haplohh object (280 haplotypes, 1424 SNPs) #see ?haplohh_cgu_bta12 for details data(haplohh_cgu_bta12) #plotting a furcation diagram for both ancestral and derived allele #from the marker "F1205400" #which display a strong signal of selection f <- calc_furcation(haplohh_cgu_bta12, mrk = "F1205400") plot(f)
#example haplohh object (280 haplotypes, 1424 SNPs) #see ?haplohh_cgu_bta12 for details data(haplohh_cgu_bta12) #plotting a furcation diagram for both ancestral and derived allele #from the marker "F1205400" #which display a strong signal of selection f <- calc_furcation(haplohh_cgu_bta12, mrk = "F1205400") plot(f)
Calculate for each chromosome the maximum length of its extended haplotype homozygosity.
calc_haplen(furcation)
calc_haplen(furcation)
furcation |
an object of class |
Extended haplotype homozygosity is defined as the region around a focal marker in which a particular chromosome shares a haplotype with (its sequence is identical to) another chromosome. The function calculates for each chromosome the boundaries of its longest shared haplotype. These correspond to the last furcations of a chromsome in a furcation diagram. Note that the calculation is performed independently upstream and downstream of the focal marker and hence upper and lower boundaries do not necessarily arise from the same chromosomal pair.
The functions returns a list containing four elements:
name/identifier of the focal marker.
position of the focal marker.
positions of left- and rightmost markers covered by extended haplotypes.
a data frame with the coordinates of extended haplotypes around the focal marker.
#example haplohh object (280 haplotypes, 1424 SNPs) #see ?haplohh_cgu_bta12 for details data(haplohh_cgu_bta12) #plotting haplotype lengths for both ancestral and derived allele #of the marker "F1205400" #which displays a strong signal of selection f <- calc_furcation(haplohh_cgu_bta12, mrk = "F1205400") h <- calc_haplen(f) plot(h)
#example haplohh object (280 haplotypes, 1424 SNPs) #see ?haplohh_cgu_bta12 for details data(haplohh_cgu_bta12) #plotting haplotype lengths for both ancestral and derived allele #of the marker "F1205400" #which displays a strong signal of selection f <- calc_furcation(haplohh_cgu_bta12, mrk = "F1205400") h <- calc_haplen(f) plot(h)
Calculate pairwise shared haplotype length between all chromosomes at a focal marker.
calc_pairwise_haplen( haplohh, mrk, phased = TRUE, maxgap = NA, max_extend = NA, side = "both" )
calc_pairwise_haplen( haplohh, mrk, phased = TRUE, maxgap = NA, max_extend = NA, side = "both" )
haplohh |
an object of class |
mrk |
integer representing the number of the focal marker within the haplohh object or string representing its ID/name. |
phased |
logical. If |
maxgap |
maximum allowed gap in bp between two markers. If exceeded, further calculation is stopped at the gap
(default= |
max_extend |
maximum distance in bp to extend shared haplotypes away from the focal marker.
(default |
side |
side to consider, either "left" (positions lower than focal position), "right" (positions higher than focal position) or "both" (default). |
The function computes the length of shared haplotypes (stretches of identical sequence) around the focal marker.
Note that the function calc_haplen
calculates for each chromosome
the boundaries of its longest shared haplotype; separately upstream and downstream of
the focal marker.
The returned value is a matrix with pairwise shared haplotype lengths.
#example haplohh object (280 haplotypes, 1424 SNPs) #see ?haplohh_cgu_bta12 for details data(haplohh_cgu_bta12) #computing shared haplotype lengths around the marker "F1205400" #which displays a strong signal of selection m <- calc_pairwise_haplen(haplohh_cgu_bta12, mrk = "F1205400")
#example haplohh object (280 haplotypes, 1424 SNPs) #see ?haplohh_cgu_bta12 for details data(haplohh_cgu_bta12) #computing shared haplotype lengths around the marker "F1205400" #which displays a strong signal of selection m <- calc_pairwise_haplen(haplohh_cgu_bta12, mrk = "F1205400")
Calculate score statistics (extremal values) for given regions. This function is intended for the comparison of different scores for the same chromosomal regions.
calc_region_stats( scan, regions, threshold = NA, pval = FALSE, ignore_sign = FALSE, right = TRUE )
calc_region_stats( scan, regions, threshold = NA, pval = FALSE, ignore_sign = FALSE, right = TRUE )
scan |
a data frame containing scores (output of |
regions |
a data frame with column names |
threshold |
boundary score above which markers are defined as "extreme". |
pval |
logical. If |
ignore_sign |
logical. If |
right |
logical, indicating if the regions should be closed on the right (and open on the left) or vice versa. |
A data frame with chromosomal regions. For each region the overall number of markers, their mean and maximum, the number of markers with extremal values, their percentage of all markers and their average are reported.
Calculate site frequency spectrum (SFS) tests Tajima's D, Fay & Wu's H and Zeng's E.
calc_sfs_tests( haplohh, polarized = TRUE, window_size = NA, overlap = 0, right = TRUE, min_n_mrk = 1, verbose = TRUE )
calc_sfs_tests( haplohh, polarized = TRUE, window_size = NA, overlap = 0, right = TRUE, min_n_mrk = 1, verbose = TRUE )
haplohh |
an object of class |
polarized |
logical. |
window_size |
size of sliding windows. If |
overlap |
size of window overlap (default 0, i.e. no overlap). |
right |
logical, indicating if the windows should be closed on the right and open on the left (default) or vice versa. |
min_n_mrk |
minimum number of (polymorphic) markers per window. |
verbose |
logical. |
Neutrality tests based on the site frequency spectrum (SFS) are largely unrelated to EHH-based methods. The tests provided here are implemented elsewhere, too (e.g. in package PopGenome).
Each test compares two estimations of the scaled mutation rate theta, which all have the same expected value under neutrality. Deviations from zero indicate violations of the neutral null model, typically population size changes, population subdivision or selection. Tajima's D and Fay & Wu's H become negative in presence of an almost completed sweep, Zeng's E becomes positive for some time after it. Significance can typically be assigned only by simulations.
The standard definition of the tests cannot cope with missing values and typically markers with missing genotypes must be discarded. Ferretti (2012) provides an extension that can handle missing values (without discarding any non-missing values). In this package, only the first moments (the theta-estimators themselves) are adapted accordingly, but not the second moments (their variances), because the latter is computationally demanding and the resulting bias relatively small. It is recommended, though, to discard markers or haplotypes with more than 20% missing values.
Multi-allelic markers are always removed since the tests rely on the "infinite sites model" which implies that all polymorphic markers are bi-allelic. Monomorphic markers can be present, but are irrelevant for the tests.
A data frame with window coordinates, the number of contained (polymorphic) markers, Watterson's, Tajima's and Zeng's estimators of theta and the test statistics of Tajima's D, Fay & Wu's H and Zeng's E.
Watterson, G.A. (1975). On the number of segregating sites in genetical models without recombination. Theoretical Population Biology 7(2) 256-276.
Tajima, F. (1983). Evolutionary relationship of DNA sequences in finite populations. Genetics 105(2) 437-60.
Tajima, F. (1989). Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123(3) 585-95.
Fay, J. and Wu, C. (2000). Hitchhiking under positive Darwinian selection. Genetics 155(3) 1405-13.
Zeng, E. et al. (2006). Statistical tests for detecting positive selection by utilizing high-frequency variants. Genetics 174(3) 1431-9.
Ferretti, L. and Raineri, E. and Ramos-Onsins, S. (2012). Neutrality tests for sequences with missing data. Genetics 191(4) 1397-401.
make.example.files() # neutral evolution hh <- data2haplohh("example_neutral.vcf", verbose = FALSE) calc_sfs_tests(hh) # strong selective sweep hh <- data2haplohh("example_sweep.vcf", verbose = FALSE) calc_sfs_tests(hh) remove.example.files()
make.example.files() # neutral evolution hh <- data2haplohh("example_neutral.vcf", verbose = FALSE) calc_sfs_tests(hh) # strong selective sweep hh <- data2haplohh("example_sweep.vcf", verbose = FALSE) calc_sfs_tests(hh) remove.example.files()
Convert input data files to an object of haplohh-class
.
data2haplohh( hap_file, map_file = NA, min_perc_geno.hap = NA, min_perc_geno.mrk = 100, min_maf = NA, chr.name = NA, popsel = NA, recode.allele = FALSE, allele_coding = "12", haplotype.in.columns = FALSE, remove_multiple_markers = FALSE, polarize_vcf = TRUE, capitalize_AA = TRUE, vcf_reader = "data.table", position_scaling_factor = NA, verbose = TRUE )
data2haplohh( hap_file, map_file = NA, min_perc_geno.hap = NA, min_perc_geno.mrk = 100, min_maf = NA, chr.name = NA, popsel = NA, recode.allele = FALSE, allele_coding = "12", haplotype.in.columns = FALSE, remove_multiple_markers = FALSE, polarize_vcf = TRUE, capitalize_AA = TRUE, vcf_reader = "data.table", position_scaling_factor = NA, verbose = TRUE )
hap_file |
file containing haplotype data (see details below). |
map_file |
file containing map information (see details below). |
min_perc_geno.hap |
threshold on percentage of missing data for haplotypes
(haplotypes with less than |
min_perc_geno.mrk |
threshold on percentage of missing data for markers (markers genotyped on less than
|
min_maf |
threshold on the Minor Allele Frequency. Markers having a MAF lower than or equal to minmaf are discarded.
In case of multi-allelic markers the second-most frequent allele is referred to as minor allele.
Setting this value to zero eliminates monomorphic sites. Default is |
chr.name |
name of the chromosome considered (relevant if data for several chromosomes is contained in the haplotype or map file). |
popsel |
code of the population considered (relevant for fastPHASE output which can contain haplotypes from various populations). |
recode.allele |
*Deprecated*. logical. |
allele_coding |
the allele coding provided by the user. Either |
haplotype.in.columns |
logical. If |
remove_multiple_markers |
logical. If |
polarize_vcf |
logical. Only of relevance for vcf files. If |
capitalize_AA |
logical. Only of relevance for vcf files with ancestral allele information.
Low confidence ancestral alleles are usually coded by lower-case letters. If |
vcf_reader |
library used to read vcf. By default, low-level parsing is
performed using the generic package |
position_scaling_factor |
intended primarily for output of ms where positions lie in the interval [0,1]. These can be rescaled to sizes of typical markers in real data. |
verbose |
logical. If |
Five haplotype input formats are supported:
a "standard format" with haplotypes in rows and markers in columns (with no header, but a haplotype ID/name in the first column).
a "transposed format" similar to the one produced by the phasing program SHAPEIT2 (O'Connell et al., 2014) in which haplotypes are in columns and markers in rows (with neither header nor marker IDs nor haplotype IDs).
output files from the fastPHASE program (Sheet and Stephens, 2006).
If haplotypes from several different population were phased simultaneously (-u fastPHASE option
was used), it is necessary to specify the population of interest by parameter popsel
(if this parameter is not or wrongly set, the error message will provide a list of
the population numbers contained in the file).
files in variant call format (vcf). No mapfile is needed is this case. If
the file contains several chromosomes, it is necessary to choose one by parameter
chr.name
.
output of the simulation program 'ms'. No mapfile is needed in this case. If the file
contains several 'runs', a specific number has to be specified by the
parameter chr.name
.
The "transposed format" has to be explicitly set while the other formats are recognized automatically.
The map file contains marker information in three, or, if it is used for polarization (see below), five columns:
marker name/id
chromosome
position (physical or genetic)
ancestral allele encoding
derived allele encoding
The markers must be in the same order as in the haplotype file. If
several chromosomes are represented in the map file, it is necessary to choose that
which corresponds to the haplotype file by parameter chr.name
.
Haplotypes can be given either with alleles already coded as numbers (in two possible ways)
or with the actual alleles (e.g. nucleotides) which can be translated into numbers
either using the fourth and fifth column of the map file or by their alpha-numeric order.
Correspondingly, the parameter allele_coding
has to be set to either "12"
,
"01"
, "map"
or "none"
:
"12"
: 0 represents missing values, 1 the ancestral allele
and 2 (or higher integers) derived allele(s).
"01"
: NA
or '.' (a point) represent missing values, 0 the
ancestral and 1 (or higher integers) derived allele(s).
"map"
: for each marker, the fourth column of the map file
defines the ancestral allele and the fifth column derived alleles.
In case of multiple derived alleles, they must be separated by commas without space.
Alleles in the haplotype file which do not appear in neither of the two columns
of the map file are regarded as missing values (NA
).
"none"
: NA
or '.' (a point) represent missing values, otherwise for each
marker the allele that comes first in alpha-numeric
order is coded by 0, the next by 1, etc. Evidently, this coding does not convey
any information about allele status as ancestral or derived, hence the alleles
cannot be regarded as polarized.
The information of allelic ancestry is exploited only in the frequency-bin-wise
standardization of iHS (see ihh2ihs
). However, although ancestry status does
not figure in the formulas of the cross populations statistics
Rsb and XP-EHH, their values do depend on the assigned status.
The arguments min_perc_geno.hap
,
min_perc_geno.mrk
and min_maf
are evaluated in this order.
The returned value is an object of haplohh-class
.
Scheet P, Stephens M (2006) A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet, 78, 629-644.
O'Connell J, Gurdasani D, Delaneau O, et al (2014) A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genet, 10, e1004234.
#copy example files into the current working directory. make.example.files() #create object using a haplotype file in "standard format" hap <- data2haplohh(hap_file = "bta12_cgu.hap", map_file = "map.inp", chr.name = 12, allele_coding = "map") #create object using fastPHASE output hap <- data2haplohh(hap_file = "bta12_hapguess_switch.out", map_file = "map.inp", chr.name = 12, popsel = 7, allele_coding = "map") #clean up demo files remove.example.files()
#copy example files into the current working directory. make.example.files() #create object using a haplotype file in "standard format" hap <- data2haplohh(hap_file = "bta12_cgu.hap", map_file = "map.inp", chr.name = 12, allele_coding = "map") #create object using fastPHASE output hap <- data2haplohh(hap_file = "bta12_hapguess_switch.out", map_file = "map.inp", chr.name = 12, popsel = 7, allele_coding = "map") #clean up demo files remove.example.files()
Plot the observed distribution of standardized iHS, Rsb or XP-EHH values together with the standard Gaussian distribution.
distribplot( data, lty = 1, lwd = 1.5, col = c("blue", "red"), qqplot = FALSE, resolution = 0.01, ... )
distribplot( data, lty = 1, lwd = 1.5, col = c("blue", "red"), qqplot = FALSE, resolution = 0.01, ... )
data |
a vector of iHS, Rsb or XPEHH values. |
lty |
line type. |
lwd |
line width. |
col |
a vector describing the colors of the observed and Gaussian distribution, respectively. |
qqplot |
logical. If |
resolution |
affects only qqplot. Rasterize data points to a quadratic grid with the specified resolution and remove duplicate points. Defaults to 0.01. |
... |
further arguments passed to |
The function returns a plot.
ihh2ihs
, ines2rsb
, ies2xpehh
, manhattanplot
.
library(rehh.data) #results from a genome scan (44,057 SNPs) see ?wgscan.cgu for details data(wgscan.cgu) #extract vector with iHS values from data frame IHS <- ihh2ihs(wgscan.cgu)$ihs[["IHS"]] distribplot(IHS, main = "iHS (CGU population)") distribplot(IHS, main = "iHS (CGU population)", qqplot = TRUE)
library(rehh.data) #results from a genome scan (44,057 SNPs) see ?wgscan.cgu for details data(wgscan.cgu) #extract vector with iHS values from data frame IHS <- ihh2ihs(wgscan.cgu)$ihs[["IHS"]] distribplot(IHS, main = "iHS (CGU population)") distribplot(IHS, main = "iHS (CGU population)", qqplot = TRUE)
Extract regions from a scan data frame.
extract_regions(scan, regions, right = TRUE)
extract_regions(scan, regions, right = TRUE)
scan |
A data frame with chromosomal positions like obtained
by |
regions |
A data frame with genomic regions like the output of |
right |
logical, indicating if the intervals should be closed on the right (and open on the left) or vice versa. |
A subset of data frame scan
, retaining only positions belonging to
the regions specified in data frame regions
.
library(rehh.data) data(wgscan.cgu) regions <- data.frame(CHR = 12, START = 2.88e+7, END = 2.92e+7) extract_regions(wgscan.cgu, regions)
library(rehh.data) data(wgscan.cgu) regions <- data.frame(CHR = 12, START = 2.88e+7, END = 2.92e+7) extract_regions(wgscan.cgu, regions)
Plot of unstandardized iHS within frequency bins.
freqbinplot( x, spectrum = FALSE, main = NA, xlab = "Derived allele frequency", ylab = NA, xlim = c(0, 1), ylim = NULL, pch = 20, ... )
freqbinplot( x, spectrum = FALSE, main = NA, xlab = "Derived allele frequency", ylab = NA, xlim = c(0, 1), ylim = NULL, pch = 20, ... )
x |
data (output of function |
spectrum |
logical. If |
main |
an overall title for the plot. |
xlab |
a title for the x axis. |
ylab |
a title for the y axis. |
xlim |
the x coordinate range of the plot. |
ylim |
the y coordinate range of the plot. |
pch |
plotting 'character' see |
... |
further arguments to be passed to plot resp. points. |
The plot shows the mean and the quantiles calculated by
function ihh2ihs
for the unstandardized iHS in each frequency bin.
Note that the standardization of iHS is performed bin-wise in order
to reduce the frequency-dependence of
iHS values (expected under neutrality).
An implicit assumption of this procedure is that each bin is dominated
by neutral markers.
library(rehh.data) data(wgscan.cgu) #results from a genome scan (44,057 SNPs) #see ?wgscan.eut and ?wgscan.cgu for details wgscan.cgu.ihs <- ihh2ihs(wgscan.cgu) freqbinplot(wgscan.cgu.ihs)
library(rehh.data) data(wgscan.cgu) #results from a genome scan (44,057 SNPs) #see ?wgscan.eut and ?wgscan.cgu for details wgscan.cgu.ihs <- ihh2ihs(wgscan.cgu) freqbinplot(wgscan.cgu.ihs)
An S4 class to represent a furcation tree on one side of one allele of a focal marker
A furcation structure consists of two trees ("left" and "right") for each allele of a focal marker. If there are only bi-allelic markers and no missing values, the trees are bifurcating.
Missing values are treated similarly to an extra allele in so far as they cause a furcation. However, the resulting daughter node is marked accordingly and the chromosomes excluded from further calculations. If all chromosomes of a parent node have missing values, the "furcation" is degenerated and yields a single daughter node.
Note that a tree with n leaves can have at most 2n-1 nodes.
In a furcation tree, the leaves do not necessarily represent single chromosomes, either due to multiple missing data or because the first/last marker was reached before all extended haplotypes were distinct.
node_parent
a vector, representing the tree structure. Each node (number) is assigned its parent node (number).
node_pos
a vector, assigning to each node (number) its position in the chromosome, i.e. at which marker position the furcation occurred.
node_with_missing_data
a vector of type logical
.
Pseudo-furcations arise due to missing data at a marker.
The daughter node (number) is marked accordingly.
label_parent
a vector, that attaches an "extra leave", representing the haplotype number (defined by the order in the haplotype data file) to leaves of the tree. This is necessary because in general not all leaves of the original tree represent a single haplotype/chromosome.
An S4 class representing the complete furcation pattern around a focal marker.
.Data
a list containing for each allele an object of allelefurcation-class
.
mrk.name
the name/identifier of the focal marker.
position
the chromosomal position of the focal marker.
xlim
the range of marker positions.
nhap
the number of haplotypes in the sample.
# copy example files into working directory make.example.files() # read first example file hh <- data2haplohh(hap_file = "example1.hap", map_file = "example1.map", allele_coding = "01") # remove example files remove.example.files() # calculate furcation structure around marker "rs6" f <- calc_furcation(hh, mrk = "rs6") # extract left side tree of ancestral allele (which is coded by '0') f[['0']]@left # the tree consists of seven nodes, '1' being the root node # nodes 2 and 3 have the root node as parent, etc. # the first chromosome is attached as a label node to node 7, etc. # For comparison, a plot of the complete furcation structure: plot(f)
# copy example files into working directory make.example.files() # read first example file hh <- data2haplohh(hap_file = "example1.hap", map_file = "example1.map", allele_coding = "01") # remove example files remove.example.files() # calculate furcation structure around marker "rs6" f <- calc_furcation(hh, mrk = "rs6") # extract left side tree of ancestral allele (which is coded by '0') f[['0']]@left # the tree consists of seven nodes, '1' being the root node # nodes 2 and 3 have the root node as parent, etc. # the first chromosome is attached as a label node to node 7, etc. # For comparison, a plot of the complete furcation structure: plot(f)
haplohh
objectThe object contains haplotype data for 140 cattle individuals (280 haplotypes) belonging to the Creole breed from Guadeloupe (CGU) and 1424 markers (mapping to chromosome BTA12).
data(haplohh_cgu_bta12)
data(haplohh_cgu_bta12)
An object of haplohh-class
.
Gautier, M. and Naves, M. (2011). Footprints of selection in the ancestral admixture of a New World Creole cattle breed. Molecular Ecology, 20, 3128-3143.
An object of this class contains the information needed for computation of EHH based statistics.
## S4 method for signature 'haplohh' chr.name(x) ## S4 method for signature 'haplohh' positions(x) ## S4 method for signature 'haplohh' haplo(x) ## S4 method for signature 'haplohh' nmrk(x) ## S4 method for signature 'haplohh' mrk.names(x) ## S4 method for signature 'haplohh' nhap(x) ## S4 method for signature 'haplohh' hap.names(x)
## S4 method for signature 'haplohh' chr.name(x) ## S4 method for signature 'haplohh' positions(x) ## S4 method for signature 'haplohh' haplo(x) ## S4 method for signature 'haplohh' nmrk(x) ## S4 method for signature 'haplohh' mrk.names(x) ## S4 method for signature 'haplohh' nhap(x) ## S4 method for signature 'haplohh' hap.names(x)
x |
an object of this class. |
This class is the basis for all calculations done by this package.
Note that the matrix in slot haplo
has to be of type integer
, not numeric
.
Objects built by versions of rehh up to 2.0.4 coded this matrix as numeric
and used
a different coding scheme. They can be converted e.g. by
haplohh <- update_haplohh(old_haplohh)
in order be used
with the present version.
chr.name
name of the chromosome/scaffold to which the markers belong.
positions
vector of type numeric
containing the marker positions within the chromosome.
haplo
matrix of type integer
containing haplotypes in rows and markers in columns.
showClass("haplohh")
showClass("haplohh")
haplohh-class
into SweepFinder formatExtract allele frequencies of an object of class haplohh-class
and returns a table in SweepFinder input format.
haplohh2sweepfinder(haplohh, polarized = TRUE, verbose = TRUE)
haplohh2sweepfinder(haplohh, polarized = TRUE, verbose = TRUE)
haplohh |
object of class |
polarized |
logical. If |
verbose |
logical. If |
SweepFinder and SweeD are two stand-alone programs which
implement the same method to detect selective sweeps using the
allele frequency at each site. This function calculates these frequencies
from a haplohh-class
and returns a table which
can be saved into a file (with tabs as separators, without row names and quotes) that can
be used as input for the two programs.
Sites with less than two haplotypes genotyped or with more than two alleles are removed.
If polarized
, sites monomorphic for the ancestral allele are removed, too.
A dataframe with four columns:
position marker position
x (absolute) frequency of the alternative (derived) variant
n number of non-missing genotypes
folded a flag marking polarization
DeGiorgio, M., and, Huber, CD and Hubisz, MJ and, Hellmann, I. and Nielsen, R. (2016) SweepFinder2: increased robustness and flexibility. Bioinformatics 32:1895-1897
Pavlidis, P., D. Zivkovic, A. Stamatakis, and N. Alachiotis, (2013) SweeD: likelihood-based detection of selective sweeps in thousands of genomes. Molecular Biology and Evolution 30: 2224-34.
#example # sweepfinder example from vignette make.example.files() hh <- data2haplohh("example_sweep_with_recombination.vcf") haplohh2sweepfinder(hh) remove.example.files()
#example # sweepfinder example from vignette make.example.files() hh <- data2haplohh("example_sweep_with_recombination.vcf") haplohh2sweepfinder(hh) remove.example.files()
Compute XP-EHH (standardized ratio of iES of two populations).
ies2xpehh( scan_pop1, scan_pop2, popname1 = NA, popname2 = NA, min_nhaplo = NA, standardize = TRUE, include_freq = FALSE, p.side = NA, p.adjust.method = "none", verbose = TRUE )
ies2xpehh( scan_pop1, scan_pop2, popname1 = NA, popname2 = NA, min_nhaplo = NA, standardize = TRUE, include_freq = FALSE, p.side = NA, p.adjust.method = "none", verbose = TRUE )
scan_pop1 |
a data frame with markers in rows and columns with chromosome name, position of the
marker, frequency of the ancestral allele and iES as obtained by |
scan_pop2 |
a data frame with markers in rows and columns with chromosome name, position of the
marker, frequency of the ancestral allele and iES as obtained by |
popname1 |
short ID/name of the first population; to be added to an output column name. |
popname2 |
short ID/name of the second population; to be added to an output column name. |
min_nhaplo |
discard positions where in at least one of the populations fewer than |
standardize |
logical. If |
include_freq |
logical. If |
p.side |
side to which refers the p-value. Default |
p.adjust.method |
method passed to function |
verbose |
logical. If |
Log ratio of iES (population 1 over population 2) computed as described in Sabeti et al. (2007). Note that the two data frames are merged on the basis of chromosome and position. Marker names are kept, if they are identical and unique in both data frames.
Since the standardized XP-EHH values follow, if markers evolve predominantly neutrally, approximately
a standard Gaussian distribution, it is practical to assign to the values a p-value relative
to the null-hypothesis of neutral evolution. The parameter p.side
determines
if the p-value is assigned to both sides of the distribution or to one side of interest.
The returned value is a data frame with markers in rows and columns for chromosome name, marker position, XP-EHH and, if standardized, p-value in a negative log10 scale. Optionally, allele frequencies are included.
Gautier, M. and Naves, M. (2011). Footprints of selection in the ancestral admixture of a New World Creole cattle breed. Molecular Ecology, 20, 3128-3143.
Sabeti, P.C. et al. (2007). Genome-wide detection and characterization of positive selection in human populations. Nature, 449, 913-918.
scan_hh
, distribplot
, manhattanplot
library(rehh.data) data(wgscan.cgu) ; data(wgscan.eut) ## results from a genome scan (44,057 SNPs) ##see ?wgscan.eut and ?wgscan.cgu for details wgscan.xpehh <- ies2xpehh(wgscan.cgu, wgscan.eut, "CGU", "EUT")
library(rehh.data) data(wgscan.cgu) ; data(wgscan.eut) ## results from a genome scan (44,057 SNPs) ##see ?wgscan.eut and ?wgscan.cgu for details wgscan.xpehh <- ies2xpehh(wgscan.cgu, wgscan.eut, "CGU", "EUT")
Compute iHS (standardized ratio of iHH values of two alleles).
ihh2ihs( scan, freqbin = 0.025, min_maf = 0.05, min_nhaplo = NA, standardize = TRUE, include_freq = FALSE, right = FALSE, alpha = 0.05, p.side = NA, p.adjust.method = "none", verbose = TRUE )
ihh2ihs( scan, freqbin = 0.025, min_maf = 0.05, min_nhaplo = NA, standardize = TRUE, include_freq = FALSE, right = FALSE, alpha = 0.05, p.side = NA, p.adjust.method = "none", verbose = TRUE )
scan |
a data frame with chromosome name,
marker position, frequency of ancestral (resp. major) allele, frequency of derived (resp. minor)
allele, and iHH for both alleles, as obtained from function |
freqbin |
size of the bins to standardize log(iHH_A/iHH_D). Markers are binned with
respect to the derived allele frequency at the focal marker. The bins are built from
|
min_maf |
focal markers with a MAF (Minor Allele Frequency) lower than or equal to |
min_nhaplo |
focal markers with least one of the two compared alleles carried by fewer
than |
standardize |
logical. If |
include_freq |
logical. If |
right |
logical. If |
alpha |
calculate quantiles |
p.side |
side to which refers the p-value. Default |
p.adjust.method |
method passed to function |
verbose |
logical. If |
Computes log ratio of iHH of two focal alleles as described in Voight et al. (2006). The standardization is performed within each bins separately because of the frequency-dependence of expected iHS values under neutrality. An implicit assumption of this approach is that each bin is dominated by neutral markers.
Since the standardized iHS values follow, if markers evolve predominantly neutrally, approximately
a standard Gaussian distribution, it is practical to assign to the values a p-value relative
to the null-hypothesis of neutral evolution. The parameter p.side
determines
if the p-value is assigned to both sides of the distribution or to one side of interest.
The returned value is a list containing two elements
a data frame with markers in rows and the columns for chromosome name, marker position, iHS and, if standardized, p-value in a negative log10 scale. Optionally, allele frequencies are included.
a data frame with bins in rows and columns for the number of markers, mean uniHS, standard deviation uniHS, lower quantile uniHS, upper quantile uniHS.
Gautier, M. and Naves, M. (2011). Footprints of selection in the ancestral admixture of a New World Creole cattle breed. Molecular Ecology, 20, 3128-3143.
Voight, B.F. and Kudaravalli, S. and Wen, X. and Pritchard, J.K. (2006). A map of recent positive selection in the human genome. Plos Biology, 4, e72.
scan_hh
, distribplot
, freqbinplot
, manhattanplot
library(rehh.data) data(wgscan.cgu) #results from a genome scan (44,057 SNPs) #see ?wgscan.eut and ?wgscan.cgu for details wgscan.cgu.ihs <- ihh2ihs(wgscan.cgu)
library(rehh.data) data(wgscan.cgu) #results from a genome scan (44,057 SNPs) #see ?wgscan.eut and ?wgscan.cgu for details wgscan.cgu.ihs <- ihh2ihs(wgscan.cgu)
Compute Rsb (standardized ratio of inES of two populations).
ines2rsb( scan_pop1, scan_pop2, popname1 = NA, popname2 = NA, min_nhaplo = NA, standardize = TRUE, include_freq = FALSE, p.side = NA, p.adjust.method = "none", verbose = TRUE )
ines2rsb( scan_pop1, scan_pop2, popname1 = NA, popname2 = NA, min_nhaplo = NA, standardize = TRUE, include_freq = FALSE, p.side = NA, p.adjust.method = "none", verbose = TRUE )
scan_pop1 |
a data frame with markers in rows and columns with chromosome name, position of the
marker, frequency of the ancestral allele and inES as obtained by |
scan_pop2 |
a data frame with markers in rows and columns with chromosome name, position of the
marker, frequency of the ancestral allele and inES as obtained by |
popname1 |
short ID/name of the first population; to be added to an output column name. |
popname2 |
short ID/name of the second population; to be added to an output column name. |
min_nhaplo |
discard positions where in at least one of the populations fewer than |
standardize |
logical. If |
include_freq |
logical. If |
p.side |
side to which refers the p-value. Default |
p.adjust.method |
method passed to function |
verbose |
logical. If |
Log ratio of inES (population 1 over population 2) computed as described in Tang et al. (2007). Note that the two data frames are merged on the basis of chromosome and position. Marker names are kept, if they are identical and unique in both data frames.
Since the standardized Rsb values follow, if markers evolve predominantly neutrally, approximately
a standard Gaussian distribution, it is practical to assign to the values a p-value relative
to the null-hypothesis of neutral evolution. The parameter p.side
determines
if the p-value is assigned to both sides of the distribution or to one side of interest.
The returned value is a data frame with markers in rows and columns for chromosome name, marker position, Rsb and, if standardized, p-value in a negative log10 scale. Optionally, allele frequencies are included.
Gautier, M. and Naves, M. (2011). Footprints of selection in the ancestral admixture of a New World Creole cattle breed. Molecular Ecology, 20, 3128-3143.
Tang, K. and Thornton, K.R. and Stoneking, M. (2007). A New Approach for Using Genome Scans to Detect Recent Positive Selection in the Human Genome. Plos Biology, 7, e171.
scan_hh
, distribplot
, manhattanplot
library(rehh.data) data(wgscan.cgu) ; data(wgscan.eut) ## results from a genome scan (44,057 SNPs) ##see ?wgscan.eut and ?wgscan.cgu for details wgscan.rsb <- ines2rsb(wgscan.cgu, wgscan.eut, "CGU", "EUT")
library(rehh.data) data(wgscan.cgu) ; data(wgscan.eut) ## results from a genome scan (44,057 SNPs) ##see ?wgscan.eut and ?wgscan.cgu for details wgscan.rsb <- ines2rsb(wgscan.cgu, wgscan.eut, "CGU", "EUT")
This function copies the following example files to the current working directory:
example1.hap
"example 1" haplotype file in "standard format"
example1.map
"example 1" marker information file
example1.vcf
"example 1" as vcf file
example2.hap
"example 2" haplotype file in "standard format"
example2.map
"example 2" marker information file
example2.vcf
"example 2" as vcf file
example_neutral.vcf
"example neutral evolution" as vcf file
example_sweep.vcf
"example for a selective sweep (without recombination)"
example_sweep_with_recombination.vcf
"example for a selective sweep with recombination
ms.out output
from a small simulation by the program 'ms'
bta12_cgu.hap
an haplotype file in "standard format"
bta12_cgu.thap
an haplotype file in "transposed format"
bta12_hapguess_switch.out
an haplotype file in fastphase output format
map.inp
a marker information file for all bta_cgu markers
Example 1 was used in (Gautier 2017) to explain the various EHH derived statistics calculated by this package. Example 2 is an extension containing multi-allelic markers and missing values.
Examples for neutral data and sweeps are discussed in a supplement of Klassmann (2020).
The bta12 files contain data for 280 haplotypes, originating from 140 individuals belonging to the Creole cattle breed from Guadeloupe, at 1.424 markers mapping to bovine chromosome 12 (BTA12) (Gautier 2011).
make.example.files()
make.example.files()
Gautier, M. and Naves, M. (2011). Footprints of selection in the ancestral admixture of a New World Creole cattle breed. Molecular Ecology, 20, 3128-3143.
Gautier, M., Klassmann, A. and Vitalis, R. (2017). rehh 2.0: a reimplementation of the R package rehh to detect positive selection from haplotype structure. Molecular Ecology Resources, 17, 78-90.
Klassmann, A. and Gautier, M. (2020). Detecting selection using Extended Haplotype Homozygosity-based statistics on unphased or unpolarized data (preprint). https://doi.org/10.22541/au.160405572.29972398/v1
data2haplohh
, remove.example.files
Manhattanplot of iHS, XP-EHH or Rsb over a genome.
manhattanplot( data, pval = FALSE, threshold = c(-2, 2), chr.name = NA, cr = NULL, cr.col = "gray", cr.opacity = 0.5, cr.lab.cex = 0.6, cr.lab.offset = 0, cr.lab.pos = "top", mrk = NULL, mrk.cex = 1, mrk.col = "gray", mrk.pch = 1, mrk.lab.cex = 0.4, mrk.lab.pos = 4, ignore_sign = FALSE, cex = 0.5, las = 1, pch = 20, inset = 5e+06, resolution = NULL, ... )
manhattanplot( data, pval = FALSE, threshold = c(-2, 2), chr.name = NA, cr = NULL, cr.col = "gray", cr.opacity = 0.5, cr.lab.cex = 0.6, cr.lab.offset = 0, cr.lab.pos = "top", mrk = NULL, mrk.cex = 1, mrk.col = "gray", mrk.pch = 1, mrk.lab.cex = 0.4, mrk.lab.pos = 4, ignore_sign = FALSE, cex = 0.5, las = 1, pch = 20, inset = 5e+06, resolution = NULL, ... )
data |
|
pval |
logical. If |
threshold |
a horizontal line is added at the corresponding value(s), for instance to represent a significance threshold. A single value (upper or lower threshold) or two values (upper and lower) can be specified. |
chr.name |
if |
cr |
highlight "candidate regions" specified by a data.frame with columns |
cr.col |
the color for highlighting |
cr.opacity |
a value between 0 (invisible) and 1 (opaque). |
cr.lab.cex |
text size of candidate region labels. |
cr.lab.offset |
offset of candidate region labels. |
cr.lab.pos |
if |
mrk |
highlight marker specified by a data.frame containing the
colums |
mrk.cex |
size of marker label. |
mrk.col |
color of the highlighted points. |
mrk.pch |
type of the highlighted points. |
mrk.lab.cex |
text size of marker label. If zero, no labels are printed. |
mrk.lab.pos |
a position specifier for the text. Values of 1, 2, 3 and 4, respectively indicate positions below, to the left of, above and to the right of the highlighted marker. |
ignore_sign |
logical. If |
cex |
size of the points representing markers in the plot(s) (see |
las |
orientation of axis labels (see |
pch |
type of the points representing markers in the plot(s) (see |
inset |
inset (in bases) between chromosomes to avoid overlap of data points. Default: 5,000,000 bases. |
resolution |
Rasterize data points to the specified resolution and remove
duplicate points. Defaults to NULL, i.e. no rasterization. A typical value might be |
... |
further arguments to be passed to |
The color of chromosomes is taken from the "Graphics Palette", see palette
.
If a single chromosome is plotted, a genomic region can be specified by
argument xlim
.
Other statistics can be plotted as well, although a warning is issued. They must be given by a data.frame with columns CHR and POSITION and the statistic in the third column.
The function returns a plot.
ihh2ihs
, ies2xpehh
, ines2rsb
, calc_candidate_regions
.
library(rehh.data) data(wgscan.cgu) ## results from a genome scan (44,057 SNPs) ## see ?wgscan.eut and ?wgscan.cgu for details wgscan.ihs <- ihh2ihs(wgscan.cgu) manhattanplot(wgscan.ihs)
library(rehh.data) data(wgscan.cgu) ## results from a genome scan (44,057 SNPs) ## see ?wgscan.eut and ?wgscan.cgu for details wgscan.ihs <- ihh2ihs(wgscan.cgu) manhattanplot(wgscan.ihs)
Plot curve of EHH values around a focal marker.
## S3 method for class 'ehh' plot( x, ylim = c(0, 1), type = "l", main = paste0("EHH around '", x$mrk.name, "'"), xlab = "Position", ylab = "Extended Haplotype Homozygosity", col = c("blue", "red", "violet", "orange"), mrk.col = "gray", bty = "n", legend = NA, legend.xy.coords = "automatic", ... )
## S3 method for class 'ehh' plot( x, ylim = c(0, 1), type = "l", main = paste0("EHH around '", x$mrk.name, "'"), xlab = "Position", ylab = "Extended Haplotype Homozygosity", col = c("blue", "red", "violet", "orange"), mrk.col = "gray", bty = "n", legend = NA, legend.xy.coords = "automatic", ... )
x |
data (output of |
ylim |
the y limits of the plot |
type |
plot type (see codeplot.default and |
main |
title for the plot (default |
xlab |
title for the x-axis. |
ylab |
title for the y-axis. |
col |
color for the ancestral and derived alleles (respectively) curves. |
mrk.col |
color of the vertical line at the focal marker position. |
bty |
box type around plot (see |
legend |
legend text. |
legend.xy.coords |
if |
... |
further arguments to be passed to function |
data2haplohh
, calc_ehh
, plot.ehhs
, scan_hh
.
#example haplohh object (280 haplotypes, 1424 SNPs) #see ?haplohh_cgu_bta12 for details data(haplohh_cgu_bta12) #computing EHH statistics for the marker "F1205400" #which displays a strong signal of selection ehh <- calc_ehh(haplohh_cgu_bta12, mrk = "F1205400") plot(ehh)
#example haplohh object (280 haplotypes, 1424 SNPs) #see ?haplohh_cgu_bta12 for details data(haplohh_cgu_bta12) #computing EHH statistics for the marker "F1205400" #which displays a strong signal of selection ehh <- calc_ehh(haplohh_cgu_bta12, mrk = "F1205400") plot(ehh)
Plot curve of EHHS values around a focal marker.
## S3 method for class 'ehhs' plot( x, nehhs = FALSE, ylim = c(0, 1), type = "l", main = paste0("EHHS around '", x$mrk.name, "'"), xlab = "Position", ylab = "Extended Haplotype Homozygosity per Site", bty = "n", mrk.col = "gray", ... )
## S3 method for class 'ehhs' plot( x, nehhs = FALSE, ylim = c(0, 1), type = "l", main = paste0("EHHS around '", x$mrk.name, "'"), xlab = "Position", ylab = "Extended Haplotype Homozygosity per Site", bty = "n", mrk.col = "gray", ... )
x |
data (output of |
nehhs |
logical. If |
ylim |
the y limits of the plot |
type |
plot type (see codeplot.default.
Type |
main |
title for the plot (default |
xlab |
title for the x-axis. |
ylab |
title for the y-axis. |
bty |
box type around plot (see |
mrk.col |
color of the vertical line at the focal marker position. |
... |
further arguments to be passed to function |
data2haplohh
, plot.ehh
, calc_ehhs
, scan_hh
.
#example haplohh object (280 haplotypes, 1424 SNPs) #see ?haplohh_cgu_bta12 for details data(haplohh_cgu_bta12) #computing EHHS statisitics for the marker "F1205400" #which displays a strong signal of selection ehhs <- calc_ehhs(haplohh_cgu_bta12, mrk = "F1205400") plot(ehhs)
#example haplohh object (280 haplotypes, 1424 SNPs) #see ?haplohh_cgu_bta12 for details data(haplohh_cgu_bta12) #computing EHHS statisitics for the marker "F1205400" #which displays a strong signal of selection ehhs <- calc_ehhs(haplohh_cgu_bta12, mrk = "F1205400") plot(ehhs)
Plots furcation trees around a focal marker
## S3 method for class 'furcation' plot( x, allele = NA, col = c("blue", "red", "violet", "orange"), mrk.col = "gray", lwd = 0.1, hap.names = NULL, cex.lab = 1, family.lab = "", offset.lab = 0.5, legend = NA, legend.xy.coords = "automatic", ... )
## S3 method for class 'furcation' plot( x, allele = NA, col = c("blue", "red", "violet", "orange"), mrk.col = "gray", lwd = 0.1, hap.names = NULL, cex.lab = 1, family.lab = "", offset.lab = 0.5, legend = NA, legend.xy.coords = "automatic", ... )
x |
an object of class furcation (see |
allele |
If |
col |
color for each allele (as coded internally). |
mrk.col |
color of the vertical line at the focal marker position. |
lwd |
controls the relative width of the diagram lines on the plot (default 0.1). |
hap.names |
a vector containing names of chromosomes. |
cex.lab |
relative size of labels. See |
family.lab |
font family for labels. See |
offset.lab |
offset of labels. See |
legend |
legend text. |
legend.xy.coords |
if |
... |
other arguments to be passed to |
#example haplohh object (280 haplotypes, 1424 SNPs) #see ?haplohh_cgu_bta12 for details data(haplohh_cgu_bta12) #plotting furcation diagram for both ancestral and derived allele #from the marker "F1205400" #which display a strong signal of selection f <- calc_furcation(haplohh_cgu_bta12, mrk = "F1205400") plot(f) plot(f, xlim = c(2e+07,3.5e+07)) plot(f, xlim = c(2.7e+07,3.1e+07)) plot(f, xlim = c(2.7e+07,3.1e+07), hap.names = hap.names(haplohh_cgu_bta12), cex.lab=0.3)
#example haplohh object (280 haplotypes, 1424 SNPs) #see ?haplohh_cgu_bta12 for details data(haplohh_cgu_bta12) #plotting furcation diagram for both ancestral and derived allele #from the marker "F1205400" #which display a strong signal of selection f <- calc_furcation(haplohh_cgu_bta12, mrk = "F1205400") plot(f) plot(f, xlim = c(2e+07,3.5e+07)) plot(f, xlim = c(2.7e+07,3.1e+07)) plot(f, xlim = c(2.7e+07,3.1e+07), hap.names = hap.names(haplohh_cgu_bta12), cex.lab=0.3)
Plot the length of extended haplotype around a focal marker.
## S3 method for class 'haplen' plot( x, allele = NA, group_by_allele = TRUE, order_by_length = FALSE, col = c("blue", "red", "violet", "orange"), mrk.col = "gray", lwd = 1, hap.names = NULL, cex.lab = 1, family.lab = "", offset.lab = 0.5, pos.lab = "left", legend = NA, legend.xy.coords = "automatic", ... )
## S3 method for class 'haplen' plot( x, allele = NA, group_by_allele = TRUE, order_by_length = FALSE, col = c("blue", "red", "violet", "orange"), mrk.col = "gray", lwd = 1, hap.names = NULL, cex.lab = 1, family.lab = "", offset.lab = 0.5, pos.lab = "left", legend = NA, legend.xy.coords = "automatic", ... )
x |
an object of class |
allele |
if |
group_by_allele |
logical. If |
order_by_length |
if |
col |
color for each allele (as coded internally). |
mrk.col |
color of the vertical line at the focal marker position. |
lwd |
line width. |
hap.names |
a vector containing the names of chromosomes. |
cex.lab |
relative letter size of labels. See |
family.lab |
font family for labels. See |
offset.lab |
offset of labels. See |
pos.lab |
position of haplotype labels. Either |
legend |
legend text. |
legend.xy.coords |
if |
... |
other parameters to be passed to |
#example haplohh object (280 haplotypes, 1424 SNPs) #see ?haplohh_cgu_bta12 for details data(haplohh_cgu_bta12) #plotting length of extended haplotypes for both ancestral and derived allele #of the marker "F1205400" #which displays a strong signal of selection f <- calc_furcation(haplohh_cgu_bta12, mrk = "F1205400") h <- calc_haplen(f) plot(h) plot(h, hap.names = hap.names(haplohh_cgu_bta12), cex.lab = 0.3)
#example haplohh object (280 haplotypes, 1424 SNPs) #see ?haplohh_cgu_bta12 for details data(haplohh_cgu_bta12) #plotting length of extended haplotypes for both ancestral and derived allele #of the marker "F1205400" #which displays a strong signal of selection f <- calc_furcation(haplohh_cgu_bta12, mrk = "F1205400") h <- calc_haplen(f) plot(h) plot(h, hap.names = hap.names(haplohh_cgu_bta12), cex.lab = 0.3)
Plot the variants of a haplohh object. This method is intended for visualization of very small data sets such as the examples provided by the package.
## S3 method for class 'haplohh' plot( x, mrk = NA, allele = NA, group_by_allele = FALSE, ignore.distances = FALSE, col = c("blue", "red", "violet", "orange"), linecol = "gray", mrk.col = "gray", pch = 19, cex = 1, lwd = 1, hap.names = NULL, mrk.names = NULL, cex.lab.hap = 0.8, cex.lab.mrk = 0.8, family.lab = "", offset.lab.hap = 0.5, offset.lab.mrk = 0.25, pos.lab.hap = "left", pos.lab.mrk = "top", srt.hap = 0, srt.mrk = 0, highlight.mrk = NULL, highlight.mrk.col = c("lightgray", "black", "darkgray"), ... )
## S3 method for class 'haplohh' plot( x, mrk = NA, allele = NA, group_by_allele = FALSE, ignore.distances = FALSE, col = c("blue", "red", "violet", "orange"), linecol = "gray", mrk.col = "gray", pch = 19, cex = 1, lwd = 1, hap.names = NULL, mrk.names = NULL, cex.lab.hap = 0.8, cex.lab.mrk = 0.8, family.lab = "", offset.lab.hap = 0.5, offset.lab.mrk = 0.25, pos.lab.hap = "left", pos.lab.mrk = "top", srt.hap = 0, srt.mrk = 0, highlight.mrk = NULL, highlight.mrk.col = c("lightgray", "black", "darkgray"), ... )
x |
an object of class |
mrk |
the focal marker. Used only, if alleles are grouped or (de-)selected. |
allele |
if |
group_by_allele |
logical. If |
ignore.distances |
logical. If |
col |
color for each allele (as coded internally). |
linecol |
the color of the background lines. If more than one color is specified and sequences sorted by the marker allele, the specified colors are used to distinguish the alleles; otherwise consecutive sequences are set into the specified colors. |
mrk.col |
color of the vertical line at the focal marker position. |
pch |
symbol used for markers. See |
cex |
relative size of marker symbol. See |
lwd |
line width. |
hap.names |
a vector containing the names of chromosomes. |
mrk.names |
a vector containing the names of markers. |
cex.lab.hap |
relative letter size of haplotype labels. See |
cex.lab.mrk |
relative letter size of marker labels. See |
family.lab |
font family for labels. See |
offset.lab.hap |
offset of haplotype labels. See |
offset.lab.mrk |
offset of marker labels. See |
pos.lab.hap |
position of haplotype labels. Either |
pos.lab.mrk |
position of marker labels. Either |
srt.hap |
rotation of haplotype labels (see |
srt.mrk |
rotation of marker labels (see |
highlight.mrk |
vector of markers to be highlighted |
highlight.mrk.col |
color for each allele (as coded internally) at highlighted markers. |
... |
other parameters to be passed to |
Specifying a haplohh-object with more than 4096 haplotypes or markers produces an error.
#example haplohh object make.example.files() hh <- data2haplohh(hap_file = "example1.hap", map_file = "example1.map", allele_coding = "01") plot(hh) hh <- data2haplohh(hap_file = "example2.hap", map_file = "example2.map", allele_coding = "01", min_perc_geno.mrk = 50) plot(hh) remove.example.files()
#example haplohh object make.example.files() hh <- data2haplohh(hap_file = "example1.hap", map_file = "example1.map", allele_coding = "01") plot(hh) hh <- data2haplohh(hap_file = "example2.hap", map_file = "example2.map", allele_coding = "01", min_perc_geno.mrk = 50) plot(hh) remove.example.files()
Remove example files from current working directory.
remove.example.files()
remove.example.files()
Removes the files created by make.example.files()
.
No error is thrown, if files do not exist.
Compute integrated EHH (iHH), integrated EHHS (iES) and integrated normalized EHHS (inES) for all markers of a chromosome (or linkage group).
scan_hh( haplohh, limhaplo = 2, limhomohaplo = 2, limehh = 0.05, limehhs = 0.05, phased = TRUE, polarized = TRUE, scalegap = NA, maxgap = NA, discard_integration_at_border = TRUE, lower_ehh_y_bound = limehh, lower_ehhs_y_bound = limehhs, interpolate = TRUE, threads = 1 )
scan_hh( haplohh, limhaplo = 2, limhomohaplo = 2, limehh = 0.05, limehhs = 0.05, phased = TRUE, polarized = TRUE, scalegap = NA, maxgap = NA, discard_integration_at_border = TRUE, lower_ehh_y_bound = limehh, lower_ehhs_y_bound = limehhs, interpolate = TRUE, threads = 1 )
haplohh |
an object of class |
limhaplo |
if there are less than |
limhomohaplo |
if there are less than |
limehh |
limit at which EHH stops to be evaluated. |
limehhs |
limit at which EHHS stops to be evaluated. |
phased |
logical. If |
polarized |
logical. |
scalegap |
scale or cap gaps larger than the specified size to the specified size (default= |
maxgap |
maximum allowed gap in bp between two markers. If exceeded, further calculation of EHH(S) is stopped at the gap
(default= |
discard_integration_at_border |
logical. If |
lower_ehh_y_bound |
lower y boundary of the area to be integrated over (default: |
lower_ehhs_y_bound |
lower y boundary of the area to be integrated (default: |
interpolate |
logical. If |
threads |
number of threads to parallelize computation |
Integrated EHH (iHH), integrated EHHS (iES) and integrated normalized EHHS (inES)
are computed for all markers of the chromosome (or linkage group). This function is several
times faster as a procedure calling in turn calc_ehh
and calc_ehhs
for all markers. To perform a whole genome-scan this function needs
to be called for each chromosome and results concatenated.
Note that setting limehh
or limehhs
to zero is likely to reduce power,
since even under neutrality a tiny fraction (<<0.05) of extremely long shared haplotypes is expected
which, if fully accounted for, would obfuscate the signal at selected sites.
The returned value is a dataframe with markers in rows and the following columns
chromosome name
position in the chromosome
sample frequency of the ancestral / major allele
sample frequency of the second-most frequent remaining allele
number of evaluated haplotypes at the focal marker for the ancestral / major allele
number of evaluated haplotypes at the focal marker for the second-most frequent remaining allele
iHH of the ancestral / major allele
iHH of the second-most frequent remaining allele
iES (used by Sabeti et al 2007)
inES (used by Tang et al 2007)
Note that in case of unphased data the evaluation is restricted to haplotypes of homozygous individuals which reduces the power to detect selection, particularly for iHS (for appropriate parameter setting see the main vignette and Klassmann et al (2020)).
Gautier, M. and Naves, M. (2011). Footprints of selection in the ancestral admixture of a New World Creole cattle breed. Molecular Ecology, 20, 3128-3143.
Klassmann, A. and Gautier, M. (2020). Detecting selection using Extended Haplotype Homozygosity-based statistics on unphased or unpolarized data (preprint). https://doi.org/10.22541/au.160405572.29972398/v1
Sabeti, P.C. et al. (2002). Detecting recent positive selection in the human genome from haplotype structure. Nature, 419, 832-837.
Sabeti, P.C. et al. (2007). Genome-wide detection and characterization of positive selection in human populations. Nature, 449, 913-918.
Tang, K. and Thornton, K.R. and Stoneking, M. (2007). A New Approach for Using Genome Scans to Detect Recent Positive Selection in the Human Genome. Plos Biology, 7, e171.
Voight, B.F. and Kudaravalli, S. and Wen, X. and Pritchard, J.K. (2006). A map of recent positive selection in the human genome. Plos Biology, 4, e72.
data2haplohh
, calc_ehh
, calc_ehhs
ihh2ihs
,ines2rsb
, ies2xpehh
#example haplohh object (280 haplotypes, 1424 SNPs) #see ?haplohh_cgu_bta12 for details data(haplohh_cgu_bta12) scan <- scan_hh(haplohh_cgu_bta12)
#example haplohh object (280 haplotypes, 1424 SNPs) #see ?haplohh_cgu_bta12 for details data(haplohh_cgu_bta12) scan <- scan_hh(haplohh_cgu_bta12)
Compute integrated EHH (iHH), integrated EHHS (iES) and integrated normalized EHHS (inES) for all markers of a chromosome (or linkage group).
This function computes the statistics by a slightly different algorithm than scan_hh
: it sidesteps the calculation of EHH and EHHS values and their subsequent integration and
consequently no cut-offs relying on these values can be specified. Instead,
it computes the (full) lengths of pairwise shared haplotypes and averages them afterwords.
This function is primarily intended for the study of general properties of these statistics using simulated data.
scan_hh_full( haplohh, phased = TRUE, polarized = TRUE, maxgap = NA, max_extend = NA, discard_integration_at_border = TRUE, geometric.mean = FALSE, threads = 1 )
scan_hh_full( haplohh, phased = TRUE, polarized = TRUE, maxgap = NA, max_extend = NA, discard_integration_at_border = TRUE, geometric.mean = FALSE, threads = 1 )
haplohh |
an object of class |
phased |
logical. If |
polarized |
logical. |
maxgap |
maximum allowed gap in bp between two markers. If exceeded, further calculation of EHH(S) is stopped at the gap
(default= |
max_extend |
maximum distance in bp to extend shared haplotypes away from the focal marker.
(default |
discard_integration_at_border |
logical. If |
geometric.mean |
logical. If |
threads |
number of threads to parallelize computation |
Integrated EHH (iHH), integrated EHHS (iES) and integrated normalized EHHS (inES)
are computed for all markers of the chromosome (or linkage group). This function sidesteps
the computation of EHH and EHHS values and their stepwise integration. Instead, the length of all shared haplotypes
is computed and afterwords averaged. In the absence of missing values the
statistics are identical to those calculated by scan_hh
with settings
limehh = 0
, limehhs = 0
, lower_ehh_y_bound = 0
and interpolate = FALSE
, yet this function is faster.
Application of a cut-off is necessary for reducing the spurious signals
of selection caused by single shared haplotypes of extreme length. Hence, e.g. for human experimental data
it might be reasonable to set max_extend
to 1 or 2 Mb.
scan_hh
computes the statistics iHH_A, ihh_D and iES/inES separately,
while this function calculates them simultaneously. Hence,
if discard_integration_at_border
is set to TRUE
and the extension of shared haplotypes
reaches a border (i.e. chromosomal boundaries or a gap larger than maxgap
), this function discards
all statistics.
The handling of missing values is different, too: scan_hh
"removes" chromosomes with missing values from further calculations.
EHH and EHHS are then calculated for the remaining chromosomes which can accidentally yield an increase in EHH or EHHS.
This can not happen with scan_hh_full()
which treats each missing value of a marker
as if it were a new allele - terminating any shared haplotype, but does changing the
set of considered chromosomes. Thus, missing values
cause a faster decay of EHH(S) with function scan_hh_full()
.
The returned value is a dataframe with markers in rows and the following columns
chromosome name
position in the chromosome
sample frequency of the ancestral / major allele
sample frequency of the second-most frequent remaining allele
number of evaluated haplotypes at the focal marker for the ancestral / major allele
number of evaluated haplotypes at the focal marker for the second-most frequent remaining allele
iHH of the ancestral / major allele
iHH of the second-most frequent remaining allele
iES (used by Sabeti et al 2007)
inES (used by Tang et al 2007)
Note that in case of unphased data the evaluation is restricted to haplotypes of homozygous individuals which reduces the power to detect selection, particularly for iHS (for appropriate parameter setting see the main vignette and Klassmann et al (2020)).
Gautier, M. and Naves, M. (2011). Footprints of selection in the ancestral admixture of a New World Creole cattle breed. Molecular Ecology, 20, 3128-3143.
Klassmann, A. and Gautier, M. (2020). Detecting selection using Extended Haplotype Homozygosity-based statistics on unphased or unpolarized data (preprint). https://doi.org/10.22541/au.160405572.29972398/v1
Sabeti, P.C. et al. (2002). Detecting recent positive selection in the human genome from haplotype structure. Nature, 419, 832-837.
Sabeti, P.C. et al. (2007). Genome-wide detection and characterization of positive selection in human populations. Nature, 449, 913-918.
Tang, K. and Thornton, K.R. and Stoneking, M. (2007). A New Approach for Using Genome Scans to Detect Recent Positive Selection in the Human Genome. Plos Biology, 7, e171.
Voight, B.F. and Kudaravalli, S. and Wen, X. and Pritchard, J.K. (2006). A map of recent positive selection in the human genome. Plos Biology, 4, e72.
data2haplohh
, scan_hh
,
ihh2ihs
, ines2rsb
, ies2xpehh
#example haplohh object (280 haplotypes, 1424 SNPs) #see ?haplohh_cgu_bta12 for details data(haplohh_cgu_bta12) #using function scan_hh() with no cut-offs scan <- scan_hh(haplohh_cgu_bta12, discard_integration_at_border = FALSE, limehh = 0, limehhs = 0, lower_ehh_y_bound = 0, interpolate = FALSE) #using function scan_hh_full() scan_full <- scan_hh_full(haplohh_cgu_bta12, discard_integration_at_border = FALSE) #both yield identical results within numerical precision all.equal(scan, scan_full)
#example haplohh object (280 haplotypes, 1424 SNPs) #see ?haplohh_cgu_bta12 for details data(haplohh_cgu_bta12) #using function scan_hh() with no cut-offs scan <- scan_hh(haplohh_cgu_bta12, discard_integration_at_border = FALSE, limehh = 0, limehhs = 0, lower_ehh_y_bound = 0, interpolate = FALSE) #using function scan_hh_full() scan_full <- scan_hh_full(haplohh_cgu_bta12, discard_integration_at_border = FALSE) #both yield identical results within numerical precision all.equal(scan, scan_full)
haplohh-class
Subsets the data of an object of class haplohh-class
,
meeting certain conditions.
## S3 method for class 'haplohh' subset( x, select.hap = NULL, select.mrk = NULL, min_perc_geno.hap = NA, min_perc_geno.mrk = 100, min_maf = NA, max_alleles = NA, verbose = TRUE, ... )
## S3 method for class 'haplohh' subset( x, select.hap = NULL, select.mrk = NULL, min_perc_geno.hap = NA, min_perc_geno.mrk = 100, min_maf = NA, max_alleles = NA, verbose = TRUE, ... )
x |
object of class |
select.hap |
expression, indicating haplotypes to select. |
select.mrk |
expression, indicating markers to select. |
min_perc_geno.hap |
threshold on percentage of missing data for haplotypes
(haplotypes with less than |
min_perc_geno.mrk |
threshold on percentage of missing data for markers (markers genotyped on less than
|
min_maf |
threshold on the Minor Allele Frequency. Markers having a MAF lower than or equal to minmaf are discarded.
In case of multi-allelic markers the second-most frequent allele is referred to as minor allele.
Setting this value to zero eliminates monomorphic sites. Default is |
max_alleles |
threshold for the maximum number of different alleles at a site. Default is |
verbose |
logical. If |
... |
further arguments are ignored. |
#example haplohh object (280 haplotypes, 1424 SNPs) #see ?haplohh_cgu_bta12 for details data(haplohh_cgu_bta12) #select subset of first 10 hyplotypes and first 5 markers subset(haplohh_cgu_bta12, select.hap = 1:10, select.mrk = 1:5)
#example haplohh object (280 haplotypes, 1424 SNPs) #see ?haplohh_cgu_bta12 for details data(haplohh_cgu_bta12) #select subset of first 10 hyplotypes and first 5 markers subset(haplohh_cgu_bta12, select.hap = 1:10, select.mrk = 1:5)
Update object of class haplohh-class
constructed by rehh versions up to version 2.0.4.
update_haplohh(haplohh)
update_haplohh(haplohh)
haplohh |
an object of an old version of |
This function is intended to update haplohh
objects
that have been built by rehh versions up to 2.0.4. These objects cannot
be used in functions of the current version.
The following changes have been made to the class definition:
The internal representation of the haplotype matrix followed the encoding
0 missing value
1 ancestral allele
2 derived allele
and has been replaced by a vcf-like encoding:
NA
missing value
0 ancestral allele
1 derived allele.
Furthermore the slots nsnp
, snp.name
and nhap
have been removed
and slot position
renamed to positions
.
An update of an old haplohh
object is done as follows:
new_haplohh = update_haplohh(old_haplohh)
.