Despite a bewildering nomenclature, the idea of Extended Haplotype Homozygosity is simple. Consider the following alignment of nucleotide sequences where only bi-allelic sites have been retained:
The colored areas mark the maximal extension to which at least two sequences carrying the same focal allele are identical, i.e. homozygous to each other. The average length of all sequence-pairwise shared haplotypes yields the iHH scores for the two central alleles, respectively. The (unstandardized) iHS value is the log ratio of them. The statistics XP-EHH and Rsb are constructed in the same way with the two alleles replaced by two populations and while Rsb is normalized to 1 at the focal position, XP-EHH is not. That’s all!
This vignette analyses in great detail two small example data sets delivered with the package rehh (see main vignette). They have been constructed to ease comprehension of the relevant statistics and functionality of the package. The first example has been already discussed in (Gautier et al. 2017) while the second set is an extension to include multiple markers and missing values. The modifications for unphased or unpolarized data have been described in (Klassmann and Gautier 2020).
The pattern of variation seen in the sets and in the alignment above is intended to reflect an evolutionary scenario of an “on-going selective sweep” with one allele of the central marker experiencing strong selection.
The package has to be installed and then loaded by
Both example data sets are provided in two formats: as a pair of haplotype & map files and as a single file in variant call format (vcf). They are copied (together with other files) into the current working directory by the command
Data set 1 contains the hypothetical variation at a particular genetic locus in 8 sequences of 4 diploid individuals. Eleven markers, which might be imagined as SNPs, have two alleles, coded as ‘0’ and ‘1’, for ancestral and derived allele, respectively.
The file example1.hap
conforms to what in the main
vignette is referred to as “standard” haplotype format, i.e. chromosomes
are given in rows and each marker corresponds to a column. The first
column is reserved for identifiers of the individual haplotypes.
> HG1_1 0 0 1 1 0 0 0 0 0 1 1
> HG1_2 0 0 0 0 1 0 1 0 1 0 0
> HG2_1 0 1 0 1 0 0 1 0 0 0 1
> HG2_2 0 0 1 0 0 0 0 1 1 0 0
> HG3_1 0 0 0 0 0 1 0 0 0 0 1
> HG3_2 0 0 0 0 0 1 0 0 0 0 1
> HG4_1 1 0 0 0 0 1 0 0 0 0 0
> HG4_2 0 0 0 0 0 1 0 0 0 0 0
The calculation of the “integrated” statistics such as iHH
and iES requires a measure for the distance between markers,
which is provided by the file example1.map
. It relates the
markers with their chromosomal positions. The latter can represent
physical positions (base pairs) or genetic positions derived from
estimated recombination rates.
> rs1 chr1 10000 0 1
> rs2 chr1 20000 0 1
> rs3 chr1 30000 0 1
> rs4 chr1 40000 0 1
> rs5 chr1 50000 0 1
> rs6 chr1 60000 0 1
> rs7 chr1 70000 0 1
> rs8 chr1 80000 0 1
> rs9 chr1 90000 0 1
> rs10 chr1 100000 0 1
> rs11 chr1 110000 0 1
The file in variant call format combines the information of
haplotype and map files. However, vcf codes alleles with
respect to a reference sequence, not with respect to ancestry status.
Information about ancestry can be added using a key of the
INFO
field, conventionally named AA
. For
instance, in the file example1.vcf
, the reference alleles
of markers rs6
and rs11
differ from the
ancestral alleles.
> ##fileformat=VCFv4.2
> ##reference=NCBI38
> ##contig=<ID=chr1,length=120000>
> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG1 HG2 HG3 HG4
> chr1 10000 rs1 G T 100 PASS AA=G GT 0|0 0|0 0|0 1|0
> chr1 20000 rs2 G A 100 PASS AA=G GT 0|0 1|0 0|0 0|0
> chr1 30000 rs3 T C 100 PASS AA=T GT 1|0 0|1 0|0 0|0
> chr1 40000 rs4 A C 100 PASS AA=A GT 1|0 1|0 0|0 0|0
> chr1 50000 rs5 T G 100 PASS AA=T GT 0|1 0|0 0|0 0|0
> chr1 60000 rs6 G A 100 PASS AA=A GT 1|1 1|1 0|0 0|0
> chr1 70000 rs7 C T 100 PASS AA=C GT 0|1 1|0 0|0 0|0
> chr1 80000 rs8 C T 100 PASS AA=C GT 0|0 0|1 0|0 0|0
> chr1 90000 rs9 T A 100 PASS AA=T GT 0|1 0|1 0|0 0|0
> chr1 100000 rs10 A G 100 PASS AA=A GT 1|0 0|0 0|0 0|0
> chr1 110000 rs11 C T 100 PASS AA=T GT 0|1 0|1 0|0 1|1
In order to work with the data, it has to be transformed to an
internal representation, namely an object of class haplohh
.
Let us first use the pair of files example1.hap
and
example1.map
. In this case the allele coding in the
haplotype file is conform to the 01
format (defined in the
main vignette), hence setting allele_coding = "01"
is
appropriate. This is also the format which is used internally.
> * Reading input file(s) *
> Map info: 11 markers declared for chromosome chr1 .
> Haplotype input file in standard format assumed.
> Assume that alleles in the haplotype file are coded as:
> NA or '.': missing value
> 0: ancestral allele
> 1, 2, ...: derived allele(s).
> * Filtering data *
> Discard markers genotyped on less than 100 % of haplotypes.
> No marker discarded.
> Data consists of 8 haplotypes and 11 markers.
> Number of mono-, bi-, multi-allelic markers:
> 1 2
> 0 11
This data set is constructed such that setting
allele_coding = "map"
or
allele_coding = "none"
yields an identical
haplohh
object. In the first case, the fourth column of the
map file, standing for the ancestral allele, contains always 0, hence
for every marker in the haplotype file the “allele” 0 is assigned by the
map file as ancestral and thus recoded by 0; the fifth column in the map
file delineates the derived alleles of each marker, here again always 1,
hence the “allele” 1 of the haplotype file is translated to the first
derived allele, coded by 1. In the second case, which causes a coding by
alpha-numeric order, we end up again by just replacing 0 by 0 and 1 by
1.
hh_map <- data2haplohh(hap_file = "example1.hap",
map_file = "example1.map",
allele_coding = "map",
verbose = FALSE)
identical(hh, hh_map)
> [1] TRUE
hh_none <- data2haplohh(hap_file = "example1.hap",
map_file = "example1.map",
allele_coding = "none",
verbose = FALSE)
identical(hh, hh_none)
> [1] TRUE
Finally, we use the vcf file as input. No map file has to be
specified. The option polarize_vcf
is set by default to
TRUE
, assuming ancestral alleles are given in the
INFO
field by key AA
. The function
data2haplohh
then recodes the markers where reference and
ancestral alleles differ. If the option is turned off, coding with
respect to the reference sequence is taken and the information about
ancestry lost.
Again, one yields the same internal representation of the data:
hh_vcf <- data2haplohh(hap_file = "example1.vcf",
vcf_reader = "data.table",
verbose = FALSE)
identical(hh, hh_vcf)
> [1] TRUE
The haplohh-object can be visualized by a simple plot that shows ancestral alleles in blue and derived ones in red:
Note, however, that this kind of plot is intended only for relatively small data sets.
We start with the computation of (allele-specific) EHH which is used for the detection of selection within a single, supposedly homogeneous, population.
We restrict our attention to the central marker with id
rs6
. We start by computing EHH for its alleles. We
include the number of evaluated haplotypes into the output which tells
us merely that 4 haplotypes are evaluated for each allele, in agreement
with their frequencies. Note that the integral iHH_D
is not
computed, since the EHH_D
is still above the threshold
limehh
(default value 0.05) at the borders of the
chromosome and the option discard_integration_at_border
is
by default TRUE
.
> An object of class "ehh"
> [[1]]
> [1] "rs6"
>
> [[2]]
> FREQ_A FREQ_D
> 0.5 0.5
>
> [[3]]
> POSITION EHH_A EHH_D NHAPLO_A NHAPLO_D
> rs1 10000 0.0000000 0.5000000 0 4
> rs2 20000 0.0000000 1.0000000 0 4
> rs3 30000 0.0000000 1.0000000 4 4
> rs4 40000 0.1666667 1.0000000 4 4
> rs5 50000 0.5000000 1.0000000 4 4
> rs6 60000 1.0000000 1.0000000 4 4
> rs7 70000 0.3333333 1.0000000 4 4
> rs8 80000 0.1666667 1.0000000 4 4
> rs9 90000 0.0000000 1.0000000 4 4
> rs10 100000 0.0000000 1.0000000 0 4
> rs11 110000 0.0000000 0.3333333 0 4
>
> [[4]]
> IHH_A IHH_D
> 18816.67 NA
The individual values of the table can be easily checked “by hand” using the internal data representation of the haplotypes:
> rs1 rs2 rs3 rs4 rs5 rs6 rs7 rs8 rs9 rs10 rs11
> HG1_1 0 0 1 1 0 0 0 0 0 1 1
> HG1_2 0 0 0 0 1 0 1 0 1 0 0
> HG2_1 0 1 0 1 0 0 1 0 0 0 1
> HG2_2 0 0 1 0 0 0 0 1 1 0 0
> HG3_1 0 0 0 0 0 1 0 0 0 0 1
> HG3_2 0 0 0 0 0 1 0 0 0 0 1
> HG4_1 1 0 0 0 0 1 0 0 0 0 0
> HG4_2 0 0 0 0 0 1 0 0 0 0 0
The chromosomes HG1_1
, HG1_2
,
HG_2_1
and HG_2_2
carry the ancestral allele
of marker rs6
and form the “starting set” of shared
haplotypes to be extended along the chromosome. By definition the
starting set is homozygous and EHH equal to 1 at the focal
marker. The shared haplotypes cover so far a single chromosomal
position, namely that of the focal marker. We extend them to the next
marker on the right, rs7
, where two chromosomes carry one
allele and the other two another allele (ancestry status matters only at
the focal marker!). Hence the initial set of four can be subdivided into
two sets of extended shared haplotypes, namely {HG1_1
,
HG2_2
} sharing 00
and {HG1_2
,
HG2_1
} sharing 01
. They cover now the region
from rs6
til rs7
. Plugging the numbers into
the formula for EHH (see main vignette) yields $$
EHH^a_{rs6,rs7}=\frac{1}{n_a(n_a-1)}\sum_{k=1}^{K^a_{rs6,rs7}}n_k(n_k-1)=\frac{1}{4\cdot
3}(2\cdot1+2\cdot1)=\frac{1}{3}\;.
$$ One marker further to the right, at rs8
, the
first set can be partitioned into two, with HG1_1
having
extended haplotype 000
, and HG2_2
001
. The second set {HG1_2
,HG2_1
}
remains homozygous, now sharing the extended haplotype 010
.
Thus, at marker rs8
the corresponding EHH value
yields $$
EHH^a_{rs6,rs8}=\frac{1}{n_a(n_a-1)}\sum_{k=1}^{K^a_{rs6,rs8}}n_k(n_k-1)=\frac{1}{4\cdot
3}(1\cdot0+2\cdot1+1\cdot0)=\frac{1}{6}\;.
$$ At marker rs9
, chromosome HG1_2
carries allele 1
and HG2_1
allele
0
. Hence the extended haplotypes now differ between all
four considered chromosomes and EHH becomes zero. An analogous,
but independent, calculation can be done for the markers to the left of
the focal marker.
The starting set for the derived allele consists of
{HG3_1
, HG3_2
, HG4_1
and
HG4_2
}. Extending to the right, the corresponding
haplotypes remain homozygous and consequently the set is not split until
marker rs11
. In particular we have $$EHH^d_{rs6,rs7}=EHH^d_{rs6,rs8}=\frac{1}{4\cdot3}\sum_{k=1}^14\cdot3=1$$
and essentially the same situation on the left side of the focal
marker.
The corresponding plot (Figure @ref(fig:ehh)) shows that EHH of the ancestral allele decays more rapidly than that of the derived allele.
Assume now that the haplotypes are not phased. That means, at each
marker for which a diploid individual is heterozygous, it is unknown
which allele belongs to chromosome ‘1’ and which to chromosome ‘2’. In
this case the concept of extended haplotypes is not well-defined across
individuals. However, we can still measure the decay of extended
homozygosity within individuals. This is done by setting option
phased
to FALSE
while assuming that the
haplotypes in the input files are ordered as pairs belonging to
individuals.
> An object of class "ehh"
> [[1]]
> [1] "rs6"
>
> [[2]]
> FREQ_A FREQ_D
> 0.5 0.5
>
> [[3]]
> POSITION EHH_A EHH_D NHAPLO_A NHAPLO_D
> rs1 10000 0.0 0.5 0 4
> rs2 20000 0.0 1.0 0 4
> rs3 30000 0.0 1.0 0 4
> rs4 40000 0.0 1.0 4 4
> rs5 50000 0.5 1.0 4 4
> rs6 60000 1.0 1.0 4 4
> rs7 70000 0.0 1.0 4 4
> rs8 80000 0.0 1.0 0 4
> rs9 90000 0.0 1.0 0 4
> rs10 100000 0.0 1.0 0 4
> rs11 110000 0.0 1.0 0 4
>
> [[4]]
> IHH_A IHH_D
> 13537.5 NA
Individuals which are heterozygous for the focal marker are excluded
from the calculation. In our small sample all 4 individuals are
homozygous at marker rs6
; HG1
and
HG2
for the ancestral allele and HG3
and
HG4
for the derived allele, hence no haplotype is excluded.
Note that in realistic data a substantial number of haplotypes is
expected to belong to heterozygous individuals, cf. the Hardy-Weinberg
principle.
Again we can retrace the calculations by hand keeping in mind that EHH is now estimated by the fraction of homozygous individuals at each marker, see the corresponding formula in the main vignette.
Extending the shared haplotypes to the right, we find that both
individuals HG1
and HG2
become heterozygous
already at marker rs7
and hence EHH at this
position becomes 0. Extending to the left, HG1
becomes
heterozygous at marker rs5
, while HG2
is still
homozygous in the region spanning from rs6
to
rs5
. Hence the proportion of homozygous individuals at this
marker is $\frac{1}{2}$. At marker
rs4
the second individual becomes heterozygous, too, and
EHH yields 0.
By contrast, the individuals carrying the derived focal allele are
homozygous for the entire chromosome except for marker rs1
where HG4
becomes heterozygous.
Figure @ref(fig:ehhunphased) shows again the difference between EHH of the two core alleles:
Next, we calculate EHH per Site or EHHS which forms
the basis for cross-population comparisons. The options of the
corresponding function calc_ehhs()
are essentially
identical to those of the function calc_ehh()
. The output
contains EHHS and its normalized version nEHHS,
distinguished merely by a factor such that the latter becomes 1 at the
focal marker. We toggle the option
discard_integration_at_border
to FALSE
in
order to obtain the two corresponding integrals iES and
inES even though the values EHHS and nEHHS
have not fallen below the default threshold of 0.05 at the first resp.
last marker. Note that elements 3 and 4 of the list can be addressed by
res$IES
and res$INES
, too.
> An object of class "ehhs"
> [[1]]
> [1] "rs6"
>
> [[2]]
> POSITION EHHS NEHHS NHAPLO
> rs1 10000 0.10714286 0.2500000 8
> rs2 20000 0.21428571 0.5000000 8
> rs3 30000 0.21428571 0.5000000 8
> rs4 40000 0.25000000 0.5833333 8
> rs5 50000 0.32142857 0.7500000 8
> rs6 60000 0.42857143 1.0000000 8
> rs7 70000 0.28571429 0.6666667 8
> rs8 80000 0.25000000 0.5833333 8
> rs9 90000 0.21428571 0.5000000 8
> rs10 100000 0.21428571 0.5000000 8
> rs11 110000 0.07142857 0.1666667 8
>
> [[3]]
> [1] 19821.43
>
> [[4]]
> [1] 52916.67
The calculations can be retraced easily. In fact, the only distinction to EHH is that all haplotypes are considered and not only those with a specific allele at the focal marker. Using visual inspection of the haplotype data file (see above), we can plug the numbers into the formula for EHHS (see main vignette): $$\mathrm{EHHS}_{rs6,rs6}=\frac{1}{n_s(n_s-1)}\left(\sum\limits_{k=1}^{K_{rs6,rs6}}n_k(n_k-1)\right)=\frac{1}{8\cdot7}(4\cdot3+4\cdot3)=\frac{3}{7}$$ $$\mathrm{EHHS}_{rs6,rs7}=\frac{1}{n_s(n_s-1)}\left(\sum\limits_{k=1}^{K_{rs6,rs7}}n_k(n_k-1)\right)=\frac{1}{8\cdot7}(2\cdot1+2\cdot1+4\cdot3)=\frac{2}{7}$$ $$\mathrm{EHHS}_{rs6,rs8}=\frac{1}{n_s(n_s-1)}\left(\sum\limits_{k=1}^{K_{rs6,rs8}}n_k(n_k-1)\right)=\frac{1}{8\cdot7}(1\cdot0+2\cdot1+1\cdot0+4\cdot3)=\frac{1}{4}\;.$$
By default, the following command shows the (un-normalized)
EHHS values as in Figure @ref(fig:ehhs1). In order to draw the
normalized values one can toggle the option nehhs
to
TRUE
.
In order to find out whether marker rs6
is “special”, we
compute the integrals over EHH and EHHS, yielding
resp. iHH and iES, for all markers, hence performing a
“genomic scan”. Since we are dealing with a single population, only the
iHH values are of interest. We can see that IHH_A and
IHH_D are particularly different for the central marker.
> CHR POSITION FREQ_A FREQ_D NHAPLO_A NHAPLO_D IHH_A IHH_D IES
> rs1 chr1 10000 0.875 0.125 7 1 21992.26 0.00 15295.238
> rs2 chr1 20000 0.875 0.125 7 1 36190.48 0.00 25892.857
> rs3 chr1 30000 0.750 0.250 6 2 45000.00 23512.50 22678.571
> rs4 chr1 40000 0.750 0.250 6 2 47666.67 28025.00 24285.714
> rs5 chr1 50000 0.875 0.125 7 1 33333.33 0.00 23750.000
> rs6 chr1 60000 0.500 0.500 4 4 18816.67 89166.67 19821.429
> rs7 chr1 70000 0.750 0.250 6 2 45333.33 28025.00 23035.714
> rs8 chr1 80000 0.875 0.125 7 1 38571.43 0.00 27678.571
> rs9 chr1 90000 0.750 0.250 6 2 50666.67 23512.50 25714.286
> rs10 chr1 100000 0.875 0.125 7 1 35000.00 0.00 25000.000
> rs11 chr1 110000 0.500 0.500 4 4 25075.00 25833.33 8032.143
> INES
> rs1 21992.26
> rs2 36190.48
> rs3 43437.50
> rs4 46250.00
> rs5 33333.33
> rs6 52916.67
> rs7 44062.50
> rs8 38571.43
> rs9 48750.00
> rs10 35000.00
> rs11 25416.67
In order to evaluate the differences, the log ratio between IHH_A and IHH_D is calculated, yielding the, as yet, un-normalized values unihs. In general these should be normalized bin-wise, grouping markers with the same frequency of the derived allele. With so few markers, though, it is appropriate to normalize over all markers at once by setting the bin number to 1.
> $ihs
> CHR POSITION IHS LOGPVALUE
> rs1 chr1 10000 NA NA
> rs2 chr1 20000 NA NA
> rs3 chr1 30000 0.5813076 0.25101145
> rs4 chr1 40000 0.4464354 0.18357125
> rs5 chr1 50000 NA NA
> rs6 chr1 60000 -1.9389615 1.27979084
> rs7 chr1 70000 0.3890668 0.15662597
> rs8 chr1 80000 NA NA
> rs9 chr1 90000 0.7168779 0.32472642
> rs10 chr1 100000 NA NA
> rs11 chr1 110000 -0.1947262 0.07283128
>
> $frequency.class
> N_MRK MEAN_UNIHS SD_UNIHS LOWER_QT UPPER_QT
> [0.05,0.95) 11 0.1405648 0.8748648 -1.365018 0.7529103
We can use calc_candidate_regions()
to delineate regions
with “extreme” scores hinting at selection. Because we have only a few
markers, we choose an unrealistically low threshold of 1.5 and set the
window size to 1, which means that no windowing is performed and only
the positions of extreme markers returned.
> CHR START END EXTR_MRK
> rs6 chr1 60000 60000 -1.938961
Under the assumption that most sites evolve neutrally, the standardized iHS values should follow a normal distribution with the sites under selection as outliers.
Obviously we do not have enough markers to fit a distribution, but a “genome-wide” plot of the ihs values shows clearly that the central marker is rather an outlier (as much as is possible for such a small sample), see Figure @ref(fig:manhattan11).
A furcation plot represents a more fine-grained visualization of the homozygosity decay. In particular, individual haplotypes can be discerned which may instigate further investigations. The labels plotted in Figure @ref(fig:furcation11) are set in bold face, if the branches with which they are associated encompass further haplotypes.
f <- calc_furcation(hh, mrk = "rs6")
# set equal plot margins on left and right side and save old ones
oldpar <- par(mar = (c(5, 3, 4, 3) + 0.1))
plot(f,
# increase line width
lwd = 1.5,
# set habplotype identifiers as labels
hap.names = hap.names(hh),
# find a place for the legend ...
legend.xy.coords = c(60000, 0.2)
)
# reset old margins
par(oldpar)
A furcation diagram consists of trees for each allele and both sides (“left” and “right”) of the marker. The individual trees can be exported into a string in Newick format to be rendered by external programs, e.g. the phylogenetic R-package ape, see Figure @ref(fig:newick1).
> [1] "(((HG2_2:0,(HG2_1:0,HG1_1:0):10000):10000,HG1_2:0):10000);"
library(ape)
tree <- ape::read.tree(text = newick)
plot(tree,
direction = "leftwards",
edge.color = "blue",
underscore = TRUE,
no.margin = TRUE)
The end points of shared extended haplotypes can be defined as the “last split” in a furcation, i.e. the positions until which at least two different chromosomes of the sample are homozygous. Calculation of shared haplotype length and its visualization in Figure @ref(fig:haplen11) are called by:
In case of unphased haplotypes, furcations can only occur within individuals which limits the informative value of furcation diagrams as in Figure @ref(fig:furcation12).
f <- calc_furcation(hh, mrk = "rs6", phased = FALSE)
# set equal plot margins on left and right side and save old ones
oldpar <- par(mar = (c(5, 3, 4, 3) + 0.1))
plot(f,
# increase line width
lwd = 1.5,
# set haplotype identifiers as labels
hap.names = hap.names(hh),
# no place for a legend inside the plot
legend.xy.coords = "none")
# reset old margins
par(oldpar)
Nevertheless, the length of shared haplotypes, now identical to the ranges of individual homozygosity, can be calculated as before to yield Figure @ref(fig:haplen13).
The second data is an extension of the first, containing four
additional haplotypes, some missing values (including for the central
marker rs6
), and two markers with three alleles, namely
rs6
and rs9
.
Let us first inspect the file example2.hap
. Missing
values are here marked by a point, but NA
could be used,
too.
> HG1_1 0 0 . 1 0 0 0 0 0 1 1
> HG1_2 0 0 0 0 1 0 1 0 1 0 0
> HG2_1 0 1 . 1 0 0 1 0 0 0 1
> HG2_2 0 0 1 0 0 0 0 1 . . 0
> HG3_1 0 0 0 0 0 1 0 0 0 0 1
> HG3_2 0 0 0 0 0 1 0 0 0 . 1
> HG4_1 1 0 0 0 0 . 0 0 0 0 0
> HG4_2 0 0 0 0 0 1 0 0 0 0 0
> HG5_1 0 0 0 0 0 2 0 0 2 1 0
> HG5_2 0 0 0 1 0 2 0 0 . 0 0
> HG6_1 0 . 0 0 0 2 0 0 1 0 0
> HG6_2 0 0 0 0 . 2 0 0 0 . 0
If the map information file is used for allele recoding, the fifth
column can accommodate multiple (derived) alleles, separated by a comma,
as in example2.map
:
> rs1 chr1 10000 0 1
> rs2 chr1 20000 0 1
> rs3 chr1 30000 0 1
> rs4 chr1 40000 0 1
> rs5 chr1 50000 0 1
> rs6 chr1 60000 0 1,2
> rs7 chr1 70000 0 1
> rs8 chr1 80000 0 1
> rs9 chr1 90000 0 1,2
> rs10 chr1 100000 0 1
> rs11 chr1 110000 0 1
The file example2.vcf
combines the information of
haplotype and map file. It contains an additional marker
rs12
which lacks information about the ancestral allele.
This marker gets excluded since by default
polarize_vcf = TRUE
, hence the function tries to polarize
all markers. If the parameter is set to FALSE
, the marker
is included, but ancestry information is lost for all markers.
Furthermore, the ancestral allele for the marker rs1
is
now given by a lower case latter, which typically means that it is of
“low confidence”. This marker gets included since
capitalize_AA = TRUE
by default. If this option is turned
off, the marker cannot be polarized and is discarded.
> ##fileformat=VCFv4.2
> ##reference=NCBI38
> ##contig=<ID=chr1,length=120000>
> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG1 HG2 HG3 HG4 HG5 HG6
> chr1 10000 rs1 G T 100 PASS AA=g GT 0|0 0|0 0|0 1|0 0|0 0|0
> chr1 20000 rs2 G A 100 PASS AA=G GT 0|0 1|0 0|0 0|0 0|0 .|0
> chr1 30000 rs3 T C 100 PASS AA=T GT .|0 .|1 0|0 0|0 0|0 0|0
> chr1 40000 rs4 A C 100 PASS AA=A GT 1|0 1|0 0|0 0|0 0|1 0|0
> chr1 50000 rs5 T G 100 PASS AA=T GT 0|1 0|0 0|0 0|0 0|0 0|.
> chr1 60000 rs6 G A,C 100 PASS AA=A GT 1|1 1|1 0|0 .|0 2|2 2|2
> chr1 70000 rs7 C T 100 PASS AA=C GT 0|1 1|0 0|0 0|0 0|0 0|0
> chr1 80000 rs8 C T 100 PASS AA=C GT 0|0 0|1 0|0 0|0 0|0 0|0
> chr1 90000 rs9 T A,G 100 PASS AA=T GT 0|1 0|. 0|0 0|0 2|. 1|0
> chr1 100000 rs10 A G 100 PASS AA=A GT 1|0 0|. 0|. 0|0 1|0 0|.
> chr1 110000 rs11 C T 100 PASS AA=T GT 0|1 0|1 0|0 1|1 1|1 1|1
> chr1 120000 rs12 T G 100 PASS AA=. GT 0|0 0|0 1|0 0|0 0|0 0|0
Alleles in the haplotype input file are already coded by the numbers
0,1 and 2 hence conform to the coding "01"
explained in the
main vignette.
> * Reading input file(s) *
> Map info: 11 markers declared for chromosome chr1 .
> Haplotype input file in standard format assumed.
> Assume that alleles in the haplotype file are coded as:
> NA or '.': missing value
> 0: ancestral allele
> 1, 2, ...: derived allele(s).
> * Filtering data *
> Discard markers genotyped on less than 100 % of haplotypes.
> 6 markers discarded.
> 5 markers remaining.
> Data consists of 12 haplotypes and 5 markers.
> Number of mono-, bi-, multi-allelic markers:
> 1 2
> 0 5
The output tells us that 6 markers with missing values have been
eliminated. This is due to the pre-set filter option of
min_perc_geno.mrk = 100
which excludes all markers that are
not fully genotyped. If we do not want to loose them, we can lower the
corresponding threshold, e.g. to 50%. Note that setting this condition
to 0 is not allowed since the package cannot handle markers with no data
attached.
hh <- data2haplohh(hap_file = "example2.hap",
map_file = "example2.map",
allele_coding = "01",
min_perc_geno.mrk = 50)
> * Reading input file(s) *
> Map info: 11 markers declared for chromosome chr1 .
> Haplotype input file in standard format assumed.
> Assume that alleles in the haplotype file are coded as:
> NA or '.': missing value
> 0: ancestral allele
> 1, 2, ...: derived allele(s).
> * Filtering data *
> Discard markers genotyped on less than 50 % of haplotypes.
> No marker discarded.
> Data consists of 12 haplotypes and 11 markers.
> Number of mono-, bi-, multi-allelic markers:
> 1 2 3
> 0 9 2
Like with the first example, the data set is constructed such that
the allele coding options "01"
, "map"
and
"none"
applied to the pair of haplotype and map file or
input from the vcf file lead to identical haplohh
objects:
hh_map <- data2haplohh(hap_file = "example2.hap",
map_file = "example2.map",
allele_coding = "map",
min_perc_geno.mrk = 50,
verbose = FALSE)
identical(hh, hh_map)
> [1] TRUE
hh_none <- data2haplohh(hap_file = "example2.hap",
map_file = "example2.map",
allele_coding = "none",
min_perc_geno.mrk = 50,
verbose = FALSE)
identical(hh, hh_none)
> [1] TRUE
hh_vcf <- data2haplohh(hap_file = "example2.vcf",
min_perc_geno.mrk = 50,
vcf_reader = "data.table",
verbose = FALSE)
identical(hh, hh_vcf)
> [1] TRUE
As with the first example data set, the function
calc_ehh()
reports the EHH values for each allele
of the focal marker of which there are now three. We set the option
include_zero_values
to TRUE
to include the
remaining marker rs11
in the table (and subsequent plot)
where all three EHH values reach zero.
res <- calc_ehh(hh,
mrk = "rs6",
include_nhaplo = TRUE,
include_zero_values = TRUE,
discard_integration_at_border = FALSE )
res
> An object of class "ehh"
> [[1]]
> [1] "rs6"
>
> [[2]]
> FREQ_A FREQ_D1 FREQ_D2
> 0.3636364 0.2727273 0.3636364
>
> [[3]]
> POSITION EHH_A EHH_D1 EHH_D2 NHAPLO_A NHAPLO_D1 NHAPLO_D2
> rs1 10000 0.0000000 1 0.0000000 0 3 0
> rs2 20000 0.0000000 1 0.0000000 0 3 2
> rs3 30000 0.0000000 1 0.3333333 2 3 3
> rs4 40000 0.1666667 1 0.3333333 4 3 3
> rs5 50000 0.5000000 1 1.0000000 4 3 3
> rs6 60000 1.0000000 1 1.0000000 4 3 4
> rs7 70000 0.3333333 1 1.0000000 4 3 4
> rs8 80000 0.1666667 1 1.0000000 4 3 4
> rs9 90000 0.0000000 1 0.0000000 4 3 3
> rs10 100000 0.0000000 1 0.0000000 0 2 0
> rs11 110000 0.0000000 0 0.0000000 0 2 0
>
> [[4]]
> IHH_A IHH_D1 IHH_D2
> 18816.67 90012.50 43216.67
Note that the derived alleles are ordered by their internal coding.
Figure @ref(fig:ehh21) shows clearly that the first derived allele has a strong extended homozygosity while the second derived allele is not that different from the ancestral allele.
For EHHS the output format does not depend on the number of core alleles. The values however, do, since more alleles at the focal marker entail a lower overall homozygosity.
res <- calc_ehhs(hh,
mrk = "rs6",
include_nhaplo = TRUE,
include_zero_values = TRUE,
discard_integration_at_border = FALSE)
res
> An object of class "ehhs"
> [[1]]
> [1] "rs6"
>
> [[2]]
> POSITION EHHS NEHHS NHAPLO
> rs1 10000 0.14285714 0.6000000 7
> rs2 20000 0.14285714 0.6000000 7
> rs3 30000 0.14285714 0.5714286 8
> rs4 40000 0.11111111 0.4166667 10
> rs5 50000 0.20000000 0.7500000 10
> rs6 60000 0.27272727 1.0000000 11
> rs7 70000 0.20000000 0.7333333 11
> rs8 80000 0.18181818 0.6666667 11
> rs9 90000 0.06666667 0.2500000 10
> rs10 100000 0.00000000 0.1000000 9
> rs11 110000 0.00000000 0.0000000 9
>
> [[3]]
> [1] 9582.161
>
> [[4]]
> [1] 49005.95
Note that the number of evaluated haplotypes NHAPLO
decreases with distance to the focal marker due to missing values which
lead at each calculation step to subsequent removals of the respective
chromosomes. This can sometimes yield a transient increase of
EHHS (and in general, EHH, too) as can be seen at
position 30 kb in Figure @ref(fig:ehhs21).
The function scan_hh()
does not evaluate EHH
for every allele of a focal marker, but chooses, besides the mandatory
ancestral allele, the derived allele with highest frequency.
> CHR POSITION FREQ_A FREQ_D NHAPLO_A NHAPLO_D IHH_A IHH_D
> rs1 chr1 10000 0.9166667 0.08333333 11 1 25045.24 0.00
> rs2 chr1 20000 0.9090909 0.09090909 10 1 32750.99 0.00
> rs3 chr1 30000 0.9000000 0.10000000 9 1 39881.94 0.00
> rs4 chr1 40000 0.7500000 0.25000000 9 3 38453.37 21383.33
> rs5 chr1 50000 0.9090909 0.09090909 10 1 29680.16 0.00
> rs6 chr1 60000 0.3636364 0.36363636 4 4 18816.67 43216.67
> rs7 chr1 70000 0.8333333 0.16666667 10 2 28960.52 28025.00
> rs8 chr1 80000 0.9166667 0.08333333 11 1 27370.13 0.00
> rs9 chr1 90000 0.7000000 0.20000000 7 2 40142.86 33012.50
> rs10 chr1 100000 0.7777778 0.22222222 7 2 24452.38 9025.00
> rs11 chr1 110000 0.6666667 0.33333333 8 4 14291.67 13037.50
> IES INES
> rs1 19362.121 25045.24
> rs2 25023.593 32750.99
> rs3 30057.143 39881.94
> rs4 23654.672 38145.51
> rs5 22727.201 29680.16
> rs6 9582.161 49005.95
> rs7 17869.697 28743.68
> rs8 21479.798 27370.13
> rs9 15730.159 39857.95
> rs10 11326.984 23125.00
> rs11 5890.837 14158.23
If alleles are not polarized, the corresponding option should be set
to FALSE
which replaces “ancestral” and “derived” by
“major” and “minor” (hence the two most frequent) alleles.
> CHR POSITION FREQ_MAJ FREQ_MIN NHAPLO_MAJ NHAPLO_MIN IHH_MAJ IHH_MIN
> rs1 chr1 10000 0.9166667 0.08333333 11 1 25045.24 0.00
> rs2 chr1 20000 0.9090909 0.09090909 10 1 32750.99 0.00
> rs3 chr1 30000 0.9000000 0.10000000 9 1 39881.94 0.00
> rs4 chr1 40000 0.7500000 0.25000000 9 3 38453.37 21383.33
> rs5 chr1 50000 0.9090909 0.09090909 10 1 29680.16 0.00
> rs6 chr1 60000 0.3636364 0.36363636 4 4 18816.67 43216.67
> rs7 chr1 70000 0.8333333 0.16666667 10 2 28960.52 28025.00
> rs8 chr1 80000 0.9166667 0.08333333 11 1 27370.13 0.00
> rs9 chr1 90000 0.7000000 0.20000000 7 2 40142.86 33012.50
> rs10 chr1 100000 0.7777778 0.22222222 7 2 24452.38 9025.00
> rs11 chr1 110000 0.6666667 0.33333333 8 4 14291.67 13037.50
> IES INES
> rs1 19362.121 25045.24
> rs2 25023.593 32750.99
> rs3 30057.143 39881.94
> rs4 23654.672 38145.51
> rs5 22727.201 29680.16
> rs6 9582.161 49005.95
> rs7 17869.697 28743.68
> rs8 21479.798 27370.13
> rs9 15730.159 39857.95
> rs10 11326.984 23125.00
> rs11 5890.837 14158.23
We continue with the polarized scan and calculate standardized log ratios of the iHH values without any binning.
> Discard focal markers with Minor Allele Frequency equal to or below 0.05 .
> No marker discarded.
> $ihs
> CHR POSITION IHS LOGPVALUE
> rs1 chr1 10000 NA NA
> rs2 chr1 20000 NA NA
> rs3 chr1 30000 NA NA
> rs4 chr1 40000 0.66462142 0.295598363
> rs5 chr1 50000 NA NA
> rs6 chr1 60000 -1.64513452 1.000251645
> rs7 chr1 70000 -0.23757455 0.090331086
> rs8 chr1 80000 NA NA
> rs9 chr1 90000 0.02742085 0.009606056
> rs10 chr1 100000 1.33214189 0.737991576
> rs11 chr1 110000 -0.14147509 0.051834262
>
> $frequency.class
> N_MRK MEAN_UNIHS SD_UNIHS LOWER_QT UPPER_QT
> [0.05,0.95) 11 0.1787203 0.6140553 -0.7234433 0.9454923
The “genome-wide” ihs values are depicted in Figure @ref(fig:manhattan21).
Now we know that actually the first derived allele at marker
rs6
is of interest, but it is not present in the scan
because it is not the major derived allele. How can we modify the scan
to include it?
It is tempting to combine the two derived alleles by overwriting the internal coding and yielding a bi-allelic marker. This, however, entails a completely new pattern of extended haplotypes and a distortion of the statistics. Another approach may be to exclude the chromosomes carrying the less interesting allele from the analysis as done below:
hh_subset = subset(hh,
# exclude haplotypes with allele 2 at "rs6"
select.hap = haplo(hh)[ ,"rs6"] != 2,
min_perc_geno.mrk = 50)
> * Subset haplotypes *
> 4 haplotypes discarded.
> 8 haplotypes remaining.
> * Filtering data *
> Discard markers genotyped on less than 50 % of haplotypes.
> No marker discarded.
> Data consists of 8 haplotypes and 11 markers.
> Number of mono-, bi-, multi-allelic markers:
> 1 2
> 1 10
> CHR POSITION FREQ_A FREQ_D NHAPLO_A NHAPLO_D IHH_A IHH_D
> rs1 chr1 10000 1.0000000 0.0000000 7 1 26267.86 0.0
> rs2 chr1 20000 0.8571429 0.1428571 6 1 38741.67 0.0
> rs3 chr1 30000 0.8000000 0.2000000 4 1 58370.83 0.0
> rs4 chr1 40000 0.7142857 0.2857143 5 2 39741.67 28025.0
> rs5 chr1 50000 0.8571429 0.1428571 6 1 33958.33 0.0
> rs6 chr1 60000 0.5714286 0.4285714 4 3 18816.67 90012.5
> rs7 chr1 70000 0.7142857 0.2857143 5 2 37241.67 28025.0
> rs8 chr1 80000 0.8571429 0.1428571 6 1 31500.00 0.0
> rs9 chr1 90000 0.8333333 0.1666667 5 1 43333.33 0.0
> rs10 chr1 100000 0.8000000 0.2000000 4 1 27500.00 0.0
> rs11 chr1 110000 0.4285714 0.5714286 3 4 14012.50 13037.5
> IES INES
> rs1 26267.857 26267.86
> rs2 24839.286 38741.67
> rs3 32741.667 58370.83
> rs4 25616.071 39697.89
> rs5 21449.405 33958.33
> rs6 18116.071 49710.52
> rs7 16568.452 36061.53
> rs8 20904.762 31500.00
> rs9 26833.333 43333.33
> rs10 14500.000 27500.00
> rs11 4267.857 13050.00
Note that the value of EHH_D, now representing the allele
with internal coding 1, is much higher at marker rs6
than
before.
However, with so few EHH values due to missing values, there
is not much signal left and a standardization by ihh2ihs()
averages the alleged outlier away as can be observed in Figure
@ref(fig:manhattan22).
ihs <- ihh2ihs(scan, freqbin = 1, verbose = FALSE)
manhattanplot(ihs, threshold = c(-1.5, 1.5), ylim = c(-2.5,2.5), pch = 20)
A furcation diagram can show the pattern for all three alleles of the
focal marker rs6
. (Pseudo-)furcations that arise from the
removal of chromosomes due to missing values are marked by dashed lines
as depicted in Figure @ref(fig:furcation21).
f <- calc_furcation(hh, mrk = "rs6")
# set equal plot margins on left and right side and save old ones
oldpar <- par(mar = (c(5, 3, 4, 3) + 0.1))
plot(f,
lwd = 1.5,
hap.names = hap.names(hh),
# no place for a legend inside the plot
legend.xy.coords = "none")
par(oldpar)
Again, it is possible to export each tree into Newick format. This format, however, has no option to mark different kinds of branches. We let package ape render the Newick string to yield Figure @ref(fig:newick2).
newick <- as.newick(f,
allele = 0,
side = "left",
hap.names = hap.names(hh))
tree <- ape::read.tree(text = newick)
plot(tree,
direction = "leftwards",
edge.color = "blue",
underscore = TRUE,
no.margin = TRUE)
Likewise, Figure @ref(fig:haplen21) of the shared haplotype lengths covers all alleles of the focal marker.
Finally, to clean-up the working directory, we call