Package HTRX
uses Haplotype Trend Regression with eXtra
flexibility (HTRX) to search a pre-defined set of SNPs for haplotype
patterns that include single nucleotide polymorphisms (SNPs) and
non-contiguous haplotypes. We search over all possible templates which
give a value for each SNP being ‘0’ or ‘1’, reflecting whether the
reference allele of each SNP is present or absent, or an ‘X’ meaning
either value is allowed. For example, a four-SNP haplotype ‘1XX0’ only
refers to the interaction between the first SNP’s alternative allele and
the fourth SNP’s reference allele.
It is unrealistic to infer the true gene-gene interactions because
the number of haplotypes increases exponentially with the number of
SNPs. The goal of package HTRX
is to make good predictions
by selecting genetic features which have the best predictive
performance. Although it is possible to use HTRX to fit the best
haplotype patterns to the entire dataset, in order to avoid overfitting
in the estimates of variance explained by haplotypes we instead split
the dataset such that we train, select and test the model on different
folds of data. By predicting the out-of-sample variance explained (R2), HTRX quantifies
whether a tagging SNP is adequate to represent a region, or whether
interactions or LD with unobserved causal SNPs are present (Yang and
Lawson (2023)).
Here we provide a tutorial on using the HTRX
package for
selecting haplotype models. We mainly focus on the following
functions:
1. make_htrx
: generating the feature data for
haplotypes;
2. do_cv
: selecting haplotypes for small regions;
3. do_cumulative_htrx
: selecting haplotypes for large
regions.
To start using the HTRX package, you need to install it from CRAN:
Then you can load the package HTRX
into R by
running:
First, we load three example datasets contained in the package
HTRX
.
example_data_nosnp
is the covariate data including
outcome, sex, age and the first 18 ancestral Principle Components (PCs)
for 5000 individuals. HTRX
can also handle datasets with
either continuous or binary response variables. As the outcome is the
binary for this example dataset, it takes values of 0 or 1 denoting
having the outcome or not, respectively.
##
example_data_nosnp[41:43,1:6]
#> outcome sex age PC1 PC2 PC3
#> 41 0 1 21 -0.3889841 -0.5743902 -0.72294996
#> 42 1 1 51 -0.2182516 -0.9395326 -0.64852708
#> 43 0 0 51 -0.7374281 -0.2923432 -0.04199157
example_hap1
and example_hap2
are the
genotype data (including 8 SNPs) for the first and second copy of the
genomes for 5000 diploid individuals, respectively. Each SNP must be
biallelic, it takes value of ‘0’ to represent the reference allele and
‘1’ for the alternative allele.
As defined by our template, there are 3m − 1 haplotypes (including SNPs, continuous haplotypes and non-contiguous haplotypes) if the region contains m SNPs. This results in large memory space and computational cost when there are many SNPs included in the region. In this section, we focus on regions with at most 6 SNPs, while we will deal with longer regions in Section 5.
Before performing haplotype selection within small regions, we need
to represent the data in terms of haplotype templates for each
individual using function make_htrx
. In this example, we
focus on the region of the first 4 SNPs for the first 2000
individuals.
## create haplotype data removing haplotypes rarer than 1%
HTRX_matrix_rmrare = make_htrx(hap1=example_hap1[1:2000,1:4],
hap2=example_hap2[1:2000,1:4],
rareremove=TRUE,rare_threshold=0.01)
## display the created haplotype data
HTRX_matrix_rmrare[1:3,47:50]
#> X0X1 01X1 11X1 X1X1
#> 1 1.0 0 0 0
#> 2 0.5 0 0 0
#> 3 1.0 0 0 0
## create haplotype data without removing any haplotypes
HTRX_matrix_allhaps = make_htrx(hap1=example_hap1[1:2000,1:4],
hap2=example_hap2[1:2000,1:4])
## create haplotype data while at maximum 3 SNPs can interact
HTRX_matrix_3snphaps = make_htrx(hap1=example_hap1[1:2000,1:4],
hap2=example_hap2[1:2000,1:4],max_int=3)
## compare the numbers of haplotypes created by setting different 'mat_int'
cat(ncol(HTRX_matrix_rmrare),
ncol(HTRX_matrix_allhaps),
ncol(HTRX_matrix_3snphaps))
#> 79 80 64
The output of make_htrx
is a data frame with rows
denoting individuals and columns denoting haplotypes which are defined
by the HTRX template. For diploid samples, such as our example, each
haplotype takes a value of either 0, 0.5 or 1 if the individual has 0, 1
or 2 this haplotype in both copies of the genome. For haploid samples,
only either 0 or 1 is taken for each haplotype.
There are many arguments for make_htrx
which enables
creating personalized haplotype data. For diploid samples,
hap1
and hap2
should be the genotype data for
both genomes. If the sample is haploid, we needn’t specify the value of
hap2
because by default hap2=hap1
representing
haploid samples. The logical argument rareremove
determines
whether haplotypes with frequency lower than rare_threshold
are removed, and by default rareremove=FALSE
meaning no
rare SNPs will be removed. Finally, max_int
can be set to
specify the maximum number of SNPs that can interact, but its value
needn’t be given if we want to select the best model from all the
haplotypes. For instance, when setting max_int=3
in the
example dataset above, 16 haplotypes which are the interaction between
all 4 SNPs are removed.
We use function do_cv
which is a two-stage procedure to
select haplotypes for regions containing at most 6 SNPs.
Stage 1: create haplotype data from SNP data, and select candidate
models using AIC, BIC or lasso.
Stage 2: select the best model using k-fold cross-validation.
The details of the algorithm are described in detail in Yang and Lawson (2023).
We first explain the important arguments in do_cv
.
data_nosnp
is a data frame whose first column must be the
outcome. If the fixed covariates are included in the model, they should
appear in the following columns. featuredata
is the genetic
features we will select from, which includes both non-contiguous
haplotypes and SNPs. sim_times
is the number of simulation
in the step for generating candidate models. A larger
sim_times
linearly increases the computational time but is
shown to improve the model performance in a very limited scale, so we
suggest it takes a small value. featurecap
limits the
maximum number of features from featuredata
to be selected.
usebinary
is determined by the type of outcome. It takes
value of 1 for binary outcome and 0 for continuous outcome. We will use
stratified sampling to split data if method="stratified"
is
specified. We provide three different criteria for model selection by
argument criteria
: “AIC”, “BIC” and “lasso”. For larger
datasets, we may want to implement parallel programming by setting
runparallel=TRUE
and specifying the cores through argument
mc.cores
. Note that we use mclapply
for
parallel running, so this is currently unavailable for Windows users.
The other auguments, their default values and the description can be
found by typing help(do_cv)
.
To start with the analysis, we select the best model entirely from
the haplotype data HTRX_matrix_rmrare
we created above
while ignoring the sex, age and PCs for individuals. The output of
do_cv
is a list with two elements. The first element is the
out-of-sample variance explained (we use standard R2 for continuous outcome
and McFadden’s R2
for binary outcome) by each fold of the test dataset, and the second
element is the selected features arranged in order of the
association with the outcome.
## selecting the best haplotype model using "AIC" from all the haplotypes
CV_results_nocovar <- do_cv(data_nosnp=example_data_nosnp[1:2000,1,drop=FALSE],
featuredata=HTRX_matrix_rmrare,
sim_times=2,featurecap=4,usebinary=1,
method="simple",criteria="BIC",gain=FALSE)
cat('The selected features', as.character(CV_results_nocovar[[2]]),
'explains \n',mean(CV_results_nocovar[[1]])*100,
'% average out-of-sample variance')
#> The selected features 0XXX 01XX X000 11X1 explains
#> 54.58883 % average out-of-sample variance
In the actual genetics analysis, we normally include some fixed
covariates into the model, e.g. sex, age and the first 18 PCs. In the
below example, we add the second the third column of
example_data_nosnp
to data_nosnp
to add the
sex and age information into the model. We also set
gain=TRUE
which means we’d like to see the extra variance
explained by haplotypes (details can be found in Barrie et
al. (2022)).
## selecting the best haplotype model using "BIC" from all the haplotypes
## here we include the sex and age as fixed covariates
CV_results_withcovar <- do_cv(data_nosnp=example_data_nosnp[1:2000,1:3],
featuredata=HTRX_matrix_rmrare,
sim_times=2,featurecap=8,usebinary=1,
method="stratified",criteria="AIC",gain=TRUE)
cat('The selected features', as.character(CV_results_withcovar[[2]]),
'explains \n', mean(CV_results_withcovar[[1]])*100,
'% extra average out-of-sample variance')
#> The selected features 1XXX 00XX 11X1 1011 XX00 1X0X 1X1X explains
#> 57.18279 % extra average out-of-sample variance
Longer haplotypes are important for discovering interactions.
However, there are 3k − 1 haplotypes in HTRX
if the region contains k SNPs,
making it unrealistic for regions with large numbers of SNPs. To address
this issue, we proposed “cumulative HTRX” (function
do_cumulative_htrx
) that enables HTRX to run on longer
haplotypes, i.e. haplotypes which include at least 7 SNPs. This is a
three-stage process explains in detail by Yang and Lawson (2023):
Stage 1: extend haplotypes until all the SNPs are included.
Stage 2: generate a set of candidate models.
Stage 3: select the best model using k-fold cross-validation.
Similarly, we focus on the first 2000 individuals’ data. Different
from do_cv
, the input of do_cumulative_htrx
doesn’t require the data to be represented in terms of haplotype
templates. Instead, the genotype data for SNPs (hap1
and
hap2
) are required, which are the same as function
make_htrx
. This is because we won’t generate all the
haplotypes at once but gradually extend haplotypes. Also, we limit the
maximum number of SNPs that can interact by setting
max_int=4
.
## selecting the best haplotype model using "BIC"
## we include all the 8 SNPs, but specify at most 4 SNPs can interact
## we also include the sex and age as fixed covariates
cumu_CV_results <- do_cumulative_htrx(data_nosnp=example_data_nosnp[1:2000,1:3],
hap1=example_hap1[1:2000,],
hap2=example_hap2[1:2000,],
sim_times=1,featurecap=8,usebinary=1,
method="stratified",criteria="AIC",
gain=TRUE,max_int=4)
cat('The selected features', as.character(cumu_CV_results[[2]]),
'explains \n',mean(cumu_CV_results[[1]])*100,
'% average out-of-sample variance')
#> The selected features 0XXXXXXX 00XXXXXX 1XX1XXXX 100XXXX1 1X01XXX1 0X00XXX0 1X1XX1X1 explains
#> 57.91961 % average out-of-sample variance
The output of do_cumulative_htrx
has the same format as
do_cv
. When comparing cumu_CV_results
with
CV_results_withcovar
, we find adding the additional 4 SNPs
hardly increase the average out-of-sample variance explained, indicating
these 4 SNPs are in very weak association with the outcome. This can be
reflected by the first 2 selected features being just the first SNP and
the interaction between first 2 SNPs.
We have avoided the exponentially increasing computational cost of
do_cv
by implementing do_cumulative_htrx
at
the cost of losing some haplotypes, but it still requires significant
amount of time to run if we have large sample sets. Thus, running
parallel by setting run_parallel=TRUE
is suggested.
Furthermore, we can increase the values of sim_times
,
featurecap
, max_int
and L
to
slightly improve the performance, but the computational cost will be
increased simultaneously.
Yang Y, Lawson DJ. HTRX: an R package for learning non-contiguous haplotypes associated with a phenotype. Bioinformatics Advances 3.1 (2023): vbad038. Barrie, W., Yang, Y., Irving-Pease, E.K. et al. Elevated genetic risk for multiple sclerosis emerged in steppe pastoralist populations. Nature 625, 321–328 (2024).