Title: | Haplotype Trend Regression with eXtra Flexibility (HTRX) |
---|---|
Description: | Detection of haplotype patterns that include single nucleotide polymorphisms (SNPs) and non-contiguous haplotypes that are associated with a phenotype. Methods for implementing HTRX are described in Yang Y, Lawson DJ (2023) <doi:10.1093/bioadv/vbad038> and Barrie W, Yang Y, Irving-Pease E.K, et al (2024) <doi:10.1038/s41586-023-06618-z>. |
Authors: | Yaoling Yang [aut, cre] , Daniel Lawson [aut] |
Maintainer: | Yaoling Yang <[email protected]> |
License: | GPL-3 |
Version: | 1.2.4 |
Built: | 2024-12-06 06:46:55 UTC |
Source: | CRAN |
This is the software for "HTRX - Haplotype Trend Regression with eXtra flexibility (HTRX)" based on the papar Genetic risk for Multiple Sclerosis originated in Pastoralist Steppe populations, Barrie W, Yang Y, Attfield K E, et al (2022).
HTRX searches for haplotype patterns that include single nucleotide polymorphisms (SNPs) and non-contiguous haplotypes.
HTRX is a template gives a value for each SNP taking values of ‘0’ or ‘1’, reflecting whether the reference allele of each SNP is present or absent, or an ‘X’ meaning either value is allowed.
We used a two-step procedure to select the best HTRX model: do_cv
.
Step 1: select candidate models using AIC, BIC or lasso;
Step 2: select the best model using 10-fold cross-validation.
There is also an option to directly perform 10-fold cross-validation: do_cv_direct
.
This method loses some accuracy and doesn't return the fixed features selected, but saves computational time.
Longer haplotypes are important for discovering interactions.
However, too many haplotypes make original HTRX unrealistic for regions with large numbers of SNPs.
We proposed "cumulative HTRX" that enables HTRX to run on longer haplotypes: do_cumulative_htrx
.
The code for HTRX is hosted at https://github.com/YaolingYang/HTRX.
Maintainer: Yaoling Yang [email protected] (ORCID)
Authors:
Daniel Lawson [email protected] (ORCID)
Yang Y, Lawson DJ. HTRX: an R package for learning non-contiguous haplotypes associated with a phenotype. Bioinformatics Advances 3(1) (2023): vbad038.
Barrie, W., Yang, Y., Irving-Pease, E.K. et al. Elevated genetic risk for multiple sclerosis emerged in steppe pastoralist populations. Nature 625, 321–328 (2024).
Eforn, B. "Bootstrap methods: another look at the jackknife." The Annals of Statistics 7 (1979): 1-26.
Schwarz, Gideon. "Estimating the dimension of a model." The annals of statistics (1978): 461-464.
McFadden, Daniel. "Conditional logit analysis of qualitative choice behavior." (1973).
Akaike, Hirotugu. "A new look at the statistical model identification." IEEE transactions on automatic control 19.6 (1974): 716-723.
Tibshirani, Robert. "Regression shrinkage and selection via the lasso." Journal of the Royal Statistical Society: Series B (Methodological) 58.1 (1996): 267-288.
Compute the variance explained by a linear or generalized linear model.
mypredict(model, newdata) computeR2(pred, outcome, usebinary = 1)
mypredict(model, newdata) computeR2(pred, outcome, usebinary = 1)
model |
a fitted model, which is the output of |
newdata |
a data frame which contains all the variables included in the model. This data frame is used to make prediction on. |
pred |
a vector of the predicted outcome. |
outcome |
a vector of the actual outcome. |
usebinary |
a non-negative number representing different models.
Use linear model if |
The variance explained by a linear model is based on the conventional R2. As for logistic regression, we use McFadden's R2.
mypredict
returns a vector of the predicted outcome.
computeR2
returns a positive number of the variance explained by the
linear model (conventional R2) or
the generalized linear model (McFadden's R2).
McFadden, Daniel. "Conditional logit analysis of qualitative choice behavior." (1973).
## create datasets x=matrix(runif(100,-2,2),ncol=5) outcome=(0.5*x[,2] - 0.8*x[,4] + 0.3*x[,5])>runif(100,-2,2) ## create binary outcome outcome[outcome]=1 data=data.frame(outcome,x) ## compute the variance explained by features model=themodel(outcome~.,data[1:80,],usebinary=1) outcome_predict=mypredict(model,data[81:100,]) computeR2(outcome_predict,data[81:100,'outcome'],usebinary=1)
## create datasets x=matrix(runif(100,-2,2),ncol=5) outcome=(0.5*x[,2] - 0.8*x[,4] + 0.3*x[,5])>runif(100,-2,2) ## create binary outcome outcome[outcome]=1 data=data.frame(outcome,x) ## compute the variance explained by features model=themodel(outcome~.,data[1:80,],usebinary=1) outcome_predict=mypredict(model,data[81:100,]) computeR2(outcome_predict,data[81:100,'outcome'],usebinary=1)
kfold_split
splits data into k folds with equal sizes, which is used for cross-validation.
twofold_split
splits data into two folds, which samples the training set.
Both stratified sampling and simple sampling are allowed.
The details can be found in function do_cv
and do_cumulative_htrx
.
kfold_split(outcome, fold, method = "simple") twofold_split(outcome, train_proportion = 0.5, method = "simple")
kfold_split(outcome, fold, method = "simple") twofold_split(outcome, train_proportion = 0.5, method = "simple")
outcome |
a vector of the variable (usually the outcome)
based on which the data is going to be stratified.
This only works when |
fold |
a positive integer specifying how many folds the data should be split into. |
method |
the method to be used for data split, either |
train_proportion |
a positive number between 0 and 1 giving
the proportion of the training dataset when splitting data into 2 folds.
By default, |
Stratified sampling works only when the outcome
variable is binary (either 0 or 1),
and it ensures each fold has almost the same number of outcome=0
and outcome=1
.
Simple sampling randomly splits the data into k folds.
Two-fold data split is used to select candidate models in Step 1 of HTRX or cumulative HTRX, while k-fold data split is used for 10-fold cross-validation in Step 2 which aims at selecting the best model.
Both functions return a list containing the indexes of different folds.
## create the binary outcome (20% prevalence) outcome=rbinom(200,1,0.2) ## simple sampling (10 folds) kfold_split(outcome,10) ## stratified sampling (10 folds) kfold_split(outcome,10,"stratified") ## stratified sampling (2 folds, with 50% training data) twofold_split(outcome,0.5,"stratified")
## create the binary outcome (20% prevalence) outcome=rbinom(200,1,0.2) ## simple sampling (10 folds) kfold_split(outcome,10) ## stratified sampling (10 folds) kfold_split(outcome,10,"stratified") ## stratified sampling (2 folds, with 50% training data) twofold_split(outcome,0.5,"stratified")
Two-step cross-validation used to select the best HTRX model for longer haplotypes, i.e. include at least 7 single nucleotide polymorphisms (SNPs).
do_cumulative_htrx( data_nosnp, hap1, hap2 = hap1, train_proportion = 0.5, sim_times = 5, featurecap = 40, usebinary = 1, randomorder = TRUE, fixorder = NULL, method = "simple", criteria = "BIC", gain = TRUE, nmodel = 3, runparallel = FALSE, mc.cores = 6, rareremove = FALSE, rare_threshold = 0.001, L = 6, dataseed = 1:sim_times, fold = 10, kfoldseed = 123, htronly = FALSE, max_int = NULL, returnwork = FALSE, verbose = FALSE ) do_cumulative_htrx_step1( data_nosnp, hap1, hap2 = hap1, train_proportion = 0.5, featurecap = 40, usebinary = 1, randomorder = TRUE, fixorder = NULL, method = "simple", criteria = "BIC", nmodel = 3, splitseed = 123, gain = TRUE, runparallel = FALSE, mc.cores = 6, rareremove = FALSE, rare_threshold = 0.001, L = 6, htronly = FALSE, max_int = NULL, verbose = FALSE ) extend_haps( data_nosnp, featuredata, train, featurecap = dim(featuredata)[2], usebinary = 1, gain = TRUE, runparallel = FALSE, mc.cores = 6, verbose = FALSE ) make_cumulative_htrx( hap1, hap2 = hap1, featurename, rareremove = FALSE, rare_threshold = 0.001, htronly = FALSE, max_int = NULL )
do_cumulative_htrx( data_nosnp, hap1, hap2 = hap1, train_proportion = 0.5, sim_times = 5, featurecap = 40, usebinary = 1, randomorder = TRUE, fixorder = NULL, method = "simple", criteria = "BIC", gain = TRUE, nmodel = 3, runparallel = FALSE, mc.cores = 6, rareremove = FALSE, rare_threshold = 0.001, L = 6, dataseed = 1:sim_times, fold = 10, kfoldseed = 123, htronly = FALSE, max_int = NULL, returnwork = FALSE, verbose = FALSE ) do_cumulative_htrx_step1( data_nosnp, hap1, hap2 = hap1, train_proportion = 0.5, featurecap = 40, usebinary = 1, randomorder = TRUE, fixorder = NULL, method = "simple", criteria = "BIC", nmodel = 3, splitseed = 123, gain = TRUE, runparallel = FALSE, mc.cores = 6, rareremove = FALSE, rare_threshold = 0.001, L = 6, htronly = FALSE, max_int = NULL, verbose = FALSE ) extend_haps( data_nosnp, featuredata, train, featurecap = dim(featuredata)[2], usebinary = 1, gain = TRUE, runparallel = FALSE, mc.cores = 6, verbose = FALSE ) make_cumulative_htrx( hap1, hap2 = hap1, featurename, rareremove = FALSE, rare_threshold = 0.001, htronly = FALSE, max_int = NULL )
data_nosnp |
a data frame with outcome (the outcome must be the first column with colnames(data_nosnp)[1]="outcome"), fixed covariates (for example, sex, age and the first 18 PCs) if there are, and without SNPs or haplotypes. |
hap1 |
a data frame of the SNPs' genotype of the first genome. The genotype of a SNP for each individual is either 0 (reference allele) or 1 (alternative allele). |
hap2 |
a data frame of the SNPs' genotype of the second genome.
The genotype of a SNP for each individual is either 0 (reference allele) or 1 (alternative allele).
By default, |
train_proportion |
a positive number between 0 and 1 giving
the proportion of the training dataset when splitting data into 2 folds.
By default, |
sim_times |
an integer giving the number of simulations in Step 1 (see details).
By default, |
featurecap |
a positive integer which manually sets the maximum number of independent features.
By default, |
usebinary |
a non-negative number representing different models.
Use linear model if |
randomorder |
logical. If |
fixorder |
a vector of the fixed order of SNPs to be added in cumulative HTRX.
This only works by setting |
method |
the method used for data splitting, either |
criteria |
the criteria for model selection, either |
gain |
logical. If |
nmodel |
a positive integer specifying the number of candidate models
that the criterion selects. By default, |
runparallel |
logical. Use parallel programming based on |
mc.cores |
an integer giving the number of cores used for parallel programming.
By default, |
rareremove |
logical. Remove rare SNPs and haplotypes or not. By default, |
rare_threshold |
a numeric number below which the haplotype or SNP is removed.
This only works when |
L |
a positive integer. The cumulative HTRX starts with haplotypes templates comtaining L SNPs.
By default, |
dataseed |
a vector of the seed that each simulation in Step 1 (see details) uses.
The length of |
fold |
a positive integer specifying how many folds the data should be split into for cross-validation. |
kfoldseed |
a positive integer specifying the seed used to
split data for k-fold cross validation. By default, |
htronly |
logical. If |
max_int |
a positive integer which specifies the maximum number of SNPs that can interact. If no value is given, interactions between all the SNPs will be considered. |
returnwork |
logical. If |
verbose |
logical. If |
splitseed |
a positive integer giving the seed that a single simulation in Step 1 (see details) uses. |
featuredata |
a data frame of the feature data, e.g. haplotype data created by HTRX or SNPs.
These features exclude all the data in |
train |
a vector of the indexes of the training data. |
featurename |
a character giving the names of features (haplotypes). |
Longer haplotypes are important for discovering interactions.
However, there are 3k-1 haplotypes in HTRX
if the region contains k SNPs,
making HTRX (do_cv
) unrealistic to apply on for regions with large numbers of SNPs.
To address this issue, we proposed "cumulative HTRX" (do_cumulative_htrx
)
that enables HTRX to run on longer haplotypes,
i.e. haplotypes which include at least 7 SNPs (we recommend).
There are 2 steps to implement cumulative HTRX.
Step 1: extend haplotypes and select candidate models.
(1) Randomly sample a subset (50 use stratified sampling when the outcome is binary. This subset is used for all the analysis in (2) and (3);
(2) Start with L randomly chosen SNPs from the entire k SNPs, and keep the top M haplotypes that are chosen from the forward regression. Then add another SNP to the M haplotypes to create 3M+2 haplotypes. There are 3M haplotypes obtained by adding "0", "1" or "X" to the previous M haplotypes, as well as 2 bases of the added SNP, i.e. "XX...X0" and "XX...X1" (as "X" was implicitly used in the previous step). The top M haplotypes from them are then selected using forward regression. Repeat this process until obtaining M haplotypes which include k-1 SNPs;
(3) Add the last SNP to create 3M+2 haplotypes.
Afterwards, if criteria="AIC"
or criteria="BIC"
,
start from a model with fixed covariates (e.g. 18 PCs, sex and age),
and perform forward regression on the subset,
i.e. iteratively choose a feature (in addition to the fixed covariates)
to add whose inclusion enables the model to explain the largest variance,
and select s models with the lowest Akaike information criterion (AIC) or
Bayesian Information Criteria (BIC)
to enter the candidate model pool;
If criteria="lasso"
, using least absolute shrinkage and selection operator (lasso)
to directly select the best s models to enter the candidate model pool;
(4) repeat (1)-(3) B times, and select all the different models in the candidate model pool as the candidate models.
Step 2: select the best model using k-fold cross-validation.
(1) Randomly split the whole data into k groups with approximately equal sizes, using stratified sampling when the outcome is binary;
(2) In each of the k folds, use a fold as the validation dataset, a fold as the test dataset, and the remaining folds as the training dataset. Then, fit all the candidate models on the training dataset, and use these fitted models to compute the additional variance explained by features (out-of-sample variance explained) in the validation and test dataset. Finally, select the candidate model with the biggest average out-of-sample variance explained in the validation set as the best model, and report the out-of-sample variance explained in the test set.
Function do_cumulative_htrx_step1
is the Step 1 (1)-(3) described above.
Function extend_haps
is used to select haplotypes in the Step 1 (2) described above.
Function make_cumulative_htrx
is used to generate the haplotype data
(by adding a new SNP into the haplotypes) from M haplotypes to 3M+2 haplotypes,
which is also described in the Step 1 (2)-(3).
When investigating haplotypes with interactions between at most 2 SNPs, L is suggested to be no bigger than 10. When investigating haplotypes with interactions between at most 3 SNPs, L should not be bigger than 9. If haplotypes with interactions between more than 4 SNPs are investigated, L is suggested to be 6 (which is the default value).
do_cumulative_htrx
returns a list containing the best model selected,
and the out-of-sample variance explained in each test set.
do_cv_step1
returns a list of three candidate models selected by a single simulation.
extend_haps
returns a character of the names of the selected features.
make_cumulative_htrx
returns a data frame of the haplotype matrix.
Yang Y, Lawson DJ. HTRX: an R package for learning non-contiguous haplotypes associated with a phenotype. Bioinformatics Advances 3.1 (2023): vbad038.
Barrie, W., Yang, Y., Irving-Pease, E.K. et al. Elevated genetic risk for multiple sclerosis emerged in steppe pastoralist populations. Nature 625, 321–328 (2024).
Eforn, B. "Bootstrap methods: another look at the jackknife." The Annals of Statistics 7 (1979): 1-26.
Schwarz, Gideon. "Estimating the dimension of a model." The annals of statistics (1978): 461-464.
McFadden, Daniel. "Conditional logit analysis of qualitative choice behavior." (1973).
Akaike, Hirotugu. "A new look at the statistical model identification." IEEE transactions on automatic control 19.6 (1974): 716-723.
Tibshirani, Robert. "Regression shrinkage and selection via the lasso." Journal of the Royal Statistical Society: Series B (Methodological) 58.1 (1996): 267-288.
## use dataset "example_hap1", "example_hap2" and "example_data_nosnp" ## "example_hap1" and "example_hap2" are ## both genomes of 8 SNPs for 5,000 individuals (diploid data) ## "example_data_nosnp" is a simulated dataset ## which contains the outcome (binary), sex, age and 18 PCs ## visualise the covariates data ## we will use only the first two covariates: sex and age in the example head(HTRX::example_data_nosnp) ## visualise the genotype data for the first genome head(HTRX::example_hap1) ## we perform cumulative HTRX on all the 8 SNPs using 2-step cross-validation ## to compute additional variance explained by haplotypes ## If the data is haploid, please set hap2=HTRX::example_hap1 ## If you want to compute total variance explained, please set gain=FALSE ## For Linux/MAC users, we recommend setting runparallel=TRUE cumu_CV_results <- do_cumulative_htrx(HTRX::example_data_nosnp[1:500,1:3], HTRX::example_hap1[1:500,], HTRX::example_hap2[1:500,], train_proportion=0.5,sim_times=1, featurecap=10,usebinary=1, randomorder=TRUE,method="stratified", criteria="BIC",gain=TRUE, runparallel=FALSE,verbose=TRUE) #This result would be more precise when setting larger sim_times and featurecap
## use dataset "example_hap1", "example_hap2" and "example_data_nosnp" ## "example_hap1" and "example_hap2" are ## both genomes of 8 SNPs for 5,000 individuals (diploid data) ## "example_data_nosnp" is a simulated dataset ## which contains the outcome (binary), sex, age and 18 PCs ## visualise the covariates data ## we will use only the first two covariates: sex and age in the example head(HTRX::example_data_nosnp) ## visualise the genotype data for the first genome head(HTRX::example_hap1) ## we perform cumulative HTRX on all the 8 SNPs using 2-step cross-validation ## to compute additional variance explained by haplotypes ## If the data is haploid, please set hap2=HTRX::example_hap1 ## If you want to compute total variance explained, please set gain=FALSE ## For Linux/MAC users, we recommend setting runparallel=TRUE cumu_CV_results <- do_cumulative_htrx(HTRX::example_data_nosnp[1:500,1:3], HTRX::example_hap1[1:500,], HTRX::example_hap2[1:500,], train_proportion=0.5,sim_times=1, featurecap=10,usebinary=1, randomorder=TRUE,method="stratified", criteria="BIC",gain=TRUE, runparallel=FALSE,verbose=TRUE) #This result would be more precise when setting larger sim_times and featurecap
Two-step cross-validation used to select the best HTRX model. It can be applied to select haplotypes based on HTR, or select single nucleotide polymorphisms (SNPs).
do_cv( data_nosnp, featuredata, train_proportion = 0.5, sim_times = 5, featurecap = dim(featuredata)[2], usebinary = 1, method = "simple", criteria = "BIC", gain = TRUE, nmodel = 3, dataseed = 1:sim_times, runparallel = FALSE, mc.cores = 6, fold = 10, kfoldseed = 123, returnwork = FALSE, verbose = FALSE ) do_cv_step1( data_nosnp, featuredata, train_proportion = 0.5, featurecap = dim(featuredata)[2], usebinary = 1, method = "simple", criteria = "BIC", nmodel = 3, splitseed = 123, runparallel = FALSE, mc.cores = 6, verbose = FALSE ) infer_step1( data_nosnp, featuredata, train, criteria = "BIC", featurecap = dim(featuredata)[2], usebinary = 1, nmodel = nmodel, runparallel = FALSE, mc.cores = 6, verbose = FALSE ) infer_fixedfeatures( data_nosnp, featuredata, train = (1:nrow(data_nosnp))[-test], test, features, coefficients = NULL, gain = TRUE, usebinary = 1, R2only = FALSE, verbose = FALSE )
do_cv( data_nosnp, featuredata, train_proportion = 0.5, sim_times = 5, featurecap = dim(featuredata)[2], usebinary = 1, method = "simple", criteria = "BIC", gain = TRUE, nmodel = 3, dataseed = 1:sim_times, runparallel = FALSE, mc.cores = 6, fold = 10, kfoldseed = 123, returnwork = FALSE, verbose = FALSE ) do_cv_step1( data_nosnp, featuredata, train_proportion = 0.5, featurecap = dim(featuredata)[2], usebinary = 1, method = "simple", criteria = "BIC", nmodel = 3, splitseed = 123, runparallel = FALSE, mc.cores = 6, verbose = FALSE ) infer_step1( data_nosnp, featuredata, train, criteria = "BIC", featurecap = dim(featuredata)[2], usebinary = 1, nmodel = nmodel, runparallel = FALSE, mc.cores = 6, verbose = FALSE ) infer_fixedfeatures( data_nosnp, featuredata, train = (1:nrow(data_nosnp))[-test], test, features, coefficients = NULL, gain = TRUE, usebinary = 1, R2only = FALSE, verbose = FALSE )
data_nosnp |
a data frame with outcome (the outcome must be the first column with colnames(data_nosnp)[1]="outcome"), fixed covariates (for example, sex, age and the first 18 PCs) if there are, and without SNPs or haplotypes. |
featuredata |
a data frame of the feature data, e.g. haplotype data created by HTRX or SNPs.
These features exclude all the data in |
train_proportion |
a positive number between 0 and 1 giving
the proportion of the training dataset when splitting data into 2 folds.
By default, |
sim_times |
an integer giving the number of simulations in Step 1 (see details).
By default, |
featurecap |
a positive integer which manually sets the maximum number of independent features.
By default, |
usebinary |
a non-negative number representing different models.
Use linear model if |
method |
the method used for data splitting, either |
criteria |
the criteria for model selection, either |
gain |
logical. If |
nmodel |
a positive integer specifying the number of candidate models
that the criterion selects. By default, |
dataseed |
a vector of the seed that each simulation in Step 1 (see details) uses.
The length of |
runparallel |
logical. Use parallel programming based on |
mc.cores |
an integer giving the number of cores used for parallel programming.
By default, |
fold |
a positive integer specifying how many folds the data should be split into for cross-validation. |
kfoldseed |
a positive integer specifying the seed used to
split data for k-fold cross validation. By default, |
returnwork |
logical. If |
verbose |
logical. If |
splitseed |
a positive integer giving the seed of data split. |
train |
a vector of the indexes of the training data. |
test |
a vector of the indexes of the test data. |
features |
a character of the names of the fixed features, excluding the intercept. |
coefficients |
a vector giving the coefficients of the fixed features, including the intercept.
If the fixed features don't have fixed coefficients, set |
R2only |
logical. If |
Function do_cv
is the main function used for selecting haplotypes from HTRX or SNPs.
It is a two-step algorithm and is used for alleviating overfitting.
Step 1: select candidate models. This is to address the model search problem, and is chosen to obtain a set of models more diverse than traditional bootstrap resampling.
(1) Randomly sample a subset (50 Specifically, when the outcome is binary, stratified sampling is used to ensure the subset has approximately the same proportion of cases and controls as the whole data;
(2) If criteria="AIC"
or criteria="BIC"
,
start from a model with fixed covariates (e.g. 18 PCs, sex and age),
and perform forward regression on the subset,
i.e. iteratively choose a feature (in addition to the fixed covariates)
to add whose inclusion enables the model to explain the largest variance,
and select s models with the lowest Akaike information criterion (AIC) or
Bayesian Information Criteria (BIC)
to enter the candidate model pool;
If criteria="lasso"
, using least absolute shrinkage and selection operator (lasso)
to directly select the best s models to enter the candidate model pool;
(3) repeat (1)-(2) B times, and select all the different models in the candidate model pool as the candidate models.
Step 2: select the best model using k-fold cross-validation.
(1) Randomly split the whole data into k groups with approximately equal sizes, using stratified sampling when the outcome is binary;
(2) In each of the k folds, use a fold as the validation dataset, a fold as the test dataset, and the remaining folds as the training dataset. Then, fit all the candidate models on the training dataset, and use these fitted models to compute the additional variance explained by features (out-of-sample variance explained) in the validation and test dataset. Finally, select the candidate model with the biggest average out-of-sample variance explained in the validation set as the best model, and report the out-of-sample variance explained in the test set.
Function do_cv_step1
is the Step 1 (1)-(2) described above.
Function infer_step1
is the Step 1 (2) described above.
Function infer_fixedfeatures
is used to fit all the candidate models on the training dataset,
and compute the additional variance explained by features (out-of-sample R2) in the test dataset,
as described in the Step 2 (2) above.
do_cv
returns a list containing the best model selected,
and the out-of-sample variance explained in each test set.
do_cv_step1
and infer_step1
return a list of three candidate models selected by a single simulation.
infer_fixedfeatures
returns a list of the variance explained in the test set if R2only=TRUE
,
otherwise, it returns a list of the variance explained in the test set, the model including all the variables,
and the null model, i.e. the model with outcome and fixed covariates only.
Yang Y, Lawson DJ. HTRX: an R package for learning non-contiguous haplotypes associated with a phenotype. Bioinformatics Advances 3.1 (2023): vbad038.
Barrie, W., Yang, Y., Irving-Pease, E.K. et al. Elevated genetic risk for multiple sclerosis emerged in steppe pastoralist populations. Nature 625, 321–328 (2024).
Eforn, B. "Bootstrap methods: another look at the jackknife." The Annals of Statistics 7 (1979): 1-26.
Schwarz, Gideon. "Estimating the dimension of a model." The annals of statistics (1978): 461-464.
McFadden, Daniel. "Conditional logit analysis of qualitative choice behavior." (1973).
Akaike, Hirotugu. "A new look at the statistical model identification." IEEE transactions on automatic control 19.6 (1974): 716-723.
Tibshirani, Robert. "Regression shrinkage and selection via the lasso." Journal of the Royal Statistical Society: Series B (Methodological) 58.1 (1996): 267-288.
## use dataset "example_hap1", "example_hap2" and "example_data_nosnp" ## "example_hap1" and "example_hap2" are ## both genomes of 8 SNPs for 5,000 individuals (diploid data) ## "example_data_nosnp" is an example dataset ## which contains the outcome (binary), sex, age and 18 PCs ## visualise the covariates data ## we will use only the first two covariates: sex and age in the example head(HTRX::example_data_nosnp) ## visualise the genotype data for the first genome head(HTRX::example_hap1) ## we perform HTRX on the first 4 SNPs ## we first generate all the haplotype data, as defined by HTRX HTRX_matrix=make_htrx(HTRX::example_hap1[1:300,1:4], HTRX::example_hap2[1:300,1:4]) ## If the data is haploid, please set ## HTRX_matrix=make_htrx(HTRX::example_hap1[1:300,1:4], ## HTRX::example_hap1[1:300,1:4]) ## then perform HTRX using 2-step cross-validation in a single small example ## to compute additional variance explained by haplotypes ## If you want to compute total variance explained, please set gain=FALSE CV_results <- do_cv(HTRX::example_data_nosnp[1:300,1:2], HTRX_matrix,train_proportion=0.5, sim_times=1,featurecap=4,usebinary=1, method="simple",criteria="BIC", gain=TRUE,runparallel=FALSE,verbose=TRUE) #This result would be more precise when setting larger sim_times and featurecap
## use dataset "example_hap1", "example_hap2" and "example_data_nosnp" ## "example_hap1" and "example_hap2" are ## both genomes of 8 SNPs for 5,000 individuals (diploid data) ## "example_data_nosnp" is an example dataset ## which contains the outcome (binary), sex, age and 18 PCs ## visualise the covariates data ## we will use only the first two covariates: sex and age in the example head(HTRX::example_data_nosnp) ## visualise the genotype data for the first genome head(HTRX::example_hap1) ## we perform HTRX on the first 4 SNPs ## we first generate all the haplotype data, as defined by HTRX HTRX_matrix=make_htrx(HTRX::example_hap1[1:300,1:4], HTRX::example_hap2[1:300,1:4]) ## If the data is haploid, please set ## HTRX_matrix=make_htrx(HTRX::example_hap1[1:300,1:4], ## HTRX::example_hap1[1:300,1:4]) ## then perform HTRX using 2-step cross-validation in a single small example ## to compute additional variance explained by haplotypes ## If you want to compute total variance explained, please set gain=FALSE CV_results <- do_cv(HTRX::example_data_nosnp[1:300,1:2], HTRX_matrix,train_proportion=0.5, sim_times=1,featurecap=4,usebinary=1, method="simple",criteria="BIC", gain=TRUE,runparallel=FALSE,verbose=TRUE) #This result would be more precise when setting larger sim_times and featurecap
Direct k-fold cross-validation used to compute the out-of-sample variance explained by selected features from HTRX. It can be applied to select haplotypes based on HTR, or select single nucleotide polymorphisms (SNPs).
do_cv_direct( data_nosnp, featuredata, featurecap = dim(featuredata)[2], usebinary = 1, method = "simple", criteria = "BIC", gain = TRUE, runparallel = FALSE, mc.cores = 6, fold = 10, kfoldseed = 123, verbose = FALSE )
do_cv_direct( data_nosnp, featuredata, featurecap = dim(featuredata)[2], usebinary = 1, method = "simple", criteria = "BIC", gain = TRUE, runparallel = FALSE, mc.cores = 6, fold = 10, kfoldseed = 123, verbose = FALSE )
data_nosnp |
a data frame with outcome (the outcome must be the first column), fixed covariates (for example, sex, age and the first 18 PCs) if there are, and without SNPs or haplotypes. |
featuredata |
a data frame of the feature data, e.g. haplotype data created by HTRX or SNPs.
These features exclude all the data in |
featurecap |
a positive integer which manually sets the maximum number of independent features.
By default, |
usebinary |
a non-negative number representing different models.
Use linear model if |
method |
the method used for data splitting, either |
criteria |
the criteria for model selection, either |
gain |
logical. If |
runparallel |
logical. Use parallel programming based on |
mc.cores |
an integer giving the number of cores used for parallel programming.
By default, |
fold |
a positive integer specifying how many folds the data should be split into for cross-validation. |
kfoldseed |
a positive integer specifying the seed used to
split data for k-fold cross validation. By default, |
verbose |
logical. If |
Function do_cv_direct
directly performs k-fold cross-validation: features are
selected from the training set using a specified criteria
,
and the out-of-sample variance explained by the selected features are computed on the test set.
This function runs faster than do_cv
with large sim_times
, but may lose
some accuracy, and it doesn't return a fixed set of features.
do_cv_direct
returns a list of the out-of-sample variance explained in each of the test set,
and the features selected in each of the k training sets.
Yang Y, Lawson DJ. HTRX: an R package for learning non-contiguous haplotypes associated with a phenotype. Bioinformatics Advances 3.1 (2023): vbad038.
Barrie, W., Yang, Y., Irving-Pease, E.K. et al. Elevated genetic risk for multiple sclerosis emerged in steppe pastoralist populations. Nature 625, 321–328 (2024).
Eforn, B. "Bootstrap methods: another look at the jackknife." The Annals of Statistics 7 (1979): 1-26.
Schwarz, Gideon. "Estimating the dimension of a model." The annals of statistics (1978): 461-464.
McFadden, Daniel. "Conditional logit analysis of qualitative choice behavior." (1973).
Akaike, Hirotugu. "A new look at the statistical model identification." IEEE transactions on automatic control 19.6 (1974): 716-723.
Tibshirani, Robert. "Regression shrinkage and selection via the lasso." Journal of the Royal Statistical Society: Series B (Methodological) 58.1 (1996): 267-288.
## use dataset "example_hap1", "example_hap2" and "example_data_nosnp" ## "example_hap1" and "example_hap2" are ## both genomes of 8 SNPs for 5,000 individuals (diploid data) ## "example_data_nosnp" is an example dataset ## which contains the outcome (binary), sex, age and 18 PCs ## visualise the covariates data ## we will use only the first two covariates: sex and age in the example head(HTRX::example_data_nosnp) ## visualise the genotype data for the first genome head(HTRX::example_hap1) ## we perform HTRX on the first 4 SNPs ## we first generate all the haplotype data, as defined by HTRX HTRX_matrix=make_htrx(HTRX::example_hap1[,1:4], HTRX::example_hap2[,1:4]) ## If the data is haploid, please set ## HTRX_matrix=make_htrx(HTRX::example_hap1[,1:4], ## HTRX::example_hap1[,1:4]) ## next compute the maximum number of independent features featurecap=htrx_max(nsnp=4,cap=10) ## then perform HTRX using direct cross-validation ## If we want to compute the total variance explained ## we can set gain=FALSE in the above example htrx_results <- do_cv_direct(HTRX::example_data_nosnp[,1:3], HTRX_matrix,featurecap=featurecap, usebinary=1,method="stratified", criteria="lasso",gain=TRUE, runparallel=FALSE,verbose=TRUE)
## use dataset "example_hap1", "example_hap2" and "example_data_nosnp" ## "example_hap1" and "example_hap2" are ## both genomes of 8 SNPs for 5,000 individuals (diploid data) ## "example_data_nosnp" is an example dataset ## which contains the outcome (binary), sex, age and 18 PCs ## visualise the covariates data ## we will use only the first two covariates: sex and age in the example head(HTRX::example_data_nosnp) ## visualise the genotype data for the first genome head(HTRX::example_hap1) ## we perform HTRX on the first 4 SNPs ## we first generate all the haplotype data, as defined by HTRX HTRX_matrix=make_htrx(HTRX::example_hap1[,1:4], HTRX::example_hap2[,1:4]) ## If the data is haploid, please set ## HTRX_matrix=make_htrx(HTRX::example_hap1[,1:4], ## HTRX::example_hap1[,1:4]) ## next compute the maximum number of independent features featurecap=htrx_max(nsnp=4,cap=10) ## then perform HTRX using direct cross-validation ## If we want to compute the total variance explained ## we can set gain=FALSE in the above example htrx_results <- do_cv_direct(HTRX::example_data_nosnp[,1:3], HTRX_matrix,featurecap=featurecap, usebinary=1,method="stratified", criteria="lasso",gain=TRUE, runparallel=FALSE,verbose=TRUE)
Example covariate data including outcome (binary), sex, age and 18 PCs for 5,000 individuals.
data("example_data_nosnp")
data("example_data_nosnp")
A data frame with 5,000 observations on a binary outcome named outcome
and 20 numeric covariates named sex
, age
and PC1
-PC18
.
data(example_data_nosnp)
data(example_data_nosnp)
Example genotype data for the first genome of 8 SNPs for 5,000 individuals.
data("example_hap1")
data("example_hap1")
A data frame with 5,000 observations on 8 binary variables named SNP1
-SNP8
("0" denotes the reference allele while "1" denotes the alternative allele).
data(example_hap1)
data(example_hap1)
Example genotype data for the second genome of 8 SNPs for 5,000 individuals.
data("example_hap2")
data("example_hap2")
A data frame with 5,000 observations on 8 binary variables named SNP1
-SNP8
("0" denotes the reference allele while "1" denotes the alternative allele).
data(example_hap2)
data(example_hap2)
The maximum number of independent features in principle from haplotypes (i.e. interactions between SNPs) generated by Haplotype Trend Regression with eXtra flexibility (HTRX).
htrx_max(nsnp, n_haps = NULL, cap = 40, max_int = NULL, htr = FALSE)
htrx_max(nsnp, n_haps = NULL, cap = 40, max_int = NULL, htr = FALSE)
nsnp |
a positive integer giving the number of single nucleotide polymorphisms (SNPs) included in the haplotypes. |
n_haps |
a positive integer giving the number of haplotypes, which is also the number of columns of the HTRX or HTR matrix. |
cap |
a positive integer which manually sets the maximum number of independent features.
By default, |
max_int |
a positive integer which specifies the maximum number of SNPs that can interact. If no value is given (by default), interactions between all the SNPs will be considered. |
htr |
logical. If |
The maximum number of independent features in principle is 2nsnp-1
for haplotypes containing interactions between all different numbers of SNPs.
However, if max_int < nsnp
, i.e. only the interactions between at most max_int
SNPs are investigated,
there will be fewer maximum number of independent features.
You can also manually set the upper limit of independent features (by setting cap
) that can be included in the final HTRX or HTR model.
htrx_max
returns a positive integer giving the maximum
number of independent features to be included in the analysis.
## the maximum number of independent haplotypes consisted of 4 SNPs from HTRX htrx_max(nsnp=4,n_haps=(3^4-1))
## the maximum number of independent haplotypes consisted of 4 SNPs from HTRX htrx_max(nsnp=4,n_haps=(3^4-1))
The total number of features in principle from haplotypes (i.e. interactions between SNPs) generated by Haplotype Trend Regression with eXtra flexibility (HTRX) .
htrx_nfeatures(nsnp, max_int = NULL, htr = FALSE)
htrx_nfeatures(nsnp, max_int = NULL, htr = FALSE)
nsnp |
a positive integer giving the number of single nucleotide polymorphisms (SNPs) included in the haplotypes. |
max_int |
a positive integer which specifies the maximum number of SNPs that can interact. If no value is given (by default), interactions between all the SNPs will be considered. |
htr |
logical. If |
The total number of features in principle is 3nsnp-1
for haplotypes containing interactions between all different numbers of SNPs.
However, if max_int < nsnp
, i.e. only the interactions between at most max_int
SNPs are investigated,
there will be fewer total number of features.
htrx_nfeatures
returns a positive integer giving the total number
of features that each analysis includes.
## the total number of haplotypes consisted of 6 SNPs ## for at most 3-SNP interactions htrx_nfeatures(nsnp=6,max_int=3)
## the total number of haplotypes consisted of 6 SNPs ## for at most 3-SNP interactions htrx_nfeatures(nsnp=6,max_int=3)
Generate the feature data, either the genotype data for single nucleotide polymorphisms (SNPs) (make_snp
),
the feature data for Haplotype Trend Regression (HTR) (make_htr
), or
the feature data for Haplotype Trend Regression with eXtra flexibility (HTRX) (make_htrx
).
make_htrx( hap1, hap2 = hap1, rareremove = FALSE, rare_threshold = 0.001, fixedfeature = NULL, max_int = NULL ) make_htr(hap1, hap2 = hap1, rareremove = FALSE, rare_threshold = 0.001) make_snp(hap1, hap2 = hap1, rareremove = FALSE, rare_threshold = 0.001)
make_htrx( hap1, hap2 = hap1, rareremove = FALSE, rare_threshold = 0.001, fixedfeature = NULL, max_int = NULL ) make_htr(hap1, hap2 = hap1, rareremove = FALSE, rare_threshold = 0.001) make_snp(hap1, hap2 = hap1, rareremove = FALSE, rare_threshold = 0.001)
hap1 |
a data frame of the SNPs' genotype of the first genome. The genotype of a SNP for each individual is either 0 (reference allele) or 1 (alternative allele). |
hap2 |
a data frame of the SNPs' genotype of the second genome.
The genotype of a SNP for each individual is either 0 (reference allele) or 1 (alternative allele).
By default, |
rareremove |
logical. Remove rare SNPs and haplotypes or not. By default, |
rare_threshold |
a numeric number below which the haplotype or SNP is removed.
This only works when |
fixedfeature |
a character consisted of the names of haplotypes.
This parameter can be |
max_int |
a positive integer which specifies the maximum number of SNPs that can interact. If no value is given, interactions between all the SNPs will be considered. |
If there are n SNPs, there are 2n different haplotypes created by HTR, and 3n-1 different haplotypes created by HTRX.
When the data is haploid, please use the default setting hap2=hap1
.
a data frame of the feature data (either for SNPs, HTR or HTRX).
## create SNP data for both genomes (diploid data) hap1=as.data.frame(matrix(0,nrow=100,ncol=4)) hap2=as.data.frame(matrix(0,nrow=100,ncol=4)) colnames(hap1)=colnames(hap2)=c('a','b','c','d') p=runif(4,0.01,0.99) for(j in 1:4){ hap1[,j]=rbinom(100,1,p[j]) hap2[,j]=rbinom(100,1,p[j]) } ## create the SNP data without removing rare SNPs make_snp(hap1,hap2) ## create feature data for "HTR" removing haplotypes rarer than 0.5% make_htr(hap1,hap2,rareremove=TRUE,0.005) ## create feature data for "HTRX" ## retaining haplotypes with interaction across at most 3 SNPs make_htrx(hap1,hap2,max_int=3) ## create feature data for feature "01XX" and "X101" ## without removing haplotypes make_htrx(hap1,hap2,fixedfeature=c("01XX","X101")) ## If the data is haploid instead of diploid ## create feature data for "HTRX" without removing haplotypes make_htrx(hap1,hap1)
## create SNP data for both genomes (diploid data) hap1=as.data.frame(matrix(0,nrow=100,ncol=4)) hap2=as.data.frame(matrix(0,nrow=100,ncol=4)) colnames(hap1)=colnames(hap2)=c('a','b','c','d') p=runif(4,0.01,0.99) for(j in 1:4){ hap1[,j]=rbinom(100,1,p[j]) hap2[,j]=rbinom(100,1,p[j]) } ## create the SNP data without removing rare SNPs make_snp(hap1,hap2) ## create feature data for "HTR" removing haplotypes rarer than 0.5% make_htr(hap1,hap2,rareremove=TRUE,0.005) ## create feature data for "HTRX" ## retaining haplotypes with interaction across at most 3 SNPs make_htrx(hap1,hap2,max_int=3) ## create feature data for feature "01XX" and "X101" ## without removing haplotypes make_htrx(hap1,hap2,fixedfeature=c("01XX","X101")) ## If the data is haploid instead of diploid ## create feature data for "HTRX" without removing haplotypes make_htrx(hap1,hap1)
Model-agnostic functions for model fitting (both linear and generalized linear models).
themodel(formula, data, usebinary = 1, clean = TRUE)
themodel(formula, data, usebinary = 1, clean = TRUE)
formula |
a formula for model-fitting, starting with outcome~. |
data |
a data frame contains all the variables included in the formula. The outcome must be the first column with colnames(data)[1]="outcome". |
usebinary |
a non-negative number representing different models.
Use linear model if |
clean |
logical. If |
This function returns a fitted model (either linear model or logistic regression model).
For logistic regression, we use function fastglm
from fastglm
package, which is much faster than glm
.
a fitted model.
## create the dataset for variables and outcome x=matrix(runif(100,-2,2),ncol=5) outcome=0.5*x[,2] - 0.8*x[,4] + 0.3*x[,5] data1=data.frame(outcome,x) ## fit a linear model themodel(outcome~.,data1,usebinary=0) ## create binary outcome outcome=outcome>runif(100,-2,2) outcome[outcome]=1 data2=data.frame(outcome,x) ## fit a logistic regression model themodel(outcome~.,data2,usebinary=1)
## create the dataset for variables and outcome x=matrix(runif(100,-2,2),ncol=5) outcome=0.5*x[,2] - 0.8*x[,4] + 0.3*x[,5] data1=data.frame(outcome,x) ## fit a linear model themodel(outcome~.,data1,usebinary=0) ## create binary outcome outcome=outcome>runif(100,-2,2) outcome[outcome]=1 data2=data.frame(outcome,x) ## fit a logistic regression model themodel(outcome~.,data2,usebinary=1)