Title: | Bayesian Whole-Genome Regression |
---|---|
Description: | Whole-genome regression methods on Bayesian framework fitted via EM or Gibbs sampling, single step (<doi:10.1534/g3.119.400728>), univariate and multivariate (<doi:10.1186/s12711-022-00730-w>, <doi:10.1093/genetics/iyae179>), with optional kernel term and sampling techniques (<doi:10.1186/s12859-017-1582-3>). |
Authors: | Alencar Xavier [aut, cre] , William Muir [aut], David Habier [aut], Kyle Kocak [aut], Shizhong Xu [aut], Katy Rainey [aut] |
Maintainer: | Alencar Xavier <[email protected]> |
License: | GPL-3 |
Version: | 2.2.13 |
Built: | 2024-12-05 19:33:21 UTC |
Source: | CRAN |
Whole-genome regression methods on Bayesian framework fitted via EM or Gibbs sampling, single step (<doi:10.1534/g3.119.400728>), univariate and multivariate (<doi:10.1186/s12711-022-00730-w>, <doi:10.1093/genetics/iyae179>), with optional kernel term and sampling techniques (<doi:10.1186/s12859-017-1582-3>).
Package: | bWGR |
Type: | Package |
Version: | 2.2.13 |
Date: | 2024-12-05 |
License: | GPL-3 |
Alencar Xavier, William Muir, David Habier, Kyle Kocak, Shizhong Xu, Katy Rainey. Maintainer: Alencar Xavier <[email protected]>
data(tpod) Fit = wgr(y,gen) cor(y,Fit$hat)
data(tpod) Fit = wgr(y,gen) cor(y,Fit$hat)
Two biparental crosses phenotyped for the percentage of pods containing four seeds
data(tpod)
data(tpod)
Soybean nested association panel with 2 families () containing 196 indiviuals. Genotypic matrix (
) have 376 SNP across 20 chromosome (
). Phenotypic information (
) regards the proportion of tetra-seed pods. Data provided by Rainey Lab for Soybean Breeding and Genetics, Purdue University.
Alencar Xavier and Katy Rainey
Univariate model to find breeding values through regression with optional resampling techniques (Xavier et al. 2017) and polygenic term (Kernel). See "Details" for additional standalone functions written in C++.
wgr(y,X,it=1500,bi=500,th=1,bag=1,rp=FALSE,iv=FALSE,de=FALSE, pi=0,df=5,R2=0.5,eigK=NULL,VarK=0.95,verb=FALSE)
wgr(y,X,it=1500,bi=500,th=1,bag=1,rp=FALSE,iv=FALSE,de=FALSE, pi=0,df=5,R2=0.5,eigK=NULL,VarK=0.95,verb=FALSE)
y |
Numeric vector of observations ( |
X |
Numeric matrix containing the genotypic data. A matrix with |
it |
Integer. Number of iterations or samples to be generated. |
bi |
Integer. Burn-in, the number of iterations or samples to be discarted. |
th |
Integer. Thinning parameter, used to save memory by storing only one every 'th' samples. |
bag |
If different than one, it indicates the proportion of data to be subsampled in each Markov chain. For datasets with moderate number of observations, values of bag from 0.30 to 0.60 may speed up computation without losses in predicion properties. This argument enable users to enhance MCMC through subsampling (Xavier et al. 2017). |
rp |
Logical. Use replacement for bootstrap samples when bag is different than one. |
iv |
Logical. Assign markers independent variance, a T prior from a mixture of normals. If true, turns the default model BLUP into BayesA. |
de |
Logical. Assign markers independent variance through double-exponential prior. If true, turns the default model BLUP into Bayesian LASSO. This argument overides iv. |
pi |
Value between 0 and 1. If greater than zero it activates variable selection, where markers have expected probability pi of having null effect. |
df |
Prior degrees of freedom of variance components. |
R2 |
Expected R2, used to calculate the prior shape. |
eigK |
Output of function |
VarK |
Numeric between 0 and 1. For reduction of dimensionality. Indicates the proportion of variance explained by Eigenpairs used to fit second random effect. |
verb |
Logical. If verbose is TRUE, function displays MCMC progress bar. |
The model for the whole-genome regression is as follows:
where is the response variable,
is the intercept,
is the genotypic matrix,
is the regression coefficient or effect of an allele substitution, with
probability of being included into the model,
is the polygenic term if a kernel is used, and
is the residual term.
Users can obtain four WGR methods out of this function: BRR (pi=0,iv=F), BayesA (pi=0,iv=T), BayesB (pi=0.95,iv=T), BayesC (pi=0.95,iv=F) and Bayesian LASSO or BayesL (pi=0,de=T). Theoretical basis of each model is described by de los Campos et al. (2013).
Gibbs sampler that updates regression coefficients is adapted from GSRU algorithm (Legarra and Misztal 2008). The variable selection of functions ,
and
works through the unconditional prior algorithm proposed by Kuo and Mallick (1998), whereas
and
are based on Metropolis-Hastings. Prior shape estimates are computed as Sb = R2*df*var(y)/MSx and Se = (1-R2)*df*var(y), with an exception for
and
where the prior shape is Sb = R2*df*var(y)/MSx/(1-pi). The polygenic term is solved by Bayesian algorithm of reproducing kernel Hilbert Spaces proposed by de los Campos et al. (2010).
In addition to wgr
, standalone C++ functions available include:
01) BayesA(y,X,it=1500,bi=500,df=5,R2=0.5)
02) BayesB(y,X,it=1500,bi=500,pi=0.95,df=5,R2=0.5)
03) BayesC(y,X,it=1500,bi=500,pi=0.95,df=5,R2=0.5)
04) BayesCpi(y,X,it=1500,bi=500,df=5,R2=0.5)
05) BayesDpi(y,X,it=1500,bi=500,df=5,R2=0.5)
06) BayesL(y,X,it=1500,bi=500,df=5,R2=0.5)
07) BayesRR(y,X,it=1500,bi=500,df=5,R2=0.5)
The implementations that support two random effects include:
08) BayesA2(y,X1,X2,it=1500,bi=500,df=5,R2=0.5)
09) BayesB2(y,X1,X2,it=1500,bi=500,pi=0.95,df=5,R2=0.5)
10) BayesRR2(y,X1,X2,it=1500,bi=500,df=5,R2=0.5)
And the cross-validation for the C++ implementations, with arguments analogous to emCV
.
mcmcCV(y,gen,k=5,n=5,it=1500,bi=500,pi=0.95,df=5,R2=0.5,avg=T,llo=NULL,tbv=NULL,ReturnGebv=FALSE)
The function wgr returns a list with expected value from the marker effect (), probability of marker being in the model (
), regression coefficient (
), variance of each marker (
), the intercept (
), the polygene (
) and polygenic variance (
), residual variance (
) and the fitted value (
).
Alencar Xavier
de los Campos, G., Hickey, J. M., Pong-Wong, R., Daetwyler, H. D., and Calus, M. P. (2013). Whole-genome regression and prediction methods applied to plant and animal breeding. Genetics, 193(2), 327-345.
de los Campos, G., Gianola, D., Rosa, G. J., Weigel, K. A., & Crossa, J. (2010). Semi-parametric genomic-enabled prediction of genetic values using reproducing kernel Hilbert spaces methods. Genetics Research, 92(04), 295-308.
Kuo, L., & Mallick, B. (1998). Variable selection for regression models. Sankhya: The Indian Journal of Statistics, Series B, 65-81.
Legarra, A., & Misztal, I. (2008). Technical note: Computing strategies in genome-wide selection. Journal of dairy science, 91(1), 360-366.
Xavier, A., Xu, S., Muir, W., & Rainey, K. M. (2017). Genomic prediction using subsampling. BMC bioinformatics, 18(1), 191.
## Not run: # Load data data(tpod) # BLUP fit_BRR = wgr(y,gen,iv=FALSE,pi=0) cor(y,fit_BRR$hat) # BayesA fit_BayesA = wgr(y,gen,iv=TRUE,pi=0) cor(y,fit_BayesA$hat) # BayesB fit_BayesB = wgr(y,gen,iv=TRUE,pi=.95) cor(y,fit_BayesB$hat) # BayesC fit_BayesC = wgr(y,gen,iv=FALSE,pi=.95) cor(y,fit_BayesC$hat) # BayesCpi fit_BayesCpi = BayesCpi(y,gen) cor(y,fit_BayesCpi$hat) # BayesDpi fit_BayesDpi = BayesDpi(y,gen) cor(y,fit_BayesDpi$hat) # BayesL fit_BayesL = wgr(y,gen,de=TRUE) cor(y,fit_BayesL$hat) # Bagging BLUP fit_Bag = wgr(y,gen,bag=0.5) cor(y,fit_Bag$hat) ## End(Not run)
## Not run: # Load data data(tpod) # BLUP fit_BRR = wgr(y,gen,iv=FALSE,pi=0) cor(y,fit_BRR$hat) # BayesA fit_BayesA = wgr(y,gen,iv=TRUE,pi=0) cor(y,fit_BayesA$hat) # BayesB fit_BayesB = wgr(y,gen,iv=TRUE,pi=.95) cor(y,fit_BayesB$hat) # BayesC fit_BayesC = wgr(y,gen,iv=FALSE,pi=.95) cor(y,fit_BayesC$hat) # BayesCpi fit_BayesCpi = BayesCpi(y,gen) cor(y,fit_BayesCpi$hat) # BayesDpi fit_BayesDpi = BayesDpi(y,gen) cor(y,fit_BayesDpi$hat) # BayesL fit_BayesL = wgr(y,gen,de=TRUE) cor(y,fit_BayesL$hat) # Bagging BLUP fit_Bag = wgr(y,gen,bag=0.5) cor(y,fit_Bag$hat) ## End(Not run)
Univariate models to find breeding values through regression fitted via expectation-maximization implemented in C++.
emRR(y, gen, df = 10, R2 = 0.5) emBA(y, gen, df = 10, R2 = 0.5) emBB(y, gen, df = 10, R2 = 0.5, Pi = 0.75) emBC(y, gen, df = 10, R2 = 0.5, Pi = 0.75) emBCpi(y, gen, df = 10, R2 = 0.5, Pi = 0.75) emBL(y, gen, R2 = 0.5, alpha = 0.02) emEN(y, gen, R2 = 0.5, alpha = 0.02) emDE(y, gen, R2 = 0.5) emML(y, gen, D = NULL) lasso(y, gen) emCV(y, gen, k = 5, n = 5, Pi = 0.75, alpha = 0.02, df = 10, R2 = 0.5, avg=TRUE, llo=NULL, tbv=NULL, ReturnGebv = FALSE)
emRR(y, gen, df = 10, R2 = 0.5) emBA(y, gen, df = 10, R2 = 0.5) emBB(y, gen, df = 10, R2 = 0.5, Pi = 0.75) emBC(y, gen, df = 10, R2 = 0.5, Pi = 0.75) emBCpi(y, gen, df = 10, R2 = 0.5, Pi = 0.75) emBL(y, gen, R2 = 0.5, alpha = 0.02) emEN(y, gen, R2 = 0.5, alpha = 0.02) emDE(y, gen, R2 = 0.5) emML(y, gen, D = NULL) lasso(y, gen) emCV(y, gen, k = 5, n = 5, Pi = 0.75, alpha = 0.02, df = 10, R2 = 0.5, avg=TRUE, llo=NULL, tbv=NULL, ReturnGebv = FALSE)
y |
Numeric vector of response variable ( |
gen |
Numeric matrix containing the genotypic data. A matrix with |
df |
Hyperprior degrees of freedom of variance components. |
R2 |
Expected R2, used to calculate the prior shape (de los Campos et al. 2013). |
Pi |
Value between 0 and 1. Expected probability pi of having null effect (or 1-Pi if Pi>0.5). |
alpha |
Value between 0 and 1. Intensity of L1 variable selection. |
D |
NULL or numeric vector with length p. Vector of weights for markers. |
k |
Integer. Folding of a k-fold cross-validation. |
n |
Integer. Number of cross-validation to perform. |
avg |
Logical. Return average across CV, or correlations within CV. |
llo |
NULL or a vector (numeric or factor) with the same length as y. If provided, the cross-validations are performed as Leave a Level Out (LLO). This argument allows the user to predefine the splits. This argument overrides |
tbv |
NULL or numeric vector of 'true breeding values' ( |
ReturnGebv |
Logical. If TRUE, it returns a list with the average marker values and fitted values across all cross-validations, in addition to the regular output. |
The model for the whole-genome regression is as follows:
where is the response variable,
is the intercept,
is the genotypic matrix,
is the effect of an allele substitution (or regression coefficient) and
is the residual term. A k-fold cross-validation for model evaluation is provided by
.
The EM functions returns a list with the intercept (), the regression coefficient (
), the fitted value (
), and the estimated intraclass-correlation (
).
The function emCV returns the predictive ability of each model, that is, the correlation between the predicted and observed values from -fold cross-validations repeated
times.
Alencar Xavier
## Not run: data(tpod) emCV(y,gen,3,3) ## End(Not run)
## Not run: data(tpod) emCV(y,gen,3,3) ## End(Not run)
Multivariate model to find breeding values.
mkr(Y,K,...) mrr(Y,X,...) mrr_float(Y,X,...)
mkr(Y,K,...) mrr(Y,X,...) mrr_float(Y,X,...)
Y |
Numeric matrix of observations x trait. |
K |
Numeric matrix containing the relationship matrix. |
X |
Numeric matrix containing the genotyping matrix. |
... |
Arguments to pass to MRR3/MRR3F. See |
Algorithm is described in Xavier and Habier (2022). The model for the ridge regression (mrr) is as follows:
where is a matrix of response variables,
represents the intercepts,
is the matrix of genotypic information,
is the matrix of marker effects, and
is the residual matrix.
The model for the kernel regression (mkr) is as follows:
where is a matrix of response variables,
represents the intercepts,
is the matrix of Eigenvector of K,
is a vector of regression coefficients and
is the residual matrix.
Algorithm: Residuals are assumed to be independent among traits. Regression coefficients are solved via a multivaraite adaptation of Gauss-Seidel Residual Update. Since version 2.0, the solver of mrr
is based on the Randomized Gauss-Seidel algorithm. Variance and covariance components are solved with an EM-REML like approach proposed by Schaeffer called Pseudo-Expectation.
Other related implementations:
01) mkr2X(Y,K1,K2):
Solves multi-trait kernel regressions with two random effects.
02) mrr2X(Y,X1,X2):
Solves multi-trait ridge regressions with two random effects.
03) MRR3(Y,X,...):
Extension of mrr with additional parameters.
04) MRR3F(Y,X,...):
MRR3 running on float.
05) mrr_svd(Y,W):
Solves mrr through the principal components of parameters.
06) MLM(Y,X,Z,maxit=500,logtol=-8,cores=1):
Multivariate model with fixed effects.
07) SEM(Y,Z,...):
Fits a MegaSEM with both shared- and trait-specific terms.
08) MEGA(Y,X,npc=-1):
Toy implementation of MegaLMM, imputing missing with GEBVs.
09) GSEM(Y,X,npc=-1):
Toy C++ implementaiton of MegaSEM, jointly fits FA and XB.
10) ZSEMF(Y,X,npc=0):
Full-rank MegaSEM, float precision.
11) YSEMF(Y,X,npc=-1):
Reduced-rank MegaSEM, float, two-steps approach.
12) XSEMF(Y,X,npc=0):
Full-rank MegaSEM, h2 fixed at 0.5, float precision.
In GSEM, XSEMF and MEGA, 'npc' means number of latent spaces if input is above zero, otherwise, 0 means all and -1 means 2*sqrt(ncol(Y))
.
Returns a list with the random effect covariances (Vb
), residual variances (Ve
), genetic correlations (GC
), matrix with marker effects (b
) or eigenvector effects (if mkr
), intercepts (mu
), heritabilities (h2
), and a matrix with fitted values (hat
).
NOTE: Numeric stability is a serious concern with multivariate models with large number of response variables, as the covariance matrix is often not invesible. If output is filled with NAs, try using MRR3
and play with some parameters. For example, one may try adding priors to stabilize variances, e.g., fit=MRR3(Y,X,df0=20)
.
Alencar Xavier, David Habier
Xavier, A and Habier, D. (2022). A new approach fits multivariate genomic prediction models efficiently. GSE, DOI: 10.1186/s12711-022-00730-w
Xavier, A et al. (2024). Megavariate Methods Capture Complex Genotype-by-Environment Interactions. Genetics, DOI: 10.1093/genetics/iyae179
# Load genomic data data(tpod) X = CNT(gen) # Simulate phenotyp sim = SimY(X) Y = sim$Y TBV = sim$tbv # Fit regression model test = mrr(Y,X) # Genetic correlation test$GC # Heritabilies test$h2 # Accuracy diag(cor(TBV,test$hat)) # try: demo(multivariates)
# Load genomic data data(tpod) X = CNT(gen) # Simulate phenotyp sim = SimY(X) Y = sim$Y TBV = sim$tbv # Fit regression model test = mrr(Y,X) # Genetic correlation test$GC # Heritabilies test$h2 # Accuracy diag(cor(TBV,test$hat)) # try: demo(multivariates)
Function to solve univariate mixed models with or without the usage of omic information. This function allows single-step modeling of replicated observations with marker information available through the usage of a linkage function to connect to a whole-genome regression method. Genomic estimated values can be optionally deregressed (no shrinkage) while fitting the model.
mixed(y,random=NULL,fixed=NULL,data=NULL,X=list(), alg=emML,maxit=10,Deregress=FALSE,...)
mixed(y,random=NULL,fixed=NULL,data=NULL,X=list(), alg=emML,maxit=10,Deregress=FALSE,...)
y |
Response variable from the data frame containg the dataset. |
random |
Formula. Right-hand side formula of random effects. |
fixed |
Formula. Right-hand side formula of fixed effects. |
data |
Data frame containing the response variable, random and fixed terms. |
X |
List of omic incidence matrix. Row names of these matrices connect the omic information to the levels of the indicated random terms (eg. |
alg |
Function. Whole-genome regression algorithm utilized to solve link functions. These include MCMC ( |
maxit |
Integer. Maximum number of iterations. |
Deregress |
Logical. Deregress (unshrink) coefficients while fitting the model? |
... |
Additional arguments to be passed to the whole-genome regression algorithms especified on |
The model for the whole-genome regression is as follows:
where is the response variable,
corresponds to the fixed effect term,
corresponds to one or more random effect terms,
is the incidence matrix of terms with omic information and
is omic values by
, where
is the genotypic matrix and
are marker effects. Here,
is the residual term. An example is provided using the data from the NAM package with:
demo(mixedmodel)
.
Alterinative (and updated) implementations have similar syntax:
01) mm(y,random=NULL,fixed=NULL,data=NULL,
M=NULL,bin=FALSE,AM=NULL,it=10,verb=TRUE,
FLM=TRUE,wgtM=TRUE,cntM=TRUE,nPc=3)
02) mtmixed = function(resp, random=NULL, fixed=NULL,
data, X=list(), maxit=10, init=10, regVC=FALSE)
The function wgr returns a list with Fitness values (Fitness
) containing observation obs
, fitted values hat
, residuals res
, and fitted values by model term fits
; Estimated variance components (VarComp
) containing the variance components per se (VarComponents
) and variance explained by each model term (VarExplained
), regression coefficients by model term (Coefficients
), and the effects of structured terms (Structure
) containing the marker effects of each model term where markers were provided.
Alencar Xavier
Xavier, A. (2019). Efficient Estimation of Marker Effects in Plant Breeding. G3: Genes, Genomes, Genetics, DOI: 10.1534/g3.119.400728
## Not run: demo(mixedmodel) ## End(Not run)
## Not run: demo(mixedmodel) ## End(Not run)
Complementary functions that may help with handling parameters and routine operations.
emGWA(y,gen)
# Simple MLM for association analysis
markov(gen,chr=NULL)
# Markovian imputation of genotypes coded as 012
IMP(X)
# Imputes genotypes with SNP expectation (column average)
CNT(X)
# Recodes SNPs by centralizing columns in a matrix
GAU(X)
# Creates Gaussian kernel as exp(-Dist2/mean(Dist2))
GRM(X,Code012=FALSE)
# Creates additive kinship matrix VanRaden 2008
SPC(y,blk,row,col,rN=3,cN=1)
# Spatial covariate
SPM(blk,row,col,rN=3,cN=1)
# Spatial design matrix
SibZ(id,p1,p2)
# Pedigree design matrix compatible to regression methods
Hmat(ped,gen=NULL)
# Kinship combining pedigree and genomics
EigenGRM(X, centralizeZ = TRUE, cores = 1)
# GRM using Eigen library
EigenARC(X, centralizeZ = TRUE, cores = 1)
# ArcCosine kernel
EigenGAU(X, phi = 1.0, cores = 1)
# Gaussian kernel using Eigen library
EigenCNT(X, cores = 1)
# Center SNPs without missing Eigen library
EigenEVD(A, cores = 1)
# Eigendecomposition from Eigen library
EigenBDCSVD(X, cores = 1)
# BDC single value composition from Eigen
EigenJacobiSVD(X, cores = 1)
# Jacobi single value composition from Eigen
EigenAcc(X1, X2, h2 = 0.5, cores = 1)
# Deterministic accuracy X1 -> X2 via V
AccByC(X1, X2, h2 = 0.5, cores = 1)
# Deterministic accuracy X1 -> X2 via C
EigenArcZ(Zfndr, Zsamp, cores = 1)
# Reduced rank ArcCos kernel PCs with founder rotation
EigenGauZ(Zfndr, Zsamp, phi=1, cores = 1)
# Reduced rank Gaussian kernel PCs with founder rotation
K2X(K, MinEV = 1e-8, cores = 1)
# Reparametrize kernel to PCs to run regression models
SimY(Z,k=5,h2=0.5,GC=0.5,seed=123,unbalanced=FALSE,PercMiss=0,BlkMiss=FALSE)
# Simulate phenotypes
SimZ(ind=500,snp=500,chr=2,F2=TRUE,rec=0.01)
# Simulate genome
SimGC(k=50,...)
# Simulate genetic correlation matrix
MvSimY(Ufndr,Zfndr,Zsamp,GxY,GxL,H2plot,nLoc=20,Seed=123)
# Simulate phenotypes given founders
Alencar Xavier