%\VignetteIndexEntry{BGLR-extdoc} \documentclass[article,shortnames,nojss]{jss} %Extra packages \usepackage{amsmath} \usepackage{tikz} \usepackage{subfig} \usepackage{appendix} \usepackage{rotating} \newlength{\RoundedBoxWidth} \newsavebox{\GrayRoundedBox} \newenvironment{GrayBox}[1][\dimexpr\textwidth-4.5ex]% {\setlength{\RoundedBoxWidth}{\dimexpr#1} \begin{lrbox}{\GrayRoundedBox} \begin{minipage}{\RoundedBoxWidth}}% { \end{minipage} \end{lrbox} \begin{center} \begin{tikzpicture}% \draw node[draw=black,fill=black!10,rounded corners,% inner sep=2ex,text width=\RoundedBoxWidth]% {\usebox{\GrayRoundedBox}}; \end{tikzpicture} \end{center}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% declarations for jss.cls %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% almost as usual \author{Paulino P\'erez \\Colegio de Postgraduados, M\'exico \And Gustavo de los Campos \\University of Alabama at Birmingham} \title{BGLR: A Statistical Package for Whole Genome Regression and Prediction} %% for pretty printing and a nice hypersummary also set: \Plainauthor{Paulino P\'erez, Gustavo de los Campos} %% comma-separated \Plaintitle{BGLR: A Statistical Package for Whole-Genome Regression} %% without formatting \Shorttitle{\pkg{BGLR}: An R-package for Whole-Genome Regression} %% a short title (if necessary) %% an abstract and keywords \Abstract{ Many modern genomic data analysis problems require implementing regressions where the number of unknowns ($p$, e.g., the number of marker effects) vastly exceeds sample size ($n$). Implementing these \textit{large-$p$-with-small-$n$} regressions poses several statistical and computational challenges. Some of these challenges can be confronted using Bayesian methods, and the Bayesian approach allows integrating various parametric and non-parametric shrinkage and variable selection procedures in a unified and consistent manner. The \pkg{BGLR} R-package implements a large collection Bayesian regression models, including various parametric regressions where regression coefficients are allowed to have different types of prior densities (flat, normal, scaled-t, double-exponential and various finite mixtures of the spike-slab family) and semi-parametric methods (Bayesian reproducing kernel Hilbert spaces nregressions, RKHS). The software was originally developed as an extension of the \pkg{BLR} package and with a focus on genomic applications; however, the methods implemented are useful for many non-genomic applications as well. The response can be continuous (censored or not) or categorical (either binary, or ordinal). The algorithm is based on a Gibbs Sampler with scalar updates and the implementation takes advantage of efficient compiled C and Fortran routines. In this article we describe the methods implemented in \pkg{BGLR}, present examples of the use of the package and discuss practical issues emerging in real-data analysis. } \Keywords{Bayesian Methods, Regression, Whole Genome Regression, Whole Genome Prediction, Genome Wide Regression, Variable Selection, Shrinkage, semi-parametric regression, RKHS, \proglang{R}} \Plainkeywords{Bayesian Regression, Whole Genome Regression, Whole Genome Prediction, Genome Wide Regression, Variable Selection, Shrinkage, semi-parametric regression, R} %% without formatting %% publication information %% NOTE: Typically, this can be left commented and will be filled out by the technical editor %% \Volume{50} %% \Issue{9} %% \Month{June} %% \Year{2012} %% \Submitdate{2012-06-04} %% \Acceptdate{2012-06-04} %% The address of (at least) one author should be given %% in the following format: \Address{ Paulino P\'erez\\ Socio Econom\'ia Estad\'istica e Inform\'atica \\ Colegio de Postgraduados, M\'exico\\ E-mail: \email{perpdgo@colpos.mx}\\ Gustavo de los Campos\\ Department of Biostatistics \\ Section on Statistical Genetics \\ University of Alabama at Birmingam \\ Telephone: +1/205/975-9248 \\ Fax: +1 /205/975-2540 \\ E-mail: \email{gcampos@uab.edu}\\ \url{http://www.soph.uab.edu/ssg/people/campos} } %% end of declarations %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{document} %% include your article here, just as usual %% Note that you should use the \pkg{}, \proglang{} and \code{} commands. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} \label{sec:Intro} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Many modern statistical learning problems involve the analysis of highly dimensional data; this is particularly common in genetic studies where, for instance, phenotypes are regressed on large numbers of predictor variables (e.g., SNPs) concurrently. Implementing these \textit{large-p-with-small-n} regressions posses several statistical and computational challenges; including how to confront the so-called `curse of dimensionality' \citep{Bellman:1961} as well as the complexity of a genetic mechanism that can involve various types of interactions between alleles and with environmental factors. Recent developments in the areas of shrinkage estimation, both in the penalized and Bayesian regression frameworks, as well as in computational methods have made the implementation of these \textit{large-p-with-small-n} regressions feasible. Consequently, whole-genome-regression approaches \citep{Meuwissen:2001} are becoming increasingly popular for the analysis and prediction of complex traits in plants \citep[e.g.][]{Crossa:2010}, animals \citep[e.g.][]{VanRaden:2009, Hayes:2009} and humans \citep[e.g.][]{Yang:2010,Makowsky:2011,Vazquez:2012, delosCampos:2013c}. In the last decade a large collection of parametric and non-parametric methods have been proposed and empirical evidence has demonstrated that there is no single approach that performs best across data sets and traits. Indeed, the choice of the model depends on multiple factors such as the genetic architecture of the trait, marker density, sample size, the span of linkage disequilibrium \citep[e.g.,][]{delosCampos:2013b}. Although various software (\textbf{BLR}, \citealt{Perez:2010}; \textbf{rrBLUP}, \citealt{Endelman:2011}; \textbf{synbreed}, \citealt{Wimmer:2012}; \textbf{GEMMA}, \citealt{Zhou:2012}) exists, most statistical packages implement a few types of methods and there is need of integrating these methods in a unified statistical and computational framework. Motivated by this need we have developed the R \citep{Rsoft} package \pkg{BGLR} \citep{delosCampos:2013a}. The package is available at CRAN and at the R-forge website \url{https://r-forge.r-project.org/projects/bglr/}. \textbf{Models.} \pkg{BGLR} can be used with \textbf{continuous} (censored or not) and \textbf{categorical} traits (binary and ordinal). The user has control in choosing the prior assigned to effects and this can be used to control the extent and type of shrinkage of estimates. For \textbf{parametric linear regressions on covariates} (e.g., genetic markers, non-genetic co-variates) the user can choose a variety of prior densities, from flat priors (the so-called \textbf{\emph{`fixed effects'}}, a method that does not induce shrinkage of estimates) to priors that induce different types of shrinkage, including: Gaussian (\textbf{Bayesian Ridge Regression}, \textbf{BRR}), scaled-t \citep[\textbf{BayesA}][]{Meuwissen:2001}, Double-Exponential \citep[\textbf{Bayesian LASSO}, \textbf{BL}][]{Park:2008}, and \textbf{\emph{two component mixtures}} with a point of mass at zero and a with a slab that can be either Gaussian \citep[\textbf{BayesC},][]{Habier:2011} or scaled-t \citep[\textbf{BayesB},][]{Habier:2011}. The \pkg{BGLR} package also implements \textbf{\emph{Bayesian Reproducing Kernel Hilbert Spaces Regressions}} \citep[\textbf{RKHS,}][]{Wahba:1990} using Gaussian processes with arbitrarily user-defined co-variance structures. This class of models allows implementing semi-parametric regressions for various types of problems, including, scatter-plot smoothing \citep[e.g., \textbf{\emph{smoothing splines}}][]{Wahba:1990}, \textbf{spatial smoothing} \citep{Cressie:1988}, \textbf{\emph{Genomic-BLUP}} \citep{VanRaden:2008}, non-parametric \textbf{\emph{RKHS}} genomic regressions \citep{Gianola:2006, Gianola:2008,delosCampos:2010b} and \textbf{\emph{pedigree-BLUP}} \citep{Henderson:1975}. All the above-mentioned prior densities (e.g., Gaussian, Double Exponential, Scaled-t, finite mixtures) are index by \textbf{regularization parameters} that control the extent of shrinkage of estimates; rather than fixing them to some user-specified values we treat them as random. Consequently, in a deeper level of the hierarchal model these regularization parameters are assigned prior densities. \textbf{Algorithms}. In \pkg{BGLR} samples from the posterior density are drawn using a Gibbs sampler \citep{Geman:1984, Casella:1992}; with scalar updating. This approach is very flexible but computationally demanding. To confront the computational challenges emerging in Markov Chain Monte Carlo (MCMC) implementations we have adopted a strategy that combines: (a) the use of built-in R functions for operations that can be vectorized with (b) customized compiled code (C and Fortran) developed to perform operations that cannot be vectorized. Thus, the kernel of our software is written in R, but the computationally demanding steps are carried out using customized routines written in C and Fortran code. The implementation makes use of BLAS routines \texttt{daxpy} and \texttt{ddot}. The computational performance of the algorithm can be greatly improved if R is linked against a tuned BLAS implementation with multithread support, for example OpenBLAS, ATLAS, Intel mkl, etc. \textbf{Ancillary functions and data sets}. In addition to the main function (BGLR) the package comes with: (a) functions to read and write from the R-console \texttt{*.ped} and \texttt{*.bed} files \citep{Purcell:2007}, (b) two publicly available data sets (see Section \ref{sec:datasets} for further details) and (c) various examples (type \texttt{demo(package=`BGLR')} in the R-console). In what remains of the article we discuss the \textbf{methods implemented} (\textbf{Section \ref{sec:methods}}), the \textbf{user interface} (\textbf{Section \ref{sec:interface}}), and the \textbf{data sets} included (\textbf{Section \ref{sec:datasets}}) in the \pkg{BGLR} package in detail. \textbf{Application examples} are given in \textbf{Section \ref{sec:examples}}. A small benchmark is given in \textbf{Section \ref{sec:benchmark}}. Finally, the article is closed in \textbf{Section \ref{sec:concluding}} with a few \textbf{concluding remarks}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Statistical Models and Algorithms} \label{sec:methods} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The \pkg{BGLR} supports models for continuous (censored or not) and categorical (binary or ordinal multinomial) traits. We begin by considering the case of a continuous response without censoring; categorical and censored data are considered later on. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Conditional distribution of the data} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% For a continuous response ($y_i;\,\, i=1,....,n$) the data equation is represented as $y_i=\eta_i+\varepsilon_i$, where $\eta_i$ is a linear predictor (the expected value of $y_i$ given predictors) and $\varepsilon_i$ are independent normal model residuals with mean zero and variance $w_i^2 \sigma_\varepsilon^2$. Here, the $w_i's$ are user-defined weights (by default \pkg{BGLR} sets $w_i=1$ for all data-points) and $\sigma_\varepsilon^2$ is a residual variance parameter. In matrix notation we have \begin{equation*} \boldsymbol{y}=\boldsymbol{\eta} + \boldsymbol{\varepsilon}, \end{equation*} where $\boldsymbol y=\{y_1,...,y_n\}$, $\boldsymbol \eta=\{\eta_1,...,\eta_n\}$ and $\boldsymbol \varepsilon=\{\varepsilon_1,...,\varepsilon_n\}$. The linear predictor represents the conditional expectation function, and it is structured as follows: \begin{equation} \label{eq:eta} \boldsymbol{\eta}=\boldsymbol 1 \mu + % \sum_{j}^J \boldsymbol{X}_j \boldsymbol{\beta}_j + % \sum_{l}^{L} \boldsymbol u_l, \end{equation} where $\mu$ is an intercept, $\boldsymbol X_j$ are design matrices for predictors, $\boldsymbol X_j=\{x_{ijk}\}$, $\boldsymbol \beta_{jk}$ are vectors of effects associated to the columns of $\boldsymbol X_j$ and $\boldsymbol u_l=\{u_{l1},...,u_{ln}\}$ are vectors of random effects. The only element of the linear predictor included by default is the intercept. The other elements are user-specified. Collecting the above assumptions, we have the following likelihood: \begin{equation*} p(\boldsymbol y | \boldsymbol \theta)= % \prod_{i=1}^n N(y_i | \mu + \sum_j^J \sum_{k=1}^{K_j} x_{ijk} \beta_{jk} + \sum_l^L u_{li}, % \sigma_\varepsilon^2 w_i^2), \end{equation*} where $\boldsymbol \theta$ represents the collection of unknowns, including the intercept, regression coefficients, random effects and the residual variance. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Prior densities} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The \textbf{residual variance} is assigned a Scaled-inverse Chi-square density $p(\sigma_\varepsilon^2)=\chi^{-2}(\sigma_\varepsilon^2 | S_\varepsilon, df_\varepsilon)$ with degree of freedom $df_\varepsilon$ (>0) and scale parameters $S_\varepsilon$ (>0) and the intercept $(\mu)$ is assigned a flat prior. In the parameterization used in \pkg{BGLR}, the prior expectation of the Scaled-inverse Chi-square density $\chi^{-2}(\cdot| S_{\cdot}, df_{\cdot})$ is given by $\frac{S_\cdot}{df_\cdot-2}$. \textbf{Regression coefficients} $\{\beta_{jk}\}$ can be assigned either un-informative (i.e., flat) or informative priors. Those coefficients assigned flat priors, the so-called `fixed' effects, are estimated based on information contained in the likelihood solely. For the coefficient assigned informative priors, the choice of the prior will play an important role in determining the type of shrinkage of estimates of effects induced. Figure \ref{fig:prior_densities} provides a graphical representation of the prior densities available in \pkg{BGLR}. The \textbf{\emph{Gaussian prior}} induce shrinkage of estimate similar to that of Ridge Regression \citep[\textbf{RR},][]{Hoerl:1970} where all effects are shrunk to a similar extent; we refer to this model as the Bayesian Ridge Regression (BRR). The \textbf{scaled-t} and \textbf{double exponential} (DE) densities have higher mass at zero and thicker tails than the normal density, and they induce a type of shrinkage of estimates that is size-of-effect dependent \citep{Gianola:2013}. The scaled-t density is the prior used in model BayesA \citep{Meuwissen:2001}, and the DE or Laplace prior is the one used in the BL \citep{Park:2008}. Finally, \pkg{BGLR} implements two \textbf{\emph{finite mixture priors}}: a mixture of a point of mass at zero and a Gaussian slab, a model usually refereed in the literature on GS as to \textbf{BayesC} \citep{Habier:2011} and a mixture of a point of mass at zero and a scaled-t slab, a model known as \textbf{BayesB} \citep{Meuwissen:2001}. By assigning a non-null prior probability for the marker effect to be equal to zero, the priors used in BayesB and BayesC have potential for inducing variable selection. \begin{figure}[!htb] \centering \includegraphics[width=0.75\textwidth]{fig_1} \caption{Prior Densities of Regression Coefficients Implemented in \pkg{BGLR}. All the densities displayed correspond to random variables with null mean and unit variance.} \label{fig:prior_densities} \end{figure} \textbf{Hyper-parameters}. Each of the prior distributions above-described are indexed by one or more parameters that control the type and extent of shrinkage induced. We treat these regularization parameters as unknown; consequently a prior is assigned to these unknowns. Table \ref{tab:priors_marker_effects} lists, for each of the prior densities implemented the set of hyper-parameters. Further details about how regularization parameters are inferred from the data are given in the Appendix. \begin{table}[!htb] \caption{Prior densities available for regression coefficients in the \pkg{BGLR} package.} \begin{center} \begin{tabular}{lll} \hline \textbf{Model} & \textbf{Hyper-parameters} & \textbf{Treatmend in BGLR}$^1$ \\ \textbf{(prior density)} & & \\ \hline Flat & Mean ($\mu_\beta$) & $\mu_\beta=0$ \\ (FIXED) & Variance ($\sigma_\beta^2$)& $\sigma_\beta^2=1 \times 1^{10}$\\ \hline Gaussian & Mean ($\mu_\beta$) & $\mu_\beta=0$ \\ (BRR) & Variance ($\sigma_\beta^2$)& $\sigma_\beta^2 \sim \chi^{-2}$\\ \hline Scaled-t & Degrees of freedom ($df_\beta$) & User-specified (default value, 5) \\ (BayesA) & Scale ($S_\beta$) & $S_\beta \sim Gamma$\\ \hline Double-Exponential & & $\lambda$ Fixed, user specified, or \\ (BL) & $\lambda^2$ & $\lambda^2 \sim Gamma$, or \\ & & $\frac{\lambda}{\max} \sim Beta ^2$ \\ \hline Gaussian Mixture &$\pi$ (prop. of non-null effects) & $\pi \sim Beta$ \\ (BayesC) &$df_\beta$ & User-specified (default value, 5) \\ &$S_\beta$ & $S_\beta \sim Gamma$ \\ \hline Scaled-t Mixture &$\pi$ (prop. of non-null effects) & $\pi \sim Beta$ \\ (BayesB) &$df_\beta$ & User-specified (default value, 5) \\ &$S_\beta$ & $S_\beta \sim Gamma$ \\ \hline \end{tabular} \end{center} 1: Further details are given in the Appendix. 2: This approach is further discussed in \cite{delosCampos:2009a}. \label{tab:priors_marker_effects} \end{table} \textbf{Combining priors}. Different priors can be specified for each of the elements of the linear predictor, $\{\boldsymbol{X}_1,\boldsymbol{X}_2,...,\boldsymbol{X}_J, \boldsymbol{u}_1,\boldsymbol{u}_2,...,\boldsymbol{u}_L\}$, giving the user great flexibility in building models for data analysis; an example illustrating how to combine different priors in a model is given in Box 3a of Section \ref{sec:examples}. \textbf{Gaussian Processes}. The vectors of random effects $\boldsymbol u_l$ are assigned multivariate-normal priors with a mean equal to zero and co-variance matrix $Cov(\boldsymbol u_l,\boldsymbol u_l')=\boldsymbol K_l \sigma_{ul}^2$ where $\boldsymbol K_l$ is an $n \times n$ symmetric positive semi-definite matrix and $\sigma_{ul}^2$ is a variance parameter with prior density $\sigma_{ul}^2 \sim \chi^{-2} (df_l,S_l)$. Special classes of models that can be implemented using these random effects include standard pedigree-regression models \citep{Henderson:1975} in which case $\boldsymbol K_l$ is a pedigree-derived co-variance matrix, Genomic BLUP \citep{VanRaden:2008}, which case $\boldsymbol K_l$ may be a marker-derived relationship matrix, or models for spatial regressions \citep{Cressie:1988} in which case $\boldsymbol K_l$ may be a co-variance matrix derived from spatial information. Illustration about the inclusion of these Gaussian processes into models for data analysis are given in examples of Section \ref{sec:examples}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Algorithms} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The R-package \pkg{BGLR} draws samples from the posterior density using a Gibbs sampler \citep{Geman:1984,Casella:1992} with scalar updating. For computational convenience the scaled-t and DE densities are represented as infinite mixtures of scaled normal densities \citep{Andrews:1974}, and the finite-mixture priors are implemented using latent random Bernoulli variables linking effects to components of the mixtures. \textbf{\emph{Categorical traits}}. The argument \texttt{response\_type} is used to indicate \pkg{BGLR} whether the response should be regarded as \texttt{`continuous'}, the default value, or \texttt{`ordinal'}. For continuos traits the response vector shoud be coercible to numeric; for ordinal traits the response can take onto $K$ possible (ordered) values $y_i \in \{1,...,K\}$ (the case where $K=2$ corresponds to the binary outcome), and the response vector should be coercible to a factor. For categorical traits we use the probit link; here, the probability of each of the categories is linked to the linear predictor according to the following link function: \begin{equation*} P(y_i=k)=\Phi(\eta_i-\gamma_k)-\Phi(\eta_i-\gamma_{k-1}) \end{equation*} where $\Phi(\cdot)$ is the standard normal cumulative distribution function, $\eta_i$ is the linear predictor, specified as above-described, and $\gamma_k$ are threshold parameters, with $\gamma_0=-\infty$, $\gamma_k \geq \gamma_{k-1}$, $\gamma_K=\infty$. The probit link is implemented using data augmentation \citep{Tanner:1987}, this is done by introducing a latent variable (so-called liability) $l_i=\eta_i+\varepsilon_i$ and a measurement model $y_i=k$ if $\gamma_{k-1} \leq l_i \leq \gamma_k$. For identification purpouses, the residual variance is set equal to one. At each iteration of the Gibbs sampler the un-observed liability scores are sampled from truncate normal densities; once the un-observed liability has been sampled the Gibbs sampler proceed as if $l_i$ were observed \citep[see][for further details]{Albert:1993}. \textbf{\emph{Missing data}}. The response vector can contain missing values. Internally, at each iteration of the Gibbs sampler missing values are sampled from the corresponding fully-conditional density. Missing values in predictors are not allowed. \textbf{\emph{Censored data}}. Censored data in \pkg{BGLR} is described a triplet $\{a_i, y_i , b_i\}$; the elements of this triplet must satisfy: $a_i 1$. Box 6 illustrates how to fit a multi-kernel model using $h=0.5 \times \{1/5,1,5\}$. With this choice of values for the bandwidth parameter the model includes kernels that give correlations much smaller ($h=2.5$) similar ($h=0.5$), and much higher ($h=0.1$) than the ones given by the $\boldsymbol G$ matrix. This is illustrated in Figure \ref{fig:entries_kernel} that displays the entries of the 1st row of the kernel matrix evaluated at each of the values of the bandwidth parameter in the grid. \begin{GrayBox} \small \textbf{Box 6: Fitting a RKHS Using a Multi-Kernel Methods (Kernel Averaging)} \begin{verbatim} #1# Loading and preparing the input data library(BGLR); data(wheat); Y<-wheat.Y; X<-wheat.X; n<-nrow(X); p<-ncol(X) y<-Y[,1] #2# Computing D and then K X<-scale(X,center=TRUE,scale=TRUE) D<-(as.matrix(dist(X,method='euclidean'))^2)/p h<-0.5*c(1/5,1,5) #3# Kernel Averaging using BGLR ETA<-list(list(K=exp(-h[1]*D),model='RKHS'), list(K=exp(-h[2]*D),model='RKHS'), list(K=exp(-h[3]*D),model='RKHS')) fm<-BGLR(y=y,ETA=ETA,nIter=5000, burnIn=1000,saveAt='RKHS_KA_') #1# Variance Components fm$ETA[[1]]$varU ; fm$ETA[[2]]$varU; fm$ETA[[3]]$varU \end{verbatim} \end{GrayBox} \begin{figure}[!htb] \centering \includegraphics[width=0.60\textwidth]{fig_3} \caption{Entries of the 1st row of the (Gaussian) kernel matrix evaluated at % three different values of the bandwidth parameter, $h=0.5 \times \{1/5,1,5\}$.} \label{fig:entries_kernel} \end{figure} \textbf{Assessment of Prediction Accuracy} The simple way of assessing prediction accuracy consists of partitioning the data set into two disjoint sets: one used for model training (TRN) and one used for testing (TST). Box 7 shows code that fits a G-BLUP model in a TRN-TST setting using the wheat data set. The code randomly assigns 100 individuals to the TST set. The variable \texttt{tst} is a vector that indicates which data-points belong to the TST data set; for these entries we put missing values in the phenotypic vector (see Box 7). Once the model is fitted predictions for individuals in TST set can be obtained typing \texttt{fit\$yHat[tst]} in the R command line. Figure \ref{fig:obs_predicted} plots observed vs predicted phenotypes for individuals in training and TST sets. \begin{GrayBox} \small \textbf{Box 7: Assessment of Prediction Accuracy: Continuous Response} \begin{verbatim} #1# Loading and preparing the input data library(BGLR); data(wheat); Y<-wheat.Y; X<-wheat.X; n<-nrow(X); p<-ncol(X) y<-Y[,1] #2# Creating a Testing set yNA<-y set.seed(123) tst<-sample(1:n,size=100,replace=FALSE) yNA[tst]<-NA #3# Computing G X<-scale(X,center=TRUE,scale=TRUE) G<-tcrossprod(X)/p #4# Fits the G-BLUP model ETA<-list(list(K=G,model='RKHS')) fm<-BGLR(y=yNA,ETA=ETA,nIter=5000, burnIn=1000,saveAt='RKHS_') plot(fm$yHat,y,xlab="Phenotype", ylab="Pred. Gen. Value" ,cex=.8,bty="L") points(x=y[tst],y=fm$yHat[tst],col=2,cex=.8,pch=19) legend("topleft", legend=c("training","testing"),bty="n", pch=c(1,19), col=c("black","red")) #5# Assesment of correlation in TRN and TST data sets cor(fm$yHat[tst],y[tst]) #TST cor(fm$yHat[-tst],y[-tst]) #TRN \end{verbatim} \end{GrayBox} \begin{figure}[!htb] \centering \includegraphics[width=0.70\textwidth]{fig_4} \caption{Estimated genetic values for training and % testing sets. Predictions were derived % using G-BLUP model (see Box7).} \label{fig:obs_predicted} \end{figure} A cross-validation is simply a generalization of the TRN-TST evaluation presented in Box 7. For a K-fold cross-validation there are K TRN-TST partitions; in each fold, the individuals assigned to that particular fold are used for TST and the remaining individuals are used for TRN. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Regression with Ordinal and Binary Traits} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% For categorical traits \pkg{BGLR} uses the probit link and the phenotype vector should be coercible to a factor. The type of response is defined by setting the argument \texttt{`response\_type'}. By default this argument is set equal to \texttt{`Gaussian'}. For binary and ordinal outcomes we should set \texttt{response\_type=`ordinal'}. Box 8 provides a simple example that uses the wheat data set with a discretized phenotype. The second block of code, \texttt{\#2\#}, presents the analysis of a binary outcome, and the third one, \texttt{\#3\#}, that of an ordinal trait. Figure \ref{fig:boxplot} shows, for the binary outcome, a plot of predicted probability versus realized value in the TRN and TST datasets. The estimated posterior means and posterior standard deviations of marker effects and posterior means of the linear predictor are retrieved as described before (e.g., \texttt{fm\$ETA[[1]]\$b}, \texttt{fm\$yHat}). For continuous outcomes the posterior mean of the linear predictor is also the conditional expectation function. For binary outcomes, the conditional expectation is simply the success probability; therefore, in this case \pkg{BGLR} also returns the estimated probabilities of each of the categories \texttt{fm\$probs}). \begin{GrayBox} \small \textbf{Box 8: Fitting models with binary and ordinal responses} \begin{verbatim} #1# Loading and preparing the input data library(BGLR); data(wheat); Y<-wheat.Y; X<-wheat.X; A<-wheat.A; y<-Y[,1] tst<-sample(1:nrow(X),size=150) \end{verbatim} \end{GrayBox} \begin{GrayBox} \small \begin{verbatim} #2# Binary outcome yBin<-ifelse(y>0,1,0) yBinNA<-yBin ; yBinNA[tst]<-NA ETA<-list(list(X=X,model='BL')) fmBin<-BGLR(y=yBinNA,response_type='ordinal', ETA=ETA, nIter=1200,burnIn=200) head(fmBin$probs) par(mfrow=c(1,2)) boxplot(fmBin$probs[-tst,2]~yBin[-tst],main='Training',ylab='Estimated prob.') boxplot(fmBin$probs[tst,2]~yBin[tst],main='Testing', ylab='Estimated prob.') #2# Ordinal outcome yOrd<-ifelse(y1$) and $CV(S_0)=1/\sqrt{s}$. If the user does not provide shape and rate parameters \pkg{BGLR} sets $s=1.1$, this gives a relatively un-informative prior with a CV of approximately $95\%$, and then solves for the rate so that the total contribution of the linear predictor matches the R-squared of the model (see default rules to set hyper-parameters, below). If one wants to run the analysis with fixed scale one can choose a very large value for the shape parameter (e.g., $1 \times10^{10}$) and then solve for the rate so that the prior mode matches the desired value of the scale parameter using $r=(s-1)/S_\beta$. \textbf{Bayesian LASSO (BL)}. In this model the marginal distribution of marker effects is double-exponential. Following \citet{Park:2008} we implement the double-exponential density as a mixture of scaled normal densities. In the first level of the hierarchy, marker effects are assigned independent normal densities with null mean and maker-specific variance parameter $\tau_{jk}^2 \times \sigma_\varepsilon^2$. The residual variance is assigned a scaled-inverse Chi-square density, and the marker-specific scale parameters, $\tau_{jk}^2$, are assigned IID exponential densities with rate parameter $\lambda^2/2$. Finally, in the last level of the hierarchy $\lambda^2$ is either regarded as fixed (this is obtained by setting in the linear predictor the option \texttt{type=`FIXED'}), or assigned either a Gamma $(\lambda^2 \sim Gamma(r,s)$ if \texttt{type=`gamma'}) or a $\lambda/\max$ is assigned a Beta prior, if \texttt{type=`beta'}, here $\max$ is a user-defined parameter representing the maximum value that $\lambda$ can take). If nothing is specified, \pkg{BGLR} sets \texttt{type=`gamma'} and $s=1.1$, and solves for the scale parameter to match the expected R-squared of the model (see section 2 of this appendix). \textbf{BayesB-C}. In these models marker effects are assigned IID priors that are mixtures of a point of mass at zero and a slab that is either normal (BayesC) or a scaled-t density (BayesB). The slab is structured as either in the BRR (this is the case of BayesC) or as in BayesA (this is the case of BayesB). Therefore, BayesB and BayesC extend BayesA and BRR, respectively, by introducing an additional parameter $\pi$ which in the case of BGLR represents the prior proportion of non-zero effects. This parameter is treated as unknown and it is assigned a Beta prior $\pi\sim Beta(p_0, \pi_0)$, with $p_0>0$ and $\pi_0 \in [0,1]$. The beta prior is parameterized in a way that the expected value by $E(\pi)=\pi_0$; on the other hand $p_0$ can be interpreted as the number of prior counts (priors ``successes'' plus prior ``failures''); the variance of the Beta distribution is then given by $Var(\pi)=\frac{\pi_0(1-\pi_0)}{(p_0+1)}$, which is inversely proportional to $p_0$. Choosing $p_0=2$ and $\pi_0=0.5$ gives a uniform prior in the interval $[0,1]$. Choosing a very large value for $p_0$ gives a prior that collapses to a point of mass at $\pi_0$. \begin{sidewaystable} Table A1. Prior densities implemented in BGLR. \vspace{0.5cm} \begin{tabular}{llp{8cm}} \hline \texttt{model=} & \textbf{Join distribution of effects and hyper-parameters} % & \textbf{Specification of elements in the} \\ & & \textbf{linear predictor} \\ \hline \texttt{FIXED} & $p(\boldsymbol \beta_j) \propto 1$ & \texttt{list(X=, model="FIXED")} \\ \hline \texttt{BRR} & $p(\boldsymbol \beta_j, \sigma_\beta^2)= % \left\{ \prod_{k} N(\beta_{jk}|0,\sigma_\beta^2) \right\} \chi^{-2} (\sigma_\beta^2|df_\beta,S_\beta)$ % & \texttt{list(X=, model="BRR",df0=,S0=,R2=)} \\ \hline \texttt{BayesA} & $p(\boldsymbol \beta_j, \sigma_{\beta_j}^2, S_\beta)= % \left\{ \prod_{k} N(\beta_{jk}|0,\sigma_{\beta_{jk}}^2) % \chi^{-2} (\sigma_{\beta_{jk}}^2|df_\beta,S_\beta) \right\} % G(S_\beta|r,s)$ & \texttt{list(X=, model="BayesA",df0=,rate0=,} \\ & & \texttt{shape0=,R2=)}\\ \hline & $p(\boldsymbol \beta_j, \boldsymbol \tau_j^2, \lambda^2|\sigma_\varepsilon^2)= % \left\{ \prod_k N(\beta_{jk} | 0, \tau_{jk}^2 \times \sigma_\varepsilon^2) Exp\left\{ \tau_{jk}^2 | \frac{\lambda^2}{2} \right\}% \right\}\times G(\lambda^2 | r, s)$ , or & \texttt{list(X=,model="BL",lambda=,type="gamma",}\\ & % & \texttt{rate=,shape=,R2=)}$^1$ \\ \texttt{BL} & $p(\boldsymbol \beta_j, \boldsymbol \tau_j^2, \lambda|\sigma_\varepsilon^2,\texttt{max})= % \left\{ \prod_k N(\beta_{jk} | 0, \tau_{jk}^2 \times \sigma_\varepsilon^2) Exp\left\{ \tau_{jk}^2 | \frac{\lambda^2}{2} \right\}% \right\}\times B(\lambda/\max |p_0, \pi_0)$, or & \texttt{list(X=,model="BL",lambda=,type="beta",}\\ & % & \texttt{probIn=,counts=,max=,R2=)}$^1$ \\ & $p(\boldsymbol \beta_j, \boldsymbol \tau_j^2|\sigma_\varepsilon^2,\lambda)= % \left\{ \prod_k N(\beta_{jk} | 0, \tau_{jk}^2 \times \sigma_\varepsilon^2) Exp\left\{ \tau_{jk}^2 | \frac{\lambda^2}{2} \right\}% \right\}$ & \texttt{list(X=,model="BL",lambda=,type="FIXED")}$^1$ \\ & % & \\ \hline \texttt{BayesC} & $\begin{array}{lll}% p(\boldsymbol \beta_j , \sigma_\beta^2,\pi)&=& % \left\{ \prod_{k} \left[ \pi N(\beta_{jk}|0,\sigma_\beta^2) % + (1-\pi) 1(\beta_{jk}=0) \right] \right\} \\ & & \times \chi^{-2} (\sigma_\beta^2 | df_\beta, S_\beta) B(\pi | p_0, \pi_0) \end{array}$ & \texttt{list(X=,model="BayesC",df0,S0, probIn=,counts=,R2=)}$^2$\\ \hline \texttt{BayesB} & $\begin{array}{lll}% p(\boldsymbol \beta_j , \sigma_\beta^2,\pi)&=& % \left\{ \prod_{k} \left[ \pi N(\beta_{jk}|0,\sigma_\beta^2) % + (1-\pi) 1(\beta_{jk}=0) \right] \chi^{-2} (\sigma_{\beta_{jk}}^2 | df_\beta, S_\beta)\right\}\\ & & B(\pi | p_0, \pi_0)\times G(S_\beta | r, s) \end{array}$ & \texttt{list(X=,model="BayesB",df0,rate0,shape0, probIn=,counts=,R2=)}$^2$\\ \hline \texttt{RKHS} & $p(\boldsymbol u_l , \sigma_{u_l}^2) = % N(\boldsymbol u_l | \boldsymbol 0, \boldsymbol K_l \times \sigma_{u_l}^2)% \chi^{-2} (\sigma_{u_l}^2 | df_l, S_l)$ & Either \texttt{list(K=,model="RKHS",df0,S0,R2=)} \\ & & or \texttt{list(V=,d=,model="RKHS",df0,S0,R2=)}$^3$ \\ \hline \end{tabular} $N(\cdot|\cdot,\cdot)$, $\chi^{-2}(\cdot|\cdot,\cdot)$, $G(\cdot|\cdot,\cdot)$, $Exp(\cdot| \cdot)$, $B(\cdot| \cdot ,\cdot)$ denote normal, scaled inverse Chi-squared, gamma, exponential and beta densities, respectively. (1) \texttt{type} can take values \texttt{"FIXED"}, \texttt{"gamma"}, or \texttt{"beta"}; (2) \texttt{probIn} represents the prior probability of a marker having a non-null effect ($\pi_0$), counts (the number of `prior counts') can be used to control how informative the prior is; (3) $\texttt{V}$ and $\texttt{d}$ represent the eigen-vectors and eigen-values of $\boldsymbol K$, respectively. \end{sidewaystable} \vspace{0.5cm} \textbf{2. Default rules for choosing hyper-parameters} \vspace{0.5cm} \pkg{BGLR} has built-in rules to set values of hyper-parameters. The default rules assign proper, but weakly informative, priors with prior modes chosen in a way that, a priori, they obey a variance partition of the phenotype into components attributable to the error terms and to each of the elements of the linear predictor. The user can control this variance partition by setting the argument R2 (representing the \textbf{model R-squared}) of the BGLR function to the desired value. By default the model R2 is set equal to 0.5, in which case hyper-parameters are chosen to match a variance partition where 50\% of the variance of the response is attributable to the linear predictor and 50\% to model residuals. Each of the \textbf{elements of the linear predictor} has its own R2 parameter (see last column of Table A1). If these are not provided, the \textbf{R2} attributable to each element of the linear predictor equals the R-squared of the model divided the number of elements in the linear predictor. Once the R2 parameters are set, BGLR checks whether each of the hyper-parameters have been specified and if not, the built in-rules are used to set values for these hyper-parameters. Next we briefly describe the built-in rules implemented in \pkg{BGLR}; these are based on formulas similar to those described by \citet{delosCampos:2013b} implemented using the prior mode instead of the prior mean. \textbf{Variance parameters.} The residual variance ($\sigma_{\varepsilon}^2$, $\sigma_{u_{l}}^2$), of the RKHS model, and $\sigma_{\beta}^2$, of the BRR, are assigned scaled-inverse Chi-square densities, which are indexed by a \textbf{scale} and a \textbf{degree of freedom} parameter. By default, if degree of freedom parameter is not specified, these are set equal to 5 (this gives a relatively un-informative scaled-inverse Chi-square and guarantees a finite prior variance) and the scale parameter is solved for to match the desired variance partition. For instance, in case of the residual variance the scale is calculated using $S_\varepsilon=var(y)\times (1-R2) \times (df_\varepsilon+2)$, this gives a prior mode for the residual variance equal to $var(y) \times (1-R2)$. Similar rules are used in case of other variance parameters. For instance, if one element of the linear predictor involves a linear regression of the form $\boldsymbol X \boldsymbol \beta$ with \texttt{model=`BRR'} then $S_\beta=var(y)\times R2 \times (df_\beta+2)/MSx$ where $MSx$ is the sum of the sample variances of the columns of $\boldsymbol X$ and R2 is the proportion of phenotypic variance a-priori assigned to that particular element of the linear predictor. The selection of the scale parameter when the model is the RKHS regression is modified relative to the above rule to account for the fact that the average diagonal value of $\boldsymbol K$ may be different than 1, specifically we choose the scale parameter according to the following formula $S_l=var(y) \times R2 \times (df_l+2)/mean(diag(\boldsymbol K))$. In models \textbf{BayesA} and \textbf{BayesB} the scale-parameter indexing the t-prior assigned to marker effects is assigned a Gamma density with rate and shape parameters $r$ and $s$, respectively. By default \pkg{BGLR} sets $s= 1.1$ and solves for the rate parameter using $r=(s-1)/S_\beta$ with $S_\beta=var(y)\times R2 \times (df_\beta+2)/MSx$, here, as before, $MSx$ represents the sum of the variances of the columns of $X$. For the \textbf{BL}, the default is to set: \texttt{type=`gamma'}, fix the shape parameter of the gamma density to 1.1 and solve for the rate parameter to match the expected proportion of variance accounted for by the corresponding element of the linear predictor, as specified by the argument R2. Specifically, we set the rate to be $(s-1)/(2 \times (1-R2)/R2 \times MSx)$. For models \textbf{BayesB} and \textbf{BayesC}, the default rule is to set $\pi_0=0.5$ and $p_0=10$. This gives a weakly informative beta prior for $\pi$ with a prior mode at 0.5. The scale and degree-of freedom parameters entering in the priors of these two models are treated as in the case of models BayesA (in the case of BayesB) and BRR (in the case of BayesC), but the rules are modified by considering that only a fraction of the markers ($\pi$) nave non-null effects; therefore, in BayesC we use $S_\beta=var(y)\times R2 \times(df_\beta+2)/MSx/\pi$ and in BayesB we set $r=(s-1)/S_\beta$ with $S_\beta=var(y)\times R2 \times(df_\beta+2)/MSx/\pi$. \end{document}