Title: | Sparse Family and Selection Index |
---|---|
Description: | Here we provide tools for the estimation of coefficients in penalized regressions when the (co)variance matrix of predictors and the covariance vector between predictors and response, are provided. These methods are extended to the context of a Selection Index (commonly used for breeding value prediction). The approaches offer opportunities such as the integration of high-throughput traits in genetic evaluations ('Lopez-Cruz et al., 2020') <doi:10.1038/s41598-020-65011-2> and solutions for training set optimization in Genomic Prediction ('Lopez-Cruz & de los Campos, 2021') <doi:10.1093/genetics/iyab030>. |
Authors: | Marco Lopez-Cruz [aut, cre], Gustavo de los Campos [aut], Paulino Perez-Rodriguez [ctb] |
Maintainer: | Marco Lopez-Cruz <[email protected]> |
License: | GPL-3 |
Version: | 1.4.1 |
Built: | 2024-11-22 06:50:20 UTC |
Source: | CRAN |
Save/read a numeric data as a fortran-formatted binary file at a defined precision (single or double).
saveBinary(X, file = paste0(tempdir(), "/file.bin"), precision.format = c("double","single"), verbose = TRUE) readBinary(file = paste0(tempdir(), "/file.bin"), rows = NULL, cols = NULL, drop = TRUE, verbose = TRUE)
saveBinary(X, file = paste0(tempdir(), "/file.bin"), precision.format = c("double","single"), verbose = TRUE) readBinary(file = paste0(tempdir(), "/file.bin"), rows = NULL, cols = NULL, drop = TRUE, verbose = TRUE)
X |
(numeric matrix) Data to save |
file |
(character) Name of the binary file to save/read |
precision.format |
(character) Either 'single' or 'double' for numeric precision and memory occupancy (4 bytes/32-bit or 8 bytes/64-bit, respectively) of the matrix to save |
rows |
(integer vector) Which rows are to be read from the file. Default |
cols |
(integer vector) Which columns are to be read from the file. Default |
drop |
Either |
verbose |
|
Function 'saveBinary' does not return any value but print a description of the file saved.
Function 'readBinary' returns the data that was read.
require(SFSI) # A numeric matrix X = matrix(rnorm(500*100), ncol=100) # Save matrix as double-precision filename1 = paste0(tempdir(),"/Matrix1.bin") saveBinary(X, filename1, precision.format="double") # Save matrix as single-precision filename2 = paste0(tempdir(),"/Matrix2.bin") saveBinary(X, filename2, precision.format="single") # Read the double-precision matrix X2 = readBinary(filename1) max(abs(X-X2)) # No loss of precision file.info(filename1)$size # Size of the file # Read the single-precision matrix X3 = readBinary(filename2) max(abs(X-X3)) # Loss of precision file.info(filename2)$size # But smaller-size file # Read specific rows and columns rows = c(2,4,5,8,10) cols = c(1,2,5) (X2 = readBinary(filename1, rows=rows, cols=cols)) # Equal to: X[rows,cols]
require(SFSI) # A numeric matrix X = matrix(rnorm(500*100), ncol=100) # Save matrix as double-precision filename1 = paste0(tempdir(),"/Matrix1.bin") saveBinary(X, filename1, precision.format="double") # Save matrix as single-precision filename2 = paste0(tempdir(),"/Matrix2.bin") saveBinary(X, filename2, precision.format="single") # Read the double-precision matrix X2 = readBinary(filename1) max(abs(X-X2)) # No loss of precision file.info(filename1)$size # Size of the file # Read the single-precision matrix X3 = readBinary(filename2) max(abs(X-X3)) # Loss of precision file.info(filename2)$size # But smaller-size file # Read specific rows and columns rows = c(2,4,5,8,10) cols = c(1,2,5) (X2 = readBinary(filename1, rows=rows, cols=cols)) # Equal to: X[rows,cols]
Transformation into correlation matrix or distance matrix from a covariance matrix
cov2dist(A, a = 1, inplace = FALSE) cov2cor2(A, a = 1, inplace = FALSE)
cov2dist(A, a = 1, inplace = FALSE) cov2cor2(A, a = 1, inplace = FALSE)
A |
(numeric matrix) Variance-covariance matrix |
inplace |
|
a |
(numeric) A number to multiply the whole resulting matrix by. Default |
For any variables Xi and Xj with mean zero and with sample vectors xi = (xi1,...,xin)' and xj = (xj1,...,xjn)' , their (sample) variances are equal (up-to a constant) to their cross-products, this is, var(Xi) = x'ixi and var(Xj) = x'jxj. Likewise, the covariance is cov(Xi,Xj) = x'ixj.
Distance.The Euclidean distance d(Xi,Xj) between the variables expressed in terms of cross-products is
d(Xi,Xj) = (x'ixi + x'jxj - 2x'ixj)1/2
Therefore, the output distance matrix will contain as off-diagonal entries
d(Xi,Xj) = (var(Xi) + var(Xj) - 2cov(Xi,Xj))1/2
while in the diagonal, the distance between one variable with itself is d(Xi,Xi) = 0
Correlation.The correlation between the variables is obtained from variances and covariances as
cor(Xi,Xj) = cov(Xi,Xj)/(sd(Xi)sd(Xj))
where sd(Xi)=sqrt(var(Xi)); while in the diagonal, the correlation between one variable with itself is cor(Xi,Xi) = 1
Variances are obtained from the diagonal values while covariances are obtained from the out-diagonal.
Function 'cov2dist' returns a matrix containing the Euclidean distances. Function 'cov2cor2' returns a correlation matrix
require(SFSI) data(wheatHTP) X = scale(M)[1:100,]/sqrt(ncol(M)) A = tcrossprod(X) # A 100x100 covariance matrix # Covariance matrix to distance matrix D = cov2dist(A) # (it must equal (but faster) to:) D0 = as.matrix(dist(X)) max(D-D0) # Covariance to a correlation matrix R = cov2cor2(A) # (it must equal (but faster) to:) R0 = cov2cor(A) max(R-R0) # Inplace calculation A[1:5,1:5] cov2dist(A, inplace=TRUE) A[1:5,1:5] # notice that A was modified
require(SFSI) data(wheatHTP) X = scale(M)[1:100,]/sqrt(ncol(M)) A = tcrossprod(X) # A 100x100 covariance matrix # Covariance matrix to distance matrix D = cov2dist(A) # (it must equal (but faster) to:) D0 = as.matrix(dist(X)) max(D-D0) # Covariance to a correlation matrix R = cov2cor2(A) # (it must equal (but faster) to:) R0 = cov2cor(A) max(R-R0) # Inplace calculation A[1:5,1:5] cov2dist(A, inplace=TRUE) A[1:5,1:5] # notice that A was modified
Computes the entire Elastic-Net solution for the regression coefficients for all values of the penalization parameter, via the Coordinate Descent (CD) algorithm (Friedman, 2007). It uses as inputs a variance matrix among predictors and a covariance vector between response and predictors
solveEN(Sigma, Gamma, alpha = 1, lambda = NULL, nlambda = 100, lambda.min = .Machine$double.eps^0.5, lambda.max = NULL, common.lambda = TRUE, beta0 = NULL, nsup.max = NULL, scale = TRUE, sdx = NULL, tol = 1E-5, maxiter = 1000, mc.cores = 1L, save.at = NULL, fileID = NULL, precision.format = c("double","single"), sparse = FALSE, eps = .Machine$double.eps*100, verbose = FALSE)
solveEN(Sigma, Gamma, alpha = 1, lambda = NULL, nlambda = 100, lambda.min = .Machine$double.eps^0.5, lambda.max = NULL, common.lambda = TRUE, beta0 = NULL, nsup.max = NULL, scale = TRUE, sdx = NULL, tol = 1E-5, maxiter = 1000, mc.cores = 1L, save.at = NULL, fileID = NULL, precision.format = c("double","single"), sparse = FALSE, eps = .Machine$double.eps*100, verbose = FALSE)
Sigma |
(numeric matrix) Variance-covariance matrix of predictors |
Gamma |
(numeric matrix) Covariance between response variable and predictors. If it contains more than one column, the algorithm is applied to each column separately as different response variables |
lambda |
(numeric vector) Penalization parameter sequence. Default is max(abs(Gamma)/alpha) to a minimum equal to zero. If |
nlambda |
(integer) Number of lambdas generated when |
lambda.min , lambda.max
|
(numeric) Minimum and maximum value of lambda that are generated when |
common.lambda |
|
beta0 |
(numeric vector) Initial value for the regression coefficients that will be updated. If |
alpha |
(numeric) Value between 0 and 1 for the weights given to the L1 and L2-penalties |
scale |
|
sdx |
(numeric vector) Scaling factor that will be used to scale the regression coefficients. When |
tol |
(numeric) Maximum error between two consecutive solutions of the CD algorithm to declare convergence |
maxiter |
(integer) Maximum number of iterations to run the CD algorithm at each lambda step before convergence is reached |
nsup.max |
(integer) Maximum number of non-zero coefficients in the last solution.
Default |
mc.cores |
(integer) Number of cores used. When |
save.at |
(character) Path where regression coefficients are to be saved (this may include a prefix added to the files). Default |
fileID |
(character) Suffix added to the file name where regression coefficients are to be saved. Default |
precision.format |
(character) Either 'single' or 'double' for numeric precision and memory occupancy (4 or 8 bytes, respectively) of the regression coefficients. This is only used when |
sparse |
|
eps |
(numeric) A numerical zero to determine if entries are near-zero. Default is the machine precision |
verbose |
|
Finds solutions for the regression coefficients in a linear model
yi = x'iβ + ei
where
yi is the response for the ith observation,
xi = (xi1,...,xip)'
is a vector of predictors assumed to have unit variance,
β = (β1,...,βp)'
is a vector of regression coefficients, and
ei
is a residual.
The regression coefficients β are estimated as function of the variance matrix among predictors (Σ) and the covariance vector between response and predictors (Γ) by minimizing the penalized mean squared error function
-Γ' β + 1/2 β' Σ β + λ J(β)
where λ is the penalization parameter and J(β) is a penalty function given by
1/2(1-α)||β||22 + α||β||1
where 0 ≤ α ≤ 1, and ||β||1 = ∑j=1|βj| and ||β||22 = ∑j=1βj2 are the L1 and (squared) L2-norms, respectively.
The "partial residual" excluding the contribution of the predictor xij is
ei(j) = yi - x'iβ + xijβj
then the ordinary least-squares (OLS) coefficient of xij on this residual is (up-to a constant)
βj(ols) = Γj - Σ'jβ + βj
where Γj is the jth element of Γ and Σj is the jth column of the matrix Σ.
Coefficients are updated for each from their current value
βj
to a new value
βj(α,λ),
given α and
λ,
by "soft-thresholding" their OLS estimate until convergence as fully described in Friedman (2007).
Returns a list object containing the elements:
lambda
: (vector) all the sequence of values of the penalty.
beta
: (matrix) regression coefficients for each predictor (in rows) associated to each value of the penalization parameter lambda (in columns).
nsup
: (vector) number of non-zero predictors associated to each value of lambda.
The returned object is of the class 'LASSO' for which methods coef
and fitted
exist. Function 'path.plot' can be also used
Friedman J, Hastie T, Höfling H, Tibshirani R (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2), 302–332.
require(SFSI) data(wheatHTP) y = as.vector(Y[,"E1"]) # Response variable X = scale(X_E1) # Predictors # Training and testing sets tst = which(Y$trial %in% 1:10) trn = seq_along(y)[-tst] # Calculate covariances in training set XtX = var(X[trn,]) Xty = cov(X[trn,],y[trn]) # Run the penalized regression fm = solveEN(XtX,Xty,alpha=0.5,nlambda=100) # Regression coefficients dim(coef(fm)) dim(coef(fm, ilambda=50)) # Coefficients associated to the 50th lambda dim(coef(fm, nsup=25)) # Coefficients with around nsup=25 are non-zero # Predicted values yHat1 = predict(fm, X=X[trn,]) # training data yHat2 = predict(fm, X=X[tst,]) # testing data # Penalization vs correlation plot(-log(fm$lambda[-1]),cor(y[trn],yHat1[,-1]), main="training", type="l") plot(-log(fm$lambda[-1]),cor(y[tst],yHat2[,-1]), main="testing", type="l")
require(SFSI) data(wheatHTP) y = as.vector(Y[,"E1"]) # Response variable X = scale(X_E1) # Predictors # Training and testing sets tst = which(Y$trial %in% 1:10) trn = seq_along(y)[-tst] # Calculate covariances in training set XtX = var(X[trn,]) Xty = cov(X[trn,],y[trn]) # Run the penalized regression fm = solveEN(XtX,Xty,alpha=0.5,nlambda=100) # Regression coefficients dim(coef(fm)) dim(coef(fm, ilambda=50)) # Coefficients associated to the 50th lambda dim(coef(fm, nsup=25)) # Coefficients with around nsup=25 are non-zero # Predicted values yHat1 = predict(fm, X=X[trn,]) # training data yHat2 = predict(fm, X=X[tst,]) # testing data # Penalization vs correlation plot(-log(fm$lambda[-1]),cor(y[trn],yHat1[,-1]), main="training", type="l") plot(-log(fm$lambda[-1]),cor(y[tst],yHat2[,-1]), main="testing", type="l")
Computes the entire LASSO solution for the regression coefficients, starting from zero, to the least-squares estimates, via the Least Angle Regression (LARS) algorithm (Efron, 2004). It uses as inputs a variance matrix among predictors and a covariance vector between response and predictors.
LARS(Sigma, Gamma, method = c("LAR","LASSO"), nsup.max = NULL, steps.max = NULL, eps = .Machine$double.eps*100, scale = TRUE, sdx = NULL, mc.cores = 1L, save.at = NULL, precision.format = c("double","single"), fileID = NULL, verbose = 1)
LARS(Sigma, Gamma, method = c("LAR","LASSO"), nsup.max = NULL, steps.max = NULL, eps = .Machine$double.eps*100, scale = TRUE, sdx = NULL, mc.cores = 1L, save.at = NULL, precision.format = c("double","single"), fileID = NULL, verbose = 1)
Sigma |
(numeric matrix) Variance-covariance matrix of predictors |
Gamma |
(numeric matrix) Covariance between response variable and predictors. If it contains more than one column, the algorithm is applied to each column separately as different response variables |
method |
(character) Either:
Default is |
nsup.max |
(integer) Maximum number of non-zero coefficients in the last LARS solution.
Default |
steps.max |
(integer) Maximum number of steps (i.e., solutions) to be computed. Default |
eps |
(numeric) A numerical zero. Default is the machine precision |
scale |
|
sdx |
(numeric vector) Scaling factor that will be used to scale the regression coefficients. When |
mc.cores |
(integer) Number of cores used. When |
save.at |
(character) Path where regression coefficients are to be saved (this may include a prefix added to the files). Default |
fileID |
(character) Suffix added to the file name where regression coefficients are to be saved. Default |
precision.format |
(character) Either 'single' or 'double' for numeric precision and memory occupancy (4 or 8 bytes, respectively) of the regression coefficients. This is only used when |
verbose |
If numeric greater than zero details on each LARS step will be printed |
Finds solutions for the regression coefficients in a linear model
yi = x'iβ + ei
where
yi is the response for the ith observation,
xi = (xi1,...,xip)'
is a vector of predictors assumed to have unit variance,
β = (β1,...,βp)'
is a vector of regression coefficients, and
ei
is a residual.
The regression coefficients β are estimated as function of the variance matrix among predictors (Σ) and the covariance vector between response and predictors (Γ) by minimizing the penalized mean squared error function
-Γ' β + 1/2 β'Σβ + 1/2 λ ||β||1
where λ is the penalization parameter and ||β||1 = ∑j=1|βj| is the L1-norm.
The algorithm to find solutions for each βj is fully described in Efron (2004) in which the "current correlation" between the predictor xij and the residual ei = yi - x'iβ is expressed (up-to a constant) as
rj = Γj - Σ'jβ
where Γj is the jth element of Γ and Σj is the jth column of the matrix Σ
Returns a list object with the following elements:
lambda
: (vector) all the sequence of values of the LASSO penalty.
beta
: (matrix) regression coefficients for each predictor (in rows) associated to each value of the penalization parameter lambda (in columns).
nsup
: (vector) number of non-zero predictors associated to each value of lambda.
The returned object is of the class 'LASSO' for which methods coef
and predict
exist. Function 'path.plot' can be also used
Adapted from the 'lars' function in package 'lars' (Hastie & Efron, 2013)
Efron B, Hastie T, Johnstone I, Tibshirani R (2004). Least angle regression. The Annals of Statistics, 32(2), 407–499.
Hastie T, Efron B (2013). lars: least angle regression, Lasso and forward stagewise. https://cran.r-project.org/package=lars.
require(SFSI) data(wheatHTP) y = as.vector(Y[,"E1"]) # Response variable X = scale(X_E1) # Predictors # Training and testing sets tst = which(Y$trial %in% 1:10) trn = seq_along(y)[-tst] # Calculate covariances in training set XtX = var(X[trn,]) Xty = cov(X[trn,],y[trn]) # Run the penalized regression fm = LARS(XtX, Xty, method="LASSO") # Regression coefficients dim(coef(fm)) dim(coef(fm, ilambda=50)) # Coefficients associated to the 50th lambda dim(coef(fm, nsup=25)) # Coefficients with around nsup=25 are non-zero # Predicted values yHat1 = predict(fm, X=X[trn,]) # training data yHat2 = predict(fm, X=X[tst,]) # testing data # Penalization vs correlation plot(-log(fm$lambda[-1]),cor(y[trn],yHat1[,-1]), main="Training", type="l") plot(-log(fm$lambda[-1]),cor(y[tst],yHat2[,-1]), main="Testing", type="l")
require(SFSI) data(wheatHTP) y = as.vector(Y[,"E1"]) # Response variable X = scale(X_E1) # Predictors # Training and testing sets tst = which(Y$trial %in% 1:10) trn = seq_along(y)[-tst] # Calculate covariances in training set XtX = var(X[trn,]) Xty = cov(X[trn,],y[trn]) # Run the penalized regression fm = LARS(XtX, Xty, method="LASSO") # Regression coefficients dim(coef(fm)) dim(coef(fm, ilambda=50)) # Coefficients associated to the 50th lambda dim(coef(fm, nsup=25)) # Coefficients with around nsup=25 are non-zero # Predicted values yHat1 = predict(fm, X=X[trn,]) # training data yHat2 = predict(fm, X=X[tst,]) # testing data # Penalization vs correlation plot(-log(fm$lambda[-1]),cor(y[trn],yHat1[,-1]), main="Training", type="l") plot(-log(fm$lambda[-1]),cor(y[tst],yHat2[,-1]), main="Testing", type="l")
Computes the entire Elastic-Net solution for the regression coefficients of a Selection Index for a grid of values of the penalization parameter.
An optimal penalization can be chosen using cross-validation (CV) within a specific training set.
SGP(y = NULL, X = NULL, b = NULL, Z = NULL, K = NULL, trn = NULL, tst = NULL, varU = NULL, varE = NULL, ID_geno = NULL, ID_trait = NULL, intercept = TRUE, alpha = 1, lambda = NULL, nlambda = 100, lambda.min = .Machine$double.eps^0.5, common.lambda = TRUE, subset = NULL, tol = 1E-4, maxiter = 500, method = c("REML","ML"), tag = NULL, save.at = NULL, precision.format = c("single","double"), mc.cores = 1L, verbose = 2) SGP.CV(y, X = NULL, b = NULL, Z = NULL, K, trn = NULL, varU = NULL, varE = NULL, ID_geno = NULL, ID_trait = NULL, intercept = TRUE, alpha = 1, lambda = NULL, nlambda = 100, lambda.min = .Machine$double.eps^0.5, common.lambda = TRUE, nfolds = 5, nCV = 1L, folds = NULL, seed = NULL, subset = NULL, tol = 1E-4, maxiter = 500, method = c("REML","ML"), tag = NULL, save.at = NULL, mc.cores = 1L, verbose = TRUE)
SGP(y = NULL, X = NULL, b = NULL, Z = NULL, K = NULL, trn = NULL, tst = NULL, varU = NULL, varE = NULL, ID_geno = NULL, ID_trait = NULL, intercept = TRUE, alpha = 1, lambda = NULL, nlambda = 100, lambda.min = .Machine$double.eps^0.5, common.lambda = TRUE, subset = NULL, tol = 1E-4, maxiter = 500, method = c("REML","ML"), tag = NULL, save.at = NULL, precision.format = c("single","double"), mc.cores = 1L, verbose = 2) SGP.CV(y, X = NULL, b = NULL, Z = NULL, K, trn = NULL, varU = NULL, varE = NULL, ID_geno = NULL, ID_trait = NULL, intercept = TRUE, alpha = 1, lambda = NULL, nlambda = 100, lambda.min = .Machine$double.eps^0.5, common.lambda = TRUE, nfolds = 5, nCV = 1L, folds = NULL, seed = NULL, subset = NULL, tol = 1E-4, maxiter = 500, method = c("REML","ML"), tag = NULL, save.at = NULL, mc.cores = 1L, verbose = TRUE)
y |
(numeric vector) Response variable. It can be a matrix with each column representing a different response variable. If it is passed to the 'SGP' function, predicted values for testing data are computed using phenotypes from training data |
X |
(numeric matrix) Design matrix for fixed effects. When |
b |
(numeric vector) Fixed effects. When |
K |
(numeric matrix) Kinship relationship matrix |
Z |
(numeric matrix) Design matrix for random effects. When |
varU , varE
|
(numeric) Genetic and residual variances. When either |
ID_geno |
(character/integer) For multi-trait analysis only, vector with either names or indices mapping entries of the vector |
ID_trait |
(character/integer) For multi-trait analysis only, vector with either names or indices mapping entries of the vector |
intercept |
|
trn , tst
|
(integer vector) Which elements from vector |
subset |
(integer vector) |
alpha |
(numeric) Value between 0 and 1 for the weights given to the L1 and L2-penalties |
lambda |
(numeric vector) Penalization parameter sequence. Default is max(abs(G[trn,tst])/alpha) to a minimum equal to zero. If |
nlambda |
(integer) Number of lambdas generated when |
lambda.min |
(numeric) Minimum value of lambda in the generated grid when |
nfolds |
(integer/character) Either 2,3,5,10 or 'n' indicating the number of non-overlapping folds in which the data is split into to do cross-validation. When |
seed |
(numeric vector) Seed to fix randomization when creating folds for cross-validation. If it is a vector, a number equal to its length of CV repetitions are performed |
nCV |
(integer) Number of CV repetitions to be performed. Default is |
folds |
(integer matrix) A matrix with |
common.lambda |
|
mc.cores |
(integer) Number of cores used. When |
tol |
(numeric) Maximum error between two consecutive solutions of the CD algorithm to declare convergence |
maxiter |
(integer) Maximum number of iterations to run the CD algorithm at each lambda step before convergence is reached |
save.at |
(character) Path where files (regression coefficients and output object) are to be saved (this may include a prefix added to the files). Default |
precision.format |
(character) Either 'single' or 'double' for numeric precision and memory occupancy (4 or 8 bytes, respectively) of the regression coefficients. This is only used when |
method |
(character) Either 'REML' (Restricted Maximum Likelihood) or 'ML' (Maximum Likelihood) to calculate variance components as per the function 'fitBLUP' |
tag |
(character) Name given to the output for tagging purposes. Default |
verbose |
(integer) If greater than zero analysis details will be printed |
The basic linear mixed model that relates phenotypes with genetic values is of the form
y = X b + Z g + e
where y is a vector with the response, b is the vector of fixed effects, g is the vector of the genetic values of the genotypes, e is the vector of environmental residuals, and X and Z are design matrices for the fixed and genetic effects, respectively. Genetic effects are assumed to follow a Normal distribution as g ~ N(0,σ2uK), and environmental terms are assumed e ~ N(0,σ2eI).
The resulting vector of genetic values u = Z g will therefore follow u ~ N(0,σ2uG) where G = Z K Z'. In the un-replicated case, Z = I is an identity matrix, and hence u = g and G = K.
The values utst = {ui}, i = 1,2,...,ntst, for a testing set are estimated individual-wise using (as predictors) all available observations in a training set as
ui = β'i (ytrn - Xtrnb)
where βi is a vector of weights that are found separately for each individual in the testing set, by minimizing the penalized mean squared error function
-σ2uG'trn,tst(i)βi + 1/2 β'i(σ2uGtrn + σ2eI)βi + λ J(βi)
where Gtrn,tst(i) is the ith column of the sub-matrix of G whose rows correspond to the training set and columns to the testing set; Gtrn is the sub-matrix corresponding to the training set; λ is the penalization parameter; and J(βi) is a penalty function given by
1/2(1-α)||βi||22 + α||βi||1
where 0 ≤ α ≤ 1, and ||βi||1 = ∑j=1|βij| and ||βi||22 = ∑j=1βij2 are the L1 and (squared) L2-norms, respectively.
Function 'SGP' calculates each individual solution using the function 'solveEN' (via the Coordinate Descent algorithm, see help(solveEN)
) by setting the argument Sigma
equal to
σ2uGtrn + σ2eI
and Gamma
equal to
σ2uGtrn,tst(i).
Function 'SGP.CV' performs cross-validation within the training data specified in argument trn
. Training data is divided into folds and the SGP is sequentially derived for (all individuals in) one fold (as testing set) using information from the remaining folds (as training set).
Function 'SGP' returns a list object of the class 'SGP' for which methods coef
, predict
, plot
, and summary
exist. Functions 'net.plot' and 'path.plot' can be also used. It contains the elements:
b
: (vector) fixed effects solutions (including the intercept).
Xb
: (vector) total fixed values (X b).
u
: (matrix) total genetic values (u = Z g)
for testing individuals (in rows) associated to each value of lambda (in columns).
varU
, varE
, h2
: variance components solutions.
alpha
: value for the elastic-net weights used.
lambda
: (matrix) sequence of values of lambda used (in columns) for each testing individual (in rows).
nsup
: (matrix) number of non-zero predictors at each solution given by lambda for each testing individual (in rows).
file_beta
: path where regression coefficients are saved.
Function 'SGP.CV' returns a list object of length nCV
of the class 'SGP'. Optimal cross-validated penalization values can be obtained using thesummary
method. Method plot
is also available.
require(SFSI) data(wheatHTP) index = which(Y$trial %in% 1:12) # Use only a subset of data Y = Y[index,] M = scale(M[index,])/sqrt(ncol(M)) # Subset and scale markers G = tcrossprod(M) # Genomic relationship matrix Y0 = scale(as.matrix(Y[,4:6])) # Response variable #--------------------------------------------------- # Single-trait model #--------------------------------------------------- y = Y0[,1] # Training and testing sets tst = which(Y$trial %in% 2:3) trn = seq_along(y)[-tst] # Sparse genomic prediction fm1 = SGP(y, K=G, trn=trn, tst=tst) uHat = predict(fm1) # Predicted values for each testing element out = summary(fm1) # Useful function to get results corTST = out$accuracy # Testing set accuracy (correlation cor(y,yHat)) out$optCOR # SGP with maximum accuracy B = coef(fm1) # Regression coefficients for all tst B = coef(fm1, iy=1) # Coefficients for first tst (tst[1]) B = coef(fm1, ilambda=10) # Coefficients associated to the 10th lambda B = coef(fm1, nsup=10) # Coefficients for which nsup=10 plot(fm1) # Penalization vs accuracy plot plot(fm1, y.stat="MSE", ylab='Mean Square Error', xlab='Sparsity') varU = fm1$varU varE = fm1$varE b = fm1$b #--------------------------------------------------- # Predicting a testing set using a value of lambda # obtained from cross-validation in a traning set #--------------------------------------------------- # Run a cross validation in training set fm2 = SGP.CV(y, K=G, varU=varU, varE=varE, b=b, trn=trn, nfolds=5, tag="1 5CV") lambda = summary(fm2)$optCOR["lambda"] # Fit the index with the obtained lambda fm3 = SGP(y, K=G, varU=varU, varE=varE, b=b, trn=trn, tst=tst, lambda=lambda) summary(fm3)$accuracy # Testing set accuracy # Compare the accuracy with that of the non-sparse index (G-BLUP) summary(fm1)$accuracy[fm1$nlambda,1] # we take the last one #--------------------------------------------------- # Multi-trait SGP #--------------------------------------------------- ID_geno = as.vector(row(Y0)) ID_trait = as.vector(col(Y0)) y = as.vector(Y0) # Training and testing sets tst = c(which(ID_trait==1)[Y$trial %in% 2:3], which(ID_trait==2)[Y$trial %in% 2], which(ID_trait==3)[Y$trial %in% 3]) trn = seq_along(y)[-tst] fmMT = SGP(y=y, K=G, ID_geno=ID_geno, ID_trait=ID_trait, trn=trn, tst=tst) multitrait.plot(fmMT)
require(SFSI) data(wheatHTP) index = which(Y$trial %in% 1:12) # Use only a subset of data Y = Y[index,] M = scale(M[index,])/sqrt(ncol(M)) # Subset and scale markers G = tcrossprod(M) # Genomic relationship matrix Y0 = scale(as.matrix(Y[,4:6])) # Response variable #--------------------------------------------------- # Single-trait model #--------------------------------------------------- y = Y0[,1] # Training and testing sets tst = which(Y$trial %in% 2:3) trn = seq_along(y)[-tst] # Sparse genomic prediction fm1 = SGP(y, K=G, trn=trn, tst=tst) uHat = predict(fm1) # Predicted values for each testing element out = summary(fm1) # Useful function to get results corTST = out$accuracy # Testing set accuracy (correlation cor(y,yHat)) out$optCOR # SGP with maximum accuracy B = coef(fm1) # Regression coefficients for all tst B = coef(fm1, iy=1) # Coefficients for first tst (tst[1]) B = coef(fm1, ilambda=10) # Coefficients associated to the 10th lambda B = coef(fm1, nsup=10) # Coefficients for which nsup=10 plot(fm1) # Penalization vs accuracy plot plot(fm1, y.stat="MSE", ylab='Mean Square Error', xlab='Sparsity') varU = fm1$varU varE = fm1$varE b = fm1$b #--------------------------------------------------- # Predicting a testing set using a value of lambda # obtained from cross-validation in a traning set #--------------------------------------------------- # Run a cross validation in training set fm2 = SGP.CV(y, K=G, varU=varU, varE=varE, b=b, trn=trn, nfolds=5, tag="1 5CV") lambda = summary(fm2)$optCOR["lambda"] # Fit the index with the obtained lambda fm3 = SGP(y, K=G, varU=varU, varE=varE, b=b, trn=trn, tst=tst, lambda=lambda) summary(fm3)$accuracy # Testing set accuracy # Compare the accuracy with that of the non-sparse index (G-BLUP) summary(fm1)$accuracy[fm1$nlambda,1] # we take the last one #--------------------------------------------------- # Multi-trait SGP #--------------------------------------------------- ID_geno = as.vector(row(Y0)) ID_trait = as.vector(col(Y0)) y = as.vector(Y0) # Training and testing sets tst = c(which(ID_trait==1)[Y$trial %in% 2:3], which(ID_trait==2)[Y$trial %in% 2], which(ID_trait==3)[Y$trial %in% 3]) trn = seq_along(y)[-tst] fmMT = SGP(y=y, K=G, ID_geno=ID_geno, ID_trait=ID_trait, trn=trn, tst=tst) multitrait.plot(fmMT)
Solves the Linear Mixed Model and calculates the Best Linear Unbiased Predictor (BLUP)
fitBLUP(y, X = NULL, Z = NULL, K = NULL, trn = NULL, EVD = NULL, varU = NULL, varE = NULL, ID_geno = NULL, ID_trait = NULL, intercept = TRUE, BLUP = TRUE, method = c("REML","ML"), interval = c(1E-9,1E9), tol = 1E-8, maxiter = 1000, n.regions = 10, verbose = TRUE)
fitBLUP(y, X = NULL, Z = NULL, K = NULL, trn = NULL, EVD = NULL, varU = NULL, varE = NULL, ID_geno = NULL, ID_trait = NULL, intercept = TRUE, BLUP = TRUE, method = c("REML","ML"), interval = c(1E-9,1E9), tol = 1E-8, maxiter = 1000, n.regions = 10, verbose = TRUE)
y |
(numeric vector) Response variable. It can be a matrix with each column representing a different response variable |
X |
(numeric matrix) Design matrix for fixed effects. When |
Z |
(numeric matrix) Design matrix for random effects. When |
K |
(numeric matrix) Kinship relationship matrix |
trn |
(integer vector) Which elements from vector |
EVD |
(list) Eigenvectors and eigenvalues from eigenvalue decomposition (EVD) of G corresponding to training data |
ID_geno |
(character/integer) For within-trait analysis only, vector with either names or indices mapping entries of the vector |
ID_trait |
(character/integer) For within-trait analysis only, vector with either names or indices mapping entries of the vector |
varU , varE
|
(numeric) Genetic and residual variances. When both |
intercept |
|
BLUP |
|
method |
(character) Either 'REML' (Restricted Maximum Likelihood) or 'ML' (Maximum Likelihood) |
tol |
(numeric) Maximum error between two consecutive solutions (convergence tolerance) when finding the root of the log-likelihood's first derivative |
maxiter |
(integer) Maximum number of iterations to run before convergence is reached |
interval |
(numeric vector) Range of values in which the root is searched |
n.regions |
(numeric) Number of regions in which the searched 'interval' is divided for local optimization |
verbose |
|
The basic linear mixed model that relates phenotypes with genetic values is of the form
y = X b + Z g + e
where y is a vector with the response, b is the vector of fixed effects, u is the vector of the (random) genetic effects of the genotypes, e is the vector of environmental residuals (random error), and X and Z are design matrices for the fixed and genetic effects, respectively. Genetic effects are assumed to follow a Normal distribution as g ~ N(0,σ2uK), and the error terms are assumed e ~ N(0,σ2eI).
The vector of total genetic values u = Z g will therefore follow u ~ N(0,σ2uG) where G = Z K Z'. In the un-replicated case, Z = I is an identity matrix, and hence u = g and G = K.
The predicted values utrn = {ui}, i = 1,2,...,ntrn, corresponding to observed data (training set) are derived as
utrn = B (ytrn - Xtrnb)
where B is a matrix of weights given by
B = σ2uGtrn (σ2uGtrn + σ2eI)-1
where Gtrn is the sub-matrix corresponding to the training set. This matrix can be rewritten as
B = Gtrn (Gtrn + θI)-1
where θ = σ2e/σ2u is the residual/genetic variances ratio representing a ridge-like shrinkage parameter.
The matrix H = Gtrn + θI in the above equation can be used to obtain predictions corresponding to un-observed data (testing set), utst = {ui}, i = 1,2,...,ntst, by
B = Gtst,trn (Gtrn + θI)-1
where Gtst,trn is the sub-matrix of G corresponding to the testing set (in rows) and training set (in columns).
Solutions are found using the GEMMA (Genome-wide Efficient Mixed Model Analysis) approach (Zhou & Stephens, 2012). First, the Brent's method is implemented to solve for the genetic/residual variances ratio (i.e., 1/θ) from the first derivative of the log-likelihood (either REML or ML). Then, variances σ2u and σ2e are calculated. Finally, b is obtained using Generalized Least Squares.
Returns a list object that contains the elements:
b
: (vector) fixed effects solutions (including the intercept).
u
: (vector) total genetic values (u = Z g).
g
: (vector) genetic effects solutions.
varU
: random effect variance.
varE
: residual variance.
h2
: heritability.
convergence
: (logical) whether Brent's method converged.
method
: either 'REML' or 'ML' method used.
Zhou X, Stephens M (2012). Genome-wide efficient mixed-model analysis for association studies. Nature Genetics, 44(7), 821-824
require(SFSI) data(wheatHTP) index = which(Y$trial %in% 1:20) # Use only a subset of data Y = Y[index,] M = scale(M[index,])/sqrt(ncol(M)) # Subset and scale markers G = tcrossprod(M) # Genomic relationship matrix Y0 = scale(as.matrix(Y[,4:6])) # Response variable #--------------------------------------------------- # Single-trait model #--------------------------------------------------- y = Y0[,1] # Training and testing sets tst = which(Y$trial %in% 1:3) trn = seq_along(y)[-tst] # Kinship-based model fm1 = fitBLUP(y, K=G, trn=trn) head(fm1$g) # Genetic effects plot(y[tst],fm1$yHat[tst]) # Predicted vs observed values in testing set cor(y[tst],fm1$yHat[tst]) # Prediction accuracy in testing set cor(y[trn],fm1$yHat[trn]) # Prediction accuracy in training set fm1$varU # Genetic variance fm1$varE # Residual variance fm1$h2 # Heritability fm1$b # Fixed effects # Markers-based model fm2 = fitBLUP(y, Z=M, trn=trn) head(fm2$g) # Marker effects all.equal(fm1$yHat, fm2$yHat) fm2$varU # Genetic variance fm2$varU*sum(apply(M,2,var)) #--------------------------------------------------- # Multiple response variables #--------------------------------------------------- ID_geno = as.vector(row(Y0)) ID_trait = as.vector(col(Y0)) y = as.vector(Y0) # Training and testing sets tst = c(which(ID_trait==1)[Y$trial %in% 1:3], which(ID_trait==2)[Y$trial %in% 1:3], which(ID_trait==3)[Y$trial %in% 1:3]) trn = seq_along(y)[-tst] # Across traits model fm3 = fitBLUP(y, K=G, ID_geno=ID_geno, trn=trn) plot(fm1$yHat,fm3$yHat[ID_trait==1]) # different from the single-trait model # Within traits model: pass an index for traits fm4 = fitBLUP(y, K=G, ID_geno=ID_geno, ID_trait=ID_trait, trn=trn) plot(fm1$yHat,fm4$yHat[ID_trait==1]) # equal to the single-trait model
require(SFSI) data(wheatHTP) index = which(Y$trial %in% 1:20) # Use only a subset of data Y = Y[index,] M = scale(M[index,])/sqrt(ncol(M)) # Subset and scale markers G = tcrossprod(M) # Genomic relationship matrix Y0 = scale(as.matrix(Y[,4:6])) # Response variable #--------------------------------------------------- # Single-trait model #--------------------------------------------------- y = Y0[,1] # Training and testing sets tst = which(Y$trial %in% 1:3) trn = seq_along(y)[-tst] # Kinship-based model fm1 = fitBLUP(y, K=G, trn=trn) head(fm1$g) # Genetic effects plot(y[tst],fm1$yHat[tst]) # Predicted vs observed values in testing set cor(y[tst],fm1$yHat[tst]) # Prediction accuracy in testing set cor(y[trn],fm1$yHat[trn]) # Prediction accuracy in training set fm1$varU # Genetic variance fm1$varE # Residual variance fm1$h2 # Heritability fm1$b # Fixed effects # Markers-based model fm2 = fitBLUP(y, Z=M, trn=trn) head(fm2$g) # Marker effects all.equal(fm1$yHat, fm2$yHat) fm2$varU # Genetic variance fm2$varU*sum(apply(M,2,var)) #--------------------------------------------------- # Multiple response variables #--------------------------------------------------- ID_geno = as.vector(row(Y0)) ID_trait = as.vector(col(Y0)) y = as.vector(Y0) # Training and testing sets tst = c(which(ID_trait==1)[Y$trial %in% 1:3], which(ID_trait==2)[Y$trial %in% 1:3], which(ID_trait==3)[Y$trial %in% 1:3]) trn = seq_along(y)[-tst] # Across traits model fm3 = fitBLUP(y, K=G, ID_geno=ID_geno, trn=trn) plot(fm1$yHat,fm3$yHat[ID_trait==1]) # different from the single-trait model # Within traits model: pass an index for traits fm4 = fitBLUP(y, K=G, ID_geno=ID_geno, ID_trait=ID_trait, trn=trn) plot(fm1$yHat,fm4$yHat[ID_trait==1]) # equal to the single-trait model
Calculation of genetic and environmental (unstructured) covariances matrices from pairwise models of variables with the same experimental design
getGenCov(y, X = NULL, Z = NULL, K = NULL, trn = NULL, EVD = NULL, ID_geno, ID_trait, scale = TRUE, pairwise = TRUE, verbose = TRUE, ...)
getGenCov(y, X = NULL, Z = NULL, K = NULL, trn = NULL, EVD = NULL, ID_geno, ID_trait, scale = TRUE, pairwise = TRUE, verbose = TRUE, ...)
y |
(numeric matrix) Response variable vector. It can be a matrix with each column representing a different response variable |
X |
(numeric matrix) Design matrix for fixed effects. When |
Z |
(numeric matrix) Design matrix for random effects. When |
K |
(numeric matrix) Kinship relationship matrix |
trn |
(integer vector) Which elements from vector |
EVD |
(list) Eigenvectors and eigenvalues from eigenvalue decomposition (EVD) of G corresponding to training data |
ID_geno |
(character/integer) Vector with either names or indices mapping entries of the vector |
ID_trait |
(character/integer) Vector with either names or indices mapping entries of the vector |
scale |
|
pairwise |
|
verbose |
|
... |
Other arguments passed to the function 'fitBLUP' |
Assumes that both y1 and y2 follow the basic linear mixed model that relates phenotypes with genetic values of the form
where b1 and b2 are the specific fixed effects, g1 and g2 are the specific genetic effects of the genotypes, e1 and e2 are the vectors of specific environmental residuals, and X and Z are common design matrices for the fixed and genetic effects, respectively. Genetic effects are assumed to follow a Normal distribution as g1 ~ N(0,σ2u1K) and g2 ~ N(0,σ2u2K), and environmental terms are assumed e1 ~ N(0,σ2e1I) and e2 ~ N(0,σ2e2I).
The genetic covariance σ2u1,u2 is estimated from the formula for the variance for the sum of two variables as
where σ2u3 is the genetic variance of the variable y3 = y1 + y2 that also follows the same model as for y1 and y2.
Likewise, the environmental covariance σ2e1,e2 is estimated as
where σ2e3 is the error variance of the variable y3.
Solutions are found using the function 'fitBLUP' (see help(fitBLUP)
) to sequentialy fit mixed models for all the variables y1, y2 and
y3.
Returns a list object that contains the elements:
varU
: (vector) genetic variances.
varE
: (vector) error variances.
covU
: (vector) genetic covariances between response variable 1 and the rest.
covE
: (vector) environmental covariances between response variable 1 and the rest.
When pairwise = TRUE
, varU
and varE
are matrices containing all variances (diagonal) and pairwise covariances (off diagonal)
require(SFSI) data(wheatHTP) index = which(Y$trial %in% 1:10) # Use only a subset of data Y = Y[index,] M = scale(M[index,])/sqrt(ncol(M)) # Subset and scale markers G = tcrossprod(M) # Genomic relationship matrix Y0 = scale(as.matrix(Y[,4:7])) # Response variable ID_geno = as.vector(row(Y0)) ID_trait = as.vector(col(Y0)) y = as.vector(Y0) # Pairwise covariance for all columns in y fm = getGenCov(y=y, K=G, ID_geno=ID_geno, ID_trait=ID_trait) fm$varU # Genetic variances fm$varE # Residual variances fm$varU+fm$varE # Phenotypic covariance var(Y0) # Sample phenotypic covariance # Covariances between the first and the rest: y[,1] and y[,2:4] fm = getGenCov(y=y, K=G, ID_geno=ID_geno, ID_trait=ID_trait, pairwise=FALSE) fm$covU # Genetic covariance between y[,1] and y[,2:4] # Containing some missing values: # The model is estimated from common training data YNA = Y0 YNA[Y$trial %in% 1:2, 1] = NA YNA[Y$trial %in% 2:3, 3] = NA y = as.vector(YNA) fm = getGenCov(y=y, K=G, ID_geno=ID_geno, ID_trait=ID_trait) fm$varU fm$varE
require(SFSI) data(wheatHTP) index = which(Y$trial %in% 1:10) # Use only a subset of data Y = Y[index,] M = scale(M[index,])/sqrt(ncol(M)) # Subset and scale markers G = tcrossprod(M) # Genomic relationship matrix Y0 = scale(as.matrix(Y[,4:7])) # Response variable ID_geno = as.vector(row(Y0)) ID_trait = as.vector(col(Y0)) y = as.vector(Y0) # Pairwise covariance for all columns in y fm = getGenCov(y=y, K=G, ID_geno=ID_geno, ID_trait=ID_trait) fm$varU # Genetic variances fm$varE # Residual variances fm$varU+fm$varE # Phenotypic covariance var(Y0) # Sample phenotypic covariance # Covariances between the first and the rest: y[,1] and y[,2:4] fm = getGenCov(y=y, K=G, ID_geno=ID_geno, ID_trait=ID_trait, pairwise=FALSE) fm$covU # Genetic covariance between y[,1] and y[,2:4] # Containing some missing values: # The model is estimated from common training data YNA = Y0 YNA[Y$trial %in% 1:2, 1] = NA YNA[Y$trial %in% 2:3, 3] = NA y = as.vector(YNA) fm = getGenCov(y=y, K=G, ID_geno=ID_geno, ID_trait=ID_trait) fm$varU fm$varE
Accuracy as a function of the penalization plot for an object of the class 'SGP'
## S3 method for class 'SGP' plot(..., x.stat = c("nsup","lambda"), y.stat = c("accuracy","MSE"), label = x.stat, nbreaks.x = 6)
## S3 method for class 'SGP' plot(..., x.stat = c("nsup","lambda"), y.stat = c("accuracy","MSE"), label = x.stat, nbreaks.x = 6)
... |
Other arguments to be passed:
|
x.stat |
(character) Either 'nsup' (number of non-zero regression coefficients entering in the prediction of a given testing individual) or 'lambda' (penalization parameter in log scale) to plot in the x-axis |
y.stat |
(character) Either 'accuracy' (correlation between observed and predicted values) or 'MSE' (mean squared error) to plot in the y-axis |
label |
(character) Similar to |
nbreaks.x |
(integer) Number of breaks in the x-axis |
Creates a plot of either accuracy or MSE versus either the support set size (average number of predictors with non-zero regression coefficient) or versus lambda.
# See examples in # help(SGP, package="SFSI")
# See examples in # help(SGP, package="SFSI")
Create a random data partition of size n
into k
non-overlapping folds of approximately the same size
get_folds(n, k = 5L, nCV = 1L, seed = NULL)
get_folds(n, k = 5L, nCV = 1L, seed = NULL)
n |
(integer) Sample size |
k |
(integer) Number of folds |
nCV |
(integer) Number of different partitions to be created |
seed |
(integer vector) Optional seed for randomization (see |
Returns a matrix with n
rows and nCV
columns. Each column contains a partition with k
folds.
require(SFSI) # Create 5 different partitions into 10 folds # for a sample size equal to 115 out = get_folds(n=115, k=10, nCV=5) # Size of folds at first partition table(out[,1])
require(SFSI) # Create 5 different partitions into 10 folds # for a sample size equal to 115 out = get_folds(n=115, k=10, nCV=5) # Size of folds at first partition table(out[,1])
Obtain a Graphical Network representation from a matrix where nodes are subjects in the rows and columns, and edges are obtained from the matrix entries
net(object, K = NULL, nsup = NULL, p.radius = 1.7, delta = .Machine$double.eps)
net(object, K = NULL, nsup = NULL, p.radius = 1.7, delta = .Machine$double.eps)
object |
Either a numeric matrix |
K |
(numeric matrix) Kinship relationship matrix among nodes |
nsup |
(numeric) For a SGP, average number of training individuals contributing to the prediction (with non-zero regression coefficient) of testing individuals. Default |
p.radius |
(numeric) For a multi-trait SGP, a factor (x-folds radius) to separate each trait from the origin |
delta |
(numeric) Minumum value to determine nodes to be connected. Default is the machine precision (numerical zero) |
For a numeric matrix X={xij} with m
rows and n
columns, a graphical network with m
+ n
nodes is obtained by defining edges connecting subjects in rows with those in columns. An edge between subject in row i
and subject in column j
is
determined if the corresponding (absolute) entry matrix is larger than certain value, i.e., |xij|>δ.
For a symmetric matrix, only m
=n
nodes are considered with edges determined by the above diagonal entries of the matrix.
Nodes and edges are plotted in the cartesian plane according to the Fruchterman-Reingold algorithm. When a matrix K is provided, nodes are plotted according to the top 2 eigenvectors from the spectral value decomposition of K = U D U'.
When the object is a 'SGP' class object the edges are taken from the regression coefficients (associated to a specific nsup
value) are used as matrix X with testing subjects in rows and training subjects in columns.
Returns a plottable object of the class 'net' that can be visualized using 'plot' method
require(SFSI) data(wheatHTP) #-------------------------------------- # Net for an SGP object #-------------------------------------- index = which(Y$trial %in% 1:6) # Use only a subset of data Y = Y[index,] M = scale(M[index,])/sqrt(ncol(M)) # Subset and scale markers G = tcrossprod(M) # Genomic relationship matrix y = as.vector(scale(Y[,"E1"])) # Scale response variable # Training and testing sets tst = which(Y$trial %in% 2) trn = seq_along(y)[-tst] fm = SGP(y, K=G, trn=trn, tst=tst) nt = net(fm) # Get the net plot(nt) # Plot the net plot(nt, i=c(1,5)) # Show the first and fifth tst elements plot(net(fm, nsup=10), show.names=c(TRUE,TRUE,FALSE)) #-------------------------------------- # Net for a numeric matrix #-------------------------------------- B = as.matrix(coef(fm, nsup=10)) plot(net(B), curve=TRUE, set.size=c(3.5,1.5,1)) #-------------------------------------- # Net for a symmetric numeric matrix #-------------------------------------- X = X_E1[,seq(1,ncol(X_E1),by=5)] R2 = cor(X)^2 # An R2 matrix plot(net(R2, delta=0.9))
require(SFSI) data(wheatHTP) #-------------------------------------- # Net for an SGP object #-------------------------------------- index = which(Y$trial %in% 1:6) # Use only a subset of data Y = Y[index,] M = scale(M[index,])/sqrt(ncol(M)) # Subset and scale markers G = tcrossprod(M) # Genomic relationship matrix y = as.vector(scale(Y[,"E1"])) # Scale response variable # Training and testing sets tst = which(Y$trial %in% 2) trn = seq_along(y)[-tst] fm = SGP(y, K=G, trn=trn, tst=tst) nt = net(fm) # Get the net plot(nt) # Plot the net plot(nt, i=c(1,5)) # Show the first and fifth tst elements plot(net(fm, nsup=10), show.names=c(TRUE,TRUE,FALSE)) #-------------------------------------- # Net for a numeric matrix #-------------------------------------- B = as.matrix(coef(fm, nsup=10)) plot(net(B), curve=TRUE, set.size=c(3.5,1.5,1)) #-------------------------------------- # Net for a symmetric numeric matrix #-------------------------------------- X = X_E1[,seq(1,ncol(X_E1),by=5)] R2 = cor(X)^2 # An R2 matrix plot(net(R2, delta=0.9))
Plot a Graphical Network obtained from a numeric matrix
## S3 method for class 'net' plot(x, i = NULL, show.names = FALSE, group = NULL, group.shape = NULL, set.color = NULL, set.size = NULL, axis.labels = TRUE, curve = FALSE, bg.color = "white", unified = TRUE, ni = 36, line.color = "gray70", line.width = 0.3, legend.pos = "right", point.color = "gray20", sets = c("Testing","Supporting","Non-active"), circle = FALSE, ...)
## S3 method for class 'net' plot(x, i = NULL, show.names = FALSE, group = NULL, group.shape = NULL, set.color = NULL, set.size = NULL, axis.labels = TRUE, curve = FALSE, bg.color = "white", unified = TRUE, ni = 36, line.color = "gray70", line.width = 0.3, legend.pos = "right", point.color = "gray20", sets = c("Testing","Supporting","Non-active"), circle = FALSE, ...)
x |
An object of the 'net' class as per the |
i |
(integer vector) Index subjects in rows to be shown in plot. Default |
show.names |
|
group |
(data.frame) Column grouping for the subjects |
group.shape |
(integer vector) Shape of each level of the grouping column provided as |
bg.color |
(character) Plot background color |
line.color |
(character) Color of lines connecting nodes in rows with those in columns |
line.width |
(numeric) Width of lines connecting nodes in rows with those in columns |
curve |
|
set.color |
(character vector) Color point of each type of node: row, 'active' column, and 'non-active' column, respectively |
set.size |
(numeric vector) Size of each type of node: row, 'active' column, and 'non-active' column, respectively |
axis.labels |
|
unified |
|
point.color |
(character) Color of the points in the plot |
ni |
(integer) Maximum number of row nodes that are plotted separated as indicated by |
legend.pos |
(character) Either "right", topright","bottomleft","bottomright","topleft", or "none" indicating where the legend is positioned in the plot |
sets |
(character vector) Names of the types of node: row, 'active' column, and 'non-active' column, respectively |
circle |
|
... |
Other arguments for method |
Plot a Graphical Network from a matrix where nodes are subjects in the rows and columns, and edges are obtained from the matrix entries. This Network is obtained using net
function
# See examples in # help(net, package="SFSI")
# See examples in # help(net, package="SFSI")
Visualizing results from an object of the class 'SGP'
multitrait.plot(object, trait_names = NULL, x.stat = c("nsup","lambda"), y.stat = c("accuracy","MSE"), label = x.stat, line.color = "orange", point.color = line.color, point.size = 1.2, nbreaks.x = 6, ...)
multitrait.plot(object, trait_names = NULL, x.stat = c("nsup","lambda"), y.stat = c("accuracy","MSE"), label = x.stat, line.color = "orange", point.color = line.color, point.size = 1.2, nbreaks.x = 6, ...)
object |
An object of the class 'SGP' for a multi-trait case |
x.stat |
(character) Either 'nsup' (number of non-zero regression coefficients entering in the prediction of a given testing individual) or 'lambda' (penalization parameter in log scale) to plot in the x-axis |
y.stat |
(character) Either 'accuracy' (correlation between observed and predicted values) or 'MSE' (mean squared error) to plot in the y-axis |
label |
(character) Similar to |
point.color , line.color
|
(character) Color of the points and lines |
point.size |
(numeric) Size of the points showing the maximum accuracy |
nbreaks.x |
(integer) Number of breaks in the x-axis |
trait_names |
(character) Names of traits to be shown in the plot |
... |
Other arguments for method |
Creates a plot of either accuracy or MSE versus either the support set size (average number of predictors with non-zero regression coefficient) or versus lambda. This is done separately for each trait
# See examples in # help(SGP, package="SFSI")
# See examples in # help(SGP, package="SFSI")
Pruning features using an R-squared threshold and maximum distance
Prune(X, alpha = 0.95, pos = NULL, d.max = NULL, centered = FALSE, scaled = FALSE, verbose = FALSE)
Prune(X, alpha = 0.95, pos = NULL, d.max = NULL, centered = FALSE, scaled = FALSE, verbose = FALSE)
X |
(numeric matrix) A matrix with observations in rows and features (e.g., SNPs) in columns |
alpha |
(numeric) R-squared threshold used to determine connected sets |
pos |
(numeric vector) Optional vector with positions (e.g., bp) of features |
d.max |
(numeric) Maximum distance that connected sets are apart |
centered |
|
scaled |
|
verbose |
|
The algorithm identifies sets of connected features as those that share an R2 > α and retains only one feature (first appearance) for each set.
The sets can be limited to lie within a distance less or equal to a d.max
value.
Returns a list object that contains the elements:
prune.in
: (vector) indices of selected (unconnected) features.
prune.out
: (vector) indices of dropped out features.
require(SFSI) data(wheatHTP) index = c(154:156,201:205,306:312,381:387,540:544) X = M[,index] # Subset markers colnames(X) = 1:ncol(X) # See connected sets using R^2=0.8 R2thr = 0.8 R2 = cor(X)^2 nw1 = net(R2, delta=R2thr) plot(nw1, show.names=TRUE) # Get pruned features res = Prune(X, alpha=R2thr) # See selected (unconnected) features nw2 = net(R2[res$prune.in,res$prune.in], delta=R2thr) nw2$xy = nw1$xy[res$prune.in,] plot(nw2, show.names=TRUE)
require(SFSI) data(wheatHTP) index = c(154:156,201:205,306:312,381:387,540:544) X = M[,index] # Subset markers colnames(X) = 1:ncol(X) # See connected sets using R^2=0.8 R2thr = 0.8 R2 = cor(X)^2 nw1 = net(R2, delta=R2thr) plot(nw1, show.names=TRUE) # Get pruned features res = Prune(X, alpha=R2thr) # See selected (unconnected) features nw2 = net(R2[res$prune.in,res$prune.in], delta=R2thr) nw2$xy = nw1$xy[res$prune.in,] plot(nw2, show.names=TRUE)
Read the output generated by the 'SGP' or 'SGP.CV' functions saved at the provided save.at
parameter. It merges all partial files when data was splited according to argument subset
. Function 'read_summary' reads the output generated after calling the method summary
; multiple outputs are read if argument path
is a vector and in this case, they are combined by averaging across all outputs.
read_SGP(path = ".", type = NULL, file.ext = ".RData", verbose = TRUE) read_summary(path = ".", type = NULL, file.ext = ".RData", verbose = TRUE)
read_SGP(path = ".", type = NULL, file.ext = ".RData", verbose = TRUE) read_summary(path = ".", type = NULL, file.ext = ".RData", verbose = TRUE)
path |
(character) Path where output files were saved. This may include a prefix |
type |
(character) Either 'SGP' or 'SGP.CV' to collect results generated by either function of the same name |
file.ext |
(character) This is the file extension at which file was saved as per the argument |
verbose |
|
An object of the class 'SGP' for which methods predict
, plot
and summary
exist
require(SFSI) data(wheatHTP) index = which(Y$trial %in% 1:10) # Use only a subset of data Y = Y[index,] M = scale(M[index,])/sqrt(ncol(M)) # Subset and scale markers G = tcrossprod(M) # Genomic relationship matrix y = as.vector(scale(Y[,"E1"])) # Scale response variable # Training and testing sets tst = which(Y$trial %in% 1:3) trn = seq_along(y)[-tst] path = paste0(tempdir(),"/testSGP_") # Run the analysis into 4 subsets and save them at a given path SGP(y, K=G, trn=trn, tst=tst, subset=c(1,4), save.at=path) SGP(y, K=G, trn=trn, tst=tst, subset=c(2,4), save.at=path) SGP(y, K=G, trn=trn, tst=tst, subset=c(3,4), save.at=path) SGP(y, K=G, trn=trn, tst=tst, subset=c(4,4), save.at=path) # Collect all results after completion fm = read_SGP(path) # Generate and save summary summary(fm, save.at=path) fm2 = read_summary(path)
require(SFSI) data(wheatHTP) index = which(Y$trial %in% 1:10) # Use only a subset of data Y = Y[index,] M = scale(M[index,])/sqrt(ncol(M)) # Subset and scale markers G = tcrossprod(M) # Genomic relationship matrix y = as.vector(scale(Y[,"E1"])) # Scale response variable # Training and testing sets tst = which(Y$trial %in% 1:3) trn = seq_along(y)[-tst] path = paste0(tempdir(),"/testSGP_") # Run the analysis into 4 subsets and save them at a given path SGP(y, K=G, trn=trn, tst=tst, subset=c(1,4), save.at=path) SGP(y, K=G, trn=trn, tst=tst, subset=c(2,4), save.at=path) SGP(y, K=G, trn=trn, tst=tst, subset=c(3,4), save.at=path) SGP(y, K=G, trn=trn, tst=tst, subset=c(4,4), save.at=path) # Collect all results after completion fm = read_SGP(path) # Generate and save summary summary(fm, save.at=path) fm2 = read_summary(path)
Retrieving regression coefficients and predicted values from the 'solveEN' and 'LARS' functions' outputs
## S3 method for class 'LASSO' coef(object, ...) ## S3 method for class 'LASSO' predict(object, ...)
## S3 method for class 'LASSO' coef(object, ...) ## S3 method for class 'LASSO' predict(object, ...)
object |
An object of the class 'LASSO' returned either by the function 'LARS' or 'solveEN' |
... |
Other arguments:
|
Method coef
returns a matrix that contains the regression coefficients (in rows) associated to each value of lambda (in columns). When the regression was applied to an object Gamma
with more than one column, method coef
returns a list
Method predict
returns a matrix with predicted values
Xβ (in rows)
for each value of lambda (in columns).
# See examples in # help(solveEN, package="SFSI") # help(LARS, package="SFSI")
# See examples in # help(solveEN, package="SFSI") # help(LARS, package="SFSI")
Coefficients evolution path plot from an object of the class 'LASSO' or 'SGP'
path.plot(object, K = NULL, i = NULL, prune = FALSE, cor.max = 0.97, lambda.min = .Machine$double.eps^0.5, nbreaks.x = 6, npaths.max = 5000, ...)
path.plot(object, K = NULL, i = NULL, prune = FALSE, cor.max = 0.97, lambda.min = .Machine$double.eps^0.5, nbreaks.x = 6, npaths.max = 5000, ...)
object |
An object of the 'LASSO' or 'SGP' class |
K |
(numeric matrix) Kinship relationships. Only needed for an object of the class 'SGP' |
i |
(integer vector) Index a response variable (columns of matrix |
prune |
|
cor.max |
(numeric) Correlation threshold to prune within groups of correlated coefficients |
lambda.min |
(numeric) Minimum value of lambda to show in the plot as |
nbreaks.x |
(integer) Number of breaks in the x-axis |
npaths.max |
(integer) Maximum number of paths defined by the number of predictors times the number of columns of matrix |
... |
Other arguments for method |
Returns the plot of the coefficients' evolution path along the regularization parameter
require(SFSI) data(wheatHTP) index = which(Y$trial %in% 1:6) # Use only a subset of data Y = Y[index,] X = scale(X_E1[index,]) # Reflectance data M = scale(M[index,])/sqrt(ncol(M)) # Subset and scale markers G = tcrossprod(M) # Genomic relationship matrix y = as.vector(scale(Y[,'E1'])) # Subset response variable # Sparse phenotypic regression fm = LARS(var(X),cov(X,y)) path.plot(fm) # Sparse Genomic Prediction fm = SGP(y, K=G, trn=12:length(y), tst=1:11) path.plot(fm, prune=TRUE) path.plot(fm, K=G, prune=TRUE, cor.max=0.9) # Path plot for the first individual in testing set for the SGP path.plot(fm, K=G, i=1)
require(SFSI) data(wheatHTP) index = which(Y$trial %in% 1:6) # Use only a subset of data Y = Y[index,] X = scale(X_E1[index,]) # Reflectance data M = scale(M[index,])/sqrt(ncol(M)) # Subset and scale markers G = tcrossprod(M) # Genomic relationship matrix y = as.vector(scale(Y[,'E1'])) # Subset response variable # Sparse phenotypic regression fm = LARS(var(X),cov(X,y)) path.plot(fm) # Sparse Genomic Prediction fm = SGP(y, K=G, trn=12:length(y), tst=1:11) path.plot(fm, prune=TRUE) path.plot(fm, K=G, prune=TRUE, cor.max=0.9) # Path plot for the first individual in testing set for the SGP path.plot(fm, K=G, i=1)
Useful methods for retrieving and summarizing important results from the 'SGP' function's output
## S3 method for class 'SGP' coef(object, ...) ## S3 method for class 'SGP' predict(object, ...) ## S3 method for class 'SGP' summary(object, ...)
## S3 method for class 'SGP' coef(object, ...) ## S3 method for class 'SGP' predict(object, ...) ## S3 method for class 'SGP' summary(object, ...)
object |
An object of the class 'SGP' |
... |
Other arguments to be passed to
For
|
Method predict
returns a matrix with the predicted values for each individual in the testing set (in rows) for each value of lambda (in columns).
Method coef
(list of matrices) returns the regression coefficients for each testing set individual (elements of the list). Each matrix contains the coefficients for each value of lambda (in rows) associated to each training set individual (in columns).
Method summary
returns a list object containing:
lambda
: (vector) sequence of values of lambda used in the coefficients' estimation.
nsup
: (vector) Number of non-zero coefficients (across testing individuals) at each solution associated to each value of lambda.
accuracy
: (vector) correlation between observed and predicted values associated to each value of lambda.
MSE
: (vector) mean squared error associated to each value of lambda.
optCOR
: (vector) summary of the optimal SGP with maximum accuracy.
optMSE
: (vector) summary of the optimal SGP with minimum MSE.
# See examples in # help(SGP, package="SFSI")
# See examples in # help(SGP, package="SFSI")
The dataset consists of 1,092 inbred wheat lines grouped into 39 trials and grown during the 2013-2014 season at the Norman Borlaug experimental research station in Ciudad Obregon, Sonora, Mexico. Each trial consisted of 28 breeding lines that were arranged in an alpha-lattice design with three replicates and six sub-blocks. The trials were grown in four different environments:
E1: Flat-Drought (sowing in flat with irrigation of 180 mm through drip system)
E2: Bed-2IR (sowing in bed with 2 irrigations approximately 250 mm)
E3: Bed-5IR (bed sowing with 5 normal irrigations)
E4: Bed-EHeat (bed sowing 30 days before optimal planting date with 5 normal irrigations approximately 500 mm)
Measurements of grain yield (YLD) were reported as the total plot yield after maturity. Records for YLD are reported as adjusted means from which trial, replicate and sub-block effects were removed. Measurements for days to heading (DTH), days to maturity (DTM), and plant height (PH) were recorded only in the first replicate at each trial and thus no phenotype adjustment was made.
2. Reflectance data.Reflectance data was collected from the fields using both infrared and hyper-spectral cameras mounted on an aircraft on 9 different dates (time-points) between January 10 and March 27th, 2014. During each flight, data from 250 wavelengths ranging from 392 to 850 nm were collected for each pixel in the pictures. The average reflectance of all the pixels for each wavelength was calculated from each of the geo-referenced trial plots and reported as each line reflectance. Data for reflectance and Green NDVI and Red NDVI are reported as adjusted phenotypes from which trial, replicate and sub-block effects were removed. Each data-point matches to each data-point in phenotypic data.
3. Marker data.Lines were sequenced for GBS at 192-plexing on Illumina HiSeq2000 or HiSeq2500 with 1 x 100 bp reads. SNPs were called across all lines anchored to the genome assembly of Chinese Spring (International Wheat Genome Sequencing Consortium 2014). Next, SNP were extracted and filtered so that lines >50% missing data were removed. Markers were recoded as –1, 0, and 1, corresponding to homozygous for the minor allele, heterozygous, and homozygous for the major allele, respectively. Next, markers with a minor allele frequency <0.05 and >15% of missing data were removed. Remaining SNPs with missing values were imputed using the mean of the observed marker genotypes at a given locus.
Adjusted un-replicated data.The SFSI
R-package includes the wheatHTP
dataset containing (un-replicated) only YLD from all environments E1,...,E4, and reflectance (latest time-point only) data from the environment E1 only. Marker data is also included in the dataset. The phenotypic and reflectance data are averages (line effects from mixed models) for 776 lines evaluated in 28 trials (with at least 26 lines each) for which marker information on 3,438 SNPs is available.
The full (replicated) data for all four environments, all traits, and all time-points can be found in the repository https://github.com/MarcooLopez/Data_for_Lopez-Cruz_et_al_2020.
Cross-validation partitions.One random partition of 4-folds was created for the 776 individuals (distributed into 28 trials). Data from 7 entire trials (25% of 28 the trials) were arbitrarily assigned to each of the 4 folds. The partition consist of an array of length 776 with indices 1, 2, 3, and 4 denoting the fold.
Genetic covariances.Multi-variate Gaussian mixed models were fitted to phenotypes. Bi-variate models were fitted to YLD with each of the 250 wavelengths from environment E1. Tetra-variate models were fitted for YLD from all environments. All models were fitted within each fold (provided partition) using scaled (null mean and unit variance) phenotypes from the remaining 3 folds as training data. Bayesian models were implemented using the 'Multitrait' function from the BGLR
R-package with 40,000 iterations discarding 5,000 runs for burning. A marker-derived relationships matrix as in VanRaden (2008) was used to model between-individuals genetic effects. Between-traits genetic covariances were assumed unstructured, while residual covariances were assumed diagonal.
Genetic covariances between YLD and each wavelength (environment E1) are storaged in a matrix of 250 rows and 4 columns (folds). Genetic and residual covariances matrices among YLD within each environment are storaged in a list with 4 elements (folds).
data(wheatHTP)
data(wheatHTP)
Y
: (matrix) phenotypic data for YLD in environments E1, E2, E3, and E4; and columns 'trial' and 'CV' (indicating the 4-folds partition).
M
: (matrix) marker data with SNPs in columns.
X_E1
: (matrix) reflectance data for time-point 9 in environment E1.
VI_E1
: (matrix) green and red NDVI for time-point 9 in environment E1.
genCOV_xy
: (matrix) genetic covariances between YLD and each reflectance trait, for each fold (in columns).
genCOV_yy
: (4-dimensional list) genetic covariances matrices for YLD among environments, for each fold.
resCOV_yy
: (4-dimensional list) residual covariances matrices for YLD among environments, for each fold.
International Maize and Wheat Improvement Center (CIMMYT), Mexico.
Perez-Rodriguez P, de los Campos G (2014). Genome-wide regression and prediction with the BGLR statistical package. Genetics, 198, 483–495.
VanRaden PM (2008). Efficient methods to compute genomic predictions. Journal of Dairy Science, 91(11), 4414–4423.