This document explains how to prepare the data and the basic usage of
the mlmm.gwas package. Use ?functionName
in R console to
get the complete documentation of a given function.
The pipeline can be divised in 3 main steps:
Only the additive model is presented in this document. The pipeline also supports additive+dominace, female+male and female+male+interaction models. For more information on how to make the matrices for these models, read Bonnafous et al. (2017).
This package includes a dataset with genotypes, phenotypes and kinship for different hybrids of sunflower.
library(mlmm.gwas)
data("mlmm.gwas.AD")
ls()
#> [1] "K.add" "K.dom" "Xa" "Xd"
#> [5] "floweringDateAD"
Phenotypes must be put in a named vector. Names are the individuals’ names and values their phenotypes.
The trait included in the dataset is the flowering date in ??C.day.
It’s a n by m matrix, where n is the number of individuals, and m the number of SNPs.
Be sure to:
Xa[1:5,1:5]
#> SNP1 SNP2 SNP3 SNP4 SNP5
#> SF173_SF326 1.3055556 0.4722222 -0.1944444 -0.1388889 -0.8611111
#> SF160_PNS1 -0.6944444 0.4722222 -0.1944444 -0.1388889 -0.8611111
#> SF031_SF302 -0.6944444 0.4722222 -0.1944444 -0.1388889 0.1388889
#> SF028_SF332 -0.6944444 0.4722222 -0.1944444 -0.1388889 0.1388889
#> SF029_SF281 -0.6944444 -0.5277778 -0.1944444 -0.1388889 1.1388889
As K is normalized in the pipeline, you don’t need to do it yourself.
So you can get it from centered X easily:
##You can get K from X with the following command line
##We are not doing it here because the X matrix included in data("mlmm.gwas.AD")
##is a subset of a larger matrix.
#K.add = Xa %*% t(Xa)
K.add[1:5,1:5]
#> SF173_SF326 SF160_PNS1 SF031_SF302 SF028_SF332 SF029_SF281
#> SF173_SF326 213719.258 -2470.409 3375.313 -3872.576 -30046.548
#> SF160_PNS1 -2470.409 266025.924 51606.647 22635.758 -29448.215
#> SF031_SF302 3375.313 51606.647 212842.369 16744.480 10089.508
#> SF028_SF332 -3872.576 22635.758 16744.480 225864.591 -9349.381
#> SF029_SF281 -30046.548 -29448.215 10089.508 -9349.381 191078.647
Individuals and markers must be the sames and in the same order in Y, X and K.
A genetic or physical map can be used for graphical representation of association (Manhattan plot).
It’s a dataframe with 3 columns:
Missing data are allowed.
There is no map in the example dataset. We will still be alble to use
the manhattan.plot
function but the markers will be ordered
by their line index in the X matrix, not by genetic or physical
position.
The function mlmm_allmodels
performs GWAS correcting for
population structure while including cofactors through a forward
regression approach.
Several markers may be associated to a single QTL. The
mlmm_allmodels
function aims to solve this problem by
selecting the closest SNP to each QTL using a forward regression
approach. The association p-values of the SNPs are estimated. Then, the
SNP with the lowest p-value is used as fixed effect (cofactor). This is
repeated, adding one SNP to the fixed effects at each step, until most
of the variance is explained or until the maximum number of iterations
is reached.
res_mlmm = mlmm_allmodels(floweringDateAD, list(Xa), list(K.add))
#> iteration LogLik wall cpu(sec) restrained
#> 1 -47.4676 6:50:13 0 0
#> 2 -43.866 6:50:13 0 0
#> 3 -43.0839 6:50:13 0 0
#> 4 -43.0056 6:50:13 0 0
#> 5 -43.0037 6:50:13 0 0
#> 6 -43.0036 6:50:13 0 0
#> null model done! pseudo-h= 0.52
#> iteration LogLik wall cpu(sec) restrained
#> 1 -35.9468 6:50:13 0 0
#> 2 -34.9284 6:50:13 0 0
#> 3 -34.6096 6:50:13 0 0
#> 4 -34.5712 6:50:13 0 0
#> 5 -34.5702 6:50:13 0 0
#> step 1 done! pseudo-h= 0.379 model: Y ~ mu + SNP303
#> iteration LogLik wall cpu(sec) restrained
#> 1 -30.707 6:50:13 0 0
#> 2 -30.1743 6:50:13 0 0
#> 3 -29.9904 6:50:13 0 0
#> 4 -29.9655 6:50:13 0 0
#> 5 -29.9648 6:50:13 0 0
#> step 2 done! pseudo-h= 0.328 model: Y ~ mu + SNP303 + SNP317
#> iteration LogLik wall cpu(sec) restrained
#> 1 -25.8333 6:50:13 0 0
#> 2 -25.8194 6:50:13 0 0
#> 3 -25.8115 6:50:13 0 0
#> 4 -25.8089 6:50:13 0 0
#> 5 -25.8085 6:50:13 0 0
#> step 3 done! pseudo-h= 0.201 model: Y ~ mu + SNP303 + SNP317 + SNP407
#> iteration LogLik wall cpu(sec) restrained
#> 1 -21.9588 6:50:13 0 0
#> 2 -21.8936 6:50:13 0 0
#> 3 -21.8537 6:50:13 0 0
#> 4 -21.8384 6:50:13 0 0
#> 5 -21.8352 6:50:13 0 0
#> 6 -21.8345 6:50:13 0 0
#> step 4 done! pseudo-h= 0.086 model: Y ~ mu + SNP303 + SNP317 + SNP407 + SNP10
#> iteration LogLik wall cpu(sec) restrained
#> 1 -18.8393 6:50:13 0 0
#> 2 -18.6398 6:50:13 0 0
#> 3 -18.5399 6:50:13 0 0
#> 4 -18.5154 6:50:13 0 0
#> 5 -18.5133 6:50:13 0 0
#> 6 -18.5131 6:50:13 0 0
#> step 5 done! pseudo-h= 0.048 model: Y ~ mu + SNP303 + SNP317 + SNP407 + SNP10 + SNP153
#> iteration LogLik wall cpu(sec) restrained
#> 1 -14.9796 6:50:13 0 0
#> 2 -14.1825 6:50:13 0 1
#> 3 -13.7957 6:50:13 0 1
#> 4 -13.7957 6:50:13 0 1
#> step 6 done! pseudo-h= 0 model: Y ~ mu + SNP303 + SNP317 + SNP407 + SNP10 + SNP153 + SNP375
mlmm_allmodels
returns a list with one element per step.
Each element is a vector containing the association p-values of each
marker.
These p-values can be visualized as a Manhattan plot:
It is necessary to select a model using a criterion to avoid overfitting. The criterion used here is the eBIC. The model with the lowest eBIC is selected.
sel_XX = frommlmm_toebic(list(Xa), res_mlmm)
res_eBIC = eBIC_allmodels(floweringDateAD, sel_XX, list(K.add), ncol(Xa))
res_eBIC
#> BIC ajout eBIC_0.5 LogL
#> mu 301.9261 0.000000 301.9261 -143.9123
#> SNP303 289.7766 6.214608 295.9912 -135.4873
#> SNP317 284.9296 11.734067 296.6637 -130.7136
#> SNP407 280.5870 16.846055 297.4330 -126.1920
#> SNP10 276.0085 21.668350 297.6768 -121.5525
#> SNP153 272.9082 26.265488 299.1737 -117.6522
#> SNP375 266.7063 30.678287 297.3846 -112.2010
In this example, the model with 1 fixed effect is selected.
The selected SNPs effects can be estimated with the function
Estimation_allmodels
sel_XXclass = fromeBICtoEstimation(sel_XX, res_eBIC, res_threshold)
effects = Estimation_allmodels(floweringDateAD, sel_XXclass, list(K.add))
effects
#> BLUE Tukey.Class Frequency
#> mu 1148.92349 <NA> NA
#> SNP303_00 66.52243 a 0.1090909
#> SNP303_01|10 15.96944 b 0.4818182
#> SNP303_11 0.00000 c 0.4090909
You can visualize the distributions of the phenotypes of the
individuals according to their allelic class for a given SNP using the
genotypes.boxplot
function:
The colored symbols represent the Tukey’s classes. Different symbols mean that the means are significantly different.
For convenience, all of the example command lines are compiled below:
library(mlmm.gwas)
data("mlmm.gwas.AD")
res_mlmm = mlmm_allmodels(floweringDateAD, list(Xa), list(K.add))
manhattan.plot( res_mlmm )
sel_XX = frommlmm_toebic(list(Xa), res_mlmm)
res_eBIC = eBIC_allmodels(floweringDateAD, sel_XX, list(K.add), ncol(Xa))
res_threshold <- threshold_allmodels(threshold=NULL, res_mlmm)
sel_XXclass = fromeBICtoEstimation(sel_XX, res_eBIC, res_threshold)
effects = Estimation_allmodels(floweringDateAD, sel_XXclass, list(K.add))
genotypes.boxplot(Xa, floweringDateAD, "SNP303", effects)
You can also run the entire pipeline (without the figures plots) with
the function run_entire_gwas_pipeline
. This function runs
internally the functions mlmm_allmodels
,
frommlmm_toebic
, eBIC_allmodels
,
threshold_allmodels
, fromeBICtoEstimation
and
Estimation_allmodels
and returns a list with the results of
the mlmm, eBIC and effects estimation steps.
results = run_entire_gwas_pipeline(floweringDateAD, list(Xa), list(K.add), threshold=NULL)
#> iteration LogLik wall cpu(sec) restrained
#> 1 -47.4676 6:50:16 0 0
#> 2 -43.866 6:50:16 0 0
#> 3 -43.0839 6:50:16 0 0
#> 4 -43.0056 6:50:16 0 0
#> 5 -43.0037 6:50:16 0 0
#> 6 -43.0036 6:50:16 0 0
#> null model done! pseudo-h= 0.52
#> iteration LogLik wall cpu(sec) restrained
#> 1 -35.9468 6:50:16 0 0
#> 2 -34.9284 6:50:16 0 0
#> 3 -34.6096 6:50:16 0 0
#> 4 -34.5712 6:50:16 0 0
#> 5 -34.5702 6:50:16 0 0
#> step 1 done! pseudo-h= 0.379 model: Y ~ mu + SNP303
#> iteration LogLik wall cpu(sec) restrained
#> 1 -30.707 6:50:16 0 0
#> 2 -30.1743 6:50:16 0 0
#> 3 -29.9904 6:50:16 0 0
#> 4 -29.9655 6:50:16 0 0
#> 5 -29.9648 6:50:16 0 0
#> step 2 done! pseudo-h= 0.328 model: Y ~ mu + SNP303 + SNP317
#> iteration LogLik wall cpu(sec) restrained
#> 1 -25.8333 6:50:16 0 0
#> 2 -25.8194 6:50:16 0 0
#> 3 -25.8115 6:50:16 0 0
#> 4 -25.8089 6:50:16 0 0
#> 5 -25.8085 6:50:16 0 0
#> step 3 done! pseudo-h= 0.201 model: Y ~ mu + SNP303 + SNP317 + SNP407
#> iteration LogLik wall cpu(sec) restrained
#> 1 -21.9588 6:50:16 0 0
#> 2 -21.8936 6:50:16 0 0
#> 3 -21.8537 6:50:16 0 0
#> 4 -21.8384 6:50:16 0 0
#> 5 -21.8352 6:50:16 0 0
#> 6 -21.8345 6:50:16 0 0
#> step 4 done! pseudo-h= 0.086 model: Y ~ mu + SNP303 + SNP317 + SNP407 + SNP10
#> iteration LogLik wall cpu(sec) restrained
#> 1 -18.8393 6:50:16 0 0
#> 2 -18.6398 6:50:16 0 0
#> 3 -18.5399 6:50:16 0 0
#> 4 -18.5154 6:50:16 0 0
#> 5 -18.5133 6:50:16 0 0
#> 6 -18.5131 6:50:16 0 0
#> step 5 done! pseudo-h= 0.048 model: Y ~ mu + SNP303 + SNP317 + SNP407 + SNP10 + SNP153
#> iteration LogLik wall cpu(sec) restrained
#> 1 -14.9796 6:50:16 0 0
#> 2 -14.1825 6:50:16 0 1
#> 3 -13.7957 6:50:16 0 1
#> 4 -13.7957 6:50:16 0 1
#> step 6 done! pseudo-h= 0 model: Y ~ mu + SNP303 + SNP317 + SNP407 + SNP10 + SNP153 + SNP375
#> [1] "No p-value below the threshold for this trait"
names(results)
#> [1] "pval" "eBic" "threshold" "effects"