GWAS pipeline manual

This document explains how to prepare the data and the basic usage of the mlmm.gwas package. Use ?functionName in R console to get the complete documentation of a given function.

The pipeline can be divised in 3 main steps:

  • The MLMM where GWAS is carried correcting for population structure while including cofactors through a forward regression approach.
  • Model selection where the model with the lowest eBIC is selected among the steps of the forward regression
  • Effects estimation in the selected model

Only the additive model is presented in this document. The pipeline also supports additive+dominace, female+male and female+male+interaction models. For more information on how to make the matrices for these models, read Bonnafous et al. (2017).

Data

This package includes a dataset with genotypes, phenotypes and kinship for different hybrids of sunflower.

    library(mlmm.gwas)
    data("mlmm.gwas.AD")
    ls()
#> [1] "K.add"           "K.dom"           "Xa"              "Xd"             
#> [5] "floweringDateAD"

Phenotypes (Y)

Phenotypes must be put in a named vector. Names are the individuals’ names and values their phenotypes.

The trait included in the dataset is the flowering date in ??C.day.

    head(floweringDateAD)
#> SF173_SF326  SF160_PNS1 SF031_SF302 SF028_SF332 SF029_SF281  RA22_SF323 
#>    1187.500    1200.520    1152.476    1130.149    1254.520    1215.293

Genotypes matrix (X)

It’s a n by m matrix, where n is the number of individuals, and m the number of SNPs.

Be sure to:

  • impute missing genotypes (missing genotypes are not allowed).
  • remove markers with low Minimum Allelic Frequency (MAF).
  • remove redondant markers.
  • center genotypes by markers.
    Xa[1:5,1:5]
#>                   SNP1       SNP2       SNP3       SNP4       SNP5
#> SF173_SF326  1.3055556  0.4722222 -0.1944444 -0.1388889 -0.8611111
#> SF160_PNS1  -0.6944444  0.4722222 -0.1944444 -0.1388889 -0.8611111
#> SF031_SF302 -0.6944444  0.4722222 -0.1944444 -0.1388889  0.1388889
#> SF028_SF332 -0.6944444  0.4722222 -0.1944444 -0.1388889  0.1388889
#> SF029_SF281 -0.6944444 -0.5277778 -0.1944444 -0.1388889  1.1388889

“Kinship” matrix (K)

As K is normalized in the pipeline, you don’t need to do it yourself.

So you can get it from centered X easily:

    ##You can get K from X with the following command line
    ##We are not doing it here because the X matrix included in data("mlmm.gwas.AD")
    ##is a subset of a larger matrix.
    #K.add = Xa %*% t(Xa)

    K.add[1:5,1:5]
#>             SF173_SF326 SF160_PNS1 SF031_SF302 SF028_SF332 SF029_SF281
#> SF173_SF326  213719.258  -2470.409    3375.313   -3872.576  -30046.548
#> SF160_PNS1    -2470.409 266025.924   51606.647   22635.758  -29448.215
#> SF031_SF302    3375.313  51606.647  212842.369   16744.480   10089.508
#> SF028_SF332   -3872.576  22635.758   16744.480  225864.591   -9349.381
#> SF029_SF281  -30046.548 -29448.215   10089.508   -9349.381  191078.647

Y, X and K harmonization

Individuals and markers must be the sames and in the same order in Y, X and K.

Genetic or physical map

A genetic or physical map can be used for graphical representation of association (Manhattan plot).

It’s a dataframe with 3 columns:

  • Markers names
  • Chromosomes or scaffolds names
  • Positions (genetic or physical, any unit is allowed)

Missing data are allowed.

There is no map in the example dataset. We will still be alble to use the manhattan.plot function but the markers will be ordered by their line index in the X matrix, not by genetic or physical position.

Association detection with the mlmm_allmodels function

The function mlmm_allmodels performs GWAS correcting for population structure while including cofactors through a forward regression approach.

Several markers may be associated to a single QTL. The mlmm_allmodels function aims to solve this problem by selecting the closest SNP to each QTL using a forward regression approach. The association p-values of the SNPs are estimated. Then, the SNP with the lowest p-value is used as fixed effect (cofactor). This is repeated, adding one SNP to the fixed effects at each step, until most of the variance is explained or until the maximum number of iterations is reached.

    res_mlmm = mlmm_allmodels(floweringDateAD, list(Xa), list(K.add))
#> iteration    LogLik     wall    cpu(sec)   restrained
#>     1      -47.4676   6:50:13      0           0
#>     2      -43.866   6:50:13      0           0
#>     3      -43.0839   6:50:13      0           0
#>     4      -43.0056   6:50:13      0           0
#>     5      -43.0037   6:50:13      0           0
#>     6      -43.0036   6:50:13      0           0
#> null model done! pseudo-h= 0.52 
#> iteration    LogLik     wall    cpu(sec)   restrained
#>     1      -35.9468   6:50:13      0           0
#>     2      -34.9284   6:50:13      0           0
#>     3      -34.6096   6:50:13      0           0
#>     4      -34.5712   6:50:13      0           0
#>     5      -34.5702   6:50:13      0           0
#> step  1  done! pseudo-h= 0.379 model: Y ~ mu + SNP303 
#> iteration    LogLik     wall    cpu(sec)   restrained
#>     1      -30.707   6:50:13      0           0
#>     2      -30.1743   6:50:13      0           0
#>     3      -29.9904   6:50:13      0           0
#>     4      -29.9655   6:50:13      0           0
#>     5      -29.9648   6:50:13      0           0
#> step  2  done! pseudo-h= 0.328 model: Y ~ mu + SNP303 + SNP317 
#> iteration    LogLik     wall    cpu(sec)   restrained
#>     1      -25.8333   6:50:13      0           0
#>     2      -25.8194   6:50:13      0           0
#>     3      -25.8115   6:50:13      0           0
#>     4      -25.8089   6:50:13      0           0
#>     5      -25.8085   6:50:13      0           0
#> step  3  done! pseudo-h= 0.201 model: Y ~ mu + SNP303 + SNP317 + SNP407 
#> iteration    LogLik     wall    cpu(sec)   restrained
#>     1      -21.9588   6:50:13      0           0
#>     2      -21.8936   6:50:13      0           0
#>     3      -21.8537   6:50:13      0           0
#>     4      -21.8384   6:50:13      0           0
#>     5      -21.8352   6:50:13      0           0
#>     6      -21.8345   6:50:13      0           0
#> step  4  done! pseudo-h= 0.086 model: Y ~ mu + SNP303 + SNP317 + SNP407 + SNP10 
#> iteration    LogLik     wall    cpu(sec)   restrained
#>     1      -18.8393   6:50:13      0           0
#>     2      -18.6398   6:50:13      0           0
#>     3      -18.5399   6:50:13      0           0
#>     4      -18.5154   6:50:13      0           0
#>     5      -18.5133   6:50:13      0           0
#>     6      -18.5131   6:50:13      0           0
#> step  5  done! pseudo-h= 0.048 model: Y ~ mu + SNP303 + SNP317 + SNP407 + SNP10 + SNP153 
#> iteration    LogLik     wall    cpu(sec)   restrained
#>     1      -14.9796   6:50:13      0           0
#>     2      -14.1825   6:50:13      0           1
#>     3      -13.7957   6:50:13      0           1
#>     4      -13.7957   6:50:13      0           1
#> step  6  done! pseudo-h= 0 model: Y ~ mu + SNP303 + SNP317 + SNP407 + SNP10 + SNP153 + SNP375

mlmm_allmodels returns a list with one element per step. Each element is a vector containing the association p-values of each marker.

These p-values can be visualized as a Manhattan plot:

    manhattan.plot( res_mlmm )

Model selection

It is necessary to select a model using a criterion to avoid overfitting. The criterion used here is the eBIC. The model with the lowest eBIC is selected.

    sel_XX = frommlmm_toebic(list(Xa), res_mlmm)
    res_eBIC = eBIC_allmodels(floweringDateAD, sel_XX, list(K.add), ncol(Xa))
    
    res_eBIC
#>             BIC     ajout eBIC_0.5      LogL
#> mu     301.9261  0.000000 301.9261 -143.9123
#> SNP303 289.7766  6.214608 295.9912 -135.4873
#> SNP317 284.9296 11.734067 296.6637 -130.7136
#> SNP407 280.5870 16.846055 297.4330 -126.1920
#> SNP10  276.0085 21.668350 297.6768 -121.5525
#> SNP153 272.9082 26.265488 299.1737 -117.6522
#> SNP375 266.7063 30.678287 297.3846 -112.2010

In this example, the model with 1 fixed effect is selected.

    res_threshold <- threshold_allmodels(threshold=NULL, res_mlmm)
#> [1] "No p-value below the threshold for this trait"

Estimation of the selected SNPs effects

The selected SNPs effects can be estimated with the function Estimation_allmodels

    sel_XXclass = fromeBICtoEstimation(sel_XX, res_eBIC, res_threshold)
    effects = Estimation_allmodels(floweringDateAD, sel_XXclass, list(K.add))
    
    effects
#>                    BLUE Tukey.Class Frequency
#> mu           1148.92349        <NA>        NA
#> SNP303_00      66.52243           a 0.1090909
#> SNP303_01|10   15.96944           b 0.4818182
#> SNP303_11       0.00000           c 0.4090909

You can visualize the distributions of the phenotypes of the individuals according to their allelic class for a given SNP using the genotypes.boxplot function:

    genotypes.boxplot(Xa, floweringDateAD, "SNP303", effects)

The colored symbols represent the Tukey’s classes. Different symbols mean that the means are significantly different.

Complete example script

For convenience, all of the example command lines are compiled below:

    library(mlmm.gwas)
    data("mlmm.gwas.AD")
    
    res_mlmm = mlmm_allmodels(floweringDateAD, list(Xa), list(K.add))
    
    manhattan.plot( res_mlmm )
    
    sel_XX = frommlmm_toebic(list(Xa), res_mlmm)
    res_eBIC = eBIC_allmodels(floweringDateAD, sel_XX, list(K.add), ncol(Xa))
    
    res_threshold <- threshold_allmodels(threshold=NULL, res_mlmm)
    
    sel_XXclass = fromeBICtoEstimation(sel_XX, res_eBIC, res_threshold)
    effects = Estimation_allmodels(floweringDateAD, sel_XXclass, list(K.add))
    
    genotypes.boxplot(Xa, floweringDateAD, "SNP303", effects)

You can also run the entire pipeline (without the figures plots) with the function run_entire_gwas_pipeline. This function runs internally the functions mlmm_allmodels, frommlmm_toebic, eBIC_allmodels, threshold_allmodels, fromeBICtoEstimation and Estimation_allmodels and returns a list with the results of the mlmm, eBIC and effects estimation steps.

    results = run_entire_gwas_pipeline(floweringDateAD, list(Xa), list(K.add), threshold=NULL)
#> iteration    LogLik     wall    cpu(sec)   restrained
#>     1      -47.4676   6:50:16      0           0
#>     2      -43.866   6:50:16      0           0
#>     3      -43.0839   6:50:16      0           0
#>     4      -43.0056   6:50:16      0           0
#>     5      -43.0037   6:50:16      0           0
#>     6      -43.0036   6:50:16      0           0
#> null model done! pseudo-h= 0.52 
#> iteration    LogLik     wall    cpu(sec)   restrained
#>     1      -35.9468   6:50:16      0           0
#>     2      -34.9284   6:50:16      0           0
#>     3      -34.6096   6:50:16      0           0
#>     4      -34.5712   6:50:16      0           0
#>     5      -34.5702   6:50:16      0           0
#> step  1  done! pseudo-h= 0.379 model: Y ~ mu + SNP303 
#> iteration    LogLik     wall    cpu(sec)   restrained
#>     1      -30.707   6:50:16      0           0
#>     2      -30.1743   6:50:16      0           0
#>     3      -29.9904   6:50:16      0           0
#>     4      -29.9655   6:50:16      0           0
#>     5      -29.9648   6:50:16      0           0
#> step  2  done! pseudo-h= 0.328 model: Y ~ mu + SNP303 + SNP317 
#> iteration    LogLik     wall    cpu(sec)   restrained
#>     1      -25.8333   6:50:16      0           0
#>     2      -25.8194   6:50:16      0           0
#>     3      -25.8115   6:50:16      0           0
#>     4      -25.8089   6:50:16      0           0
#>     5      -25.8085   6:50:16      0           0
#> step  3  done! pseudo-h= 0.201 model: Y ~ mu + SNP303 + SNP317 + SNP407 
#> iteration    LogLik     wall    cpu(sec)   restrained
#>     1      -21.9588   6:50:16      0           0
#>     2      -21.8936   6:50:16      0           0
#>     3      -21.8537   6:50:16      0           0
#>     4      -21.8384   6:50:16      0           0
#>     5      -21.8352   6:50:16      0           0
#>     6      -21.8345   6:50:16      0           0
#> step  4  done! pseudo-h= 0.086 model: Y ~ mu + SNP303 + SNP317 + SNP407 + SNP10 
#> iteration    LogLik     wall    cpu(sec)   restrained
#>     1      -18.8393   6:50:16      0           0
#>     2      -18.6398   6:50:16      0           0
#>     3      -18.5399   6:50:16      0           0
#>     4      -18.5154   6:50:16      0           0
#>     5      -18.5133   6:50:16      0           0
#>     6      -18.5131   6:50:16      0           0
#> step  5  done! pseudo-h= 0.048 model: Y ~ mu + SNP303 + SNP317 + SNP407 + SNP10 + SNP153 
#> iteration    LogLik     wall    cpu(sec)   restrained
#>     1      -14.9796   6:50:16      0           0
#>     2      -14.1825   6:50:16      0           1
#>     3      -13.7957   6:50:16      0           1
#>     4      -13.7957   6:50:16      0           1
#> step  6  done! pseudo-h= 0 model: Y ~ mu + SNP303 + SNP317 + SNP407 + SNP10 + SNP153 + SNP375 
#> [1] "No p-value below the threshold for this trait"
    names(results)
#> [1] "pval"      "eBic"      "threshold" "effects"