--- title: "Inferring Linkage Disequilibrium blocks from genotypes" author: "Shubham Chaturvedi, Pierre Neuvial, Nathalie Vialaneix" date: "`r Sys.Date()`" output: html_document: toc: yes vignette: > %\VignetteIndexEntry{Inferring Linkage Disequilibrium blocks from genotypes} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r skipNoSNPSTATS} # IMPORTANT: this vignette is not created if snpStats is not installed if (!require("snpStats")) { knitr::opts_chunk$set(eval = FALSE) } ``` # Introduction In this vignette we demonstrate the use of `snpClust` function in the `adjclust` package. `snpClust` performs adjacency-constrained hierarchical clustering of single nucleotide polymorphisms (SNPs), where the similarity between SNPs is defined by linkage disequilibrium (LD). This function implements the algorithm described in [1]. It is an extension of the algorithm described in [3,4]. Denoting by $p$ the number of SNPs to cluster and assuming that the similarity between SNPs whose indices are more distant than $h$, its time complexity is $O(p (\log(p) + h))$, and its space complexity is $O(hp)$. ```{r loadLib, message=FALSE} library("adjclust") ``` # Loading and displaying genotype data The beginning of this vignette closely follows the "LD vignette" of the SnpStats package [2]. First, we load genotype data. ```{r loadData, results="hide", message=FALSE} data("ld.example", package = "snpStats") ``` We focus on the `ceph.1mb` data. ```{r preData} geno <- ceph.1mb[, -316] ## drop one SNP leading to one missing LD value p <- ncol(geno) nSamples <- nrow(geno) geno ``` These data are drawn from the International HapMap Project and concern 602 SNPs[^1] over a 1Mb region of chromosome 22 in sample of 90 Europeans. We can compute and display the LD between these SNPs. [^1]: We have dropped SNP rs2401075 because it produced a missing value due to the lack of genetic diversity in the considered sample. ```{r LD} ld.ceph <- snpStats::ld(geno, stats = "R.squared", depth = p-1) image(ld.ceph, lwd = 0) ``` # Adjacency-constrained Hierarchical Agglomerative Clustering The `snpClust` function can handle genotype data as an input: ```{r snpClust} fit <- snpClust(geno, stats = "R.squared") ``` Note that due to numerical errors in the LD estimation, some of the estimated LD values may be slightly larger than 1. These values are rounded to 1 internally. The above figure suggests that the LD signal is concentrated close to the diagonal. We can focus on a diagonal band with the bandwidth parameter `h`: ```{r snpClust-sparse} fitH <- snpClust(geno, h = 100, stats = "R.squared") fitH ``` # Output The output of the `snpClust` is of class `chac`. In particular, it can be plotted as a dendrogram silently relying on the function `plot.dendrogram`: ```{r dendro} plot(fitH, type = "rectangle", leaflab = "perpendicular") ``` Moreover, the output contains an element named `merge` which describes the successive merges of the clustering, and an element `gains` which gives the improvement in the criterion optimized by the clustering at each successive merge. ```{r objectDesc} head(cbind(fitH$merge, fitH$gains)) ``` # Other types of input In this section we show how the `snpClust` function can also be applied directly to LD values. ```{r snpClust-LD} h <- 100 ld.ceph <- snpStats::ld(geno, stats = "R.squared", depth = h, symmetric = TRUE) image(ld.ceph, lwd = 0) ``` Note that we have forced the `snpStats::ld` function to return a symmetric matrix. We can apply `snpClust` directly to this LD matrix (of class `Matrix::dsCMatrix`): ```{r snpClust-sMatrix} fitL <- snpClust(ld.ceph, h) ``` `snpClust` also handles inputs of class `base::matrix`: ```{r snpClust-matrix, warning=FALSE} gmat <- as(geno, "matrix") fitM <- snpClust(geno, h, stats = "R.squared") ``` # References [1] Ambroise C., Dehman A., Neuvial P., Rigaill G., and Vialaneix N. (2019). Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics. *Algorithms for Molecular Biology*, **14**, 22. [2] Clayton D. (2015). snpStats: SnpMatrix and XSnpMatrix classes and methods. R package version 1.20.0 [3] Dehman A., Ambroise C., Neuvial P. (2015). Performance of a blockwise approach in variable selection using linkage disequilibrium information. *BMC Bioinformatics*, **16**, 148. [4] Randriamihamison N., Vialaneix N., and Neuvial P. (2021). Applicability and interpretability of Ward's hierarchical agglomerative clustering with or without contiguity constraints. *Journal of Classification*, **38**, 363–389. # Session information ```{r session} sessionInfo() ```