---
title: "Inferring Linkage Disequilibrium blocks from genotypes"
author: "Shubham Chaturvedi, Pierre Neuvial, Nathalie Vialaneix"
date: "`r Sys.Date()`"
output:
html_document:
toc: yes
vignette: >
%\VignetteIndexEntry{Inferring Linkage Disequilibrium blocks from genotypes}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r skipNoSNPSTATS}
# IMPORTANT: this vignette is not created if snpStats is not installed
if (!require("snpStats")) {
knitr::opts_chunk$set(eval = FALSE)
}
```
# Introduction
In this vignette we demonstrate the use of `snpClust` function in the `adjclust`
package. `snpClust` performs adjacency-constrained hierarchical clustering of
single nucleotide polymorphisms (SNPs), where the similarity between SNPs is
defined by linkage disequilibrium (LD).
This function implements the algorithm described in [1]. It is an extension of
the algorithm described in [3,4]. Denoting by $p$ the number of SNPs to cluster
and assuming that the similarity between SNPs whose indices are more distant
than $h$, its time complexity is $O(p (\log(p) + h))$, and its space complexity
is $O(hp)$.
```{r loadLib, message=FALSE}
library("adjclust")
```
# Loading and displaying genotype data
The beginning of this vignette closely follows the "LD vignette" of the SnpStats
package [2]. First, we load genotype data.
```{r loadData, results="hide", message=FALSE}
data("ld.example", package = "snpStats")
```
We focus on the `ceph.1mb` data.
```{r preData}
geno <- ceph.1mb[, -316] ## drop one SNP leading to one missing LD value
p <- ncol(geno)
nSamples <- nrow(geno)
geno
```
These data are drawn from the International HapMap Project and concern 602
SNPs[^1] over a 1Mb region of chromosome 22 in sample of 90 Europeans.
We can compute and display the LD between these SNPs.
[^1]: We have dropped SNP rs2401075 because it produced a missing value due to
the lack of genetic diversity in the considered sample.
```{r LD}
ld.ceph <- snpStats::ld(geno, stats = "R.squared", depth = p-1)
image(ld.ceph, lwd = 0)
```
# Adjacency-constrained Hierarchical Agglomerative Clustering
The `snpClust` function can handle genotype data as an input:
```{r snpClust}
fit <- snpClust(geno, stats = "R.squared")
```
Note that due to numerical errors in the LD estimation, some of the estimated LD values may be slightly larger than 1. These values are rounded to 1 internally.
The above figure suggests that the LD signal is concentrated close to the
diagonal. We can focus on a diagonal band with the bandwidth parameter `h`:
```{r snpClust-sparse}
fitH <- snpClust(geno, h = 100, stats = "R.squared")
fitH
```
# Output
The output of the `snpClust` is of class `chac`. In particular, it can be
plotted as a dendrogram silently relying on the function `plot.dendrogram`:
```{r dendro}
plot(fitH, type = "rectangle", leaflab = "perpendicular")
```
Moreover, the output contains an element named `merge` which describes the
successive merges of the clustering, and an element `gains` which gives the
improvement in the criterion optimized by the clustering at each successive
merge.
```{r objectDesc}
head(cbind(fitH$merge, fitH$gains))
```
# Other types of input
In this section we show how the `snpClust` function can also be applied directly
to LD values.
```{r snpClust-LD}
h <- 100
ld.ceph <- snpStats::ld(geno, stats = "R.squared", depth = h, symmetric = TRUE)
image(ld.ceph, lwd = 0)
```
Note that we have forced the `snpStats::ld` function to return a symmetric matrix.
We can apply `snpClust` directly to this LD matrix (of class `Matrix::dsCMatrix`):
```{r snpClust-sMatrix}
fitL <- snpClust(ld.ceph, h)
```
`snpClust` also handles inputs of class `base::matrix`:
```{r snpClust-matrix, warning=FALSE}
gmat <- as(geno, "matrix")
fitM <- snpClust(geno, h, stats = "R.squared")
```
# References
[1] Ambroise C., Dehman A., Neuvial P., Rigaill G., and Vialaneix N. (2019).
Adjacency-constrained hierarchical clustering of a band similarity matrix with
application to genomics. *Algorithms for Molecular Biology*, **14**, 22.
[2] Clayton D. (2015). snpStats: SnpMatrix and XSnpMatrix classes and methods.
R package version 1.20.0
[3] Dehman A., Ambroise C., Neuvial P. (2015). Performance of a blockwise
approach in variable selection using linkage disequilibrium information. *BMC
Bioinformatics*, **16**, 148.
[4] Randriamihamison N., Vialaneix N., and Neuvial P. (2021). Applicability and
interpretability of Ward's hierarchical agglomerative clustering with or without
contiguity constraints. *Journal of Classification*, **38**, 363–389.
# Session information
```{r session}
sessionInfo()
```