Title: | Compute a distance metric between two partitions of a set |
---|---|
Description: | partitionMetric computes a distance between two partitions of a set. |
Authors: | David Weisman, Dan Simovici |
Maintainer: | David Weisman <[email protected]> |
License: | BSD_2_clause + file LICENSE |
Version: | 1.1 |
Built: | 2024-12-08 06:58:08 UTC |
Source: | CRAN |
This small dataset contains aligned protein sequences for seven alleles of the aryl hydrocarbon receptor (AhR).
data(AhRs)
data(AhRs)
The format is a character matrix in which column represents
the
'th position in the alignment, and contains an amino
acid code or "-" indicating an indel. Row names contain the
animal species.
A DNA or protein sequence has an associated index set
that labels the
positions of the nucleotides or amino acids (AA).
This index set can be partitioned such that all members referring to
the same AA share a homogeneous partition.
For example, given the sequence
ATGTA
and its index
set , the "A" partition
contains the subset
, the "T" partition contains
, and so on.
Given two aligned sequences and their respective partitions of the
index set, a metric distance between these partitions can be computed. See
partitionMetric
for such a metric, along with an example
of clustering this AhR dataset.
This dataset was derived from NCBI HomoloGene:1224.
Mark Hahn, Aryl hydrocarbon receptors: diversity and evolution. Chem Biol Interact, 2002, 141, 131-160
Given a set partitioned in two ways, compute a distance metric between the partitions.
partitionMetric(B, C, beta = 2)
partitionMetric(B, C, beta = 2)
B |
B and C are vectors that represents partitions of a single set, with
each element representing a member of the set. See examples below for more information. |
C |
See B above. |
beta |
|
The return value is a nonnegative real number representing the distance between the two partition of the set. Full details are in the paper referenced below.
David Weisman, Dan Simovici
David Weisman and Dan Simovici, Several Remarks on the Metric Space of Genetic Codes. International Journal of Data Mining and Bioinformatics, 2012(6).
## Define several partitions of a 4-element set gender <- c('boy', 'girl', 'girl', 'boy') height <- c('short', 'tall', 'medium', 'tall') age <- c(7, 6, 5, 4) ## Compute some distances (dGG <- partitionMetric (gender, gender)) (dGH <- partitionMetric (gender, height)) (dHG <- partitionMetric (height, gender)) (dGA <- partitionMetric (gender, age)) (dHA <- partitionMetric (height, age)) ## These properties must hold for any metric dGG == 0 dGH == dHG dGA <= dGH + dHA ## Note that the partition names are irrelevant, and only need to be ## self-consistent within each B and C. It follows that these two set ## partitions are identical and have distance 0. partitionMetric (c(1,8,8), c(7,3,3)) == 0 ## Use the set partition to measure amino acid acid sequence differences ## between several alleles of the aryl hydrocarbon receptor. data(AhRs) dim(AhRs) AhRs[,1:10] distanceMatrix <- matrix(nrow=nrow(AhRs), ncol=nrow(AhRs), 0, dimnames=list(rownames(AhRs), rownames(AhRs))) for (pair in combn(rownames(AhRs), 2, simplify=FALSE)) { d <- partitionMetric (AhRs[pair[1],], AhRs[pair[2],], beta=1.01) distanceMatrix[pair[1],pair[2]] <- distanceMatrix[pair[2],pair[1]] <- d } hc <- hclust(as.dist(distanceMatrix)) plot(hc, sub=sprintf('Cophenentic correlation between distances and tree is %0.2f', cor(as.dist(distanceMatrix), cophenetic(hc))))
## Define several partitions of a 4-element set gender <- c('boy', 'girl', 'girl', 'boy') height <- c('short', 'tall', 'medium', 'tall') age <- c(7, 6, 5, 4) ## Compute some distances (dGG <- partitionMetric (gender, gender)) (dGH <- partitionMetric (gender, height)) (dHG <- partitionMetric (height, gender)) (dGA <- partitionMetric (gender, age)) (dHA <- partitionMetric (height, age)) ## These properties must hold for any metric dGG == 0 dGH == dHG dGA <= dGH + dHA ## Note that the partition names are irrelevant, and only need to be ## self-consistent within each B and C. It follows that these two set ## partitions are identical and have distance 0. partitionMetric (c(1,8,8), c(7,3,3)) == 0 ## Use the set partition to measure amino acid acid sequence differences ## between several alleles of the aryl hydrocarbon receptor. data(AhRs) dim(AhRs) AhRs[,1:10] distanceMatrix <- matrix(nrow=nrow(AhRs), ncol=nrow(AhRs), 0, dimnames=list(rownames(AhRs), rownames(AhRs))) for (pair in combn(rownames(AhRs), 2, simplify=FALSE)) { d <- partitionMetric (AhRs[pair[1],], AhRs[pair[2],], beta=1.01) distanceMatrix[pair[1],pair[2]] <- distanceMatrix[pair[2],pair[1]] <- d } hc <- hclust(as.dist(distanceMatrix)) plot(hc, sub=sprintf('Cophenentic correlation between distances and tree is %0.2f', cor(as.dist(distanceMatrix), cophenetic(hc))))