Title: | A Data-Driven Similarity Kernel on Probability Spaces |
---|---|
Description: | We present a rank-based Mercer kernel to compute a pair-wise similarity metric corresponding to informative representation of data. We tailor the development of a kernel to encode our prior knowledge about the data distribution over a probability space. The philosophical concept behind our construction is that objects whose feature values fall on the extreme of that feature’s probability mass distribution are more similar to each other, than objects whose feature values lie closer to the mean. Semblance emphasizes features whose values lie far away from the mean of their probability distribution. The kernel relies on properties empirically determined from the data and does not assume an underlying distribution. The use of feature ranks on a probability space ensures that Semblance is computational efficacious, robust to outliers, and statistically stable, thus making it widely applicable algorithm for pattern analysis. The output from the kernel is a square, symmetric matrix that gives proximity values between pairs of observations. |
Authors: | Divyansh Agarwal <[email protected]> Nancy R. Zhang <[email protected]> |
Maintainer: | Divyansh Agarwal <[email protected]> |
License: | GPL-2 |
Version: | 1.1.0 |
Built: | 2024-11-17 06:49:26 UTC |
Source: | CRAN |
Compute semblance when there is only one feature, given as a vector x.
computeSemblanceOneFeature(x)
computeSemblanceOneFeature(x)
x |
a vector of observations for whom a given feature has been measured or estimated |
a Semblance metric for only one feature measured for several observations
Compute semblance when there is only one feature, given as a vector x, but weight the feature by its Gini coefficient. Use for data with strictly positive values.
computeSemblanceOneFeature_Gini(x)
computeSemblanceOneFeature_Gini(x)
x |
a vector of observations for whom a given feature has been measured or estimated |
a Semblance metric for only one feature measured for several observations
Make the upper triangular part the same as the lower triangular part.
makeUpperLower(m)
makeUpperLower(m)
m |
a matrix whose upper traingular part needs to be created using the lower traingular part |
a matrix where the upper triangular part the same as the lower triangular part
Kernel methods can operate in a high-dimensional, implicit feature space with low computational cost. Here, we present a rank-based Mercer kernel to compute a pair-wise similarity metric, corresponding to informative representation of data. We tailor the development of a kernel to encode our prior knowledge about the data distribution over a probability space. The philosophical concept behind our construction is that objects whose feature values fall on the extreme of that feature’s probability mass distribution are more similar to each other, than objects whose feature values lie closer to the mean. This idea represents a fundamentally novel way of assessing similarity between two observations. Our kernel (henceforth called ’Semblance’) naturally lends itself to the construction of a distance metric that emphasizes features whose values lie far away from the mean of their probability distribution. Semblance relies on properties empirically determined from the data and does not assume an underlying distribution. The use of feature ranks on a probability space ensures that Semblance is computational efficacious, robust to outliers, and statistically stable, thus making it widely applicable algorithm for pattern analysis. This R package accompanies the research article "Semblance: A Data-driven Kernel Redefines the Notion of Similarity", to appear in Science Advances.
ranksem(X)
ranksem(X)
X |
a matrix X with n observations and m features, whose Semblance Gram Matrix is to be computed |
The resultant Gram Matrix after applying Semblance kernel to the input
# Simulation Example when the user inputs a matrix with single-cell gene expression data ngenes = 10 ncells = 10 nclust = 2 mu=c(100, 0) #mean in cluster 1, cluster 2 for informative genes sigma=c(0.01, 1) #stdev in cluster 1, cluster 2 for informative genes size.rare.clust = 0.1 prop.info.genes = 0.2 n.info.genes=round(prop.info.genes*ngenes) n.clust1.cells = round(ncells*size.rare.clust) mu1=c(rep(mu[1]*sigma[2], n.info.genes), rep(0, ngenes-n.info.genes)) mu2=c(rep(mu[2]*sigma[2], n.info.genes), rep(0, ngenes-n.info.genes)) sig1=c(rep(sigma[1], n.info.genes), rep(1, ngenes-n.info.genes)) sig2=c(rep(sigma[2], n.info.genes), rep(1, ngenes-n.info.genes)) X=matrix(ncol=ngenes, nrow=ncells, data=0) for(i in 1:n.clust1.cells){ X[i,] = rnorm(ngenes, mean=mu1, sd=sig1) } for(i in (n.clust1.cells+1):ncells){ X[i,] = rnorm(ngenes, mean=mu2, sd=sig2) } #Compute kernels/distances rks=ranksem(X)
# Simulation Example when the user inputs a matrix with single-cell gene expression data ngenes = 10 ncells = 10 nclust = 2 mu=c(100, 0) #mean in cluster 1, cluster 2 for informative genes sigma=c(0.01, 1) #stdev in cluster 1, cluster 2 for informative genes size.rare.clust = 0.1 prop.info.genes = 0.2 n.info.genes=round(prop.info.genes*ngenes) n.clust1.cells = round(ncells*size.rare.clust) mu1=c(rep(mu[1]*sigma[2], n.info.genes), rep(0, ngenes-n.info.genes)) mu2=c(rep(mu[2]*sigma[2], n.info.genes), rep(0, ngenes-n.info.genes)) sig1=c(rep(sigma[1], n.info.genes), rep(1, ngenes-n.info.genes)) sig2=c(rep(sigma[2], n.info.genes), rep(1, ngenes-n.info.genes)) X=matrix(ncol=ngenes, nrow=ncells, data=0) for(i in 1:n.clust1.cells){ X[i,] = rnorm(ngenes, mean=mu1, sd=sig1) } for(i in (n.clust1.cells+1):ncells){ X[i,] = rnorm(ngenes, mean=mu2, sd=sig2) } #Compute kernels/distances rks=ranksem(X)
Compute Gini-weighted Semblance
ranksem_Gini(X)
ranksem_Gini(X)
X |
a matrix X with n observations and m features, whose Semblance Gram Matrix is to be computed. While computing this Gram Matrix, each feature is weighed by the Gini index for efficient feature selection. |
The resultant Gini-weighted Gram Matrix after applying Semblance kernel to the input
# Simulation Example when the user inputs a matrix with single-cell gene expression data ngenes = 10 ncells = 10 nclust = 2 mu=c(5, 1) #mean in cluster 1, cluster 2 for informative genes sigma=c(2, 1) #stdev in cluster 1, cluster 2 for informative genes size.rare.clust = 0.2 prop.info.genes = 0.2 n.info.genes=round(prop.info.genes*ngenes) n.clust1.cells = round(ncells*size.rare.clust) mu1=c(rep(mu[1]*sigma[2], n.info.genes), rep(0, ngenes-n.info.genes)) mu2=c(rep(mu[2]*sigma[2], n.info.genes), rep(0, ngenes-n.info.genes)) sig1=c(rep(sigma[1], n.info.genes), rep(1, ngenes-n.info.genes)) sig2=c(rep(sigma[2], n.info.genes), rep(1, ngenes-n.info.genes)) X=matrix(ncol=ngenes, nrow=ncells, data=0) for(i in 1:n.clust1.cells){ X[i,] = rnorm(ngenes, mean=mu1, sd=sig1) } for(i in (n.clust1.cells+1):ncells){ X[i,] = rnorm(ngenes, mean=mu2, sd=sig2) } Noise <- matrix(rnorm(prod(dim(X)), mean=2, sd=0.4), nrow = 10) X = X + Noise #Compute kernels/distances rks=ranksem_Gini(X)
# Simulation Example when the user inputs a matrix with single-cell gene expression data ngenes = 10 ncells = 10 nclust = 2 mu=c(5, 1) #mean in cluster 1, cluster 2 for informative genes sigma=c(2, 1) #stdev in cluster 1, cluster 2 for informative genes size.rare.clust = 0.2 prop.info.genes = 0.2 n.info.genes=round(prop.info.genes*ngenes) n.clust1.cells = round(ncells*size.rare.clust) mu1=c(rep(mu[1]*sigma[2], n.info.genes), rep(0, ngenes-n.info.genes)) mu2=c(rep(mu[2]*sigma[2], n.info.genes), rep(0, ngenes-n.info.genes)) sig1=c(rep(sigma[1], n.info.genes), rep(1, ngenes-n.info.genes)) sig2=c(rep(sigma[2], n.info.genes), rep(1, ngenes-n.info.genes)) X=matrix(ncol=ngenes, nrow=ncells, data=0) for(i in 1:n.clust1.cells){ X[i,] = rnorm(ngenes, mean=mu1, sd=sig1) } for(i in (n.clust1.cells+1):ncells){ X[i,] = rnorm(ngenes, mean=mu2, sd=sig2) } Noise <- matrix(rnorm(prod(dim(X)), mean=2, sd=0.4), nrow = 10) X = X + Noise #Compute kernels/distances rks=ranksem_Gini(X)
Make a matrix by repeating vector v into n columns
repCol(v, n)
repCol(v, n)
v |
a vector to be operated on |
n |
number of columns the vector will be repeated over |
a matrix with repeated columns
Make a matrix by repeating vector v into n rows
repRow(v, n)
repRow(v, n)
v |
a vector to be operated on |
n |
number of rows the vector will be repeated over |
a matrix with repeated rows