KODAMA (Knowledge Discovery by Accuracy Maximization) is a novel learning algorithm for unsupervised feature extraction, specifically designed for analysing noisy and high-dimensional data sets (Cacciatore , 2014).The core idea of the original algorithm is to use an iteration procedure to produce a hypothetical classification through maximization of cross-validated predictive accuracy. Using only the data set as input (no knowledge is needed), an iterative procedure permits classification with a high cross-validated accuracy. Two different classifiers are implemented in this package for the computation of cross-validated predictive accuracy: -nearest neighbors (NN) and Partial Least Squares - Discriminant Analysis (PLS-DA). This procedure is repeated several times to average the effects owing to the randomness of the iterative procedure. After each run of the procedure, a classification vector with high cross-validated accuracy is obtained. KODAMA subsequently collects and processes these results by constructing a dissimilarity matrix to provide a holistic view of the data. This documentation introduces the usage of KODAMA.
The R package KODAMA (current version 1.1) is part of the
Comprehensive R Archive Network (CRAN)1. The simplest way to
install the package is to enter the following command into your R
session: install.packages("KODAMA")
. We suggest to install
the R package rgl
for the data visualization in 3D
interactive plots.
To compile the C/C++ code included in the package for customization or installation on alternative operating systems the package can be manually installed from source. To this end, open the package’s page at CRAN (Cacciatore , 2014) and then proceed as follows:
R CMD INSTALL KODAMA.tar.gz
to install the package. Note
that this may require additional software on some platforms. Windows
requires Rtools2 to be installed and to be available in the
default search path (environment variable PATH). MAC OS X requires
installation of Xcode developers and command line tools.All versions downloadable from CRAN have been built using R version, R.3.2.3. The package should work without major issues on R versions > 3.0.0.
To load the package, enter the following instruction in your R session:
library("KODAMA")
If this command terminates without any error messages, you can be sure that the package has been installed successfully. The KODAMA package is now ready for use.
The package includes both a user manual (this document) and a
reference manual (help pages for each function). To view the user
manual, enter vignette("KODAMA")
. Help pages can be viewed
using the help command help(package="KODAMA")
.
The function swissroll
computes the Swiss Roll dataset
of a given sample size. The following example computes a Swiss Roll
dataset containing 1,000 samples.
require("rgl")
x=swissroll()
open3d()
plot3d(x, col=rainbow(1000),box=FALSE,type="s",size=1)
The function dinisurface
computes the Ulisse Dini’s
surface dataset of a given sample size. The following example computes a
Ulisse Dini’s surface dataset containing 1,000 samples.
require("rgl")
x=dinisurface()
open3d()
plot3d(x, col=rainbow(1000),box=FALSE,type="s",size=1)
The function helicoid
computes the Helicoid dataset of a
given sample size. The following example computes a Helicoid dataset
containing 1,000 samples.
require("rgl")
x=helicoid()
open3d()
plot3d(x, col=rainbow(1000),box=FALSE,type="s",size=1)
The function spirals
computes the Spirals dataset of a
given sample size. The following example computes a Spirals dataset
containing 1,000 samples.
par(mfrow=c(2,2),mai=c(0,0,0,0))
v1=spirals(c(100,100,100),c(0.1,0.1,0.1))
plot(v1,col=rep(2:4,each=100))
v2=spirals(c(100,100,100),c(0.1,0.2,0.3))
plot(v2,col=rep(2:4,each=100))
v3=spirals(c(100,100,100,100,100),c(0,0,0.2,0,0))
plot(v3,col=rep(2:6,each=100))
v4=spirals(c(20,40,60,80,100),c(0.1,0.1,0.1,0.1,0.1))
plot(v4,col=rep(2:6,c(20,40,60,80,100)))
The lymphoma
dataset consists of gene expression
profiles of the three most prevalent adult lymphoid malignancies:
diffuse large B-cell lymphoma (DLBCL), follicular lymphoma (FL), and
B-cell chronic lymphocytic leukemia (B-CLL). The dataset consists of
4,682 mRNA genes for 62 samples (42 samples of DLBCL, 9 samples of FL
and 11 samples of B-CLL). Missing values are imputed and data are
standardized as described in Dudoit, . (2002).
The data of belong to a cohort of 22 healthy donors (11 male and 11 female) where each provided about 40 urine samples over the time course of approximately 2 months, for a total of 873 samples. Each sample was analyzed by Nuclear Magnetic Resonance Spectroscopy. Each spectrum was divided in 450 spectral bins.
The USA
dataset consists of the spoken, not written,
presidential addresses from 1900 until the sixth address by Barack Obama
in 2014. Punctuation characters, numbers, words shorter than three
characters, and stop-words (, “that”, “and”, and “which”) were removed
from the dataset. This resulted in a dataset of 86 speeches containing
834 different meaningful words each. Term frequency-inverse document
frequency (TF-IDF) was used to obtain feature vectors. It is often used
as a weighting factor in information retrieval and text mining. The
TF-IDF value increases proportionally to the number of times a word
appears in the document, but is offset by the frequency of the word in
the corpus, which helps to control for the fact that some words are
generally more common than others.
This famous Fisher’s (aka Anderson’s) iris data set gives the
measurements (centimetres) of the variables sepal length and width and
petal length and width, respectively, for 50 flowers from each of 3
species of iris. The species are , , and . iris is a data frame with 150
cases (rows) and 5 variables (columns) named Sepal.Length
,
Sepal.Width
, Petal.Length
,
Petal.Width
and Species
.
We suggest KODAMA is first tested with the Iris dataset using the
default parameters (, NN classifier with =2). The KODAMA function
automatically performs the fast T-distributed Stochastic Neighbor
Embedding (van der Maaten 2014) on the KODAMA dissimilarity matrix. The
results can be visualized with the plot
function.
data(iris)
data=iris[,-5]
labels=iris[,5]
kodama_knn_5=KODAMA.matrix(data,FUN="KNN",f.par = 2)
cc=KODAMA.visualization(kk,"t-SNE")
plot(cc,pch=21,bg=as.numeric(labels)+1)
The mcplot
function can be used to extract the values of
cross-validated accuracy obtained from each iterative step of the Monte
Carlo procedures of maximization of the cross-validated accuracy.
mcplot(kodama_knn_5)
KODAMA can be run with different input settings. The two different
classifiers, NN and PLS-DA, are tested. The classifier and its parameter
can be changed by modifying FUN
and f.par
,
respectively. The NN classifier is tested with =2, 5, 10 and the PLS-DA
classifier is tested with 2, 3 and 4 components.
kodama_knn_2 =KODAMA.matrix(data,FUN="KNN",f.par=2)
cc_knn_2=KODAMA.visualization(kodama_knn_2,"t-SNE")
kodama_knn_5 =KODAMA.matrix(data,FUN="KNN",f.par=5)
cc_knn_5=KODAMA.visualization(kodama_knn_5,"t-SNE")
kodama_knn_10=KODAMA.matrix(data,FUN="KNN",f.par=10)
cc_knn_10=KODAMA.visualization(kodama_knn_10,"t-SNE")
kodama_pls_2 =KODAMA.matrix(data,FUN="PLS-DA",f.par=2)
cc_pls_2=KODAMA.visualization(kodama_pls_2,"t-SNE")
kodama_pls_3 =KODAMA.matrix(data,FUN="PLS-DA",f.par=3)
cc_pls_3=KODAMA.visualization(kodama_pls_3,"t-SNE")
kodama_pls_4 =KODAMA.matrix(data,FUN="PLS-DA",f.par=4)
cc_pls_4=KODAMA.visualization(kodama_pls_4,"t-SNE")
After the KODAMA analyses, the different solutions can be visualized for comparative purposes.
par(mfrow=c(2,3))
plot(cc_knn_2 ,pch=21,bg=as.numeric(labels)+1,main="KODAMA with kNN (k=2)")
plot(cc_knn_5 ,pch=21,bg=as.numeric(labels)+1,main="KODAMA with kNN (k=5)")
plot(cc_knn_10,pch=21,bg=as.numeric(labels)+1,main="KODAMA with kNN (k=10)")
plot(cc_pls_2 ,pch=21,bg=as.numeric(labels)+1,main="KODAMA with PLS (ncomp=2)")
plot(cc_pls_3 ,pch=21,bg=as.numeric(labels)+1,main="KODAMA with PLS (ncomp=3)")
plot(cc_pls_4 ,pch=21,bg=as.numeric(labels)+1,main="KODAMA with PLS (ncomp=4)")
The Shannon Entropy (H) (Shannon, 1948), a measure of unpredictability of information content, can be used to choose the best classifier and optimize its relative parameters. H is given by:
H = ∑i∑jvij × log vij
where vij is the proximity value between the sample i and the sample j divided by the sum of overall proximities.
Different classifiers can lead to solutions that represent different interpretation of the data. A lower H value can be indicative of the solution that best characterizes the data.
Classifier | parameter | Entropy |
---|---|---|
kNN | 2 | 9.371 |
kNN | 5 | 9.362 |
kNN | 10 | 9.381 |
PLS-DA | 2 | 9.976 |
PLS-DA | 3 | 9.933 |
PLS-DA | 4 | 9.977 |
The k.test
function performs a statistical test to
assess association between the KODAMA output and any additional related
parameters such as clinical metadata. The coefficient of determination
(R2) is used to
assess the proportion of the variance in the dependent variable (KODAMA
output) that is predictable from the independent variable and can be
thus used as a measure of the goodness of fit (Cameron , 1997). A
permutation test is performed by randomly sampling the value of the
labels to estimate the significance of the observed association.
The loads
function can be used to extract the variable
ranking. After each maximization of the cross-validated accuracy the
final label set is used to calculate the loadings of PLS-DA or the
p-value from the Kruskal-Wallis Rank Sum test. The output of the
loads
function is the average of these values for each
variable.
In the next example, MetRef dataset is used to show all possibilities offered by KODAMA performing the analysis in unsupervised or semi-supervised fashion. Firstly the MetRef dataset is pre-processed. This involves removing zero values from the MetRef dataset matrix and correcting for variations in urine concentration using Probabilistic Quotient Normalization (Dieterle 2006).
data(MetRef)
# Zero values are removed.
MetRef$data=MetRef$data[,-which(colSums(MetRef$data)==0)]
# Normalization of the data set with the Probabilistic Quotient Normalization method.
MetRef$data=normalization(MetRef$data)$newXtrain
# Centers the mean to zero and scales data by dividing each variable by the variance.
MetRef$data=scaling(MetRef$data)$newXtrain
donor=as.numeric(as.factor(MetRef$donor))
gender=as.numeric(as.factor(MetRef$gender))
KODAMA is performed using different input settings.
kk1=KODAMA.matrix(MetRef$data,FUN = "PLS-DA",f.par = 2)
kk2=KODAMA.matrix(MetRef$data,FUN = "PLS-DA",f.par = 5)
kk3=KODAMA.matrix(MetRef$data,FUN = "PLS-DA",f.par = 10)
kk4=KODAMA.matrix(MetRef$data,FUN = "PLS-DA",f.par = 20)
kk5=KODAMA.matrix(MetRef$data,FUN = "PLS-DA",f.par = 50)
kk6=KODAMA.matrix(MetRef$data,FUN = "PLS-DA",f.par = 100)
kk7=KODAMA.matrix(MetRef$data,FUN = "KNN",f.par = 2)
kk8=KODAMA.matrix(MetRef$data,FUN = "KNN",f.par = 3)
kk9=KODAMA.matrix(MetRef$data,FUN = "KNN",f.par = 5)
kk10=KODAMA.matrix(MetRef$data,FUN = "KNN",f.par = 10)
kk11=KODAMA.matrix(MetRef$data,FUN = "KNN",f.par = 15)
kk12=KODAMA.matrix(MetRef$data,FUN = "KNN",f.par = 20)
The most informative solution is represented by KODAMA using the PLS-DA with 100 components as the internal classifier as can be seen in the table below, which presents all Entropy values of each solution.
Classifier | parameter | Entropy |
---|---|---|
kNN | 2 | 12.847 |
kNN | 3 | 12.228 |
kNN | 5 | 12.129 |
kNN | 10 | 12.432 |
kNN | 15 | 12.783 |
kNN | 20 | 13.137 |
PLS-DA | 2 | 13.37 |
PLS-DA | 5 | 13.416 |
PLS-DA | 10 | 13.322 |
PLS-DA | 20 | 12.637 |
PLS-DA | 50 | 11.327 |
PLS-DA | 100 | 11.307 |
Principal Component Analysis, performed on this data set, is shown in the figure. Colors are indicative of donor specificity (left plot) and gender of donor (right plot).
par(mfrow=c(1,2))
plot(pca(MetRef$data)$x,
bg=rainbow(22)[donor],pch=21,cex=1.5)
plot(pca(MetRef$data)$x,
bg=c("#2c7ac8","#e3b80f")[gender],pch=21,cex=1.5)
The result of KODAMA with PLS-DA using 100 components as the classifier is subsequently shown. The KODAMA output saved in the variable kk6 is transformed by t-SNE in a low-dimensional space.
cc6=KODAMA.visualization(kk6,"t-SNE")
par(mfrow=c(1,2))
plot(cc6,bg=rainbow(22)[donor],pch=21,cex=1.5)
plot(cc6,bg=c("#2c7ac8","#e3b80f")[gender],pch=21,cex=1.5)
KODAMA can also use external information to work in a semi-supervised way. Supervised constraints can be imposed by linking some samples in such a way that if one of them is changed the linked ones must change in the same way (, they are forced to belong to the same class). This will produce solutions where linked samples are forced to have the lowest values in the KODAMA dissimilarity matrix.
In the next example, urine samples from the same donor are linked by providing class (donor) information to the algorithm before performing the iterative procedure thereby undertaking a “semi-supervised approach” to highlight otherwise hidden features.
kkA=KODAMA.matrix(MetRef$data,FUN="PLS-DA",f.par = 100,constrain=donor )
ccA=KODAMA.visualization(kkA,"t-SNE")
par(mfrow=c(1,2))
plot(ccA,bg=rainbow(22)[donor],pch=21,cex=1.5)
plot(ccA,bg=c("#2c7ac8","#e3b80f")[gender],pch=21,cex=1.5)
Additional external information can be provided through fixing labels
of vector W
on the KODAMA algorithm. The value of the
vector fix
must be TRUE
or FALSE
.
By default all elements are FALSE
. Samples with the
TRUE
fix value will not change the class label defined in W
during the maximization of the cross-validated accuracy procedure.
Here, gender information for the first ten donors is provided. Color coding indicates gender in the figures. Square data points indicate samples with supervised information. Circle data points indicate samples without any supervised information.
FIX=sample(c(TRUE,FALSE),873,T)
inform=gender
inform[!FIX]=NA
kkB=KODAMA.matrix(MetRef$data,FUN="PLS-DA",f.par = 100,W=inform,fix=FIX)
ccB=KODAMA.visualization(kkB,"t-SNE")
par(mfrow=c(1,2))
plot(ccB,bg=rainbow(22)[donor],pch=21,cex=1.5)
plot(ccB,bg=c("#2c7ac8","#e3b80f")[gender],pch=21,cex=1.5)
Here, information of the last two examples are provided together.
FIX=donor>10
inform=gender
inform[!FIX]=NA
kkC=KODAMA.matrix(MetRef$data,FUN="PLS-DA",f.par = 100,W= inform,constrain=donor,fix=FIX)
ccC=KODAMA.visualization(kkC,"t-SNE")
par(mfrow=c(1,2))
plot(ccC,bg=rainbow(22)[donor],pch=21,cex=1.5)
plot(ccC,bg=c("#2c7ac8","#e3b80f")[gender],pch=21,cex=1.5)
Version 2.3 separate the previous function KODAMA in two distint function: KODAMA.matrix and KODAMA.visualization. The first one is to perform the KODAMA analysis. The second is to transform the KODAMA dissimilarity matrix in a low-dimensional space using t-SNE, MDS or UMAP.
Version 2.0 has introduced a few novelty to the KODAMA algorithm and some small changes in the output structure of the functions. We tried to ensure backward compatibility with the previous version where possible. Sammon’s Non-Linear Mapping (Sammon 1969) used in the previous version to transform the KODAMA dissimilarity matrix in a space dimensionality has been replaced by t-SNE (van der Maaten 2008) with Tree-Based implementation (van der Maaten 2014) using the Rtsne package.
Version 1.1 has brought several fundamental changes to the architecture of the package. We tried to ensure backward compatibility with the previous version where possible. However, there are still some caveats the users should take into account.
Users who upgrade to version 1.1 from the older version (0.0.1)
should be aware that the package requires the new version of
Rcpp
and RcppArmadillo
packages. This issue
can simply be solved by installing Rcpp
and
RcppArmadillo
from CRAN using
install.packages("Rcpp")
and
install.packages("RcppArmadillo")
.
Compiling the source R package, we reported the following error:
ld: warning: directory not found for option '-L/usr/local/lib/gcc/x86_64-apple-darwin13.0.0/4.8.2'
ld: library not found for -lquadmath
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [KODAMA.so] Error 1
ERROR: compilation failed for package ‘KODAMA’
* removing ‘/Users/dmacinty/Library/R/3.3/library/KODAMA’
Mainly, R for OS X Maverick was compiled using gfortran-4.8. The Solution is to go to the optional libraries, frameworks and applications for Mac OS X on r.research.att.com and download gfortran-4.8.2-darwin13.tar.bz2. Extract the package in ~/, which is root. The files should be unpacked into /usr/local/…
Alternatively, open terminal and type:
curl -O http://r.research.att.com/libs/gfortran-4.8.2-darwin13.tar.bz2
sudo tar fvxz gfortran-4.8.2-darwin13.tar.bz2 -C /
Cacciatore S, Tenori L, Luchinat C, Bennett P, and MacIntyre DA. KODAMA: an updated R package for knowledge discovery and data mining. Bioinformatics. Submitted.
Moreover, the original paper in which KODAMA was introduced should also be cited as follows:
Cacciatore S, Luchinat C, Tenori L. Knowledge discovery by accuracy maximization. Proc Natl Acad Sci U S A 2014;111(14):5117-22.
To obtain BibTex entries of the two references, you can enter the
following into your R session to Bibtex
citation("KODAMA")
.
Cacciatore S, Luchinat C, Tenori L (2014) Knowledge discovery by accuracy maximization. ,111, 5117-22.
Alizadeh AA, (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. , 403(6769), 503-11.
Cameron CA, et al. (1997) An R-squared measure of goodness of fit of some common nonlinear regression models. 77(2), 1790-2.
Dieterle, F et al. (2006) Probabilistic Quotient Normalization as Robust Method to Account for Diluition of Complex Biological Mixtures. Application in 1H NMR Metabolomics. Anal Chem, 78, 4281-90.
Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. , 97(417), 77-87.
Fisher RA (1936) The use of multiple measurements in taxonomic problems. , 7, Part II, 179–88.
Sammon, J. W. (1969) A non-linear mapping for data structure analysis. , C-18 401–409.
Shannon CE (1948) A mathematical theory of communication. , 27(3), 379-423.
van der Maaten, L.J.P. & Hinton, G.E. (2008) Visualizing High-Dimensional Data Using t-SNE. , 9, 2579-2605.
van der Maaten, L.J.P. (2014) Accelerating t-SNE using Tree-Based Algorithms. , 15, 3221-3245.