Title: | Knowledge Discovery by Accuracy Maximization |
---|---|
Description: | An unsupervised and semi-supervised learning algorithm that performs feature extraction from noisy and high-dimensional data. It facilitates identification of patterns representing underlying groups on all samples in a data set. Based on Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA. (2017) Bioinformatics <doi:10.1093/bioinformatics/btw705> and Cacciatore S, Luchinat C, Tenori L. (2014) Proc Natl Acad Sci USA <doi:10.1073/pnas.1220873111>. |
Authors: | Stefano Cacciatore [aut, trl, cre] , Leonardo Tenori [aut] |
Maintainer: | Stefano Cacciatore <[email protected]> |
License: | GPL (>= 2) |
Version: | 2.4.1 |
Built: | 2024-11-06 09:23:11 UTC |
Source: | CRAN |
Summarization of the categorical information.
categorical.test (name,x,y,total.column=FALSE,...)
categorical.test (name,x,y,total.column=FALSE,...)
name |
the name of the feature. |
x |
the information to summarize. |
y |
the classification of the cohort. |
total.column |
option to visualize the total (by default = " |
... |
further arguments to be passed to the function. |
The function returns a table with the summarized information and The p-value computated using the Fisher's test.
Stefano Cacciatore
Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-22. doi: 10.1073/pnas.1220873111. Link
Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. Link
correlation.test
,continuous.test
, txtsummary
data(clinical) hosp=clinical[,"Hospital"] gender=clinical[,"Gender"] GS=clinical[,"Gleason score"] BMI=clinical[,"BMI"] age=clinical[,"Age"] A=categorical.test("Gender",gender,hosp) B=categorical.test("Gleason score",GS,hosp) C=continuous.test("BMI",BMI,hosp,digits=2) D=continuous.test("Age",age,hosp,digits=1) rbind(A,B,C,D)
data(clinical) hosp=clinical[,"Hospital"] gender=clinical[,"Gender"] GS=clinical[,"Gleason score"] BMI=clinical[,"BMI"] age=clinical[,"Age"] A=categorical.test("Gender",gender,hosp) B=categorical.test("Gleason score",GS,hosp) C=continuous.test("BMI",BMI,hosp,digits=2) D=continuous.test("Age",age,hosp,digits=1) rbind(A,B,C,D)
The data belong to a cohort of 35 patients with prostate cancer from two different hospitals.
data(clinical)
data(clinical)
The data.frame "prcomp
" with the following elements: "Hospital
", "Gender
", "Gleason score
", "BMI
", and "Age
".
data(clinical) head(clinical)
data(clinical) head(clinical)
Summarization of the continuous information.
continuous.test (name, x, y, digits = 3, scientific = FALSE, range = c("IQR","95%CI"), logchange = FALSE, pos=2, method=c("non-parametric","parametric"), total.column=FALSE, ...)
continuous.test (name, x, y, digits = 3, scientific = FALSE, range = c("IQR","95%CI"), logchange = FALSE, pos=2, method=c("non-parametric","parametric"), total.column=FALSE, ...)
name |
the name of the feature. |
x |
the information to summarize. |
y |
the classification of the cohort. |
digits |
how many significant digits are to be used. |
scientific |
either a logical specifying whether result should be encoded in scientific format. |
range |
the range to be visualized. |
logchange |
either a logical specifying whether log2 of fold change should be visualized. |
pos |
a value indicating the position of range to be visualized. 1 for column, 2 for row. |
method |
a character string indicating which test method is to be computed. "non-parametric" (default), or "parametric". |
total.column |
option to visualize the total (by default = " |
... |
further arguments to be passed to or from methods. |
The function returns a table with the summarized information and the relative p-value. For non-parametric method, if the number of group is equal to two, the p-value is computed using the Wilcoxon rank-sum test, Kruskal-Wallis test otherwise. For parametric method, if the number of group is equal to two, the p-value is computed using the Student's t-Test, ANOVA one-way otherwise.
Stefano Cacciatore
Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-22. doi: 10.1073/pnas.1220873111. Link
Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. Link
correlation.test
, categorical.test
, txtsummary
data(clinical) hosp=clinical[,"Hospital"] gender=clinical[,"Gender"] GS=clinical[,"Gleason score"] BMI=clinical[,"BMI"] age=clinical[,"Age"] A=categorical.test("Gender",gender,hosp) B=categorical.test("Gleason score",GS,hosp) C=continuous.test("BMI",BMI,hosp,digits=2) D=continuous.test("Age",age,hosp,digits=1) rbind(A,B,C,D)
data(clinical) hosp=clinical[,"Hospital"] gender=clinical[,"Gender"] GS=clinical[,"Gleason score"] BMI=clinical[,"BMI"] age=clinical[,"Age"] A=categorical.test("Gender",gender,hosp) B=categorical.test("Gleason score",GS,hosp) C=continuous.test("BMI",BMI,hosp,digits=2) D=continuous.test("Age",age,hosp,digits=1) rbind(A,B,C,D)
This function performs the maximization of cross-validated accuracy by an iterative process
core_cpp(x, xTdata=NULL, clbest, Tcycle=20, FUN=c("PLS-DA","KNN"), fpar=2, constrain=NULL, fix=NULL, shake=FALSE)
core_cpp(x, xTdata=NULL, clbest, Tcycle=20, FUN=c("PLS-DA","KNN"), fpar=2, constrain=NULL, fix=NULL, shake=FALSE)
x |
a matrix. |
xTdata |
a matrix for projections. This matrix contains samples that are not used for the maximization of the cross-validated accuracy. Their classification is obtained by predicting samples on the basis of the final classification vector. |
clbest |
a vector to optimize. |
Tcycle |
number of iterative cycles that leads to the maximization of cross-validated accuracy. |
FUN |
classifier to be consider. Choices are " |
fpar |
parameters of the classifier. If the classifier is |
constrain |
a vector of |
fix |
a vector of |
shake |
if |
The function returns a list with 3 items:
clbest |
a classification vector with a maximized cross-validated accuracy. |
accbest |
the maximum cross-validated accuracy achieved. |
vect_acc |
a vector of all cross-validated accuracies obtained. |
vect_proj |
a prediction of samples in |
Stefano Cacciatore and Leonardo Tenori
Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-22. doi: 10.1073/pnas.1220873111. Link
Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. Link
KODAMA.matrix
,KODAMA.visualization
# Here, the famous (Fisher's or Anderson's) iris data set was loaded data(iris) u=as.matrix(iris[,-5]) s=sample(1:150,150,TRUE) # The maximization of the accuracy of the vector s is performed results=core_cpp(u, clbest=s,fpar = 5) print(as.numeric(results$clbest))
# Here, the famous (Fisher's or Anderson's) iris data set was loaded data(iris) u=as.matrix(iris[,-5]) s=sample(1:150,150,TRUE) # The maximization of the accuracy of the vector s is performed results=core_cpp(u, clbest=s,fpar = 5) print(as.numeric(results$clbest))
Summarization of the continuous information.
correlation.test (x, y, method = c("pearson", "spearman","MINE"), name=NA, perm=100, ...)
correlation.test (x, y, method = c("pearson", "spearman","MINE"), name=NA, perm=100, ...)
x |
a numeric vector. |
y |
a numeric vector. |
method |
a character string indicating which correlation method is to be computed. "pearson" (default), "spearman", or "MINE". |
name |
the name of the feature. |
perm |
number of permutation needed to estimate the p-value with MINE correlation. |
... |
further arguments to be passed to or from methods. |
The function returns a table with the summarized information.
Stefano Cacciatore
Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-22. doi: 10.1073/pnas.1220873111. Link
Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. Link
categorical.test
,continuous.test
, txtsummary
data(clinical) correlation.test(clinical[,"Age"],clinical[,"BMI"],name="correlation between Age and BMI")
data(clinical) correlation.test(clinical[,"Age"],clinical[,"BMI"],name="correlation between Age and BMI")
This function creates a data set based upon data points distributed on a Ulisse Dini's surface.
dinisurface(N=1000)
dinisurface(N=1000)
N |
Number of data points. |
The function returns a three dimensional data set.
Stefano Cacciatore and Leonardo Tenori
Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-22. doi: 10.1073/pnas.1220873111. Link
Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. Link
require("rgl") x=dinisurface() open3d() plot3d(x, col=rainbow(1000),box=FALSE,size=3)
require("rgl") x=dinisurface() open3d() plot3d(x, col=rainbow(1000),box=FALSE,size=3)
The floyd
function finds all shortest paths in a graph using Floyd's algorithm.
floyd(data)
floyd(data)
data |
matrix or distance object |
floyd
returns a matrix with the total lengths of the shortest path between each pair of points.
Floyd, Robert W
Algorithm 97: Shortest Path.
Communications of the ACM 1962; 5 (6): 345. doi:10.1145/367766.368168.
# build a graph with 5 nodes x=matrix(c(0,NA,NA,NA,NA,30,0,NA,NA,NA,10,NA,0,NA,NA,NA,70,50,0,10,NA,40,20,60,0),ncol=5) print(x) # compute all path lengths z=floyd(x) print(z)
# build a graph with 5 nodes x=matrix(c(0,NA,NA,NA,NA,30,0,NA,NA,NA,10,NA,0,NA,NA,NA,70,50,0,10,NA,40,20,60,0),ncol=5) print(x) # compute all path lengths z=floyd(x) print(z)
A method to select unbalanced groupd in a cohort.
frequency_matching (data,label,times=5,seed=1234)
frequency_matching (data,label,times=5,seed=1234)
data |
a data.frame of data. |
label |
a classification of the groups. |
times |
The ratio between the two groups. |
seed |
a single number for random number generation. |
The function returns a list with 2 items or 4 items (if a test data set is present):
data |
the data after the frequency matching. |
label |
the label after the frequency matching. |
selection |
the rows selected for the frequency matching. |
Stefano Cacciatore
Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-22. doi: 10.1073/pnas.1220873111. Link
Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. Link
data(clinical) hosp=clinical[,"Hospital"] gender=clinical[,"Gender"] GS=clinical[,"Gleason score"] BMI=clinical[,"BMI"] age=clinical[,"Age"] A=categorical.test("Gender",gender,hosp) B=categorical.test("Gleason score",GS,hosp) C=continuous.test("BMI",BMI,hosp,digits=2) D=continuous.test("Age",age,hosp,digits=1) # Analysis without matching rbind(A,B,C,D) # The order is important. Right is more important than left in the vector # So, Ethnicity will be more important than Age var=c("Age","BMI","Gleason score") t=frequency_matching(clinical[,var],clinical[,"Hospital"],times=1) newdata=clinical[t$selection,] hosp.new=newdata[,"Hospital"] gender.new=newdata[,"Gender"] GS.new=newdata[,"Gleason score"] BMI.new=newdata[,"BMI"] age.new=newdata[,"Age"] A=categorical.test("Gender",gender.new,hosp.new) B=categorical.test("Gleason score",GS.new,hosp.new) C=continuous.test("BMI",BMI.new,hosp.new,digits=2) D=continuous.test("Age",age.new,hosp.new,digits=1) # Analysis with matching rbind(A,B,C,D)
data(clinical) hosp=clinical[,"Hospital"] gender=clinical[,"Gender"] GS=clinical[,"Gleason score"] BMI=clinical[,"BMI"] age=clinical[,"Age"] A=categorical.test("Gender",gender,hosp) B=categorical.test("Gleason score",GS,hosp) C=continuous.test("BMI",BMI,hosp,digits=2) D=continuous.test("Age",age,hosp,digits=1) # Analysis without matching rbind(A,B,C,D) # The order is important. Right is more important than left in the vector # So, Ethnicity will be more important than Age var=c("Age","BMI","Gleason score") t=frequency_matching(clinical[,var],clinical[,"Hospital"],times=1) newdata=clinical[t$selection,] hosp.new=newdata[,"Hospital"] gender.new=newdata[,"Gender"] GS.new=newdata[,"Gleason score"] BMI.new=newdata[,"BMI"] age.new=newdata[,"Age"] A=categorical.test("Gender",gender.new,hosp.new) B=categorical.test("Gleason score",GS.new,hosp.new) C=continuous.test("BMI",BMI.new,hosp.new,digits=2) D=continuous.test("Age",age.new,hosp.new,digits=1) # Analysis with matching rbind(A,B,C,D)
This function creates a data set based upon data points distributed on a Helicoid surface.
helicoid(N=1000)
helicoid(N=1000)
N |
Number of data points. |
The function returns a three dimensional data set.
Stefano Cacciatore and Leonardo Tenori
Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-22. doi: 10.1073/pnas.1220873111. Link
Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. Link
require("rgl") x=helicoid() open3d() plot3d(x, col=rainbow(1000),box=FALSE,size=3)
require("rgl") x=helicoid() open3d() plot3d(x, col=rainbow(1000),box=FALSE,size=3)
This function performs a permutation test using PLS to assess association between the KODAMA output and any additional related parameters such as clinical metadata.
k.test(data, labels, n = 100)
k.test(data, labels, n = 100)
data |
a matrix. |
labels |
a classification vector. |
n |
number of iterations of the permutation test. |
The p-value of the test.
Stefano Cacciatore
Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-22. doi: 10.1073/pnas.1220873111. Link
Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. Link
KODAMA.matrix
,KODAMA.visualization
data(iris) data=iris[,-5] labels=iris[,5] kk=KODAMA.matrix(data,FUN="KNN",f.par=2) kkplot=KODAMA.visualization(kk,"t-SNE") k1=k.test(kkplot,labels) print(k1) k2=k.test(kkplot,sample(labels)) print(k2)
data(iris) data=iris[,-5] labels=iris[,5] kk=KODAMA.matrix(data,FUN="KNN",f.par=2) kkplot=KODAMA.visualization(kk,"t-SNE") k1=k.test(kkplot,labels) print(k1) k2=k.test(kkplot,sample(labels)) print(k2)
This function performs a 10-fold cross validation on a given data set using k-Nearest Neighbors (kNN) model. To assess the prediction ability of the model, a 10-fold cross-validation is conducted by generating splits with a ratio 1:9 of the data set, that is by removing 10% of samples prior to any step of the statistical analysis, including PLS component selection and scaling. Best number of component for PLS was carried out by means of 10-fold cross-validation on the remaining 90% selecting the best Q2y value. Permutation testing was undertaken to estimate the classification/regression performance of predictors.
knn.double.cv(Xdata, Ydata, constrain=1:nrow(Xdata), compmax=min(5,c(ncol(Xdata),nrow(Xdata))), perm.test=FALSE, optim=TRUE, scaling = c("centering","autoscaling"), times=100, runn=10)
knn.double.cv(Xdata, Ydata, constrain=1:nrow(Xdata), compmax=min(5,c(ncol(Xdata),nrow(Xdata))), perm.test=FALSE, optim=TRUE, scaling = c("centering","autoscaling"), times=100, runn=10)
Xdata |
a matrix. |
Ydata |
the responses. If Ydata is a numeric vector, a regression analysis will be performed. If Ydata is factor, a classification analysis will be performed. |
constrain |
a vector of |
compmax |
the number of k to be used for classification. |
perm.test |
a classification vector. |
optim |
if perform the optmization of the number of k. |
scaling |
the scaling method to be used. Choices are " |
times |
number of cross-validations with permutated samples |
runn |
number of cross-validations loops. |
A list with the following components:
Ypred |
the vector containing the predicted values of the response variables obtained by cross-validation. |
Yfit |
the vector containing the fitted values of the response variables. |
Q2Y |
Q2y value. |
R2Y |
R2y value. |
conf |
The confusion matrix (only in classification mode). |
acc |
The cross-validated accuracy (only in classification mode). |
txtQ2Y |
a summary of the Q2y values. |
txtR2Y |
a summary of the R2y values. |
Stefano Cacciatore
Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-22. doi: 10.1073/pnas.1220873111. Link
Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. Link
data(iris) data=iris[,-5] labels=iris[,5] pp=knn.double.cv(data,labels) print(pp$Q2Y) table(pp$Ypred,labels) data(MetRef) u=MetRef$data; u=u[,-which(colSums(u)==0)] u=normalization(u)$newXtrain u=scaling(u)$newXtrain pp=knn.double.cv(u,as.factor(MetRef$donor)) print(pp$Q2Y) table(pp$Ypred,MetRef$donor)
data(iris) data=iris[,-5] labels=iris[,5] pp=knn.double.cv(data,labels) print(pp$Q2Y) table(pp$Ypred,labels) data(MetRef) u=MetRef$data; u=u[,-which(colSums(u)==0)] u=normalization(u)$newXtrain u=scaling(u)$newXtrain pp=knn.double.cv(u,as.factor(MetRef$donor)) print(pp$Q2Y) table(pp$Ypred,MetRef$donor)
k-nearest neighbour classification for a test set from a training set.
knn.kodama(Xtrain, Ytrain, Xtest, Ytest=NULL, k, scaling = c("centering","autoscaling"), perm.test=FALSE, times=1000)
knn.kodama(Xtrain, Ytrain, Xtest, Ytest=NULL, k, scaling = c("centering","autoscaling"), perm.test=FALSE, times=1000)
Xtrain |
a matrix of training set cases. |
Ytrain |
a classification vector. |
Xtest |
a matrix of test set cases. |
Ytest |
a classification vector. |
k |
the number of nearest neighbors to consider. |
scaling |
the scaling method to be used. Choices are " |
perm.test |
a classification vector. |
times |
a classification vector. |
The function utilizes the Approximate Nearest Neighbor (ANN) C++ library, which can give the exact nearest neighbours or (as the name suggests) approximate nearest neighbours to within a specified error bound. For more information on the ANN library please visit http://www.cs.umd.edu/~mount/ANN/.
The function returns a vector of predicted labels.
Stefano Cacciatore and Leonardo Tenori
Bentley JL (1975)
Multidimensional binary search trees used for associative search.
Communication ACM 1975;18:309-517.
Arya S, Mount DM
Approximate nearest neighbor searching
Proc. 4th Ann. ACM-SIAM Symposium on Discrete Algorithms (SODA'93);271-280.
Arya S, Mount DM, Netanyahu NS, Silverman R, Wu AY
An optimal algorithm for approximate nearest neighbor searching
Journal of the ACM 1998;45:891-923.
Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-22. doi: 10.1073/pnas.1220873111. Link
Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. Link
KODAMA.matrix
,KODAMA.visualization
data(iris) data=iris[,-5] labels=iris[,5] ss=sample(150,15) z=knn.kodama(data[-ss,], labels[-ss], data[ss,], k=5) table(z$Ypred[,5],labels[ss])
data(iris) data=iris[,-5] labels=iris[,5] ss=sample(150,15) z=knn.kodama(data[-ss,], labels[-ss], data[ss,], k=5) table(z$Ypred[,5],labels[ss])
KODAMA (KnOwledge Discovery by Accuracy MAximization) is an unsupervised and semi-supervised learning algorithm that performs feature extraction from noisy and high-dimensional data.
KODAMA.matrix (data, M = 100, Tcycle = 20, FUN_VAR = function(x) { ceiling(ncol(x)) }, FUN_SAM = function(x) { ceiling(nrow(x) * 0.75)}, bagging = FALSE, FUN = c("PLS-DA","KNN"), f.par = 5, W = NULL, constrain = NULL, fix=NULL, epsilon = 0.05, dims=2, landmarks=1000, neighbors=min(c(landmarks,nrow(data)))-1)
KODAMA.matrix (data, M = 100, Tcycle = 20, FUN_VAR = function(x) { ceiling(ncol(x)) }, FUN_SAM = function(x) { ceiling(nrow(x) * 0.75)}, bagging = FALSE, FUN = c("PLS-DA","KNN"), f.par = 5, W = NULL, constrain = NULL, fix=NULL, epsilon = 0.05, dims=2, landmarks=1000, neighbors=min(c(landmarks,nrow(data)))-1)
data |
a matrix. |
M |
number of iterative processes (step I-III). |
Tcycle |
number of iterative cycles that leads to the maximization of cross-validated accuracy. |
FUN_VAR |
function to select the number of variables to select randomly. By default all variable are taken. |
FUN_SAM |
function to select the number of samples to select randomly. By default the 75 per cent of all samples are taken. |
bagging |
Should sampling be with replacement, |
FUN |
classifier to be considered. Choices are " |
f.par |
parameters of the classifier. |
W |
a vector of |
constrain |
a vector of |
fix |
a vector of |
epsilon |
cut-off value for low proximity. High proximity are typical of intracluster relationships, whereas low proximities are expected for intercluster relationships. Very low proximities between samples are ignored by (default) setting |
dims |
dimensions of the configurations of t-SNE based on the KODAMA dissimilarity matrix. |
landmarks |
number of landmarks to use. |
neighbors |
number of neighbors to include in the dissimilarity matrix yo pass to the |
KODAMA consists of five steps. These can be in turn divided into two parts: (i) the maximization of cross-validated accuracy by an iterative process (step I and II), resulting in the construction of a proximity matrix (step III), and (ii) the definition of a dissimilarity matrix (step IV and V). The first part entails the core idea of KODAMA, that is, the partitioning of data guided by the maximization of the cross-validated accuracy. At the beginning of this part, a fraction of the total samples (defined by FUN_SAM
) are randomly selected from the original data. The whole iterative process (step I-III) is repeated M
times to average the effects owing to the randomness of the iterative procedure. Each time that this part is repeated, a different fraction of samples is selected. The second part aims at collecting and processing these results by constructing a dissimilarity matrix to provide a holistic view of the data while maintaining their intrinsic structure (steps IV and V). Then, KODAMA.visualization
function is used to visualise the results of KODAMA dissimilarity matrix.
The function returns a list with 4 items:
dissimilarity |
a dissimilarity matrix. |
acc |
a vector with the |
proximity |
a proximity matrix. |
v |
a matrix containing the all classification obtained maximizing the cross-validation accuracy. |
res |
a matrix containing all classification vectors obtained through maximizing the cross-validation accuracy. |
f.par |
parameters of the classifier.. |
entropy |
Shannon's entropy of the KODAMA proximity matrix. |
landpoints |
indexes of the landmarks used. |
data |
original data. |
knn_Armadillo |
dissimilarity matrix used as input for the |
Stefano Cacciatore and Leonardo Tenori
Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-22. doi: 10.1073/pnas.1220873111. Link
Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. Link
L.J.P. van der Maaten and G.E. Hinton.
Visualizing High-Dimensional Data Using t-SNE.
Journal of Machine Learning Research 9 (Nov) : 2579-2605, 2008.
L.J.P. van der Maaten.
Learning a Parametric Embedding by Preserving Local Structure.
In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS), JMLR W&CP 5:384-391, 2009.
McInnes L, Healy J, Melville J.
Umap: Uniform manifold approximation and projection for dimension reduction.
arXiv preprint:1802.03426. 2018 Feb 9.
data(iris) data=iris[,-5] labels=iris[,5] kk=KODAMA.matrix(data,FUN="KNN",f.par=2) cc=KODAMA.visualization(kk,"t-SNE") plot(cc,col=as.numeric(labels),cex=2)
data(iris) data=iris[,-5] labels=iris[,5] kk=KODAMA.matrix(data,FUN="KNN",f.par=2) cc=KODAMA.visualization(kk,"t-SNE") plot(cc,col=as.numeric(labels),cex=2)
Provides a simple function to transform the KODAMA dissimilarity matrix in a low-dimensional space.
KODAMA.visualization(kk, method=c("t-SNE","MDS","UMAP"), perplexity=min(30,floor((kk$knn_Armadillo$neighbors+1)/3)-1), ...)
KODAMA.visualization(kk, method=c("t-SNE","MDS","UMAP"), perplexity=min(30,floor((kk$knn_Armadillo$neighbors+1)/3)-1), ...)
kk |
output of |
method |
method to be considered for transforming the dissimilarity matrix in a low-dimensional space. Choices are " |
perplexity |
Perplexity parameter. (optimal number of neighbors) for " |
... |
other parameters for " |
The function returns a matrix contains the coordinates of the datapoints in a low-dimensional space.
Stefano Cacciatore and Leonardo Tenori
Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-22. doi: 10.1073/pnas.1220873111. Link
Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. Link
L.J.P. van der Maaten and G.E. Hinton.
Visualizing High-Dimensional Data Using t-SNE.
Journal of Machine Learning Research 9 (Nov) : 2579-2605, 2008.
L.J.P. van der Maaten.
Learning a Parametric Embedding by Preserving Local Structure.
In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS), JMLR W&CP 5:384-391, 2009.
McInnes L, Healy J, Melville J.
Umap: Uniform manifold approximation and projection for dimension reduction.
arXiv preprint:1802.03426. 2018 Feb 9.
data(iris) data=iris[,-5] labels=iris[,5] kk=KODAMA.matrix(data,FUN="KNN",f.par=2) cc=KODAMA.visualization(kk,"t-SNE") plot(cc,col=as.numeric(labels),cex=2)
data(iris) data=iris[,-5] labels=iris[,5] kk=KODAMA.matrix(data,FUN="KNN",f.par=2) cc=KODAMA.visualization(kk,"t-SNE") plot(cc,col=as.numeric(labels),cex=2)
This function can be used to extract the variable ranking when KODAMA is performed with the PLS-DA classifier.
loads(model,method=c("loadings","kruskal.test"))
loads(model,method=c("loadings","kruskal.test"))
model |
output of KODAMA. |
method |
method to be used. Choices are " |
The function returns a vector of values indicating the "importance" of each variable. If "method="loadings"
the average of the loading of the first component of PLS models based on the cross-validated accuracy maximized vector is computed. If "method="kruskal.test"
the average of minus logarithm of p-value of Kruskal-Wallis Rank Sum test is computed.
Stefano Cacciatore
Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-22. doi: 10.1073/pnas.1220873111. Link
Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. Link
KODAMA.matrix
,KODAMA.visualization
data(iris) data=iris[,-5] labels=iris[,5] kk=KODAMA.matrix(data,FUN="PLS-DA") loads(kk)
data(iris) data=iris[,-5] labels=iris[,5] kk=KODAMA.matrix(data,FUN="PLS-DA") loads(kk)
This dataset consists of gene expression profiles of the three most prevalent adult lymphoid malignancies: diffuse large B-cell lymphoma (DLBCL), follicular lymphoma (FL), and B-cell chronic lymphocytic leukemia (B-CLL). The dataset consists of 4,682 mRNA genes for 62 samples (42 samples of DLBCL, 9 samples of FL, and 11 samples of B-CLL). Missing value are imputed and data are standardized as described in Dudoit, et al. (2002).
data(lymphoma)
data(lymphoma)
A list with the following elements:
data |
Gene expression data. A matrix with 62 rows and 4,682 columns. |
class |
Class index. A vector with 62 elements. |
Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-22. doi: 10.1073/pnas.1220873111. Link
Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. Link
Alizadeh AA, Eisen MB, Davis RE, et al.
Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling.
Nature 2000;403(6769):503-511.
Dudoit S, Fridlyand J, Speed TP
Comparison of discrimination methods for the classification of tumors using gene expression data.
J Am Stat Assoc 2002;97(417):77-87.
data(lymphoma) class=1+as.numeric(lymphoma$class) cc=pca(lymphoma$data)$x plot(cc,pch=21,bg=class) kk=KODAMA.matrix(lymphoma$data) cc=KODAMA.visualization(kk,"t-SNE") plot(cc,pch=21,bg=class)
data(lymphoma) class=1+as.numeric(lymphoma$class) cc=pca(lymphoma$data)$x plot(cc,pch=21,bg=class) kk=KODAMA.matrix(lymphoma$data) cc=KODAMA.visualization(kk,"t-SNE") plot(cc,pch=21,bg=class)
This function can be used to plot the accuracy values obtained during KODAMA procedure.
mcplot(model)
mcplot(model)
model |
output of KODAMA. |
No return value.
Stefano Cacciatore
Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-22. doi: 10.1073/pnas.1220873111. Link
Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. Link
KODAMA.matrix
,KODAMA.visualization
data=as.matrix(iris[,-5]) kk=KODAMA.matrix(data) mcplot(kk)
data=as.matrix(iris[,-5]) kk=KODAMA.matrix(data) mcplot(kk)
The data belong to a cohort of 22 healthy donors (11 male and 11 female) where each provided about 40 urine samples over the time course of approximately 2 months, for a total of 873 samples. Each sample was analysed by Nuclear Magnetic Resonance Spectroscopy. Each spectrum was divided in 450 spectral bins.
data(MetRef)
data(MetRef)
A list with the following elements:
data |
Metabolomic data. A matrix with 873 rows and 450 columns. |
gender |
Gender index. A vector with 873 elements. |
donor |
Donor index. A vector with 873 elements. |
Assfalg M, Bertini I, Colangiuli D, et al.
Evidence of different metabolic phenotypes in humans.
Proc Natl Acad Sci U S A 2008;105(5):1420-4. doi: 10.1073/pnas.0705685105. Link
Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-22. doi: 10.1073/pnas.1220873111. Link
Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. Link
data(MetRef) u=MetRef$data; u=u[,-which(colSums(u)==0)] u=normalization(u)$newXtrain u=scaling(u)$newXtrain class=as.numeric(as.factor(MetRef$gender)) cc= pca(u)$x plot(cc,pch=21,bg=class) class=as.numeric(as.factor(MetRef$donor)) plot(cc,pch=21,bg=rainbow(22)[class]) kk=KODAMA.matrix(u) cc=KODAMA.visualization(kk,"t-SNE") plot(cc,pch=21,bg=rainbow(22)[class])
data(MetRef) u=MetRef$data; u=u[,-which(colSums(u)==0)] u=normalization(u)$newXtrain u=scaling(u)$newXtrain class=as.numeric(as.factor(MetRef$gender)) cc= pca(u)$x plot(cc,pch=21,bg=class) class=as.numeric(as.factor(MetRef$donor)) plot(cc,pch=21,bg=rainbow(22)[class]) kk=KODAMA.matrix(u) cc=KODAMA.visualization(kk,"t-SNE") plot(cc,pch=21,bg=rainbow(22)[class])
Summarization of the continuous information.
multi_analysis (data, y, FUN=c("continuous.test","correlation.test"), ...)
multi_analysis (data, y, FUN=c("continuous.test","correlation.test"), ...)
data |
the matrix containing the continuous values. Each row corresponds to a different sample. Each column corresponds to a different variable. |
y |
the classification of the cohort. |
FUN |
function to be considered. Choices are " |
... |
further arguments to be passed to or from methods. |
The function returns a table with the summarized information. If the number of group is equal to two, the p-value is computed using the Wilcoxon rank-sum test, Kruskal-Wallis test otherwise.
Stefano Cacciatore
Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-22. doi: 10.1073/pnas.1220873111. Link
Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. Link
categorical.test
,continuous.test
,correlation.test
, txtsummary
data(clinical) multi_analysis(clinical[,c("BMI","Age")],clinical[,"Hospital"],FUN="continuous.test")
data(clinical) multi_analysis(clinical[,c("BMI","Age")],clinical[,"Hospital"],FUN="continuous.test")
Collection of Different Normalization Methods.
normalization(Xtrain,Xtest=NULL, method = "pqn",ref=NULL)
normalization(Xtrain,Xtest=NULL, method = "pqn",ref=NULL)
Xtrain |
a matrix of data (training data set). |
Xtest |
a matrix of data (test data set).(by default = NULL). |
method |
the normalization method to be used. Choices are " |
ref |
Reference sample for Probabilistic Quotient Normalization. (by default = NULL). |
A number of different normalization methods are provided:
"none
": no normalization method is applied.
"pqn
": the Probabilistic Quotient Normalization is computed as described in Dieterle, et al. (2006).
"sum
": samples are normalized to the sum of the absolute value of all variables for a given sample.
"median
": samples are normalized to the median value of all variables for a given sample.
"sqrt
": samples are normalized to the root of the sum of the squared value of all variables for a given sample.
The function returns a list with 2 items or 4 items (if a test data set is present):
newXtrain |
a normalized matrix (training data set). |
coeXtrain |
a vector of normalization coefficient of the training data set. |
newXtest |
a normalized matrix (test data set). |
coeXtest |
a vector of normalization coefficient of the test data set. |
Stefano Cacciatore and Leonardo Tenori
Dieterle F,Ross A, Schlotterbeck G, Senn H.
Probabilistic Quotient Normalization as Robust Method to Account for Diluition of Complex Biological Mixtures. Application in 1H NMR Metabolomics.
Anal Chem 2006;78:4281-90.
Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-22. doi: 10.1073/pnas.1220873111. Link
Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. Link
data(MetRef) u=MetRef$data; u=u[,-which(colSums(u)==0)] u=normalization(u)$newXtrain u=scaling(u)$newXtrain class=as.numeric(as.factor(MetRef$gender)) cc=pca(u) plot(cc$x,pch=21,bg=class)
data(MetRef) u=MetRef$data; u=u[,-which(colSums(u)==0)] u=normalization(u)$newXtrain u=scaling(u)$newXtrain class=as.numeric(as.factor(MetRef$gender)) cc=pca(u) plot(cc$x,pch=21,bg=class)
Performs a principal components analysis on the given data matrix and returns the results as an object of class "prcomp
".
pca(x, ...)
pca(x, ...)
x |
a matrix of data. |
... |
arguments passed to |
The function returns a list with class prcomp
containing the following components:
sdev |
the standard deviations of the principal components (i.e., the square roots of the eigenvalues of the covariance/correlation matrix, though the calculation is actually done with the singular values of the data matrix). |
rotation |
the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors). The function |
x |
if |
center , scale
|
the centering and scaling used, or |
txt |
the component of variance of each Principal Component. |
Stefano Cacciatore
Pearson, K
On Lines and Planes of Closest Fit to Systems of Points in Space.
Philosophical Magazine 1901;2 (11): 559-572. doi:10.1080/14786440109462720. Link
data(MetRef) u=MetRef$data; u=u[,-which(colSums(u)==0)] u=normalization(u)$newXtrain u=scaling(u)$newXtrain class=as.numeric(as.factor(MetRef$gender)) cc=pca(u) plot(cc$x,pch=21,bg=class)
data(MetRef) u=MetRef$data; u=u[,-which(colSums(u)==0)] u=normalization(u)$newXtrain u=scaling(u)$newXtrain class=as.numeric(as.factor(MetRef$gender)) cc=pca(u) plot(cc$x,pch=21,bg=class)
This function performs a 10-fold cross validation on a given data set using Partial Least Squares (PLS) model. To assess the prediction ability of the model, a 10-fold cross-validation is conducted by generating splits with a ratio 1:9 of the data set, that is by removing 10% of samples prior to any step of the statistical analysis, including PLS component selection and scaling. Best number of component for PLS was carried out by means of 10-fold cross-validation on the remaining 90% selecting the best Q2y value. Permutation testing was undertaken to estimate the classification/regression performance of predictors.
pls.double.cv(Xdata, Ydata, constrain=1:nrow(Xdata), compmax=min(5,c(ncol(Xdata),nrow(Xdata))), perm.test=FALSE, optim=TRUE, scaling = c("centering","autoscaling"), times=100, runn=10)
pls.double.cv(Xdata, Ydata, constrain=1:nrow(Xdata), compmax=min(5,c(ncol(Xdata),nrow(Xdata))), perm.test=FALSE, optim=TRUE, scaling = c("centering","autoscaling"), times=100, runn=10)
Xdata |
a matrix. |
Ydata |
the responses. If Ydata is a numeric vector, a regression analysis will be performed. If Ydata is factor, a classification analysis will be performed. |
constrain |
a vector of |
compmax |
the number of latent components to be used for classification. |
perm.test |
a classification vector. |
optim |
if perform the optmization of the number of components. |
scaling |
the scaling method to be used. Choices are " |
times |
number of cross-validations with permutated samples |
runn |
number of cross-validations loops. |
A list with the following components:
B |
the (p x m x length(ncomp)) array containing the regression coefficients. Each row corresponds to a predictor variable and each column to a response variable. The third dimension of the matrix B corresponds to the number of PLS components used to compute the regression coefficients. If ncomp has length 1, B is just a (p x m) matrix. |
Ypred |
the vector containing the predicted values of the response variables obtained by cross-validation. |
Yfit |
the vector containing the fitted values of the response variables. |
P |
the (p x max(ncomp)) matrix containing the X-loadings. |
Q |
the (m x max(ncomp)) matrix containing the Y-loadings. |
T |
the (ntrain x max(ncomp)) matrix containing the X-scores (latent components) |
R |
the (p x max(ncomp)) matrix containing the weights used to construct the latent components. |
Q2Y |
Q2y value. |
R2Y |
R2y value. |
R2X |
vector containg the explained variance of X by each PLS component. |
txtQ2Y |
a summary of the Q2y values. |
txtR2Y |
a summary of the R2y values. |
Stefano Cacciatore
Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-22. doi: 10.1073/pnas.1220873111. Link
Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. Link
data(iris) data=iris[,-5] labels=iris[,5] pp=pls.double.cv(data,labels) print(pp$Q2Y) table(pp$Ypred,labels) data(MetRef) u=MetRef$data; u=u[,-which(colSums(u)==0)] u=normalization(u)$newXtrain u=scaling(u)$newXtrain pp=pls.double.cv(u,as.factor(MetRef$donor)) print(pp$Q2Y) table(pp$Ypred,MetRef$donor)
data(iris) data=iris[,-5] labels=iris[,5] pp=pls.double.cv(data,labels) print(pp$Q2Y) table(pp$Ypred,labels) data(MetRef) u=MetRef$data; u=u[,-which(colSums(u)==0)] u=normalization(u)$newXtrain u=scaling(u)$newXtrain pp=pls.double.cv(u,as.factor(MetRef$donor)) print(pp$Q2Y) table(pp$Ypred,MetRef$donor)
Partial Least Squares (PLS) regression for test set from training set.
pls.kodama(Xtrain, Ytrain, Xtest, Ytest = NULL, ncomp, scaling = c("centering","autoscaling"), perm.test=FALSE, times=1000)
pls.kodama(Xtrain, Ytrain, Xtest, Ytest = NULL, ncomp, scaling = c("centering","autoscaling"), perm.test=FALSE, times=1000)
Xtrain |
a matrix of training set cases. |
Ytrain |
a classification vector. |
Xtest |
a matrix of test set cases. |
Ytest |
a classification vector. |
ncomp |
the number of components to consider. |
scaling |
the scaling method to be used. Choices are " |
perm.test |
a classification vector. |
times |
a classification vector. |
A list with the following components:
B |
the (p x m x length(ncomp)) matrix containing the regression coefficients. Each row corresponds to a predictor variable and each column to a response variable. The third dimension of the matrix B corresponds to the number of PLS components used to compute the regression coefficients. If ncomp has length 1, B is just a (p x m) matrix. |
Ypred |
the (ntest x m x length(ncomp)) containing the predicted values of the response variables for the observations from Xtest. The third dimension of the matrix Ypred corresponds to the number of PLS components used to compute the regression coefficients. |
P |
the (p x max(ncomp)) matrix containing the X-loadings. |
Q |
the (m x max(ncomp)) matrix containing the Y-loadings. |
T |
the (ntrain x max(ncomp)) matrix containing the X-scores (latent components) |
R |
the (p x max(ncomp)) matrix containing the weights used to construct the latent components. |
Stefano Cacciatore
Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-22. doi: 10.1073/pnas.1220873111. Link
Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. Link
KODAMA.matrix
,KODAMA.visualization
data(iris) data=iris[,-5] labels=iris[,5] ss=sample(150,15) ncomponent=3 z=pls.kodama(data[-ss,], labels[-ss], data[ss,], ncomp=ncomponent) table(z$Ypred[,ncomponent],labels[ss])
data(iris) data=iris[,-5] labels=iris[,5] ss=sample(150,15) ncomponent=3 z=pls.kodama(data[-ss,], labels[-ss], data[ss,], ncomp=ncomponent) table(z$Ypred[,ncomponent],labels[ss])
Collection of Different Scaling Methods.
scaling(Xtrain,Xtest=NULL, method = "autoscaling")
scaling(Xtrain,Xtest=NULL, method = "autoscaling")
Xtrain |
a matrix of data (training data set). |
Xtest |
a matrix of data (test data set).(by default = NULL). |
method |
the scaling method to be used. Choices are " |
A number of different scaling methods are provided:
"none
": no scaling method is applied.
"centering
": centers the mean to zero.
"autoscaling
": centers the mean to zero and scales data by dividing each variable by the variance.
"rangescaling
": centers the mean to zero and scales data by dividing each variable by the difference between the minimum and the maximum value.
"paretoscaling
": centers the mean to zero and scales data by dividing each variable by the square root of the standard deviation. Unit scaling divides each variable by the standard deviation so that each variance equal to 1.
The function returns a list with 1 item or 2 items (if a test data set is present):
newXtrain |
a scaled matrix (training data set). |
newXtest |
a scale matrix (test data set). |
Stefano Cacciatore and Leonardo Tenori
van den Berg RA, Hoefsloot HCJ, Westerhuis JA, et al.
Centering, scaling, and transformations: improving the biological information content of metabolomics data.
BMC Genomics 2006;7(1):142.
Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-22. doi: 10.1073/pnas.1220873111. Link
Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. Link
data(MetRef) u=MetRef$data; u=u[,-which(colSums(u)==0)] u=normalization(u)$newXtrain u=scaling(u)$newXtrain class=as.numeric(as.factor(MetRef$gender)) cc=pca(u) plot(cc$x,pch=21,bg=class,xlab=cc$txt[1],ylab=cc$txt[2])
data(MetRef) u=MetRef$data; u=u[,-which(colSums(u)==0)] u=normalization(u)$newXtrain u=scaling(u)$newXtrain class=as.numeric(as.factor(MetRef$gender)) cc=pca(u) plot(cc$x,pch=21,bg=class,xlab=cc$txt[1],ylab=cc$txt[2])
Produces a data set of spiral clusters.
spirals(n=c(100,100,100),sd=c(0,0,0))
spirals(n=c(100,100,100),sd=c(0,0,0))
n |
a vector of integer. The length of the vector is the number of clusters and each number corresponds to the number of data points in each cluster. |
sd |
amount of noise for each spiral. |
The function returns a two dimensional data set.
Stefano Cacciatore and Leonardo Tenori
Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-22. doi: 10.1073/pnas.1220873111. Link
Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. Link
helicoid
,dinisurface
,swissroll
v1=spirals(c(100,100,100),c(0.1,0.1,0.1)) plot(v1,col=rep(2:4,each=100)) v2=spirals(c(100,100,100),c(0.1,0.2,0.3)) plot(v2,col=rep(2:4,each=100)) v3=spirals(c(100,100,100,100,100),c(0,0,0.2,0,0)) plot(v3,col=rep(2:6,each=100)) v4=spirals(c(20,40,60,80,100),c(0.1,0.1,0.1,0.1,0.1)) plot(v4,col=rep(2:6,c(20,40,60,80,100)))
v1=spirals(c(100,100,100),c(0.1,0.1,0.1)) plot(v1,col=rep(2:4,each=100)) v2=spirals(c(100,100,100),c(0.1,0.2,0.3)) plot(v2,col=rep(2:4,each=100)) v3=spirals(c(100,100,100,100,100),c(0,0,0.2,0,0)) plot(v3,col=rep(2:6,each=100)) v4=spirals(c(20,40,60,80,100),c(0.1,0.1,0.1,0.1,0.1)) plot(v4,col=rep(2:6,c(20,40,60,80,100)))
Computes the Swiss Roll data set of a given number of data points.
swissroll(N=1000)
swissroll(N=1000)
N |
Number of data points. |
The function returns a three dimensional matrix.
Stefano Cacciatore and Leonardo Tenori
Balasubramanian M, Schwartz EL
The isomap algorithm and topological stability.
Science 2002;295(5552):7.
Roweis ST, Saul LK
Nonlinear dimensionality reduction by locally linear embedding.
Science 2000;290(5500):2323-6.
Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-22. doi: 10.1073/pnas.1220873111. Link
Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. Link
require("rgl") x=swissroll() open3d() plot3d(x, col=rainbow(1000),box=FALSE,size=3)
require("rgl") x=swissroll() open3d() plot3d(x, col=rainbow(1000),box=FALSE,size=3)
This function converts a classification vector into a classification matrix.
transformy(y)
transformy(y)
y |
a vector or factor. |
This function converts a classification vector into a classification matrix.
A matrix.
Stefano Cacciatore and Leonardo Tenori
Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-22. doi: 10.1073/pnas.1220873111. Link
Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. Link
y=rep(1:10,3) print(y) z=transformy(y) print(z)
y=rep(1:10,3) print(y) z=transformy(y) print(z)
Summarization of a numeric vector.
txtsummary (x,digits=0,scientific=FALSE,range=c("IQR","95%CI"))
txtsummary (x,digits=0,scientific=FALSE,range=c("IQR","95%CI"))
x |
a numeric vector. |
digits |
how many significant digits are to be used. |
scientific |
either a logical specifying whether result should be encoded in scientific format. |
range |
the range to be visualized. |
The function returns the median and the range (interquartile or 95% coefficient interval) of numeric vetor.
Stefano Cacciatore
Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-22. doi: 10.1073/pnas.1220873111. Link
Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. Link
categorical.test
,continuous.test
,correlation.test
, txtsummary
data(clinical) txtsummary(clinical[,"BMI"])
data(clinical) txtsummary(clinical[,"BMI"])
This dataset consists of the spoken, not written, addresses from 1900 until the sixth address by Barack Obama in 2014. Punctuation characters, numbers, words shorter than three characters, and stop-words (e.g., "that", "and", and "which") were removed from the dataset. This resulted in a dataset of 86 speeches containing 834 different meaningful words each. Term frequency-inverse document frequency (TF-IDF) was used to obtain feature vectors. It is often used as a weighting factor in information retrieval and text mining. The TF-IDF value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others.
data(USA)
data(USA)
A list with the following elements:
data |
TF-IDF data. A matrix with 86 rows and 834 columns. |
year |
Year index. A vector with 86 elements. |
president |
President index. A vector with 86 elements. |
Stefano Cacciatore and Leonardo Tenori
Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-22. doi: 10.1073/pnas.1220873111. Link
Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. Link
# Here is reported the analysis on the State of the Union # of USA president as shown in Cacciatore, et al. (2014) data(USA) kk=KODAMA.matrix(USA$data,FUN="KNN") cc=KODAMA.visualization(kk,"t-SNE",perplexity = 10) oldpar <- par(cex=0.5,mar=c(15,6,2,2)); plot(USA$year,cc[,1],axes=FALSE,pch=20,xlab="",ylab="First Component"); axis(1,at=USA$year,labels=rownames(USA$data),las=2); axis(2,las=2); box() par(oldpar)
# Here is reported the analysis on the State of the Union # of USA president as shown in Cacciatore, et al. (2014) data(USA) kk=KODAMA.matrix(USA$data,FUN="KNN") cc=KODAMA.visualization(kk,"t-SNE",perplexity = 10) oldpar <- par(cex=0.5,mar=c(15,6,2,2)); plot(USA$year,cc[,1],axes=FALSE,pch=20,xlab="",ylab="First Component"); axis(1,at=USA$year,labels=rownames(USA$data),las=2); axis(2,las=2); box() par(oldpar)