Title: | Omics Data Integration Using Kernel Methods |
---|---|
Description: | Kernel-based methods are powerful methods for integrating heterogeneous types of data. mixKernel aims at providing methods to combine kernel for unsupervised exploratory analysis. Different solutions are provided to compute a meta-kernel, in a consensus way or in a way that best preserves the original topology of the data. mixKernel also integrates kernel PCA to visualize similarities between samples in a non linear space and from the multiple source point of view <doi:10.1093/bioinformatics/btx682>. A method to select (as well as funtions to display) important variables is also provided <doi:10.1093/nargab/lqac014>. |
Authors: | Nathalie Vialaneix [aut, cre], Celine Brouard [aut], Remi Flamary [aut], Julien Henry [aut], Jerome Mariette [aut] |
Maintainer: | Nathalie Vialaneix <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.9-1 |
Built: | 2024-12-23 06:49:16 UTC |
Source: | CRAN |
Center and scale a dataset.
center.scale(X)
center.scale(X)
X |
a numeric matrix (or data frame) to center and scaled.
|
center.scale
returns a centered and scaled matrix.
Celine Brouard <[email protected]> Jerome Mariette <[email protected]> Nathalie Vialaneix <[email protected]>
compute.kernel
, combine.kernels
data("nutrimouse") ## Not run: nutrimouse.sc <- center.scale(nutrimouse$gene) ## End(Not run)
data("nutrimouse") ## Not run: nutrimouse.sc <- center.scale(nutrimouse$gene) ## End(Not run)
Compute cosine from Frobenius norm between kernels and display the corresponding correlation plot.
cim.kernel( ..., scale = TRUE, method = c("circle", "square", "number", "shade", "color", "pie") )
cim.kernel( ..., scale = TRUE, method = c("circle", "square", "number", "shade", "color", "pie") )
... |
list of kernels (called 'blocks') computed on different datasets and measured on the same samples. |
scale |
boleean. If |
method |
character. The visualization method to be used. Currently, seven methods are supported (see Details). |
The displayed similarities are the kernel generalization of the RV-coefficient described in Lavit et al., 1994.
The plot is displayed using the corrplot
package.
Seven visualization methods are implemented: "circle"
(default),
"square"
, "number"
, "pie"
, "shade"
and
"color"
. Circle and square areas are proportional to the absolute
value of corresponding similarities coefficients.
cim.kernel
returns a matrix containing the cosine from
Frobenius norm between kernels.
Jerome Mariette <[email protected]> Nathalie Vialaneix <[email protected]>
Lavit C., Escoufier Y., Sabatier R. and Traissac P. (1994). The ACT (STATIS method). Computational Statistics and Data Analysis, 18(1), 97-119.
Mariette J. and Villa-Vialaneix N. (2018). Unsupervised multiple kernel learning for heterogeneous data integration. Bioinformatics, 34(6), 1009-1015.
data(TARAoceans) # compute one kernel per dataset phychem.kernel <- compute.kernel(TARAoceans$phychem, kernel.func = "linear") pro.phylo.kernel <- compute.kernel(TARAoceans$pro.phylo, kernel.func = "abundance") pro.NOGs.kernel <- compute.kernel(TARAoceans$pro.NOGs, kernel.func = "abundance") # display similarities between kernels cim.kernel(phychem = phychem.kernel, pro.phylo = pro.phylo.kernel, pro.NOGs = pro.NOGs.kernel, method = "square")
data(TARAoceans) # compute one kernel per dataset phychem.kernel <- compute.kernel(TARAoceans$phychem, kernel.func = "linear") pro.phylo.kernel <- compute.kernel(TARAoceans$pro.phylo, kernel.func = "abundance") pro.NOGs.kernel <- compute.kernel(TARAoceans$pro.NOGs, kernel.func = "abundance") # display similarities between kernels cim.kernel(phychem = phychem.kernel, pro.phylo = pro.phylo.kernel, pro.NOGs = pro.NOGs.kernel, method = "square")
Compute multiple kernels into a single meta-kernel
combine.kernels( ..., scale = TRUE, method = c("full-UMKL", "STATIS-UMKL", "sparse-UMKL"), knn = 5, rho = 20 )
combine.kernels( ..., scale = TRUE, method = c("full-UMKL", "STATIS-UMKL", "sparse-UMKL"), knn = 5, rho = 20 )
... |
list of kernels (called 'blocks') computed on different datasets and measured on the same samples. |
scale |
boleean. If |
method |
character. Which method should be used to compute the
meta-kernel. Default: |
knn |
integer. If |
rho |
integer. Parameters for the augmented Lagrangian method. Default:
|
The arguments method
allows to specify the Unsupervised Multiple
Kernel Learning (UMKL) method to use:
"STATIS-UMKL"
: combines input kernels into the best
consensus of all kernels;
"full-UMKL"
: computes a kernel that minimizes the distortion
between the meta-kernel and the k-NN graphs obtained from all input
kernels;
"sparse-UMKL"
: a sparse variant of the "full-UMKL"
approach.
combine.kernels
returns an object of classes "kernel"
and "metaKernel"
, a list that contains the following components:
kernel |
: the computed meta-kernel matrix; |
X |
: the dataset from which the kernel has been computed, as given by
the function |
weights |
: a vector containing the weights used to combine the kernels. |
Jerome Mariette <[email protected]> Nathalie Vialaneix <[email protected]>
Mariette J. and Villa-Vialaneix N. (2018). Unsupervised multiple kernel learning for heterogeneous data integration . Bioinformatics, 34(6), 1009-1015. DOI: doi:10.1093/bioinformatics/btx682.
data(TARAoceans) # compute one kernel per dataset phychem.kernel <- compute.kernel(TARAoceans$phychem, kernel.func = "linear") pro.phylo.kernel <- compute.kernel(TARAoceans$pro.phylo, kernel.func = "abundance") pro.NOGs.kernel <- compute.kernel(TARAoceans$pro.NOGs, kernel.func = "abundance") # compute the meta kernel meta.kernel <- combine.kernels(phychem = phychem.kernel, pro.phylo = pro.phylo.kernel, pro.NOGs = pro.NOGs.kernel, method = "full-UMKL")
data(TARAoceans) # compute one kernel per dataset phychem.kernel <- compute.kernel(TARAoceans$phychem, kernel.func = "linear") pro.phylo.kernel <- compute.kernel(TARAoceans$pro.phylo, kernel.func = "abundance") pro.NOGs.kernel <- compute.kernel(TARAoceans$pro.NOGs, kernel.func = "abundance") # compute the meta kernel meta.kernel <- combine.kernels(phychem = phychem.kernel, pro.phylo = pro.phylo.kernel, pro.NOGs = pro.NOGs.kernel, method = "full-UMKL")
Compute a kernel from a given data matrix.
compute.kernel(X, kernel.func = "linear", ..., test.pos.semidef = FALSE)
compute.kernel(X, kernel.func = "linear", ..., test.pos.semidef = FALSE)
X |
a numeric matrix (or data frame) used to compute the kernel.
|
kernel.func |
the kernel function to use. This parameter can be set to
any user defined kernel function. Widely used kernel functions are
pre-implemented, that can be used by setting |
... |
the kernel function arguments. Valid parameters for pre-implemented kernels are:
|
test.pos.semidef |
boleean. If |
compute.kernel
returns an object of classes "kernel"
, a
list that contains the following components:
kernel |
: the computed kernel matrix. |
X |
: the original dataset. If |
kernel.func |
: the kernel function used. |
kernel.args |
: the arguments used to compute the kernel. |
Jerome Mariette <[email protected]> Nathalie Vialaneix <[email protected]>
Lozupone C. and Knight R. (2005). UniFrac: a new phylogenetic method for comparing microbial communities. Applied and Environmental Microbiology, 71(12), 8228-8235.
Lozupone C., Hamady M., Kelley S.T. and Knight R. (2007). Quantitative and qualitative beta diversity measures lead to different insights into factors that structure microbial communities. Applied and Environmental Microbiology, 73(5), 1576-1585.
Witten D. (2011). Classification and clustering of sequencing data using a Poisson model. Annals of Applied Statistics, 5(4), 2493-2518.
data(TARAoceans) pro.NOGs.kernel <- compute.kernel(TARAoceans$pro.NOGs, kernel.func = "abundance")
data(TARAoceans) pro.NOGs.kernel <- compute.kernel(TARAoceans$pro.NOGs, kernel.func = "abundance")
Performs a kernel PCA.
kernel.pca(K, ncomp = nrow(K$kernel))
kernel.pca(K, ncomp = nrow(K$kernel))
K |
a kernel object obtained using either |
ncomp |
integer. Indicates the number of components to return.. |
kernel.pca
returns an object of classes "kernel.pca"
and "pca"
, which is a list containing the following entries:
ncomp |
: the number of principal components; |
X |
: the input kernel matrix; |
kernel |
: the input kernel object provided by the user; |
sdev |
: the singular values (square root of the eigenvalues); |
rotation |
: the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors); |
loadings |
: same as 'rotation' to keep the mixOmics spirit; |
x |
: same as 'rotation' to keep the mixOmics spirit; |
Jerome Mariette <[email protected]> Nathalie Vialaneix <[email protected]>
Scholkopf B., Smola A. and Muller K.R. (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299-1319.
compute.kernel
, combine.kernels
data(TARAoceans) phychem.kernel <- compute.kernel(TARAoceans$phychem, kernel.func = "linear") kernel.pca.result <- kernel.pca(phychem.kernel, ncomp = 3)
data(TARAoceans) phychem.kernel <- compute.kernel(TARAoceans$phychem, kernel.func = "linear") kernel.pca.result <- kernel.pca(phychem.kernel, ncomp = 3)
Assess importance of variables on a given PC component by computing the Crone-Crosby distance between original sample positions and sample positions obtained by a random permutation of the variables.
kernel.pca.permute(kpca.result, ncomp = 1, ..., directory = NULL)
kernel.pca.permute(kpca.result, ncomp = 1, ..., directory = NULL)
kpca.result |
a kernel.pca object returned by the
|
ncomp |
integer. Number of KPCA components used to compute the
importance. Default: |
... |
list of character vectors. The parameter name must be the kernel
name to be considered for permutation of variables. Provided vectors length
has to be equal to the number of variables of the input dataset. A kernel is
performed on each unique variables values. Crone-Crosby distances are
computed on each KPCA performed on resulted kernels or meta-kernels and can
be displayed using the |
directory |
character. To limit computational burden, this argument allows to store / read temporary computed kernels. |
plotVar.kernel.pca
produces a barplot for each block. The variables
for which the importance has been computed with
kernel.pca.permute
are displayed. The representation is limited
to the ndisplay
most important variables.
kernel.pca.permute
returns a copy of the input
kpca.result
results and add values in the three entries:
cc.distances
, cc.variables
and cc.blocks
.
Jerome Mariette <[email protected]> Nathalie Vialaneix <[email protected]>
Mariette J. and Villa-Vialaneix N. (2018). Unsupervised multiple kernel learning for heterogeneous data integration. Bioinformatics, 34(6), 1009-1015. DOI: doi:10.1093/bioinformatics/btx682
Crone L. and Crosby D. (1995). Statistical applications of a metric on subspaces to satellite meteorology. Technometrics, 37(3), 324-328.
data(TARAoceans) # compute one kernel for the psychem dataset phychem.kernel <- compute.kernel(TARAoceans$phychem, kernel.func = "linear") # perform a KPCA kernel.pca.result <- kernel.pca(phychem.kernel) # compute importance for all variables in this kernel kernel.pca.result <- kernel.pca.permute(kernel.pca.result, phychem = colnames(TARAoceans$phychem))
data(TARAoceans) # compute one kernel for the psychem dataset phychem.kernel <- compute.kernel(TARAoceans$phychem, kernel.func = "linear") # perform a KPCA kernel.pca.result <- kernel.pca(phychem.kernel) # compute importance for all variables in this kernel kernel.pca.result <- kernel.pca.permute(kernel.pca.result, phychem = colnames(TARAoceans$phychem))
Find the location of the mixKernel User's Guide and optionnaly opens it
mixKernel.users.guide(html = TRUE, view = html)
mixKernel.users.guide(html = TRUE, view = html)
html |
logical. Should the document returned by the function be the
compiled PDF or the Rmd source. Default to |
view |
logical. Should the document be opened using the default HTML
viewer? Default to |
If the operating system is not Windows, then the HTML viewer used is that
given by Sys.getenv("R_BROWSER")
. The HTML viewer can be changed using
Sys.setenv(R_BROWSER = )
.
Character string giving the file location. If html = TRUE
and
view = TRUE
, the HTML document reader is started and the User's Guide
is opened in it.
Jerome Mariette <[email protected]> Nathalie Vialaneix <[email protected]>
mixKernel.users.guide(view = FALSE) mixKernel.users.guide(html = FALSE) ## Not run: mixKernel.users.guide()
mixKernel.users.guide(view = FALSE) mixKernel.users.guide(html = FALSE) ## Not run: mixKernel.users.guide()
Provides a representation of variable importance in kernel PCA.
plotVar.kernel.pca( object, blocks = unique(object$cc.blocks), ndisplay = 5, ncol = 2, ... )
plotVar.kernel.pca( object, blocks = unique(object$cc.blocks), ndisplay = 5, ncol = 2, ... )
object |
: a kernel.pca object returned by |
blocks |
a numerical vector indicating the block variables to display. |
ndisplay |
integer. The number of important variables per blocks shown in
the representation. Default: |
ncol |
integer. Each block of variables is displayed in a separate
subfigure. |
... |
external arguments. |
plotVar.kernel.pca
produces a barplot for each block. The variables for which the
importance has been computed with kernel.pca.permute
are
displayed. The representation is limited to the ndisplay
most important
variables.
Jerome Mariette <[email protected]> Nathalie Vialaneix <[email protected]>
Crone L. and Crosby D. (1995). Statistical applications of a metric on subspaces to satellite meteorology. Technometrics, 37(3), 324-328.
kernel.pca
, kernel.pca.permute
data(TARAoceans) # compute one kernel for the psychem dataset phychem.kernel <- compute.kernel(TARAoceans$phychem, kernel.func = "linear") # perform a KPCA kernel.pca.result <- kernel.pca(phychem.kernel) # compute importance for all variables in this kernel kernel.pca.result <- kernel.pca.permute(kernel.pca.result, phychem = colnames(TARAoceans$phychem)) ## Not run: plotVar.kernel.pca(kernel.pca.result, ndisplay = 10)
data(TARAoceans) # compute one kernel for the psychem dataset phychem.kernel <- compute.kernel(TARAoceans$phychem, kernel.func = "linear") # perform a KPCA kernel.pca.result <- kernel.pca(phychem.kernel) # compute importance for all variables in this kernel kernel.pca.result <- kernel.pca.permute(kernel.pca.result, phychem = colnames(TARAoceans$phychem)) ## Not run: plotVar.kernel.pca(kernel.pca.result, ndisplay = 10)
Select features using supervised or unsupervised kernel method. A
supervised feature selection method is performed if Y
is provided.
## S3 method for class 'features' select( X, Y = NULL, kx.func = c("linear", "gaussian.radial.basis", "bray"), ky.func = c("linear", "gaussian.radial.basis"), keepX = NULL, method = c("kernel", "kpca", "graph"), lambda = NULL, n_components = 2, Lg = NULL, mu = 1, max_iter = 100, nstep = 50, ... )
## S3 method for class 'features' select( X, Y = NULL, kx.func = c("linear", "gaussian.radial.basis", "bray"), ky.func = c("linear", "gaussian.radial.basis"), keepX = NULL, method = c("kernel", "kpca", "graph"), lambda = NULL, n_components = 2, Lg = NULL, mu = 1, max_iter = 100, nstep = 50, ... )
X |
a numeric matrix (or data frame) used to select variables.
|
Y |
a numeric matrix (or data frame) used to select variables.
|
kx.func |
the kernel function name to use on |
ky.func |
the kernel function name to use on |
keepX |
the number of variables to select. |
method |
the method to use. Either an unsupervised variable selection
method ( |
lambda |
the penalization parameter that controls the trade-off between the minimization of the distorsion and the sparsity of the solution parameter. |
n_components |
how many principal components should be used with method
|
Lg |
the Laplacian matrix of the graph representing relations between
the input dataset variables. Required with method |
mu |
the penalization parameter that controls the trade-off between the
the distorsion and the influence of the graph. Default: |
max_iter |
the maximum number of iterations. Default: |
nstep |
the number of values used for the regularization path. Default:
|
... |
the kernel function arguments. In particular,
|
ukfs
returns a vector of sorted selected features indexes.
Celine Brouard <[email protected]> Jerome Mariette <[email protected]> Nathalie Vialaneix <[email protected]>
Brouard C., Mariette J., Flamary R. and Vialaneix N. (2022). Feature selection for kernel methods in systems biology. NAR Genomics and Bioinformatics, 4(1), lqac014. DOI: doi:10.1093/nargab/lqac014.
## These examples require the installation of python modules ## See installation instruction at: http://mixkernel.clementine.wf data("Koren.16S") ## Not run: sf.res <- select.features(Koren.16S$data.raw, kx.func = "bray", lambda = 1, keepX = 40, nstep = 1) colnames(Koren.16S$data.raw)[sf.res] ## End(Not run) data("nutrimouse") ## Not run: grb.func <- "gaussian.radial.basis" genes <- center.scale(nutrimouse$gene) lipids <- center.scale(nutrimouse$lipid) sf.res <- select.features(genes, lipids, kx.func = grb.func, ky.func = grb.func, keepX = 40) colnames(nutrimouse$gene)[sf.res] ## End(Not run)
## These examples require the installation of python modules ## See installation instruction at: http://mixkernel.clementine.wf data("Koren.16S") ## Not run: sf.res <- select.features(Koren.16S$data.raw, kx.func = "bray", lambda = 1, keepX = 40, nstep = 1) colnames(Koren.16S$data.raw)[sf.res] ## End(Not run) data("nutrimouse") ## Not run: grb.func <- "gaussian.radial.basis" genes <- center.scale(nutrimouse$gene) lipids <- center.scale(nutrimouse$lipid) sf.res <- select.features(genes, lipids, kx.func = grb.func, ky.func = grb.func, keepX = 40) colnames(nutrimouse$gene)[sf.res] ## End(Not run)
The TARA Oceans expedition facilitated the study of plankton communities by providing oceans metagenomic data combined with environmental measures to the scientific community. This dataset focuses on 139 prokaryotic-enriched samples collected from 68 stations and spread across three depth layers: the surface (SRF), the deep chlorophyll maximum (DCM) layer and the mesopelagic (MES) zones. Samples were located in height different oceans or seas: Indian Ocean (IO), Mediterranean Sea (MS), North Atlantic Ocean (NAO), North Pacific Ocean (NPO), Red Sea (RS), South Atlantic Ocean (SAO), South Pacific Ocean (SPO) and South Ocean (SO). Here, only a subset of the original data is provided (1% of the 35,650 prokaryotic operational taxonomic units (OTUs) and of the 39,246 bacterial genes (NOGs) (selected at random).
data(TARAoceans)
data(TARAoceans)
A list containing the following components:
phychem
data matrix with 139 rows and 22 columns. Each row represents a sample and each column an environmental variable.
pro.phylo
data matrix with 139 rows (samples) and 356 columns (prokaryotic OTUs).
taxonomy
data matrix with 356 rows (prokaryotic OTUs) and 6 columns indicating the taxonomy of each OTU.
phylogenetic.tree
a phylo object (see package 'ape') representing the prokaryotic OTUs phylogenetic tree.
pro.NOGs
data matrix with 139 rows (samples) and 638 columns (NOGs).
sample
a list containing three following entries (all three
are character vectors): name
(sample name), ocean
(oceanic
region of the sample) and depth
(sample depth).
The raw data were downloaded from http://ocean-microbiome.embl.de/companion.html.
Sunagawa S., Coelho L.P., Chaffron S., Kultima J.R., Labadie K., Salazar F., Djahanschiri B., Zeller G., Mende D.R., Alberti A., Cornejo-Castillo F., Costea P.I., Cruaud C., d'Oviedo F., Engelen S., Ferrera I., Gasol J., Guidi L., Hildebrand F., Kokoszka F., Lepoivre C., Lima-Mendez G., Poulain J., Poulos B., Royo-Llonch M., Sarmento H., Vieira-Silva S., Dimier C., Picheral M., Searson S., Kandels-Lewis S., Tara Oceans coordinators, Bowler C., de Vargas C., Gorsky G., Grimsley N., Hingamp P., Iudicone D., Jaillon O., Not F., Ogata H., Pesant S., Speich S., Stemmann L., Sullivan M., Weissenbach J., Wincker P., Karsenti E., Raes J., Acinas S. and Bork P. (2015). Structure and function of the global ocean microbiome. Science, 348, 6237.