This vignette presents the M3JF,which implements a framework named multi-modality matrix joint factorization (M3JF) to conduct integrative analysis of multiple modality data in R. The objective is to provide an implementation of the proposed method, which is designed to solve the high dimensionality multiple modality data in bioinformatics. It was achieved by jointly factorizing the matrices into a shared sub-matrix and several modality specific sub-matrices. The introduction of group sparse constraint on the shared sub-matrix forces the samples in the same group to allow each modality exploiting only a subset of the dimensions of the global latent space.
The latest stable version of the package can be installed from any CRAN repository mirror:
The latest development version is available from https://cran.r-project.org/package=M3JF and may be downloaded from there and installed manually:
Support: Users interested in this package are encouraged to email to Xiaoyao Yin ([email protected]) for enquiries, bug reports, feature requests, suggestions or M3JF-related discussions.
We will give an example of how to use this package hereafter.
We generate simulated data with the R package InterSIM, which generates three inter-related data set with realistic inter- and intra- relationships based on the DNA methylation, mRNA expression and protein expression from the TCGA ovarian cancer study. Each data modality consists of 500 samples, samples are assigned to 4 groups with 100, 150, 135 and 115 samples per group. The data can be generated by running:
library(InterSIM)
sim.data <- InterSIM(n.sample=500, cluster.sample.prop = c(0.20,0.30,0.27,0.23),
delta.methyl=5, delta.expr=5, delta.protein=5,p.DMP=0.2, p.DEG=NULL,
p.DEP=NULL,sigma.methyl=NULL, sigma.expr=NULL,
sigma.protein=NULL,cor.methyl.expr=NULL,
cor.expr.protein=NULL,do.plot=FALSE, sample.cluster=TRUE,
feature.cluster=TRUE)
sim.methyl <- sim.data$dat.methyl
sim.expr <- sim.data$dat.expr
sim.protein <- sim.data$dat.protein
data_list <- list(sim.methyl, sim.expr, sim.protein)
Label assignment: According to the data generation process, we assign the groundtruth label to the data we have generated as:
this label will be used to test the clustering ability afterwards.
Now we can cluster the samples with the proposed method and compare its performance by calculating the normalized mutual information with the function cal_NMI by inputting the truelabel and the predicted label.
Evaluating k: Evaluate the most proper cluster number k by mean of modality modulairty with the function new_modularity.
#Build similarity matrices for your data with SNFtool
library(SNFtool)
library(dplyr)
WL_dist1 <- lapply(data_list,function(x){
dd <- x%>%as.matrix
w <- dd %>% dist2(dd) %>% affinityMatrix(K = 10, sigma = 0.5)
})
#Assign the interval of k according to your data
k_list = 2:10
#Initialize the varible
clu_eval <- RotationCostBestGivenGraph(W,k_list)
#The most proper is the one with minimal rotation cost
best_k = k_list[which.min(clu_eval)]
M3JF: Jointly factorize the matrices into a shared embedding matrix and several modality private basis matrices.
Now you have got the classification result you want.
Robustness test: We test the robustness of our method by calculating the normalized mutual information and adjusted rand index of the true label and our predicted label. We can compare the performance of our method with others by these scores, which lie in the interval [0,1]. The larger the scores, the more robust the method.