Package 'M3JF'

Title: Multi-Modal Matrix Joint Factorization for Integrative Multi-Omics Data Analysis
Description: Multi modality data matrices are factorized conjointly into the multiplication of a shared sub-matrix and multiple modality specific sub-matrices, group sparse constraint is applied to the shared sub-matrix to capture the homogeneous and heterogeneous information, respectively. Then the samples are classified by clustering the shared sub-matrix with kmeanspp(), a new version of kmeans() developed here to obtain concordant results. The package also provides the cluster number estimation by rotation cost. Moreover, cluster specific features could be retrieved using hypergeometric tests.
Authors: Xiaoyao Yin [aut, cre]
Maintainer: Xiaoyao Yin <[email protected]>
License: GPL-3
Version: 0.1.0
Built: 2024-10-26 06:14:29 UTC
Source: CRAN

Help Index


Calculate the cost defined by the objective function

Description

Calculate the cost defined by the objective function

Usage

cost(WL, init_list, lambda)

Arguments

WL

a list of multiple modality data matrices

init_list

a list of the initialized modality specific sub-matrices list Hi and shared sub-matrix E

lambda

a parameter to set the relative weight of the L1,infinity norm defined on sub-matrices list E

Value

res, a real data value of the cost

Examples

library(InterSIM)
sim.data <- InterSIM(n.sample=500, cluster.sample.prop = c(0.20,0.30,0.27,0.23),
delta.methyl=5, delta.expr=5, delta.protein=5,p.DMP=0.2, p.DEG=NULL,
p.DEP=NULL,sigma.methyl=NULL, sigma.expr=NULL, sigma.protein=NULL,cor.methyl.expr=NULL,
cor.expr.protein=NULL,do.plot=FALSE, sample.cluster=TRUE, feature.cluster=TRUE)
sim.methyl <- sim.data$dat.methyl
sim.expr <- sim.data$dat.expr
sim.protein <- sim.data$dat.protein
temp_data <- list(sim.methyl, sim.expr, sim.protein)
init_list <- initialize_WL(temp_data,k=4)
update_H_list <- update_H(temp_data,init_list)
lambda <- 0.01
update_E_list <- update_E(temp_data,update_H_list,lambda)
new_cost <- cost(temp_data,update_E_list,lambda)

Generate the simulated dataset with three modalities with the package crimmix

Description

Generate the simulated dataset with three modalities with the package crimmix

Usage

crimmix_data_gen(nclust=4, n_byClust=c(10,20,5,25),
feature_nums=c(1000,500,5000), noises=c(0.5,0.01,0.3),props=c(0.005,0.01,0.02))

Arguments

nclust

number of clusters

n_byClust

number of samples per cluster

feature_nums

number of features in each modality

noises

percentage of noise adding to each modality

props

proportion of cluster related features in each modality

Value

res, a list of length 2, where the first element is a list of simulated data, while the second element is a vector indicating the true label of each sample

Examples

crimmix_data <- crimmix_data_gen(nclust=4, n_byClust=c(10,20,5,25),
feature_nums=c(1000,500,5000), noises=c(0.5,0.01,0.3),props=c(0.005,0.01,0.02))

Screen the cluster related features via hypergeometric test p value and distribution standard derivation

Description

Screen the cluster related features via hypergeometric test p value and distribution standard derivation

Usage

feature_screen_sd(feature_list, sig_num = 20)

Arguments

feature_list

a data list, which is the output of feature_selection function

sig_num

the number of significant features for each cluster

Value

selected_features, a list the same long as the cluster number, each element is a sub-list with two vectors, one for the over-expressed features, one for the under-expressed features for the current cluster

Examples

library(InterSIM)
sim.data <- InterSIM(n.sample=500, cluster.sample.prop = c(0.20,0.30,0.27,0.23),
delta.methyl=5, delta.expr=5, delta.protein=5,p.DMP=0.2, p.DEG=NULL,
p.DEP=NULL,sigma.methyl=NULL, sigma.expr=NULL, sigma.protein=NULL,cor.methyl.expr=NULL,
cor.expr.protein=NULL,do.plot=FALSE, sample.cluster=TRUE, feature.cluster=TRUE)
sim.methyl <- sim.data$dat.methyl
sim.expr <- sim.data$dat.expr
sim.protein <- sim.data$dat.protein
temp_data <- list(sim.methyl, sim.expr, sim.protein)
M3JF_res <- M3JF(temp_data,k=4)
feature_list <- feature_selection(temp_data[[1]],M3JF_res$cluster_res,z_score=TRUE,
upper_bound=1, lower_bound=-1)
selected_features <- feature_screen_sd(feature_list,sig_num=20)

Select the cluster related features via hypergeometric test

Description

Select the cluster related features via hypergeometric test

Usage

feature_selection(
  X,
  clusters,
  z_score = FALSE,
  upper_bound,
  lower_bound,
  p.adjust.method = "BH"
)

Arguments

X

the feature matrix to be analyzed, with rows as samples and columns as features

clusters

the numeric cluster results with number specifying the cluster

z_score

a binary value to specify whether to calculate z-score for X first

upper_bound

values larger than this value should be treated as over-expressed

lower_bound

values smaller than this value should be treated as under-expressed

p.adjust.method

the p value adjust method, defalut as 'BH'

Value

results, a list, which is as long as (cluster number+2), with the first (cluster number) element as two sub-list, each composing a feature vector and a FDR vector. The last two elements are two matrices, one is the matrix representing the fraction of over-express samples in each cluster for each features , and the other represents that of under-express.

Examples

library(InterSIM)
sim.data <- InterSIM(n.sample=500, cluster.sample.prop = c(0.20,0.30,0.27,0.23),
delta.methyl=5, delta.expr=5, delta.protein=5,p.DMP=0.2, p.DEG=NULL,
p.DEP=NULL,sigma.methyl=NULL, sigma.expr=NULL, sigma.protein=NULL,cor.methyl.expr=NULL,
cor.expr.protein=NULL,do.plot=FALSE, sample.cluster=TRUE, feature.cluster=TRUE)
sim.methyl <- sim.data$dat.methyl
sim.expr <- sim.data$dat.expr
sim.protein <- sim.data$dat.protein
temp_data <- list(sim.methyl, sim.expr, sim.protein)
M3JF_res <- M3JF(temp_data,k=4)
feature_list <- feature_selection(temp_data[[1]],M3JF_res$cluster_res,z_score=TRUE,
upper_bound=1, lower_bound=-1)

Initialize the shared sub-matrix E and modality specific sub-matrices list Hi

Description

Initialize the shared sub-matrix E and modality specific sub-matrices list Hi

Usage

initialize_WL(WL, k)

Arguments

WL

a list of multiple modality data matrices

k

the cluster number

Value

res, a list of length N+3, where N is the number of data modality. the first N elements are the modality specific sub-matrices list Hi, the (N+1) element is the shared sub-matrix E, the last two elements are the loss defined on the shared sub-matrix E and modality specific sub-matrices list Hi.

Examples

library(InterSIM)
sim.data <- InterSIM(n.sample=500, cluster.sample.prop = c(0.20,0.30,0.27,0.23),
delta.methyl=5, delta.expr=5, delta.protein=5,p.DMP=0.2, p.DEG=NULL,
p.DEP=NULL,sigma.methyl=NULL, sigma.expr=NULL, sigma.protein=NULL,cor.methyl.expr=NULL,
cor.expr.protein=NULL,do.plot=FALSE, sample.cluster=TRUE, feature.cluster=TRUE)
sim.methyl <- sim.data$dat.methyl
sim.expr <- sim.data$dat.expr
sim.protein <- sim.data$dat.protein
temp_data <- list(sim.methyl, sim.expr, sim.protein)
init_list <- initialize_WL(temp_data,k=4)

Generate the simulated dataset with three modalities as the work iNMF

Description

Generate the simulated dataset with three modalities as the work iNMF

Usage

iNMF_data_gen(Xs_dim_list=list(c(100,100),c(100,100),c(100,100)),
mod_dim_list=list(matrix(c(20,30,20,30,20,30,20,30),4,2),
matrix(c(20,20,30,30,20,30,20,30),4,2),
matrix(c(26,24,26,24,20,30,20,30),4,2)),e_u=0.15, e_s=0.9, e_h=0)

Arguments

Xs_dim_list

a list of data matrix dimensions for multiple modality data

mod_dim_list

a list of the dimensions of each cluster and their features

e_u

the level of uniform noise

e_s

signal to noise ratio

e_h

block adding probability

Value

res, a list of length 2, where the first element is a list of simulated data, while the second element is a vector indicating the true label of each sample.

Examples

iNMF_data <- iNMF_data_gen(Xs_dim_list=list(c(100,100),c(100,100),c(100,100)),
mod_dim_list=list(matrix(c(20,30,20,30,20,30,20,30),4,2),
matrix(c(20,20,30,30,20,30,20,30),4,2),
matrix(c(26,24,26,24,20,30,20,30),4,2)),e_u=0.15, e_s=0.9, e_h=0)

Generate the simulated dataset with three modalities with the package InterSIM

Description

Generate the simulated dataset with three modalities with the package InterSIM

Usage

intersim_data_gen(prop=c(0.20,0.30,0.27,0.23), n_sample=500)

Arguments

prop

proportion of samples for each cluster

n_sample

the number of samples

Value

res, a list of length 2, where the first element is a list of simulated data, while the second element is a vector indicating the true label of each sample.

Examples

library(InterSIM)
intersim_data <- intersim_data_gen(prop=c(0.20,0.30,0.27,0.23), n_sample=500)

A new version of kmeans that generates stable cluster result

Description

A new version of kmeans that generates stable cluster result

Usage

kmeanspp(X, k)

Arguments

X

a data matrix with each row as a sample and each column as a feature

k

the cluster number

Value

res, the cluster result generated by this function

Examples

library(InterSIM)
sim.data <- InterSIM(n.sample=500, cluster.sample.prop = c(0.20,0.30,0.27,0.23),
delta.methyl=5, delta.expr=5, delta.protein=5,p.DMP=0.2, p.DEG=NULL,
p.DEP=NULL,sigma.methyl=NULL, sigma.expr=NULL, sigma.protein=NULL,cor.methyl.expr=NULL,
cor.expr.protein=NULL,do.plot=FALSE, sample.cluster=TRUE, feature.cluster=TRUE)
sim.methyl <- sim.data$dat.methyl
sim.expr <- sim.data$dat.expr
sim.protein <- sim.data$dat.protein
temp_data <- list(sim.methyl, sim.expr, sim.protein)
init_list <- initialize_WL(temp_data,k=4)
lambda <- 0.01
update_E_list <- update_E(temp_data,init_list,lambda)
cluster_res <- kmeanspp(update_E_list[[4]],4)

Multi-Modal Matrix Joint Factorization

Description

Multi-Modal Matrix Joint Factorization

Usage

M3JF(WL, lambda = 0.01, theta = 10^-6, k)

Arguments

WL

a list of multiple modality data matrices

lambda

the parameter to set the relative weight of the group sparse constraint

theta

threshold for the stopping criteria

k

cluster number

Value

result, a list of 3 elements, the first element is a list comprising the shared sub-matrix and the modality specific sub-matrices. The second element is a vector of the clustering result. The third element is a vector of the cost in each step during optimization.

Examples

library(InterSIM)
sim.data <- InterSIM(n.sample=500, cluster.sample.prop = c(0.20,0.30,0.27,0.23),
delta.methyl=5, delta.expr=5, delta.protein=5,p.DMP=0.2, p.DEG=NULL,
p.DEP=NULL,sigma.methyl=NULL, sigma.expr=NULL, sigma.protein=NULL,cor.methyl.expr=NULL,
cor.expr.protein=NULL,do.plot=FALSE, sample.cluster=TRUE, feature.cluster=TRUE)
sim.methyl <- sim.data$dat.methyl
sim.expr <- sim.data$dat.expr
sim.protein <- sim.data$dat.protein
temp_data <- list(sim.methyl, sim.expr, sim.protein)
M3JF_res <- M3JF(temp_data,k=4)

Evaluate the cluster number of multiple modality data

Description

Evaluate the cluster number of multiple modality data

Usage

RotationCostBestGivenGraph(W, NUMC = 2:5)

Arguments

W

a list of multiple modality data matrices

NUMC

a vector specify the data range to select best cluster number

Value

quality, a vector of rotation cost the same long as NUMC, where each element is the rotation cost value of the corresponding cluster number.

Examples

library(InterSIM)
library(SNFtool)
sim.data <- InterSIM(n.sample=100, cluster.sample.prop = c(0.20,0.30,0.27,0.23),
delta.methyl=5, delta.expr=5, delta.protein=5,p.DMP=0.2, p.DEG=NULL,
p.DEP=NULL,sigma.methyl=NULL, sigma.expr=NULL, sigma.protein=NULL,cor.methyl.expr=NULL,
cor.expr.protein=NULL,do.plot=FALSE, sample.cluster=TRUE, feature.cluster=TRUE)
sim.methyl <- sim.data$dat.methyl
sim.expr <- sim.data$dat.expr
sim.protein <- sim.data$dat.protein
temp_data <- list(sim.methyl, sim.expr, sim.protein)
dat <- lapply(temp_data, function(dd) {
  dd <- as.matrix(dd)
  dd1 <- dist2(dd,dd)
  W1 <- affinityMatrix(dd1, K = 10, sigma = 0.5)
})
W <- SNF(dat, 10, 10)
clu_eval <- RotationCostBestGivenGraph(W,2:10)

Generate the simulated dataset with specified parameters

Description

Generate the simulated dataset with specified parameters

Usage

simulateY(nclust = 4, n_byClust = c(10,20,5,25), J=1000, prop = 0.01,
noise = 0.1,flavor =c("normal", "beta", "binary"),
params = list(c(mean = 1,sd = 1)))

Arguments

nclust

number of clusters

n_byClust

number of samples per cluster

J

number of features in each modality

prop

proportion of cluster related features

noise

percentage of noise adding to each modality

flavor

a vector indicating the data type

params

a list indicating the mean and standard derivation of the simulated data

Value

res, a list of length 2, where the first element is a list of simulated data, while the second element is a vector indicating the true label of each sample

Examples

temp_data <- simulateY(nclust = 4, n_byClust = c(10,20,5,25), J=1000,
prop = 0.01, noise = 0.1,flavor =c("normal", "beta", "binary"),
params = list(c(mean = 1,sd = 1)))

Update sub-matrix E

Description

Update sub-matrix E

Usage

update_E(WL, init_list, lambda)

Arguments

WL

a list of multiple modality data matrices

init_list

a list of the initialized modality specific sub-matrices list Hi and shared sub-matrix E

lambda

a parameter to set the relative weight of the L1,infinity norm defined on sub-matrices list E

Value

update_E_list, the data list init_list with the shared sub-matrix E updated.

Examples

library(InterSIM)
sim.data <- InterSIM(n.sample=500, cluster.sample.prop = c(0.20,0.30,0.27,0.23),
delta.methyl=5, delta.expr=5, delta.protein=5,p.DMP=0.2, p.DEG=NULL,
p.DEP=NULL,sigma.methyl=NULL, sigma.expr=NULL, sigma.protein=NULL,cor.methyl.expr=NULL,
cor.expr.protein=NULL,do.plot=FALSE, sample.cluster=TRUE, feature.cluster=TRUE)
sim.methyl <- sim.data$dat.methyl
sim.expr <- sim.data$dat.expr
sim.protein <- sim.data$dat.protein
temp_data <- list(sim.methyl, sim.expr, sim.protein)
init_list <- initialize_WL(temp_data,k=4)
update_H_list <- update_H(temp_data,init_list)
lambda <- 0.01
update_E_list <- update_E(temp_data,update_H_list,lambda)

Update sub-matrices list Hi

Description

Update sub-matrices list Hi

Usage

update_H(WL, init_list)

Arguments

WL

a list of multiple modality data matrices

init_list

a list of the initialized modality specific sub-matrices list Hi and shared sub-matrix E

Value

update_H_list, the data list init_list with the modality specific sub-matrices list Hi updated.

Examples

library(InterSIM)
sim.data <- InterSIM(n.sample=500, cluster.sample.prop = c(0.20,0.30,0.27,0.23),
delta.methyl=5, delta.expr=5, delta.protein=5,p.DMP=0.2, p.DEG=NULL,
p.DEP=NULL,sigma.methyl=NULL, sigma.expr=NULL, sigma.protein=NULL,cor.methyl.expr=NULL,
cor.expr.protein=NULL,do.plot=FALSE, sample.cluster=TRUE, feature.cluster=TRUE)
sim.methyl <- sim.data$dat.methyl
sim.expr <- sim.data$dat.expr
sim.protein <- sim.data$dat.protein
temp_data <- list(sim.methyl, sim.expr, sim.protein)
init_list <- initialize_WL(temp_data,k=4)
update_H_list <- update_H(temp_data,init_list)