Title: | Supervised NMF |
---|---|
Description: | Non-negative Matrix Factorization(NMF) is a powerful tool for identifying the key features of microbial communities and a dimension-reduction method. When we are interested in the differences between the structures of two groups of communities, supervised NMF(Yun Cai, Hong Gu and Tobby Kenney (2017),<doi:10.1186/s40168-017-0323-1>) provides a better way to do this, while retaining all the advantages of NMF -- such as interpretability, and being based on a simple biological intuition. |
Authors: | Yun Cai [aut, cre], Hong Gu [aut], Toby Kenney [aut] |
Maintainer: | Yun Cai <[email protected]> |
License: | GPL-3 |
Version: | 0.1.1 |
Built: | 2024-12-10 06:41:13 UTC |
Source: | CRAN |
chty is used to get number of types for the data.
chty(data,y,k,maxr)
chty(data,y,k,maxr)
data |
an optional n by p count data matrix. The p columns of the matrix are different variables and the n rows are samples. Each column should contain at lest one none zero entry. When n = 1, it is a row vector. |
y |
a binary variable contains classification information of the data. Usually one group is labelled as "0" and the other as "1". |
k |
a value gives the number of folds used in cross validation when choosing number of types. |
maxr |
a number gives the upper bound of the number of types. |
r1 |
the suggested number of types for class labeled as 1. |
r2 |
the suggested number of types for class labeled as 0. |
Yun Cai, Hong Gu and Toby Kenney
Learning Microbial Community Structures with Supervised and Unsupervised Non-negative Matrix Factorization
##we use the simulated data spdata here ##the spdata is simulated from feature matrix combined by 2 #3types features from one group and 3 types from the other. ##choose number of types using our function ##2-folds cross validation is used here ##the upper bound of number of types for both classes is 2 ##remove all zero variables from the data spdata.rm=spdata[c(1:4,41:44),colSums(spdata)!=0] y=c(rep(1,4),rep(0,4)) types=chty(spdata.rm,y,2,2) #number of types for class labeled as 1 nmb1 = types$r1 #number of types for class labeled as 0 nmb2 = types$r2
##we use the simulated data spdata here ##the spdata is simulated from feature matrix combined by 2 #3types features from one group and 3 types from the other. ##choose number of types using our function ##2-folds cross validation is used here ##the upper bound of number of types for both classes is 2 ##remove all zero variables from the data spdata.rm=spdata[c(1:4,41:44),colSums(spdata)!=0] y=c(rep(1,4),rep(0,4)) types=chty(spdata.rm,y,2,2) #number of types for class labeled as 1 nmb1 = types$r1 #number of types for class labeled as 0 nmb2 = types$r2
getT is used to calculate the combined feature matrix.
getT(data,y,Tr1,Tr2)
getT(data,y,Tr1,Tr2)
data |
an optional n by p count data matrix. The p columns of the matrix are different variables and the n rows are samples. Each column should contain at lest one none zero entry. When n = 1, it is a row vector. |
y |
a binary variable contains classification information of the data. Usually one group is labelled as "0" and the other as "1". |
Tr1 |
a value gives the number of types for class labeled as 1. The appropriate Tr1 can also be estimated from function |
Tr2 |
a value gives the number of types for class labeled as 0. The appropriate Tr2 can also be estimated from function |
getT is used to calculate the combined feature matrix. The data used in getT
should contain samples from both classes. If feature matrix is needed for only
one class, basis(NMF(data; Tr; "KL"))
can be used.
T |
a feature matrix in dimension p by r. It is a combined feature matrix contains information from both classes. |
Yun Cai, Hong Gu and Tobby Kenney
Learning Microbial Community Structures with Supervised and Unsupervised Non-negative Matrix Factorization
#get feature matrix with rank 2 for one group and rank 3 for the other of the simualted spdata y=c(rep(1,4),rep(0,4)) T.eg=getT(spdata,y,2,3)
#get feature matrix with rank 2 for one group and rank 3 for the other of the simualted spdata y=c(rep(1,4),rep(0,4)) T.eg=getT(spdata,y,2,3)
the spdata is simulated from poisson distribution with mean as the product of feature and weight matrix. The feature matrix has 2804 variables and is combined by 2 types features from one group and 3 types from the other. The weight matrix is generated from uniform distribution on 0,1.
The format is: int [1:80, 1:2804] 5 12 7 10 14 1 12 18 4 26 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:80] "ibd.old0" "ibd.old0" "ibd.old0" "ibd.old0" ... ..$ : NULL
The spdata has a dimention of 80 by 2804, 40 labeled as class one and the left labeled as class two.
data(spdata)
data(spdata)
The spnmf is used to fit supervised Non-negative Matrix Factorization model on data when the combined feature matrix is known.
spnmf(data,Tp)
spnmf(data,Tp)
data |
an optional n by p count data matrix. The p columns of the matrix are different variables and the n rows are samples. Each column should contain at lest one none zero entry. When n = 1, it is a row vector. |
Tp |
a combined feature matrix in dimension p by r. p is the number of variables and r
is the number of types. Tp can also be calculated from function |
The function is based on R package NMF.
W |
the supervised weight matrix in dimension n by r. n is the number of observations. r is the number of type for the data. It is the coefficients of the feature matrix. |
loglh |
the log-likelihood of the supervised NMF model. |
Yun Cai, Hong Gu and Toby Kenney
Learning Microbial Community Structures with Supervised and Unsupervised Non-negative Matrix Factorization
##an example of classification based on supervised nmf results #spdata consists of two classes, the first 40 samples are from class 1 and the left from class 2 ##label each observation's class as 1 or 0 y=c(rep(1,4),rep(0,4)) ##split the data half as training data the other half as test data y.train=y.test=c(rep(1 ,2),rep(0,2)) spdata.train=spdata[c(1:2,41:42),] spdata.test=spdata[c(21:22,61:62),] #remove all zero columns spdata.train.rm=spdata.train[,colSums(spdata.train)!=0] #remove the same variables from test data spdata.test.rm=spdata.test[,colSums(spdata.train)!=0] #get feature matrix with rank 2 and 3 for the two groups T.eg=getT(spdata.train.rm,y.train,2,3) #get weight matrix rs.train=spnmf(spdata.train.rm,T.eg) w.train=rs.train$W rs.test=spnmf(spdata.test.rm,T.eg) w.test=rs.test$W ##the weight matrix can be used to do classification md.train=glm(y.train~.,data=data.frame(w.train),family=binomial(link=logit)) ##predict the test data pred=predict(md.train,newdata=data.frame(w.test),type ="response")
##an example of classification based on supervised nmf results #spdata consists of two classes, the first 40 samples are from class 1 and the left from class 2 ##label each observation's class as 1 or 0 y=c(rep(1,4),rep(0,4)) ##split the data half as training data the other half as test data y.train=y.test=c(rep(1 ,2),rep(0,2)) spdata.train=spdata[c(1:2,41:42),] spdata.test=spdata[c(21:22,61:62),] #remove all zero columns spdata.train.rm=spdata.train[,colSums(spdata.train)!=0] #remove the same variables from test data spdata.test.rm=spdata.test[,colSums(spdata.train)!=0] #get feature matrix with rank 2 and 3 for the two groups T.eg=getT(spdata.train.rm,y.train,2,3) #get weight matrix rs.train=spnmf(spdata.train.rm,T.eg) w.train=rs.train$W rs.test=spnmf(spdata.test.rm,T.eg) w.test=rs.test$W ##the weight matrix can be used to do classification md.train=glm(y.train~.,data=data.frame(w.train),family=binomial(link=logit)) ##predict the test data pred=predict(md.train,newdata=data.frame(w.test),type ="response")