Title: | High-Dimensional Covariate-Augmented Overdispersed Multi-Study Poisson Factor Model |
---|---|
Description: | We introduce factor models designed to jointly analyze high-dimensional count data from multiple studies by extracting study-shared and specified factors. Our factor models account for heterogeneous noises and overdispersion among counts with augmented covariates. We propose an efficient and speedy variational estimation procedure for estimating model parameters, along with a novel criterion for selecting the optimal number of factors and the rank of regression coefficient matrix. More details can be referred to Liu et al. (2024) <doi:10.48550/arXiv.2402.15071>. |
Authors: | Wei Liu [aut, cre], Qingzhi Zhong [aut] |
Maintainer: | Wei Liu <[email protected]> |
License: | GPL-3 |
Version: | 1.1 |
Built: | 2024-11-11 07:25:07 UTC |
Source: | CRAN |
Generate simulated data from covariate-augmented Poisson factor models
gendata_simu_multi2( seed = 1, nvec = c(100, 300), a_interval = c(0, 1), p = 50, d = 3, q = 3, qs = rep(2, length(nvec)), rank0 = 3, rho = c(rhoA = 1, rhoB = 1, rhoZ = 1), sigma2_eps = 1, seed.beta = 1 )
gendata_simu_multi2( seed = 1, nvec = c(100, 300), a_interval = c(0, 1), p = 50, d = 3, q = 3, qs = rep(2, length(nvec)), rank0 = 3, rho = c(rhoA = 1, rhoB = 1, rhoZ = 1), sigma2_eps = 1, seed.beta = 1 )
seed |
a postive integer, the random seed for reproducibility of data generation process. |
nvec |
a vector with postive integers, specify the sample size in each study/source. |
a_interval |
a numeric vector with two elements, specify the range of offset term values in each study. |
p |
a postive integer, specify the dimension of count variables. |
d |
a postive integer, specify the dimension of covariate matrix. |
q |
a postive integer, specify the number of study-shared factors. |
qs |
a vector with postive integers, specify the number of study-specified factors. |
rank0 |
a postive integer, specify the rank of the coefficient matrix. |
rho |
a numeric vector with length 3 and positive elements, specify the signal strength of regression coefficient and loading matrices, respectively. |
sigma2_eps |
a positive real, the variance of overdispersion error. |
seed.beta |
a postive integer, the random seed for fixing the regression coefficient matrix and loading matrix generation. |
None
return a list including the following components: (1) Xlist, the list consisting of high-dimensional count matrices from multiple studies; (2) aList: the known normalization term (offset) for each study; (3) Zlist, the list consisting of covariate matrix; (4) bbeta0, the true regression coefficient matrix; (5) A0, the loading matrix of study-shared factors; (6) Blist, the list consisting of loading matrices of study-specified factors; (7)lambdavec, the variance vector of the random error vector; (8)Flist, the list composed by study-shared factor matrices; (9) Hlist, the list composed by study-specified factor matrices; (10) rank0, the rank of underlying regression coefficient matrix; (11) q, the number of study-shared factors; (12)qs, the numbers of study-specified factors.
None
None
seed <- 1; nvec <- c(100,300); p<- 300; d <- 3; q<- 3; qs <- rep(2,2) datlist <- gendata_simu_multi2(seed=seed, nvec=nvec, p=p, d=d, q=3, qs=qs) str(datlist)
seed <- 1; nvec <- c(100,300); p<- 300; d <- 3; q<- 3; qs <- rep(2,2) datlist <- gendata_simu_multi2(seed=seed, nvec=nvec, p=p, d=d, q=3, qs=qs) str(datlist)
Fit the multi-study covariate-augmented linear factor model via variational inference
MSFRVI( XList, ZList, q = 15, qs = rep(2, length(XList)), rank_use = NULL, aList = NULL, epsELBO = 1e-05, maxIter = 30, verbose = TRUE, seed = 1 )
MSFRVI( XList, ZList, q = 15, qs = rep(2, length(XList)), rank_use = NULL, aList = NULL, epsELBO = 1e-05, maxIter = 30, verbose = TRUE, seed = 1 )
XList |
A length-M list, where each component represents a matrix and is the observed response matrix from each source/study. Ideally, each matrix should be continuous. |
ZList |
a length-M list with each component a matrix that is the covariate matrix from each study. |
q |
an optional integer, specify the number of study-shared factors; default as 15. |
qs |
a integer vector with length M, specify the number of study-specifed factors; default as 2. |
rank_use |
an optional integer, specify the rank of the regression coefficient matrix; default as NULL, which means that rank is the dimension of covariates in Z. |
aList |
an optional length-M list with each component a vector, the normalization factors of each study; default as full-one vector. |
epsELBO |
an optional positive vlaue, tolerance of relative variation rate of the envidence lower bound value, defualt as '1e-5'. |
maxIter |
the maximum iteration of the VEM algorithm. The default is 30. |
verbose |
a logical value, whether output the information in iteration. |
seed |
an optional integer, specify the random seed for reproducibility in initialization. |
None
return a list including the following components: (1) F, a list composed by the posterior estimation of study-shared factor matrix for each study; (2) H, a list composed by the posterior estimation of study-specified factor matrix for each study; (3) Sf, a list consisting of the posterior estimation of covariance matrix of study-shared factors for each study; (4) Sh, a list consisting of the posterior estimation of covariance matrix of study-specified factors for each study; (5) A, the loading matrix corresponding to study-shared factors; (6) B, a list composed by the loading matrices corresponding to the study-specified factors; (7) bbeta, the estimated regression coefficient matrix; (8) invLambda, the inverse of the estimated variances of error; (9) ELBO: the ELBO value when algorithm stops; (7) ELBO_seq: the sequence of ELBO values. (11) qrlist, the number of factors and rank of regression coefficient matrix used in fitting; (12) time.use, the elapsed time for model fitting.
None
seed <- 1; nvec <- c(100,300); p<- 300; d <- 3; q<- 3; qs <- rep(2,2) datlist <- gendata_simu_multi2(seed=seed, nvec=nvec, p=p, d=d, q=3, qs=qs) XList <- lapply(datlist$Xlist, function(x) log(1+x)) fit_msfavi <- MSFRVI(XList, ZList = datlist$Zlist, q=3, qs=qs, rank_use = d) str(fit_msfavi)
seed <- 1; nvec <- c(100,300); p<- 300; d <- 3; q<- 3; qs <- rep(2,2) datlist <- gendata_simu_multi2(seed=seed, nvec=nvec, p=p, d=d, q=3, qs=qs) XList <- lapply(datlist$Xlist, function(x) log(1+x)) fit_msfavi <- MSFRVI(XList, ZList = datlist$Zlist, q=3, qs=qs, rank_use = d) str(fit_msfavi)
Fit the high-dimensional multi-study covariate-augmented overdispersed Poisson factor model via variational inference.
MultiCOAP( XcList, ZList, q = 15, qs = rep(2, length(XcList)), rank_use = NULL, aList = NULL, init = c("MSFRVI", "LFM"), epsELBO = 1e-05, maxIter = 30, verbose = TRUE, seed = 1 )
MultiCOAP( XcList, ZList, q = 15, qs = rep(2, length(XcList)), rank_use = NULL, aList = NULL, init = c("MSFRVI", "LFM"), epsELBO = 1e-05, maxIter = 30, verbose = TRUE, seed = 1 )
XcList |
a length-M list with each component a count matrix, which is the observed count matrix from each source/study. |
ZList |
a length-M list with each component a matrix that is the covariate matrix from each study. |
q |
an optional integer, specify the number of study-shared factors; default as 15. |
qs |
a integer vector with length M, specify the number of study-specifed factors; default as 2. |
rank_use |
an optional integer, specify the rank of the regression coefficient matrix; default as NULL, which means that rank is the dimension of covariates in Z. |
aList |
an optional length-M list with each component a vector, the normalization factors of each study; default as full-one vector. |
init |
an optional string, specify the initialization method, default as "MSFRVI". |
epsELBO |
an optional positive vlaue, tolerance of relative variation rate of the envidence lower bound value, defualt as '1e-5'. |
maxIter |
the maximum iteration of the VEM algorithm. The default is 30. |
verbose |
a logical value, whether output the information in iteration. |
seed |
an optional integer, specify the random seed for reproducibility in initialization. |
If init="MSFRVI"
, it will use the results from multi-study linear factor model as initial values; If init="LFM"
, it will use the results from linear factor model by combing data from all studies as initials.
return a list including the following components: (1) F, a list composed by the posterior estimation of study-shared factor matrix for each study; (2) H, a list composed by the posterior estimation of study-specified factor matrix for each study; (3) Sf, a list consisting of the posterior estimation of covariance matrix of study-shared factors for each study; (4) Sh, a list consisting of the posterior estimation of covariance matrix of study-specified factors for each study; (5) A, the loading matrix corresponding to study-shared factors; (6) B, a list composed by the loading matrices corresponding to the study-specified factors; (7) bbeta, the estimated regression coefficient matrix; (8) invLambda, the inverse of the estimated variances of error; (9) ELBO: the ELBO value when algorithm stops; (7) ELBO_seq: the sequence of ELBO values. (11) qrlist, the number of factors and rank of regression coefficient matrix used in fitting; (12) time.use, the elapsed time for model fitting.
None
seed <- 1; nvec <- c(100,300); p<- 300; d <- 3; q<- 3; qs <- rep(2,2) datlist <- gendata_simu_multi2(seed=seed, nvec=nvec, p=p, d=d, q=3, qs=qs) fit_mcoap <- MultiCOAP(datlist$Xlist, ZList = datlist$Zlist, q=3, qs=qs, rank_use = d) str(fit_mcoap)
seed <- 1; nvec <- c(100,300); p<- 300; d <- 3; q<- 3; qs <- rep(2,2) datlist <- gendata_simu_multi2(seed=seed, nvec=nvec, p=p, d=d, q=3, qs=qs) fit_mcoap <- MultiCOAP(datlist$Xlist, ZList = datlist$Zlist, q=3, qs=qs, rank_use = d) str(fit_mcoap)