Title: | Expectation-Maximization Algorithm for Multivariate Normal (Gaussian) with Missing Data |
---|---|
Description: | Initially designed to distribute code for estimating the Gaussian graphical model with Lasso regularization, also known as the graphical lasso (glasso), using an Expectation-Maximization (EM) algorithm based on work by Städler and Bühlmann (2012) <doi:10.1007/s11222-010-9219-7>. As a byproduct, code for estimating means and covariances (or the precision matrix) under a multivariate normal (Gaussian) distribution is also available. |
Authors: | Carl F. Falk [cre, aut] |
Maintainer: | Carl F. Falk <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.2.1 |
Built: | 2024-10-31 06:52:01 UTC |
Source: | CRAN |
EM algorithm for multivariate normal, covariance matrix parameterization
em.cov( dat, max.iter = 500, tol = 1e-05, start = c("diag", "pairwise", "listwise", "full"), debug = 0, ... )
em.cov( dat, max.iter = 500, tol = 1e-05, start = c("diag", "pairwise", "listwise", "full"), debug = 0, ... )
dat |
Data frame or matrix that contains the raw data. |
max.iter |
Max number of EM cycles. |
tol |
Tolerance for change in parameter estimates across EM Cycles. If
all changes are less than |
start |
Starting value method (see details). |
debug |
(Experimental) set an integer value > 1 for some information as the algorithm runs. |
... |
Space for additional arguments, not currently used. |
This function computes all means and covariances among a set of variables using
the Expectation-Maximization (EM) algorithm to handle missing values, and assuming
multivariate normality. The EM code was originally developed for the precision
matrix parameterization (em.prec
), i.e., the parameters are the
means and the inverse of the covariance matrix. But, this is easily modifiable
to handle a covariance matrix parameterization such that means and covariances
are the model parameters.
For starting values, the function accepts either a list that has mu
and
S
slots corresponding to the starting mean and covariance matrix. This
is useful if the user would like to use custom starting values. Otherwise, a
character corresponding to any of the options available in the
startvals.cov
function will be used to take a guess at starting values.
A list with the following:
p.est
: all parameter estimates as a vector (means followed by unique elements of precision matrix).
mu
: estimated means.
S
: estimated covariance matrix.
it
: number of EM cycles completed.
conv
: boolean value indicating convergence (TRUE) or not (FALSE).
library(psych) data(bfi) test <- em.cov(bfi[,1:25])
library(psych) data(bfi) test <- em.cov(bfi[,1:25])
EM algorithm for multivariate normal, precision matrix parameterization
em.prec( dat, max.iter = 500, tol = 1e-05, start = c("diag", "pairwise", "listwise", "full"), glassoversion = c("none", "glassoFast", "glasso", "glassonostart"), debug = 0, ... )
em.prec( dat, max.iter = 500, tol = 1e-05, start = c("diag", "pairwise", "listwise", "full"), glassoversion = c("none", "glassoFast", "glasso", "glassonostart"), debug = 0, ... )
dat |
Data frame or matrix that contains the raw data. |
max.iter |
Max number of EM cycles. |
tol |
Tolerance for change in parameter estimates across EM Cycles. If
all changes are less than |
start |
Starting value method (see details). |
glassoversion |
Character indicating whether to do regularization (lasso),
and if so, using which package. "glasso" uses the |
debug |
(Experimental) set an integer value > 1 for some information as the algorithm runs. |
... |
Arguments passed down to any of the glasso functions. |
This function computes all means and the precision matrix (inverse of covariance
matrix) among a set of variables using the Expectation-Maximization (EM)
algorithm to handle missing values, and assuming multivariate normality. The
EM code was originally developed based on Stadler and Buhlmann (2012) and for
use with the graphical lasso (i.e., glasso). This version allows the possibility
of using a lasso by specifying something other than "none" for glassoversion
.
However, it can also be used without regularization to just estimate the precision matrix.
For starting values for the EM algorithm itself (not at the M-step), the
function accepts either a list that has mu
and S
slots
corresponding to the starting mean and covariance matrix. This is useful if
the user would like to use custom starting values. Otherwise, a character
corresponding to any of the options available in the startvals.cov
function will be used to take a guess at starting values.
A list with the following:
p.est
: all parameter estimates as a vector (means followed by unique elements of precision matrix).
mu
: estimated means.
S
: estimated covariance matrix.
K
: estimated precision matrix.
it
: number of EM cycles completed.
conv
: boolean value indicating convergence (TRUE) or not (FALSE).
Städler, N., & Bühlmann, P. (2012). Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, 22, 219–235. doi:10.1007/s11222-010-9219-7
library(psych) data(bfi) test <- em.prec(bfi[,1:25])
library(psych) data(bfi) test <- em.prec(bfi[,1:25])
Regularized GGM under missing data with EBIC or k-fold cross validation
EMggm( dat, max.iter = 500, est.tol = 1e-07, start = c("diag", "pairwise", "listwise", "full"), glassoversion = c("glasso", "glassoFast", "glassonostart", "none"), rho = 0, rhoselect = c("ebic", "kfold"), N = NULL, gam = 0.5, zero.tol = 1e-10, k = 5, seed = NULL, debug = 0, convfail = FALSE, ... )
EMggm( dat, max.iter = 500, est.tol = 1e-07, start = c("diag", "pairwise", "listwise", "full"), glassoversion = c("glasso", "glassoFast", "glassonostart", "none"), rho = 0, rhoselect = c("ebic", "kfold"), N = NULL, gam = 0.5, zero.tol = 1e-10, k = 5, seed = NULL, debug = 0, convfail = FALSE, ... )
dat |
A data frame or matrix of the raw data. |
max.iter |
Maxumum number of EM cycles for the EM algorithm, passed eventually to |
est.tol |
Tolerance for change in parameter estimates across EM Cycles. If
all changes are less than |
start |
Starting value method (see details of |
glassoversion |
Character indicating which function to use at the M-step for regularization. See
|
rho |
Vector of tuning parameter values. Defaults to just a single value of 0 (i.e., also no regularization). |
rhoselect |
Method of selecting the tuning parameter. "ebic" will select the best value in |
N |
Sample size to use in any EBIC calculations. Defaults to average pairwise sample size. |
gam |
Value of gamma in EBIC calculations. Typical choice, and the default, is .5. |
zero.tol |
Tolerance in EBIC calculations for declaring edges to be zero. Anything in absolute value
below |
k |
If using k-fold cross validation, an integer specifying the number of folds. Defaults to 5. |
seed |
Random number seed passed to function that does k-fold cross validation. Use if you want folds to be more or less replicable. |
debug |
(Experimental) Pick an integer above 0 for some messages that may aide in debugging. |
convfail |
(Experimental) What to do if optimization fails. If TRUE, will fill in the partial correlation matrix with 0's. |
... |
Other arguments passed down to |
This function will estimate a regularized Gaussian graphical model (GGM) using either EBIC or k-fold
cross validation to select the tuning parameter. It was written as a wrapper function to the
em.prec
function, which is an implementation of the EM algorithm from
Städler & Bühlmann (2012). These authors also used k-fold cross validation, which is implemented here.
For the tuning parameter (rho
), typically a grid of of values is evaluated and the one
that results in the best EBIC or cross validation metric is selected and returned as the final model.
See rhogrid
and examples for some ways to pick these candidate tuning parameter values.
This function is intended to be compatible (to my knowledge) with estimateNetwork
in the bootnet
package in that one can input this as a custom function and then have all of the
benefits of plotting, centrality indices, and so on, that bootnet
provides.
Most methods in examples were studied by Falk and Starr (under review). In particular use of this
function in conjunction with estimateNetwork
worked well for both "ebic"
and "kfold" for model selection, though the original article used the glassoFast
package for estimation, which by default penalizes the diagonal of the precision matrix. It seems
slightly more common to not penalize the diagonal. Therefore, the below use glasso
;
this approach performed well but occasionally would get stuck while trying to find an optimal solution.
In addition, the two-stage approach studied by Falk and Starr (under review) also performed well, though
not as good as the present function; that approach is available in the bootnet package. Note also that
cglasso
has an implementation of Städler & Bühlmann (2012), but we found in
simulations with a high proportion of missing data that our implementation was less likely to
encounter estimation problems.
A list with the following elements:
results
: Contains the "best" or selected model, with elements mostly following that of em.prec
's output.
Additional elements include rho
and crit
, which are the grid of tuning parameter values and value of the
criterion. These are added here because if used with bootnet
, it may not save the other slots below.
rho
: Grid for rho
.
crit
: Vector of the same length as the grid for rho
with the criterion (EBIC or cross-validation).
Smaller is better.
graph
: Estimated partial correlation matrix; bootnet
appears to expect this in order
to do plots, compute centrality indices and so on.
Evidence that EBIC and k-fold works well with this package (cite if you use the EMglasso R package): Falk, C. F., & Starr, J. (2023, July 19). Regularized cross-sectional network modeling with missing data: A comparison of methods. Preprint: doi:10.31219/osf.io/dk6zv
Original publication on use of EM algorithm and k-fold cross-validation for glasso: Städler, N., & Bühlmann, P. (2012). Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing,22, 219–235. doi:10.1007/s11222-010-9219-7
If you use this package with bootnet: Epskamp,S., Borsboom, D., & Fried, E. I. (2018). Estimating psychological networks and their accuracy: A tutorial paper. Behavior research methods, 50(1), 195–212. doi:10.3758/s13428-017-0862-1
If you also use this package with qgraph: Epskamp, S., Cramer, A., Waldorp, L., Schmittmann, V. D., & Borsboom, D. (2012). qgraph: Network visualizations of relationships in psychometric data. Journal of Statistical Software, 48 (1), 1-18.
Foundational article on glasso: Friedman, J. H., Hastie, T., & Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9 (3), 432-441.
Foundational article on use of EBIC with glasso: Foygel, R., & Drton, M. (2010). Extended Bayesian information criteria for Gaussian graphical models.
If glasso is also the estimation method: Friedman, J. H., Hastie, T., & Tibshirani, R. (2014). glasso: Graphical lasso estimation of Gaussian graphical models. Retrieved from https://CRAN.R-project.org/package=glasso
If glassoFast is also the estimation method: Sustik M.A., Calderhead B. (2012). GLASSOFAST: An efficient GLASSO implementation. UTCS Technical Report TR-12-29:1-3.
library(psych) library(bootnet) data(bfi) # Regularized estimation with just a couple of tuning parameter values # EBIC rho <- seq(.01,.5,length.out = 50) ebic1 <- EMggm(bfi[,1:25], rho = rho, glassoversion = "glasso", rhoselect = "ebic") # k-fold kfold1 <- EMggm(bfi[,1:25], rho = rho, glassoversion = "glasso", rhoselect = "kfold") plot(rho, ebic1$crit) # values of EBIC along grid plot(rho, kfold1$crit) # values of kfold along grid # partial correlation matrix # ebic1$graph # kfold1$graph # Integration with bootnet package ebic2 <- estimateNetwork(bfi[,1:25], fun = EMggm, rho = rho, glassoversion = "glasso", rhoselect = "ebic") kfold2 <- estimateNetwork(bfi[,1:25], fun = EMggm, rho = rho, glassoversion = "glasso", rhoselect = "kfold") # ebic2 and kfold2 now do just about anything one could normally do with an # object returned from estimateNetwork e.g., plotting plot(ebic2) plot(kfold2) # e.g., centrality indices library(qgraph) centralityPlot(ebic2) # and so on # Other ways to pick grid for tuning parameter... 100 values using method # that qgraph basically uses; requires estimate of covariance matrix. Here # the covariance matrix is obtained directly from the data. rho <- rhogrid(100, method="qgraph", dat = bfi[,1:25]) ebic3 <- estimateNetwork(bfi[,1:25], fun = EMggm, rho=rho, glassoversion = "glasso", rhoselect = "ebic") kfold3 <- estimateNetwork(bfi[,1:25], fun = EMggm, rho=rho, glassoversion = "glasso", rhoselect = "kfold") # look at tuning parameter values vs criterion plot(ebic3$result$rho, ebic3$result$crit) plot(kfold3$result$rho, kfold3$result$crit) # EBIC with bootnet; does listwise deletion by default ebic.listwise <- estimateNetwork(bfi[,1:25], default="EBICglasso") # EBIC with bootnet; two-stage estimation # Note the constant added to bfi tricks lavaan into thinking the data are # not categorical ebic.ts <- estimateNetwork(bfi[,1:25] + 1e-10, default="EBICglasso", corMethod="cor_auto", missing="fiml")
library(psych) library(bootnet) data(bfi) # Regularized estimation with just a couple of tuning parameter values # EBIC rho <- seq(.01,.5,length.out = 50) ebic1 <- EMggm(bfi[,1:25], rho = rho, glassoversion = "glasso", rhoselect = "ebic") # k-fold kfold1 <- EMggm(bfi[,1:25], rho = rho, glassoversion = "glasso", rhoselect = "kfold") plot(rho, ebic1$crit) # values of EBIC along grid plot(rho, kfold1$crit) # values of kfold along grid # partial correlation matrix # ebic1$graph # kfold1$graph # Integration with bootnet package ebic2 <- estimateNetwork(bfi[,1:25], fun = EMggm, rho = rho, glassoversion = "glasso", rhoselect = "ebic") kfold2 <- estimateNetwork(bfi[,1:25], fun = EMggm, rho = rho, glassoversion = "glasso", rhoselect = "kfold") # ebic2 and kfold2 now do just about anything one could normally do with an # object returned from estimateNetwork e.g., plotting plot(ebic2) plot(kfold2) # e.g., centrality indices library(qgraph) centralityPlot(ebic2) # and so on # Other ways to pick grid for tuning parameter... 100 values using method # that qgraph basically uses; requires estimate of covariance matrix. Here # the covariance matrix is obtained directly from the data. rho <- rhogrid(100, method="qgraph", dat = bfi[,1:25]) ebic3 <- estimateNetwork(bfi[,1:25], fun = EMggm, rho=rho, glassoversion = "glasso", rhoselect = "ebic") kfold3 <- estimateNetwork(bfi[,1:25], fun = EMggm, rho=rho, glassoversion = "glasso", rhoselect = "kfold") # look at tuning parameter values vs criterion plot(ebic3$result$rho, ebic3$result$crit) plot(kfold3$result$rho, kfold3$result$crit) # EBIC with bootnet; does listwise deletion by default ebic.listwise <- estimateNetwork(bfi[,1:25], default="EBICglasso") # EBIC with bootnet; two-stage estimation # Note the constant added to bfi tricks lavaan into thinking the data are # not categorical ebic.ts <- estimateNetwork(bfi[,1:25] + 1e-10, default="EBICglasso", corMethod="cor_auto", missing="fiml")
Create sequence of possible tuning parameter values
rhogrid( n.rho, method = c("qgraph", "glassopath"), rho.min.ratio = 0.01, dat = NULL, S = NULL, ... )
rhogrid( n.rho, method = c("qgraph", "glassopath"), rho.min.ratio = 0.01, dat = NULL, S = NULL, ... )
n.rho |
Integer corresponding to the number of tuning parameter values. |
method |
Character corresponding to the method to create tuning parameter values ("qgraph" or "glassopath"); see Details. |
rho.min.ratio |
Numeric value that mimics |
dat |
The raw data, if |
S |
Covariance matrix for the data. If provided, supersedes |
... |
Other arguments passed down to |
For regularized estimation of the Gaussian graphical model, a sequence or grid of possible tuning
parameter values is often tried, with the tuning parameter that optimizes some criterion (EBIC, k-fold
cross validation) chosen. This is an attempt to automate some of the sequence creation. Code
is borrowed from qgraph and glasso, acknowledged in references below. Both require some estimate
of the covariance matrix in order to do regularization; if not provided, em.prec
with
default options is attempted.
For "qgraph" the max value is determined by the maximum absolute value of the
difference between the covariance matrix and an identity matrix. The min is rho.min.ratio
times the max value. A sequence that is equally spaced on a log scale is then constructed between
these two values.
For "glassopath", the max value is the max absolute value of the covariance matrix. The min is the max divided by the number of desired tuning parameter values. A sequence that is equally spaced between these two values is then constructed.
A vector of possible tuning parameter values.
qgraph: Epskamp, S., Cramer, A., Waldorp, L., Schmittmann, V. D., & Borsboom, D. (2012). qgraph: Network visualizations of relationships in psychometric data. Journal of Statistical Software, 48 (1), 1-18.
glasso (i.e., glassopath option): Friedman, J. H., Hastie, T., & Tibshirani, R. (2014). glasso: Graphical lasso estimation of Gaussian graphical models. Retrieved from https://CRAN.R-project.org/package=glasso
library(psych) data(bfi) # pick 50 values using the approach qgraph uses; give data as input rho <- rhogrid(50, method="qgraph", dat = bfi[,1:25]) emresult <- em.cov(bfi[,1:25]) S<-emresult$S # pick 50 values using the approach glasso uses; give S as input rho2 <- rhogrid(50, method="glassopath", S = S)
library(psych) data(bfi) # pick 50 values using the approach qgraph uses; give data as input rho <- rhogrid(50, method="qgraph", dat = bfi[,1:25]) emresult <- em.cov(bfi[,1:25]) S<-emresult$S # pick 50 values using the approach glasso uses; give S as input rho2 <- rhogrid(50, method="glassopath", S = S)
Starting values for means and covariances
startvals.cov(dat, start = c("diag", "pairwise", "listwise", "full"))
startvals.cov(dat, start = c("diag", "pairwise", "listwise", "full"))
dat |
Data frame or matrix that contains the raw data. |
start |
Starting value method (see details). |
Attempts to figure out a starting values for the means and covariances for use
with other functions that do the EM algorithm such as em.prec
or
em.cov
. Note that means are determined univariately using all
available cases. For covariances, several options are available:
- "diag" Use all available complete values to compute the variances of each variable and construct a diagonal covariance matrix.
- "pairwise" Pairwise (co)variances will be used to construct the starting covariance matrix.
- "listwise" Listwise deletion will be used and only those with complete data will contribute to the starting covariance matrix.
- "full" Cheat and use lavaan
to obtain direct maximum likelihood estimates of covariances. This defeats the purpose to some extent, but not that lavaan
may be quite slow compared to this implementation.
A list consisting of:
mustart
- starting values for means.
covstart
- starting values for covariances.
library(psych) data(bfi) startvals.cov(bfi[,1:25])
library(psych) data(bfi) startvals.cov(bfi[,1:25])