Title: | Bayesian Inference for Factor Modeling |
---|---|
Description: | Collection of procedures to perform Bayesian analysis on a variety of factor models. Currently, it includes: "Bayesian Exploratory Factor Analysis" (befa) from G. Conti, S. Frühwirth-Schnatter, J.J. Heckman, R. Piatek (2014) <doi:10.1016/j.jeconom.2014.06.008>, an approach to dedicated factor analysis with stochastic search on the structure of the factor loading matrix. The number of latent factors, as well as the allocation of the manifest variables to the factors, are not fixed a priori but determined during MCMC sampling. |
Authors: | Rémi Piatek [aut, cre] |
Maintainer: | Rémi Piatek <[email protected]> |
License: | GPL-3 |
Version: | 0.1.7 |
Built: | 2024-12-11 07:10:46 UTC |
Source: | CRAN |
The long-term goal of this package is to provide a collection of procedures to perform Bayesian inference on a variety of factor models.
Currently, this package includes: Bayesian Exploratory Factor
Analysis (befa
), as developed in Conti et al. (2014), an approach to
dedicated factor analysis with stochastic search on the structure of the
factor loading matrix. The number of latent factors, as well as the
allocation of the observed variables to the factors, are not fixed a priori
but determined during MCMC sampling. More approaches will be included in
future releases of this package.
You are very welcome to send me any comments or suggestions for improvements, and to share with me any problems you may encounter with the use of this package.
Rémi Piatek [email protected]
G. Conti, S. Frühwirth-Schnatter, J.J. Heckman, R. Piatek (2014): “Bayesian Exploratory Factor Analysis”, Journal of Econometrics, 183(1), pages 31-57, doi:10.1016/j.jeconom.2014.06.008.
This function implements the Bayesian Exploratory Factor Analysis
(befa
) approach developed in Conti et al. (CFSHP, 2014). It runs a
MCMC sampler for a factor model with dedicated factors, where each manifest
variable is allowed to load on at most one latent factor. The allocation of
the manifest variables to the latent factors is not fixed a priori but
determined stochastically during sampling. The minimum number of variables
dedicated to each factor can be controlled by the user to achieve the desired
level of identification. The manifest variables can be continuous or
dichotomous, and control variables can be introduced as covariates.
befa(model, data, burnin = 1000, iter = 10000, Nid = 3, Kmax, A0 = 10, B0 = 10, c0 = 2, C0 = 1, HW.prior = TRUE, nu0 = Kmax + 1, S0 = 1, kappa0 = 2, xi0 = 1, kappa = 1/Kmax, indp.tau0 = TRUE, rnd.step = TRUE, n.step = 5, search.delay = min(burnin, 10), R.delay = min(burnin, 100), dedic.start, alpha.start, sigma.start, beta.start, R.start, verbose = TRUE)
befa(model, data, burnin = 1000, iter = 10000, Nid = 3, Kmax, A0 = 10, B0 = 10, c0 = 2, C0 = 1, HW.prior = TRUE, nu0 = Kmax + 1, S0 = 1, kappa0 = 2, xi0 = 1, kappa = 1/Kmax, indp.tau0 = TRUE, rnd.step = TRUE, n.step = 5, search.delay = min(burnin, 10), R.delay = min(burnin, 100), dedic.start, alpha.start, sigma.start, beta.start, R.start, verbose = TRUE)
model |
This argument specifies the manifest variables and the covariates used in the model (if any). It can be specified in two different ways:
Binary manifest variables should be specified as logical vectors in
the data frame to be treated as dichotomous. |
data |
Data frame. If missing, parent data frame if used. |
burnin |
Burn-in period of the MCMC sampler. |
iter |
Number of MCMC iterations saved for posterior inference (after burn-in). |
Nid |
Minimum number of manifest variables dedicated to each latent factor for identification. |
Kmax |
Maximum number of latent factors. If missing, the maximum number of
factors that satisfies the identification condition determined by
|
A0 |
Scaling parameters of the variance of the Normal prior on the nonzero factor loadings. Either a scalar or a numeric vector of length equal to the number of manifest variables. |
B0 |
Variances of the Normal prior on the regression coefficients. Either a scalar or a numeric vector of length equal to the number of manifest variables. |
c0 |
Shape parameters of the Inverse-Gamma prior on the idiosyncratic variances. Either a scalar or a numeric vector of length equal to the number of manifest variables. |
C0 |
Scale parameters of the Inverse-Gamma prior on the idiosyncratic variances. Either a scalar or a numeric vector of length equal to the number of manifest variables. |
HW.prior |
If |
nu0 |
Degrees of freedom of the Inverse-Wishart prior on the covariance matrix of the latent factors in the expanded model. |
S0 |
Scale parameters of the Inverse-Wishart prior on the covariance matrix of latent factors in the expanded model:
Either a scalar or a numeric vector of length equal to |
kappa0 |
First shape parameter of the Beta prior distribution on the
probability |
xi0 |
Second shape parameter of the Beta prior distribution on the
probability |
kappa |
Concentration parameters of the Dirichlet prior distribution on the
indicators. Either a scalar or a numeric vector of length equal to
|
indp.tau0 |
If |
rnd.step |
If |
n.step |
Controls the number of intermediate steps in non-identified models:
|
search.delay |
Number of MCMC iterations run with fixed indicator matrix (specified
in |
R.delay |
Number of MCMC iterations run with fixed correlation matrix (specified
in |
dedic.start |
Starting values for the indicators. Vector of integers of length equal
to the number of manifest variables. Each element takes a value among
0, 1, ..., |
alpha.start |
Starting values for the factor loadings. Numeric vector of length
equal to the number of manifest variables. If missing, sampled from a
Normal distribution with zero mean and variance |
sigma.start |
Starting values for the idiosyncratic variances. Numeric vector of length equal to the number of manifest variables. Sampled from prior if missing. |
beta.start |
Starting values for the regression coefficients. Numeric vector of length equal to the total number of regression coefficients, concatenated for all the manifest variables. Sampled from prior if missing. |
R.start |
Starting values for the correlation matrix of the latent factors.
Numeric matrix with |
verbose |
If |
Model specification. The model is specified as follows, for
each observation :
where is the
-vector containing the latent
variables underlying the corresponding
manifest variables
, which can be continuous such that
, or binary with
, for
.
The
-vector
contains the latent factors, and
is the
-matrix of factor loadings.
The
-vector
is the vector of error terms.
Covariates can be included in the
-vector
and are related to
the manifest variables through the
-matrix of
regression coefficients
. Intercept terms are automatically
included, but can be omitted in some or all equations using the usual syntax
for R formulae (e.g., 'Y1 ~ X1 - 1' specifies that that Y1 is regressed on X1
and no intercept is included in the corresponding equation).
The number of latent factors is specified as
Kmax
. However,
during MCMC sampling the stochastic search process on the matrix
may produce zero columns, thereby reducing the number of active factors.
The covariance matrix of the latent factors is assumed to be a
correlation matrix for identification.
Each row of the factor loading matrix contains at most one
nonzero element (dedicated factor model). The allocation of the manifest
variables to the latent factors is indicated by the binary matrix
with same dimensions as
, such that each row
indicates which factor loading is different from zero, e.g.:
indicates that variable loads on the
th factor, where
is a
-vector that contains only zeros, besides its
th
element that equals 1.
Identification. The function verifies that the maximum number of
latent factors Kmax
does not exceed the Ledermann bound. It also
checks that Kmax
is consistent with the identification restriction
specified with Nid
(enough variables should be available to load on
the factors when Kmax
is reached). The default value for Kmax
is the minimum between the Ledermann bound and the maximum number of factors
that can be loaded by Nid
variables. The user is free to select the
level of identification, see CFSHP section 2.2 (non-identified models are
allowed with Nid = 1
).
Note that identification is achieved only with respect to the scale of the
latent factors. Non-identifiability problems may affect the posterior sample
because of column switching and sign switching of the factor loadings.
These issues can be addressed a posteriori with the functions
post.column.switch
and post.sign.switch
.
Prior specification.
The indicators are assumed to have the following probabilities,
for :
If indp.tau0 = FALSE
, the probabilities are specified as:
with =
kappa0
, =
xi0
and
=
kappa
.
Alternatively, if indp.tau0 = TRUE
, the probabilities are specified
as:
for each manifest variable .
A normal-inverse-Gamma prior distribution is assumed on the nonzero factor loadings and on the idiosyncratic variances:
where denotes the only nonzero loading in the
th
row of
.
For the regression coefficients, a multivariate Normal prior distribution is
assumed on each row of
:
The covariates can be different across manifest variables, implying zero
restrictions on the matrix . To specify covariates, use a list
of formulas as
model
(see example below). Intercept terms can be
introduced using
To sample the correlation matrix of the latent factors, marginal data
augmentation is implemented (van Dyk and Meng, 2001), see CFSHP section 2.2.
Using the transformation
, the
parameters
are used as
working parameters. These parameters correspond to the variances of
the latent factors in an expanded version of the model where the factors do
not have unit variances. Two prior distributions can be specified on the
covariance matrix
in the expanded model:
If HW.prior = FALSE
, inverse-Wishart distribution:
with =
nu0
and =
S0
.
If HW.prior = TRUE
, Huang-Wand (2013) prior:
with =
nu0
- Kmax
+ 1, and the shape and
rate parameters are specified such that the mean of the gamma distribution
is equal to , for each
.
Missing values. Missing values (NA
) are allowed in the
manifest variables. They are drawn from their corresponding conditional
distributions during MCMC sampling. Control variables with missing values
can be passed to the function. However, all the observations with at least
one missing value in the covariates are discarded from the sample (a warning
message is issued in that case).
The function returns an object of class 'befa
' containing the
MCMC draws of the model parameters saved in the following matrices (each
matrix has 'iter
' rows):
alpha
: Factor loadings.
sigma
: Idiosyncratic variances.
R
: Correlation matrix of the latent factors (off-diagonal
elements only).
beta
: regression coefficients (if any).
dedic
: indicators (integers indicating on which factors the
manifest variable load).
The returned object also contains:
nfac
: Vector of number of 'active' factors across MCMC
iterations (i.e., factors loaded by at least Nid
manifest
variables).
MHacc
: Logical vector indicating accepted proposals of
Metropolis-Hastings algorithm.
The parameters Kmax
and Nid
are saved as object attributes, as
well as the function call and the number of mcmc iterations (burnin
and iter
), and two logical variables indicating if the returned object
has been post processed to address the column switching problem
(post.column.switch
) and the sign switching problem
(post.sign.switch
).
Rémi Piatek [email protected]
G. Conti, S. Frühwirth-Schnatter, J.J. Heckman, R. Piatek (2014): “Bayesian Exploratory Factor Analysis”, Journal of Econometrics, 183(1), pages 31-57, doi:10.1016/j.jeconom.2014.06.008.
A. Huang, M.P. Wand (2013): “Simple Marginally Noninformative Prior Distributions for Covariance Matrices”, Bayesian Analysis, 8(2), pages 439-452, doi:10.1214/13-BA815.
D.A. van Dyk, X.-L. Meng (2001): “The Art of Data Augmentation”, Journal of Computational and Graphical Statistics, 10(1), pages 1-50, doi:10.1198/10618600152418584.
post.column.switch
and post.sign.switch
for column switching and sign switching of the factor loading matrix and of
the correlation matrix of the latent factors to restore identification
a posteriori.
summary.befa
and plot.befa
to summarize
and plot the posterior results.
simul.R.prior
and simul.nfac.prior
to
simulate the prior distribution of the correlation matrix of the factors and
the prior distribution of the indicator matrix, respectively. This is useful
to perform prior sensitivity analysis and to understand the role of the
corresponding parameters in the factor search.
#### model without covariates set.seed(6) # generate fake data with 15 manifest variables and 3 factors N <- 100 # number of observations Y <- simul.dedic.facmod(N, dedic = rep(1:3, each = 5)) # run MCMC sampler # notice: 1000 MCMC iterations for illustration purposes only, # increase this number to obtain reliable posterior results! mcmc <- befa(Y, Kmax = 5, iter = 1000) # post process MCMC draws to restore identification mcmc <- post.column.switch(mcmc) mcmc <- post.sign.switch(mcmc) summary(mcmc) # summarize posterior results plot(mcmc) # plot posterior results # summarize highest posterior probability (HPP) model summary(mcmc, what = 'hppm') #### model with covariates # generate covariates and regression coefficients Xcov <- cbind(1, matrix(rnorm(4*N), ncol = 4)) colnames(Xcov) <- c('(Intercept)', paste0('X', 1:4)) beta <- rbind(rnorm(15), rnorm(15), diag(3) %x% t(rnorm(5))) # add covariates to previous model Y <- Y + Xcov %*% beta # specify model model <- c('~ X1', # X1 covariate in all equations paste0('Y', 1:5, ' ~ X2'), # X2 covariate for Y1-Y5 only paste0('Y', 6:10, ' ~ X3'), # X3 covariate for Y6-Y10 only paste0('Y', 11:15, ' ~ X4')) # X4 covariate for Y11-Y15 only model <- lapply(model, as.formula) # make list of formulas # run MCMC sampler, post process and summarize mcmc <- befa(model, data = data.frame(Y, Xcov), Kmax = 5, iter = 1000) mcmc <- post.column.switch(mcmc) mcmc <- post.sign.switch(mcmc) mcmc.sum <- summary(mcmc) mcmc.sum # compare posterior mean of regression coefficients to true values beta.comp <- cbind(beta[beta != 0], mcmc.sum$beta[, 'mean']) colnames(beta.comp) <- c('true', 'mcmc') print(beta.comp, digits = 3)
#### model without covariates set.seed(6) # generate fake data with 15 manifest variables and 3 factors N <- 100 # number of observations Y <- simul.dedic.facmod(N, dedic = rep(1:3, each = 5)) # run MCMC sampler # notice: 1000 MCMC iterations for illustration purposes only, # increase this number to obtain reliable posterior results! mcmc <- befa(Y, Kmax = 5, iter = 1000) # post process MCMC draws to restore identification mcmc <- post.column.switch(mcmc) mcmc <- post.sign.switch(mcmc) summary(mcmc) # summarize posterior results plot(mcmc) # plot posterior results # summarize highest posterior probability (HPP) model summary(mcmc, what = 'hppm') #### model with covariates # generate covariates and regression coefficients Xcov <- cbind(1, matrix(rnorm(4*N), ncol = 4)) colnames(Xcov) <- c('(Intercept)', paste0('X', 1:4)) beta <- rbind(rnorm(15), rnorm(15), diag(3) %x% t(rnorm(5))) # add covariates to previous model Y <- Y + Xcov %*% beta # specify model model <- c('~ X1', # X1 covariate in all equations paste0('Y', 1:5, ' ~ X2'), # X2 covariate for Y1-Y5 only paste0('Y', 6:10, ' ~ X3'), # X3 covariate for Y6-Y10 only paste0('Y', 11:15, ' ~ X4')) # X4 covariate for Y11-Y15 only model <- lapply(model, as.formula) # make list of formulas # run MCMC sampler, post process and summarize mcmc <- befa(model, data = data.frame(Y, Xcov), Kmax = 5, iter = 1000) mcmc <- post.column.switch(mcmc) mcmc <- post.sign.switch(mcmc) mcmc.sum <- summary(mcmc) mcmc.sum # compare posterior mean of regression coefficients to true values beta.comp <- cbind(beta[beta != 0], mcmc.sum$beta[, 'mean']) colnames(beta.comp) <- c('true', 'mcmc') print(beta.comp, digits = 3)
This function makes different plots that are useful to assess the posterior results: a trace plot of the number of latent factors (also showing Metropolis-Hastings acceptance across MCMC replications), a histogram of the posterior probabilities of the number of factors, heatmaps for the inficator probabilities, the factor loading matrix, and the correlation matrix of the latent factors.
## S3 method for class 'befa' plot(x, ...)
## S3 method for class 'befa' plot(x, ...)
x |
Object of class 'befa'. |
... |
The following extra arguments can be specified:
|
This function makes graphs based on the summary results returned by
summary.befa
. It therefore accepts the same optional arguments
as this function.
No return value, called for side effects (plots the posterior results
returned by summary.befa
).
Rémi Piatek [email protected]
summary.befa
to summarize posterior results.
set.seed(6) # generate fake data with 15 manifest variables and 3 factors Y <- simul.dedic.facmod(N = 100, dedic = rep(1:3, each = 5)) # run MCMC sampler and post process output # notice: 1000 MCMC iterations for illustration purposes only, # increase this number to obtain reliable posterior results! mcmc <- befa(Y, Kmax = 5, iter = 1000) mcmc <- post.column.switch(mcmc) mcmc <- post.sign.switch(mcmc) # plot results for highest posterior probability model plot(mcmc, what = 'hppm')
set.seed(6) # generate fake data with 15 manifest variables and 3 factors Y <- simul.dedic.facmod(N = 100, dedic = rep(1:3, each = 5)) # run MCMC sampler and post process output # notice: 1000 MCMC iterations for illustration purposes only, # increase this number to obtain reliable posterior results! mcmc <- befa(Y, Kmax = 5, iter = 1000) mcmc <- post.column.switch(mcmc) mcmc <- post.sign.switch(mcmc) # plot results for highest posterior probability model plot(mcmc, what = 'hppm')
This function reorders the columns of the factor loading matrix for each MCMC draw, as well as the rows and columns of the correlation matrix of the factors, to restore the identification of the model a posteriori with respect to the column switching problem.
post.column.switch(mcmc)
post.column.switch(mcmc)
mcmc |
Object of class ' |
The reordering of the columns of the factor loading matrix is based
on the top elements of the columns (i.e., the first row containing a nonzero
factor loading in each nonzero column of , starting from the top
of the matrix). At each MCMC iteration, the nonzero columns of
are reordered such that the top elements appear in increasing order.
The rows and columns of the correlation matrix
of the factors are
switched accordingly. See section 4.3 of CFSHP (p.42) for more details.
Same 'befa
' object as the one passed to the function, where
the indicators in the matrix dedic
, as well as the rows and columns of
the correlation matrix of the factors saved in draws
, have been
switched appropriately to restore the identification of the factor model with
respect to column switching.
Rémi Piatek [email protected]
G. Conti, S. Frühwirth-Schnatter, J.J. Heckman, R. Piatek (2014): “Bayesian Exploratory Factor Analysis”, Journal of Econometrics, 183(1), pages 31-57, doi:10.1016/j.jeconom.2014.06.008.
post.sign.switch
to restore identification a
posteriori with respect to the sign switching problem.
set.seed(6) Y <- simul.dedic.facmod(N = 100, dedic = rep(1:3, each = 5)) mcmc <- befa(Y, Kmax = 5, iter = 1000) mcmc <- post.column.switch(mcmc)
set.seed(6) Y <- simul.dedic.facmod(N = 100, dedic = rep(1:3, each = 5)) mcmc <- befa(Y, Kmax = 5, iter = 1000) mcmc <- post.column.switch(mcmc)
This function performs a sign switch on the MCMC draws to restore the consistency of the signs of the factors loadings and of the correlations of the latent factors a posteriori.
post.sign.switch(mcmc, benchmark = NULL, benchmark.threshold = 0.5)
post.sign.switch(mcmc, benchmark = NULL, benchmark.threshold = 0.5)
mcmc |
Object of class ' |
benchmark |
Vector of integers of length equal to the maximum number of latent
factors. Each element indicates which factor loading is used as a
benchmark for the sign switch. If |
benchmark.threshold |
Minimum posterior probability for a factor loading to be considered as a benchmark. |
The signs of the factor loadings, as well as of the corresponding
correlations of the latent factors, are switched for each MCMC iteration such
that the factor loadings defined as benchmark
s are positive. The sign
switch can only be performed if post.column.switch
has been run
before. See section 4.3 (p.42) of CFSHP for more details.
If a latent factor has no benchmarks, or if its benchmark is equal to zero at some MCMC iteration, then no sign switch is performed on the corresponding loadings and correlations for this particular factor or MCMC iteration.
Note that in complicated models where the sampler visits several models with
different numbers of latent factors, it may not be relevant to use the
default value of benchmark
, as the posterior probabilities that the
factor loadings are different from zero would be computed across models.
Instead, the user might consider finding the highest posterior probability
model first, and use its top elements in each column of the factor loading
matrix as benchmarks to perform the sign switch.
This function returns the same 'befa
' object, where the signs
of the factor loadings and of the factor correlations have been switched
appropriately to restore the identification of the factor model with respect
to sign switching.
Rémi Piatek [email protected]
G. Conti, S. Frühwirth-Schnatter, J.J. Heckman, R. Piatek (2014): “Bayesian Exploratory Factor Analysis”, Journal of Econometrics, 183(1), pages 31-57, doi:10.1016/j.jeconom.2014.06.008.
post.column.switch
for column switching of the factor
loading matrix and of the correlation matrix of the latent factors to restore
identification a posteriori.
set.seed(6) Y <- simul.dedic.facmod(N = 100, dedic = rep(1:3, each = 5)) mcmc <- befa(Y, Kmax = 5, iter = 1000) mcmc <- post.column.switch(mcmc) # factor loadings corresponding to variables 1, 6, 11, 12 and 13 are # used as benchmarks: mcmc1 <- post.sign.switch(mcmc, benchmark = c(1, 6, 11, 12, 13)) # factor loadings with the highest posterior probability of being different # from zero in each column are used as benchmark: mcmc2 <- post.sign.switch(mcmc)
set.seed(6) Y <- simul.dedic.facmod(N = 100, dedic = rep(1:3, each = 5)) mcmc <- befa(Y, Kmax = 5, iter = 1000) mcmc <- post.column.switch(mcmc) # factor loadings corresponding to variables 1, 6, 11, 12 and 13 are # used as benchmarks: mcmc1 <- post.sign.switch(mcmc, benchmark = c(1, 6, 11, 12, 13)) # factor loadings with the highest posterior probability of being different # from zero in each column are used as benchmark: mcmc2 <- post.sign.switch(mcmc)
This function simulates data from a dedicated factor model. The parameters of the model are either passed by the user or simulated by the function.
simul.dedic.facmod(N, dedic, alpha, sigma, R, R.corr = TRUE, max.corr = 0.85, R.max.trial = 1000)
simul.dedic.facmod(N, dedic, alpha, sigma, R, R.corr = TRUE, max.corr = 0.85, R.max.trial = 1000)
N |
Number of observations in data set. |
dedic |
Vector of indicators. The number of manifest variables is equal to the length of this vector, and the number of factors is equal to the number of unique nonzero elements. Each integer element indicates on which latent factor the corresponding variable loads uniquely. |
alpha |
Vector of factor loadings, should be of same length as |
sigma |
Idiosyncratic variances, should be of same length as |
R |
Covariance matrix of the latent factors. If missing, values are simulated (see details below). |
R.corr |
If TRUE, covariance matrix |
max.corr |
Maximum correlation allowed between the latent factors. |
R.max.trial |
Maximum number of trials allowed to sample from the truncated
distribution of the covariance matrix of the latent factors
(accept/reject sampling scheme, to make sure |
The function simulates data from the following dedicated factor
model, for :
where the -vector
contains the latent factors, and
is the
-matrix of factor loadings. Each
row
of
contains only zeros, besides its element
indicated by the
th element of
dedic
that is equal to the
th element of
alpha
(denoted below).
The
-vector
is the vector of error terms, with
sigma
.
is equal to the length of
the vector
dedic
, and is equal to the maximum value of this
vector.
Only N
and dedic
are required, all the other parameters can be
missing, completely or partially. Missing values (NA
) are
independently sampled from the following distributions, for each manifest
variable :
Factor loadings:
Idiosyncratic variances:
For the variables that do not load on any factors (i.e., for which the
corresponding elements of dedic
are equal to 0), it is specified that
and
.
Covariance matrix of the latent factors:
which is rescaled to be a correlation matrix if R.corr = TRUE
:
Note that the distribution of the covariance matrix is truncated such that
all the off-diagonal elements of the implied correlation matrix are
below
max.corr
in absolute value. The truncation is also applied if
the covariance matrix is used instead of the correlation matrix (i.e., if
R.corr = FALSE
).
The distributions and the corresponding default values used to simulate the model parameters are specified as in the Monte Carlo study of CFSHP, see section 4.1 (p.43).
The function returns a data frame with N
observations
simulated from the corresponding dedicated factor model.
The parameters used to generate the data are saved as attributes:
dedic
, alpha
, sigma
and R
.
Rémi Piatek [email protected]
G. Conti, S. Frühwirth-Schnatter, J.J. Heckman, R. Piatek (2014): “Bayesian Exploratory Factor Analysis”, Journal of Econometrics, 183(1), pages 31-57, doi:10.1016/j.jeconom.2014.06.008.
# generate 1000 observations from model with 4 factors and 20 variables # (5 variables loading on each factor) dat <- simul.dedic.facmod(N = 1000, dedic = rep(1:4, each = 5)) # generate data set with 5000 observations from the following model: dedic <- rep(1:3, each = 4) # 3 factors and 12 manifest variables alpha <- rep(c(1, NA, NA, NA), 3) # set first loading to 1 for each factor, # sample remaining loadings from default sigma <- rep(0.5, 12) # idiosyncratic variances all set to 0.5 R <- toeplitz(c(1, .6, .3)) # Toeplitz matrix dat <- simul.dedic.facmod(N = 5000, dedic, alpha, sigma, R)
# generate 1000 observations from model with 4 factors and 20 variables # (5 variables loading on each factor) dat <- simul.dedic.facmod(N = 1000, dedic = rep(1:4, each = 5)) # generate data set with 5000 observations from the following model: dedic <- rep(1:3, each = 4) # 3 factors and 12 manifest variables alpha <- rep(c(1, NA, NA, NA), 3) # set first loading to 1 for each factor, # sample remaining loadings from default sigma <- rep(0.5, 12) # idiosyncratic variances all set to 0.5 R <- toeplitz(c(1, .6, .3)) # Toeplitz matrix dat <- simul.dedic.facmod(N = 5000, dedic, alpha, sigma, R)
This function produces a sample from the prior distribution of the number of latent factors. It depends on the prior parameters used for the distribution of the indicators, on the size of the model (number of manifest variables and maximum number of latent factors), and on the identification restriction (minimum number of manifest variables dedicated to each factor).
simul.nfac.prior(nvar, Kmax, Nid = 3, kappa = 1/Kmax, nrep = 10^6)
simul.nfac.prior(nvar, Kmax, Nid = 3, kappa = 1/Kmax, nrep = 10^6)
nvar |
Number of manifest variables. |
Kmax |
Maximum number of latent factors. |
Nid |
Minimum number of manifest variables dedicated to each latent factor for identification. |
kappa |
Concentration parameter of the Dirichlet prior distribution on the indicators. |
nrep |
Number of Monte Carlo replications. |
This function simulates the prior distribution of the number of
latent factors for models that fulfill the identification restriction
restriction that at least Nid
manifest variables (or no variables) are
loading on each latent factor. Several (scalar) parameters kappa
can
be passed to the function to simulate the prior for different prior parameter
values and compare the results.
An accept/reject sampling scheme is used: a vector of probabilities is drawn
from a Dirichlet distribution with concentration parameter kappa
, and
the nvar
manifest variables are randomly allocated to the Kmax
latent factors. If each latent factor has at least Nid
dedicated
variables or no variables at all, the identification requirement is fulfilled
and the draw is accepted. The number of factors loaded by at least Nid
manifest variables is returned as a draw from the prior distribution.
Note that this function does not use the two-level prior distribution implemented in CFSHP, where manifest variables can be discarded from the model according to a given probability. Therefore, this function only help understand the prior distribution conditional on all the manifest variables being included into the model.
A list of length equal to the number of parameters specified in
kappa
is returned, where each element of the list contains:
nfac
: Vector of integers of length equal to the number of
accepted draws.
acc
: Acceptance rate of the accept/reject sampling scheme.
Rémi Piatek [email protected]
G. Conti, S. Frühwirth-Schnatter, J.J. Heckman, R. Piatek (2014): “Bayesian Exploratory Factor Analysis”, Journal of Econometrics, 183(1), pages 31-57, doi:10.1016/j.jeconom.2014.06.008.
# replicate first row of table 2 in CFSHP (p.44) # note: use larger number of replications nrep to improve accuracy prior.nfac <- simul.nfac.prior(nvar = 15, Kmax = 5, kappa = c(.3, .7, 1), nrep = 10000) summary(prior.nfac) plot(prior.nfac)
# replicate first row of table 2 in CFSHP (p.44) # note: use larger number of replications nrep to improve accuracy prior.nfac <- simul.nfac.prior(nvar = 15, Kmax = 5, kappa = c(.3, .7, 1), nrep = 10000) summary(prior.nfac) plot(prior.nfac)
This function produces a sample of correlation matrices drawn from their prior distribution induced in the identified version of the factor model, given the prior distribution specified on the corresponding covariance matrices of the factors in the expanded model.
simul.R.prior(Kmax, nu0 = Kmax + 1, S0 = 1, HW.prior = TRUE, nrep = 10^5, verbose = TRUE)
simul.R.prior(Kmax, nu0 = Kmax + 1, S0 = 1, HW.prior = TRUE, nrep = 10^5, verbose = TRUE)
Kmax |
Maximum number of latent factors. |
nu0 |
Degrees of freedom of the Inverse-Wishart prior on the covariance matrix of the latent factors in the expanded model. |
S0 |
Scale parameters of the Inverse-Wishart prior on the covariance matrix of latent factors in the expanded model:
Either a scalar or a numeric vector of length equal to |
HW.prior |
If |
nrep |
Number of Monte Carlo replications. |
verbose |
If |
Covariance matrices are sampled from the prior distribution in the
expanded model, and transformed to produce the corresponding correlation
matrices. See section 2.3.5 of CFSHP (p.36-37), as well as the details of
the function befa
.
To compare several prior specifications, different values of the parameters
nu0
and S0
can be specified. The function then simulates for
each pair of these parameters. nu0
and S0
should therefore be
scalars or vectors of same length.
A list of length equal to the number of pairs of parameters
nu0
and S0
, where each element of the list is an array of
dimension (Kmax
, Kmax
, nrep
) that contains the
correlation matrices of the latent factors drawn from the prior.
Rémi Piatek [email protected]
G. Conti, S. Frühwirth-Schnatter, J.J. Heckman, R. Piatek (2014): “Bayesian Exploratory Factor Analysis”, Journal of Econometrics, 183(1), pages 31-57, doi:10.1016/j.jeconom.2014.06.008.
# partial reproduction of figure 1 in CFSHP (p.38) # note: use larger number of replications nrep to increase smoothness Kmax <- 10 Rsim <- simul.R.prior(Kmax, nu0 = Kmax + c(1, 2, 5), S0 = .5, nrep = 1000) summary(Rsim) plot(Rsim)
# partial reproduction of figure 1 in CFSHP (p.38) # note: use larger number of replications nrep to increase smoothness Kmax <- 10 Rsim <- simul.R.prior(Kmax, nu0 = Kmax + c(1, 2, 5), S0 = .5, nrep = 1000) summary(Rsim) plot(Rsim)
Generic function summarizing the posterior results of a 'befa' object. Optional arguments can be specified to customize the summary.
## S3 method for class 'befa' summary(object, ...)
## S3 method for class 'befa' summary(object, ...)
object |
Object of class 'befa'. |
... |
The following extra arguments can be specified:
|
This function summarizes the posterior distribution of the parameters.
The algorithm may visit different configurations of the indicator matrix
during sampling, where the manifest variables are allocated to
different latent factors. When the posterior distribution of the factor
loadings is summarized separately for each manifest variable
(
what = 'maxp'
or what = 'all'
), the function provides the
latent factor each manifest variable is allocated to (dedic
), and the
corresponding posterior probability (prob
). If dedic = 0
, then
prob
corresponds to the posterior probability that the manifest
variable is discarded. Discarded variables are listed last if
byfac = TRUE
. Low probability cases can be discarded by setting
min.prob
appropriately (default is 0.20).
Idiosyncratic variances, factor correlation matrix and regression
coefficients (if any) are summarized across all MCMC iterations if
what = 'all'
or what = 'maxp'
, and within each HPP model if
what = 'hppm'
.
Highest posterior probability model.
The HPP model is the model with a given allocation of the measurements to the
latent factors (i.e., a given indicator matrix ) that is visited
most often by the algorithm.
When specifying what = 'hppm'
, the function sorts the models according
to the posterior frequencies of their indicator matrices in decreasing order.
Therefore, the first model returned (labeled 'm1') corresponds to the HPP
model.
Low probability models can be discarded by setting min.prob
appropriately(default is 0.20, implying that only models with a posterior
probability larger than 0.20 are displayed).
HPP models can only be found if identification with respect to column switching has been restored a posteriori. An error message is returned if this is not the case.
If called directly, the summary is formatted and displayed on the standard output. Otherwise if saved in an object, a list of the following elements is returned:
MHacc
: Metropolis-Hastings acceptance rate.
alpha
: Data frame (or list of data frames if
what = 'hppm'
) containing posterior summary statistics for the
factor loadings.
sigma
: Data frame (or list of matrices if what = 'hppm'
)
containing posterior summary statistics for the idiosyncratic
variances.
R
: Data frame (or list of data frames if what = 'hppm'
)
containing posterior summary statistics for the factor correlations.
beta
: Data frame (or list of data frames if
what = 'hppm'
) containing posterior summary statistics for the
regression coefficients (if any).
nfac
(only if what = 'maxp'
or what = 'all'
):
Table of posterior frequencies of numbers of factors.
hppm
(only if what = 'hppm'
): List of the following
elements summarizing the different HPP models, sorted in decreasing
order of their posterior probabilities:
prob
: Vector of posterior probabilities.
nfac
: Vector of numbers of factors.
dedic
: Data frame of factor indicators.
Data frames of posterior summary statistics include the means (mean
),
standard deviations (sd
) and highest posterior density intervals
(hpd.lo
and hpd.up
, for the probability specified in
hpd.prob
) of the corresponding parameters.
For the factor loadings, the matrix may also include a column labeled
'dedic
' indicating to which factors the corresponding manifest
variables are dedicated (a zero value means that the manifest variable does
not load on any factor), as well as a column labeled 'prob
' showing
the corresponding posterior probabilities that the manifest variables load on
these factors.
Summary results are returned as lists of data frames for HPP models, where
the elements of the list are labeled as 'm1
, 'm2
', etc.
Rémi Piatek [email protected]
plot.befa
to plot posterior results.
set.seed(6) # generate fake data with 15 manifest variables and 3 factors Y <- simul.dedic.facmod(N = 100, dedic = rep(1:3, each = 5)) # run MCMC sampler and post process output # notice: 1000 MCMC iterations for illustration purposes only, # increase this number to obtain reliable posterior results! mcmc <- befa(Y, Kmax = 5, iter = 1000) mcmc <- post.column.switch(mcmc) mcmc <- post.sign.switch(mcmc) # summarize posterior results summary(mcmc) # summarize highest posterior probability (HPP) model hppm.sum <- summary(mcmc, what = 'hppm') # print summary with 6-digit precision print(hppm.sum, digits = 6) # extract posterior means of the factor loadings in HPP model alpha.mean <- hppm.sum$alpha$m1$mean print(alpha.mean)
set.seed(6) # generate fake data with 15 manifest variables and 3 factors Y <- simul.dedic.facmod(N = 100, dedic = rep(1:3, each = 5)) # run MCMC sampler and post process output # notice: 1000 MCMC iterations for illustration purposes only, # increase this number to obtain reliable posterior results! mcmc <- befa(Y, Kmax = 5, iter = 1000) mcmc <- post.column.switch(mcmc) mcmc <- post.sign.switch(mcmc) # summarize posterior results summary(mcmc) # summarize highest posterior probability (HPP) model hppm.sum <- summary(mcmc, what = 'hppm') # print summary with 6-digit precision print(hppm.sum, digits = 6) # extract posterior means of the factor loadings in HPP model alpha.mean <- hppm.sum$alpha$m1$mean print(alpha.mean)