Title: | Estimation/Multiple Imputation for Mixed Categorical and Continuous Data |
---|---|
Description: | Estimation/multiple imputation programs for mixed categorical and continuous data. |
Authors: | Joseph L. Schafer [aut], Brian Ripley [aut, trl, cre] (R port) |
Maintainer: | Brian Ripley <[email protected]> |
License: | Unlimited |
Version: | 1.0-13 |
Built: | 2024-12-11 16:43:03 UTC |
Source: | CRAN |
Markov Chain Monte Carlo method for generating posterior draws of the
parameters of the unrestricted general location model, given a matrix of
incomplete mixed data. At each step, missing data are randomly imputed
under the current parameter, and a new parameter value is drawn from its
posterior distribution given the completed data. After a suitable
number of steps are taken, the resulting value of the parameter may be
regarded as a random draw from its observed-data posterior
distribution. May be used together with imp.mix
to create
multiple imputations of the missing data.
da.mix(s, start, steps=1, prior=0.5, showits=FALSE)
da.mix(s, start, steps=1, prior=0.5, showits=FALSE)
s |
summary list of an incomplete data matrix created by the
function |
start |
starting value of the parameter. This is a parameter list
such as one created by the function |
steps |
number of data augmentation steps to be taken. |
prior |
Optional vector or array of hyperparameter(s) for a Dirichlet prior
distribution. The default is the Jeffreys prior (all hyperparameters
= .5). If structural zeros appear in the table, prior counts for these
cells should be set to |
showits |
if |
The prior distribution used by this function is a combination of a
Dirichlet prior for the cell probabilities, an improper uniform prior
for the within-cell means, and the improper Jeffreys prior for the
covariance matrix. The posterior distribution is not guaranteed to
exist, especially in sparse-data situations. If this seems to be a
problem, then better results may be obtained by imposing restrictions
on the parameters; see ecm.mix
and dabipf.mix
.
A new parameter list. The parameter can be put into a more
understandable format by the function getparam.mix
.
The random number generator seed must be set at least once by the
function rngseed
before this function can be used.
Schafer, J. L. (1996) Analysis of Incomplete Multivariate Data. Chapman & Hall, Chapter 9.
prelim.mix
, getparam.mix
,
em.mix
, and rngseed
.
data(stlouis) s <- prelim.mix(stlouis,3) # preliminary manipulations thetahat <- em.mix(s) # find ML estimate rngseed(1234567) # set random number generator seed newtheta <- da.mix(s, thetahat, steps=100, showits=TRUE) # take 100 steps ximp1 <- imp.mix(s, newtheta) # impute under newtheta
data(stlouis) s <- prelim.mix(stlouis,3) # preliminary manipulations thetahat <- em.mix(s) # find ML estimate rngseed(1234567) # set random number generator seed newtheta <- da.mix(s, thetahat, steps=100, showits=TRUE) # take 100 steps ximp1 <- imp.mix(s, newtheta) # impute under newtheta
Markov Chain Monte Carlo method for generating posterior draws of the
parameters of the unrestricted general location model, given a matrix
of incomplete mixed data. After a suitable number of steps are taken,
the resulting value of the parameter may be regarded as a random draw
from its observed-data posterior distribution. May be used together
with imp.mix
to create multiple imputations
of the missing data.
dabipf.mix(s, margins, design, start, steps=1, prior=0.5, showits=FALSE)
dabipf.mix(s, margins, design, start, steps=1, prior=0.5, showits=FALSE)
s |
summary list of an incomplete data matrix created by the
function |
margins |
vector describing the sufficient configurations or margins in the
desired loglinear model. The variables are ordered in the original
order of the columns of |
design |
design matrix specifying the relationship of the continuous
variables to the categorical ones. The dimension is |
start |
starting value of the parameter. This is a parameter list
such as one created by this function or by |
steps |
number of steps of data augmentation-Bayesian IPF to be taken. |
prior |
Optional vector or array of hyperparameter(s) for a Dirichlet prior
distribution. The default is the Jeffreys prior (all hyperparameters
= .5). If structural zeros appear in the table, prior counts for these
cells should be set to |
showits |
if |
The prior distribution used by this function is a combination of a constrained Dirichlet prior for the cell probabilities, an improper uniform prior for the regression coefficients, and the improper Jeffreys prior for the covariance matrix. The posterior distribution is not guaranteed to exist, especially in sparse-data situations. If this seems to be a problem, then better results may be obtained by imposing restrictions further restrictions on the parameters.
a new parameter list. The parameter can be put into a more
understandable format by the function getparam.mix
.
The random number generator seed must be set at least once by the
function rngseed
before this function can be used.
The starting value should satisfy the restrictions of the model and
should lie in the interior of the parameter space. A suitable starting
value can be obtained by running ecm.mix
,
possibly with the prior
hyperparameters set to some value greater than 1, to ensure that the
mode lies in the interior.
Schafer, J. L. (1996) Analysis of Incomplete Multivariate Data. Chapman & Hall, Chapter 9.
prelim.mix
, getparam.mix
,
ecm.mix
, rngseed
, imp.mix
.
data(stlouis) s <- prelim.mix(stlouis,3) # do preliminary manipulations margins <- c(1,2,3) # saturated contingency table model design <- diag(rep(1,12)) # identity matrix D=no of cells thetahat <- ecm.mix(s,margins,design) # find ML estimate rngseed(1234567) # random generator seed newtheta <- dabipf.mix(s,margins,design,thetahat,steps=200) ximp <- imp.mix(s,newtheta,stlouis) # impute under newtheta
data(stlouis) s <- prelim.mix(stlouis,3) # do preliminary manipulations margins <- c(1,2,3) # saturated contingency table model design <- diag(rep(1,12)) # identity matrix D=no of cells thetahat <- ecm.mix(s,margins,design) # find ML estimate rngseed(1234567) # random generator seed newtheta <- dabipf.mix(s,margins,design,thetahat,steps=200) ximp <- imp.mix(s,newtheta,stlouis) # impute under newtheta
Computes maximum-likelihood estimates for the parameters of the general location model from an incomplete mixed dataset.
ecm.mix(s, margins, design, start, prior=1, maxits=1000, showits=TRUE, eps=0.0001)
ecm.mix(s, margins, design, start, prior=1, maxits=1000, showits=TRUE, eps=0.0001)
s |
summary list of an incomplete data matrix |
margins |
vector describing the sufficient configurations or margins in the
desired loglinear model. The variables are ordered in the original
order of the columns of |
design |
design matrix specifying the relationship of the continuous
variables to the categorical ones. The dimension is |
start |
optional starting value of the parameter. This is a list such as one
created by this function or by |
prior |
Optional vector or array of hyperparameter(s) for a Dirichlet prior
distribution. By default, uses a uniform prior on the cell
probabilities. ECM finds the posterior mode, which under
a uniform prior is the same as a maximum-likelihood estimate. If
structural zeros appear in the table, hyperparameters for those cells
should be set to |
maxits |
maximum number of iterations performed. The algorithm will stop if the parameter still has not converged after this many iterations. |
showits |
if |
eps |
optional convergence criterion. The algorithm stops when the maximum relative difference in every parameter from one iteration to the next is less than or equal to this value. |
a list representing the maximum-likelihood estimates (or posterior
mode) of the normal parameters. This list contains cell probabilities,
cell means, and covariances. The parameter can be transformed back to
the original scale and put into a more understandable format by the
function getparam.mix
.
If zero cell counts occur in the complete-data table, the maximum likelihood estimate may not be unique, and the algorithm may converge to different stationary values depending on the starting value. Also, if zero cell counts occur in the complete-data table, the ML estimate may lie on the boundary of the parameter space.
Schafer, J. L. (1996) Analysis of Incomplete Multivariate Data. Chapman & Hall, Chapter 9.
prelim.mix
, em.mix
,
getparam.mix
,
loglik.mix
.
data(stlouis) s <- prelim.mix(stlouis,3) # preliminary manipulations margins <- c(1,2,3) # saturated loglinear model design <- diag(rep(1,12)) # identity matrix, D=no of cells thetahat <- ecm.mix(s,margins,design) # should be same as em.mix(s) loglik.mix(s,thetahat) # loglikelihood at thetahat
data(stlouis) s <- prelim.mix(stlouis,3) # preliminary manipulations margins <- c(1,2,3) # saturated loglinear model design <- diag(rep(1,12)) # identity matrix, D=no of cells thetahat <- ecm.mix(s,margins,design) # should be same as em.mix(s) loglik.mix(s,thetahat) # loglikelihood at thetahat
Computes maximum-likelihood estimates for the parameters of the unrestricted general location model from an incomplete mixed dataset.
em.mix(s, start, prior=1, maxits=1000, showits=TRUE, eps=0.0001)
em.mix(s, start, prior=1, maxits=1000, showits=TRUE, eps=0.0001)
s |
summary list of an incomplete data matrix produced by the function
|
start |
optional starting value of the parameter. This is a parameter list in
packed storage, such as one returned by this function or by
|
prior |
Optional vector or array of hyperparameters for a Dirichlet prior distribution. By default, uses a uniform prior on the cell probabilities (all hyperparameters set to one). EM algorithm finds the posterior mode, which under a uniform prior is the same as a maximum-likelihood estimate. If structural zeros appear in the table, the corresponding hyperparameters should be set to NA. |
maxits |
maximum number of iterations performed. The algorithm will stop if the parameter still has not converged after this many iterations. |
showits |
if |
eps |
optional convergence criterion. The algorithm stops when the maximum relative difference in every parameter from one iteration to the next is less than or equal to this value. |
a list representing the maximum-likelihood estimates (or posterior
mode) of the normal parameters. This list contains cell probabilities,
cell means, and covariances. The parameter can be transformed back to
the original scale and put into a more understandable format by the
function getparam.mix
.
If zero cell counts occur in the complete-data table, the maximum likelihood estimate may not be unique, and the algorithm may converge to different stationary values depending on the starting value. Also, if zero cell counts occur in the complete-data table, the ML estimate may lie on the boundary of the parameter space.
Schafer, J. L. (1996) Analysis of Incomplete Multivariate Data. Chapman & Hall, Chapter 9.
prelim.mix
, getparam.mix
,
and ecm.mix
.
data(stlouis) s <- prelim.mix(stlouis,3) # do preliminary manipulations thetahat <- em.mix(s) # compute ML estimate getparam.mix(s,thetahat, corr=TRUE) # look at estimated parameters
data(stlouis) s <- prelim.mix(stlouis,3) # do preliminary manipulations thetahat <- em.mix(s) # compute ML estimate getparam.mix(s,thetahat, corr=TRUE) # look at estimated parameters
Present parameters of general location model in an understandable format.
getparam.mix(s, theta, corr=FALSE)
getparam.mix(s, theta, corr=FALSE)
s |
summary list of an incomplete normal data matrix created by the
function |
theta |
list of parameters such as one produced by the function |
corr |
if |
if corr=FALSE
, a list containing the components pi
,
mu
and sigma
; if
corr=TRUE
, a list containing the components pi
, mu
,
sdv
, and r
.
The components are:
pi |
array of cell probabilities whose dimensions correspond to the
columns of the categorical part of $x$. The dimension is
|
mu |
Matrix of cell means. The dimension is |
sigma |
matrix of variances and covariances corresponding to the continuous
variables in |
sdv |
vector of standard deviations corresponding to the continuous
variables in |
r |
matrix of correlations corresponding to the continuous
variables in |
In a restricted general location model, the matrix of means is
required to satisfy t(mu)=A%*%beta
for a given design matrix
A
. To obtain beta
, perform a multivariate regression
of t(mu)
on A
— for
example, beta <- lsfit(A, t(mu), intercept=FALSE)$coef
.
Schafer, J. L. (1996) Analysis of Incomplete Multivariate Data. Chapman & Hall, Chapter 9.
prelim.mix
, em.mix
, ecm.mix
,
da.mix
, dabipf.mix
.
data(stlouis) s <- prelim.mix(stlouis,3) # do preliminary manipulations thetahat <- em.mix(s) # compute ML estimate getparam.mix(s, thetahat, corr=TRUE)$r # look at estimated correlations
data(stlouis) s <- prelim.mix(stlouis,3) # do preliminary manipulations thetahat <- em.mix(s) # compute ML estimate getparam.mix(s, thetahat, corr=TRUE)$r # look at estimated correlations
This function, when used with da.mix
or
dabipf.mix
, can be
used to create proper multiple imputations of missing data under
the general location model with or without restrictions.
imp.mix(s, theta, x)
imp.mix(s, theta, x)
s |
summary list of an incomplete data matrix |
theta |
value of the parameter under which the missing data are to be
randomly imputed. This is a parameter list such as one created
by |
x |
the original data matrix used to create the summary list |
This function is essentially the I-step of data augmentation.
a matrix of the same form as x
, but with all missing values filled in
with simulated values drawn from their predictive distribution given
the observed data and the specified parameter.
The random number generator seed must be set at least once by the
function rngseed
before this function can be used.
Schafer, J. L. (1996) Analysis of Incomplete Multivariate Data. Chapman & Hall, Chapter 9.
prelim.mix
, da.mix
,
dabipf.mix
, rngseed
data(stlouis) s <- prelim.mix(stlouis,3) # do preliminary manipulations thetahat <- em.mix(s) # ML estimate for unrestricted model rngseed(1234567) # set random number generator seed newtheta <- da.mix(s,thetahat,steps=100) # data augmentation ximp <- imp.mix(s, newtheta, stlouis) # impute under newtheta
data(stlouis) s <- prelim.mix(stlouis,3) # do preliminary manipulations thetahat <- em.mix(s) # ML estimate for unrestricted model rngseed(1234567) # set random number generator seed newtheta <- da.mix(s,thetahat,steps=100) # data augmentation ximp <- imp.mix(s, newtheta, stlouis) # impute under newtheta
Calculates the observed-data loglikelihood under the general location model at a user-specified parameter value.
loglik.mix(s, theta)
loglik.mix(s, theta)
s |
summary list of an incomplete data matrix |
theta |
the value of the loglikelihood function at theta
.
Schafer, J. L. (1996) Analysis of Incomplete Multivariate Data. Chapman & Hall, Chapter 9.
data(stlouis) s <- prelim.mix(stlouis,3) # preliminary manipulations thetahat <- em.mix(s) # MLE under unrestricted general location model loglik.mix(s, thetahat) # loglikelihood at thetahat
data(stlouis) s <- prelim.mix(stlouis,3) # preliminary manipulations thetahat <- em.mix(s) # MLE under unrestricted general location model loglik.mix(s, thetahat) # loglikelihood at thetahat
Combines estimates and standard errors from m complete-data analyses performed on m imputed datasets to produce a single inference. Uses the technique described by Rubin (1987) for multiple imputation inference for a scalar estimand.
mi.inference(est, std.err, confidence=0.95)
mi.inference(est, std.err, confidence=0.95)
est |
a list of |
std.err |
a list of |
confidence |
desired coverage of interval estimates. |
a list with the following components, each of which is a vector of the
same length as the components of est
and std.err
:
est |
the average of the complete-data estimates. |
std.err |
standard errors incorporating both the between and the within-imputation uncertainty (the square root of the "total variance"). |
df |
degrees of freedom associated with the |
signif |
P-values for the two-tailed hypothesis tests that the estimated quantities are equal to zero. |
lower |
lower limits of the (100*confidence)% interval estimates. |
upper |
upper limits of the (100*confidence)% interval estimates. |
r |
estimated relative increases in variance due to nonresponse. |
fminf |
estimated fractions of missing information. |
Uses the method described on pp. 76-77 of Rubin (1987) for combining the complete-data estimates from $m$ imputed datasets for a scalar estimand. Significance levels and interval estimates are approximately valid for each one-dimensional estimand, not for all of them jointly.
Rubin, D. B. (1987) Multiple Imputation for Nonresponse in Surveys. Wiley.
Schafer, J. L. (1996) Analysis of Incomplete Multivariate Data. Chapman & Hall.
This function performs grouping and sorting operations on a mixed
dataset with missing values. It creates a list that is
needed for input to em.mix
, da.mix
,
imp.mix
, etc.
prelim.mix(x, p)
prelim.mix(x, p)
x |
data matrix containing missing values. The rows of x correspond to
observational units, and the columns to variables. Missing values are
denoted by |
p |
number of categorical variables in x |
a list of twenty-nine (!) components that summarize various features of x after the data have been collapsed, centered, scaled, and sorted by missingness patterns. Components that might be of interest to the user include:
nmis |
a vector of length |
r |
matrix of response indicators showing the missing data patterns in
|
Schafer, J. L. (1996) Analysis of Incomplete Multivariate Data. Chapman & Hall, Chapter 9.
em.mix
, ecm.mix
,
da.mix
, dabipf.mix
, imp.mix
,
getparam.mix
data(stlouis) s <- prelim.mix(stlouis, 3) # do preliminary manipulations s$nmis # look at nmis s$r # look at missing data patterns
data(stlouis) s <- prelim.mix(stlouis, 3) # do preliminary manipulations s$nmis # look at nmis s$r # look at missing data patterns
Initialize random number generator seed for mix package.
rngseed(seed)
rngseed(seed)
seed |
a positive number, preferably a large integer. |
NULL
.
The random number generator seed must be set at least once
by this function before the simulation or imputation functions
in this package (da.mix
, imp.mix
, etc.)
can be used.
The St. Louis Risk Research Project was an observational study to assess the affects of parental psychological disorders on child development. In the preliminary study, 69 families with 2 children were studied.
data(stlouis)
data(stlouis)
This is a numeric matrix with 69 rows and 7 columns:
[,1] |
G |
Parental risk group |
[,2] |
D1 |
Symptoms, child 1 |
[,3] |
D2 |
Symptoms, child 2 |
[,4] |
R1 |
Reading score, child 1 |
[,5] |
V1 |
Verbal score, child 1 |
[,6] |
R2 |
Reading score, child 2 |
[,7] |
V2 |
Verbal score, child 2 |
The parental risk group was coded 1, 2 or 3, from low or high, and the
child symptoms 1 = low or 2 = high. Missing values occur on all
variables except G
.
Little, R. J. A. and Schluchter, M. D. (1985), Maximum-likelihood estimation for mixed continuous and categorical data with missing values. Biometrika, 72, 492–512.
Schafer, J. L. (1996) Analysis of Incomplete Multivariate Data. Chapman & Hall. pp. 359–367.