Title: | Measurement Error Modeling Tools |
---|---|
Description: | This includes functions for analysis with error-prone covariates, including deconvolution, latent regression and errors-in-variables regression. It implements methods by Rabe-Hesketh et al. (2003) <doi:10.1191/1471082x03st056oa>, Lockwood and McCaffrey (2014) <doi:10.3102/1076998613509405>, and Lockwood and McCaffrey (2017) <doi:10.1007/s11336-017-9556-y>, among others. |
Authors: | J.R. Lockwood |
Maintainer: | J.R. Lockwood <[email protected]> |
License: | GPL (>= 2) | file LICENSE |
Version: | 0.1-8 |
Built: | 2024-12-18 06:30:35 UTC |
Source: | CRAN |
Implements a version of the Rabe-Hesketh et al. (2003) algorithm for computing the nonparametric MLE of a univariate latent variable distribution from observed error-prone measures. Allows for normal heteroskedastic measurement error with variance that depends on the latent variable, such as with estimates of latent ability from item response theory models.
deconv_npmle(W, csem, gridspec = list(fixed=FALSE, xmin=-5, xmax=5, numpoints=2000), lambda = 0.0005, lltol = 1e-7, psmall = 0.00005, discrete = FALSE, quietly = FALSE)
deconv_npmle(W, csem, gridspec = list(fixed=FALSE, xmin=-5, xmax=5, numpoints=2000), lambda = 0.0005, lltol = 1e-7, psmall = 0.00005, discrete = FALSE, quietly = FALSE)
W |
Vector of observed measures, where |
csem |
A function of a single variable |
gridspec |
A named list specifying the grid over which the NPMLE will be
computed. It must have a logical component |
lambda |
Step size, on probability scale, in Rabe-Hesketh et al. (2003) algorithm. See reference for details. |
lltol |
Algorithm stops when the improvement to the log likelihood does not
exceed |
psmall |
If a mass point in the estimated latent distribution evolves to have
probability less than |
discrete |
Not currently implemented. |
quietly |
If |
The assumed model is W = X + U
where the conditional
distribution of U
given X = x
is assumed to be normal
with mean zero and standard deviation csem(x)
. The function
uses W
to estimate a discrete latent distribution for X
that maximizes the likelihood of the observed data. The function
optimizes the mass points (among a grid of candidate values) and the
associated probabilities.
In the special case of homoskedastic error, the function csem
must be defined such that when passed a vector of length n
, it
returns a vector of length n
where each element is a common
constant.
The default values of xmin
and xmin
in gridspec
are generally appropriate only for a latent variable on a standardized
scale with mean zero and variance one, and should be set to
appropriate values given the scale of W
.
A list with elements
gridspec: Information about the initial grid
.history: Iteration-by-iteration evolution of the estimated
distribution, if gridspec$fixed
is FALSE. Otherwise it is
an empty list
px: A dataframe providing the final NPMLE distribution. There are as many rows as there are mass points in the estimated distribution; fields described below
reliability: An estimate of the reliability of W
, equal
to the estimated variance of X
divided by the sample variance
of W
simex_varfuncs: A dataframe with as many rows as there are
unique values of W
, providing estimated plug-in variance
functions to use for SIMEX data generation with latent
heteroskedastic error as described in Lockwood and McCaffrey
(forthcoming); see references. Fields described below
The fields of px
are:
x: Location of mass point
csem: Value of function csem
at mass point
p: probability at mass point
ll: log likelihood at solution
ex: Estimate of mean of latent distribution
varx: Estimate of variance of latent distribution
The fields of simex_varfuncs
are:
W: Unique observed values w
of W
gW: The square of csem
evaluated at W = w
gEXW: The square of csem
evaluated at E[X | W=w]
, the
conditional mean of X
given W=w
EgXW: The conditional mean of the square of csem
of
X
given W=w
, equal to E[g(X) | W=w]
J.R. Lockwood [email protected]
Lockwood J.R. and McCaffrey D.F. (2014). “Correcting for test score measurement error in ANCOVA models for estimating treatment effects,” Journal of Educational and Behavioral Statistics 39(1):22-52.
Lockwood J.R. and McCaffrey D.F. (2017). “Simulation-extrapolation with latent heteroskedastic variance,” Psychometrika 82(3):717-736.
Rabe-Hesketh S., Pickles A. and Skrondal A. (2003). “Correcting for covariate measurement error in logistic regression using nonparametric maximum likelihood estimation,” Statistical Modelling 3:215-232.
data(testscores) ## get the unique values of the lag 1 math score and CSEM ## values and approximate the CSEM function using approxfun() tmp <- unique(testscores[,c("math_lag1","math_lag1_csem")]) print(tmp <- tmp[order(tmp$math_lag1),]) .csem <- approxfun(tmp$math_lag1, tmp$math_lag1_csem, rule=2:2) plot(tmp$math_lag1, tmp$math_lag1_csem) lines(tmp$math_lag1, .csem(tmp$math_lag1), col="blue") ## get NPMLE distribution of latent lag 1 math achievement m <- deconv_npmle(W = testscores$math_lag1, csem = .csem, gridspec = list(fixed = FALSE, xmin = min(testscores$math_lag1), xmax = max(testscores$math_lag1), numpoints = 10000), quietly = TRUE) print(m$px) ## estimated mean is approximately the mean of W, but ## the estimated variance is less than the variance of W, ## as it should be print(c(empirical = mean(testscores$math_lag1), estimated = m$px$ex[1])) print(c(empirical = var(testscores$math_lag1), estimated = m$px$varx[1])) ## estimated reliability of W: print(m$reliability) ## if implementing SIMEX, simex_varfuncs provides plug-in ## options to use for the heteroskedastic error variance ## of each observed W print(m$simex_varfuncs) ## simple "value-added" estimates of school effects on math, ## adjusting for measurement error in the lag 1 math score. testscores$schoolid <- factor(testscores$schoolid) meiv <- eivreg(math ~ math_lag1 + sped + frl + schoolid, data = testscores, reliability = c(math_lag1 = m$reliability), contrasts = list(schoolid = "contr.sum")) print(summary(meiv)) ## alternative deconvolution with fixed grid m <- deconv_npmle(W = testscores$math_lag1, csem = .csem, gridspec = list(fixed = TRUE, xmin = min(testscores$math_lag1), xmax = max(testscores$math_lag1), numpoints = 40), quietly = TRUE) print(m$px)
data(testscores) ## get the unique values of the lag 1 math score and CSEM ## values and approximate the CSEM function using approxfun() tmp <- unique(testscores[,c("math_lag1","math_lag1_csem")]) print(tmp <- tmp[order(tmp$math_lag1),]) .csem <- approxfun(tmp$math_lag1, tmp$math_lag1_csem, rule=2:2) plot(tmp$math_lag1, tmp$math_lag1_csem) lines(tmp$math_lag1, .csem(tmp$math_lag1), col="blue") ## get NPMLE distribution of latent lag 1 math achievement m <- deconv_npmle(W = testscores$math_lag1, csem = .csem, gridspec = list(fixed = FALSE, xmin = min(testscores$math_lag1), xmax = max(testscores$math_lag1), numpoints = 10000), quietly = TRUE) print(m$px) ## estimated mean is approximately the mean of W, but ## the estimated variance is less than the variance of W, ## as it should be print(c(empirical = mean(testscores$math_lag1), estimated = m$px$ex[1])) print(c(empirical = var(testscores$math_lag1), estimated = m$px$varx[1])) ## estimated reliability of W: print(m$reliability) ## if implementing SIMEX, simex_varfuncs provides plug-in ## options to use for the heteroskedastic error variance ## of each observed W print(m$simex_varfuncs) ## simple "value-added" estimates of school effects on math, ## adjusting for measurement error in the lag 1 math score. testscores$schoolid <- factor(testscores$schoolid) meiv <- eivreg(math ~ math_lag1 + sped + frl + schoolid, data = testscores, reliability = c(math_lag1 = m$reliability), contrasts = list(schoolid = "contr.sum")) print(summary(meiv)) ## alternative deconvolution with fixed grid m <- deconv_npmle(W = testscores$math_lag1, csem = .csem, gridspec = list(fixed = TRUE, xmin = min(testscores$math_lag1), xmax = max(testscores$math_lag1), numpoints = 40), quietly = TRUE) print(m$px)
Fits errors-in-variables (EIV) linear regression given specified reliabilities, or a specified variance/covariance matrix for the measurement errors. For either case, it computes robust standard error estimates that allow for weighting and/or clustering.
eivreg(formula, data, subset, weights, na.action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = FALSE, contrasts = NULL, reliability = NULL, Sigma_error = NULL, cluster_varname = NULL, df_adj = FALSE, stderr = TRUE, offset, ...)
eivreg(formula, data, subset, weights, na.action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = FALSE, contrasts = NULL, reliability = NULL, Sigma_error = NULL, cluster_varname = NULL, df_adj = FALSE, stderr = TRUE, offset, ...)
formula , data , subset , weights , na.action , method , model , x , y , qr
|
See documentation for |
singular.ok , contrasts , offset , ...
|
See documentation for |
reliability |
Named numeric vector giving the reliability for each error-prone
covariate. If left |
Sigma_error |
Named numeric matrix giving the variance/covariance matrix of the
measurement errors for the error-prone covariate(s). If left
|
cluster_varname |
A character variable providing the name of a variable in |
df_adj |
Logical (default FALSE); if TRUE, the estimated variance/covariance
matrix of the regression parameters is multiplied by |
stderr |
Logical (default TRUE); if FALSE, does not compute estimated variance/covariance matrix of the regression parameters. |
Theory
The EIV estimator applies when one wishes to estimate the parameters
of a linear regression of on
, but
covariates
are instead observed, where
for mean zero measurement error
. Additional
assumptions are required about
for consistent estimation;
see references for details.
The standard EIV estimator of the regression coefficients is , where
is the design
matrix formed from
and
is a matrix that
adjusts
to account for elements that are distorted due
to measurement error. The value of
depends on whether
reliability
or Sigma_error
is specified. When
Sigma_error
is specified, is known. When
reliability
is specified, must be estimated using
the marginal variances of the observed error-prone covariates.
The estimated regression coefficients are solutions to a system of
estimating equations, and both the system of equations and the
solutions depend on whether reliability
or Sigma_error
is specified. For each of these two cases, standard errors for the
estimated regression coefficients are computed using standard results
from M-estimation; see references. For either case, adjustments for
clustering are provided if specified.
Syntax Details
Exactly one of reliability
or Sigma_error
must be
specified in the call. Sigma_error
need not be diagonal in the
case of correlated measurement error across multiple error-prone
covariates.
Error-prone variables must be included as linear main effects only; the
current version of the code does not allow interactions among
error-prone covariates, interactions of error-prone covariates with
error-free covariates, or nonlinear functions of error-prone
covariates. The error-prone covariates cannot be specified with any
construction involving I()
.
The current version does not allow singular.ok=TRUE
.
It is strongly encouraged to use the data
argument to pass a dataframe
containing all variables to be used in the regression, rather than
using a matrix on the right hand side of the regression formula. In
addition, if cluster_varname
is specified, everything including
the clustering variable must be passed as data
.
If weights
is specified, a weighted version of the EIV
estimator is computed using operations analogous to weighted least
squares in linear regression, and a standard error for this weighted
estimator is computed. Weights must be positive and will be
normalized inside the function to sum to the number of observations
used to fit the model. Cases with missing weights will get dropped
just like cases with missing covariates.
Different software packages that compute robust standard errors make
different choices about degrees-of-freedom adjustments intended to
improve small-sample coverage properties. The df_adj
argument
will inflate the estimated variance/covariance matrix of the estimated
regression coefficients by N/(N-p)
; see Wooldridge (2002, p. 57). In
addition, if cluster_varname
is specified, the estimated
variance/covariance matrix will be inflated by M/(M-1)
where
M
is the number of unique clusters present in the estimation sample.
An list object of class eivlm
with the following components:
coefficients |
Estimated regression coefficients from EIV model. |
residuals |
Residuals from fitted EIV model. |
rank |
Column rank of regression design matrix. |
fitted.values |
Fitted values from EIV model. |
N |
Number of observations used in fitted model. |
Sigma_error |
The measurement error covariance matrix, if supplied. |
reliability |
The vector of reliabilities, if supplied. |
relnames |
The names of the error-prone covariates. |
XpX_adj |
The cross-product matrix of the regression, adjusted for measurement error. |
varYXZ |
The maximum likelihood estimate of the covariance matrix
of the outcome |
latent_resvar |
A degrees-of-freedom adjusted estimate of the
residual variance of the latent regression. NOTE: this not an
estimate of the residual variance of the regression on the observed
covariates |
vcov |
The estimated variance/covariance matrix of the regression coefficients. |
cluster_varname , cluster_values , cluster_num
|
If
|
OTHER |
The object also includes components |
J.R. Lockwood [email protected] modified the lm
function to adapt it for EIV regression.
Carroll R.J, Ruppert D., Stefanski L.A. and Crainiceanu C.M. (2006). Measurement Error in Nonlinear Models: A Modern Perspective (2nd edition). London: Chapman & Hall.
Fuller W. (2006). Measurement Error Models (2nd edition). New York: John Wiley & Sons.
Stefanksi L.A. and Boos D.B. (2002). “The calculus of M-estimation,” The American Statistician 56(1):29-38.
Wooldridge J. (2002). Econometric Analysis of Cross Section and Panel Data. Cambridge, MA: MIT Press.
lm
, summary.eivlm
, deconv_npmle
set.seed(1001) ## simulate data with covariates x1, x2 and z. .n <- 1000 .d <- data.frame(x1 = rnorm(.n)) .d$x2 <- sqrt(0.5)*.d$x1 + rnorm(.n, sd=sqrt(0.5)) .d$z <- as.numeric(.d$x1 + .d$x2 > 0) ## generate outcome ## true regression parameters are c(2,1,1,-1) .d$y <- 2.0 + 1.0*.d$x1 + 1.0*.d$x2 - 1.0*.d$z + rnorm(.n) ## generate error-prone covariates w1 and w2 Sigma_error <- diag(c(0.20, 0.30)) dimnames(Sigma_error) <- list(c("w1","w2"), c("w1","w2")) .d$w1 <- .d$x1 + rnorm(.n, sd = sqrt(Sigma_error["w1","w1"])) .d$w2 <- .d$x2 + rnorm(.n, sd = sqrt(Sigma_error["w2","w2"])) ## fit EIV regression specifying known measurement error covariance matrix .mod1 <- eivreg(y ~ w1 + w2 + z, data = .d, Sigma_error = Sigma_error) print(class(.mod1)) .tmp <- summary(.mod1) print(class(.tmp)) print(.tmp) ## fit EIV regression specifying known reliabilities. Note that ## point estimator is slightly different from .mod1 because ## the correction matrix S must be estimated when the reliability ## is known. .lambda <- c(1,1) / (c(1,1) + diag(Sigma_error)) .mod2 <- eivreg(y ~ w1 + w2 + z, data = .d, reliability = .lambda) print(summary(.mod2))
set.seed(1001) ## simulate data with covariates x1, x2 and z. .n <- 1000 .d <- data.frame(x1 = rnorm(.n)) .d$x2 <- sqrt(0.5)*.d$x1 + rnorm(.n, sd=sqrt(0.5)) .d$z <- as.numeric(.d$x1 + .d$x2 > 0) ## generate outcome ## true regression parameters are c(2,1,1,-1) .d$y <- 2.0 + 1.0*.d$x1 + 1.0*.d$x2 - 1.0*.d$z + rnorm(.n) ## generate error-prone covariates w1 and w2 Sigma_error <- diag(c(0.20, 0.30)) dimnames(Sigma_error) <- list(c("w1","w2"), c("w1","w2")) .d$w1 <- .d$x1 + rnorm(.n, sd = sqrt(Sigma_error["w1","w1"])) .d$w2 <- .d$x2 + rnorm(.n, sd = sqrt(Sigma_error["w2","w2"])) ## fit EIV regression specifying known measurement error covariance matrix .mod1 <- eivreg(y ~ w1 + w2 + z, data = .d, Sigma_error = Sigma_error) print(class(.mod1)) .tmp <- summary(.mod1) print(class(.tmp)) print(.tmp) ## fit EIV regression specifying known reliabilities. Note that ## point estimator is slightly different from .mod1 because ## the correction matrix S must be estimated when the reliability ## is known. .lambda <- c(1,1) / (c(1,1) + diag(Sigma_error)) .mod2 <- eivreg(y ~ w1 + w2 + z, data = .d, reliability = .lambda) print(summary(.mod2))
Computes a scale matrix in the BUGS parameterization that corresponds to a minimally-informative Wishart prior distribution for a precision matrix, with the property that the medians of the diagonals of the implied prior distribution for the corresponding covariance matrix are approximately equal to specified target variances.
get_bugs_wishart_scalemat(target, nsim=25000, reltol = 0.05, quietly=TRUE)
get_bugs_wishart_scalemat(target, nsim=25000, reltol = 0.05, quietly=TRUE)
target |
A |
nsim |
Number of Monte-Carlo simulations used to set target scale matrix. Default is 25,000. |
reltol |
Relative tolerance determining when the algorithm stops trying to find a better scale matrix. Default is 0.05. |
quietly |
If |
When using WinBUGS/OpenBUGS/JAGS, it is often necessary to provide a
Wishart prior distribution for the precision matrix of a
p
-dimensional random vector. It is common to use a Wishart
distribution with p+1
degrees of freedom in this case. The
question is what scale matrix to use. The BUGS languages parameterize the
Wishart distribution such that if a precision matrix M
is given
the prior distribution M ~ dwish(S,p+1)
for a pxp
scale
matrix S
and p+1
degrees of freedom, the expected value
of M
is p+1
times the inverse of S
.
The current function determines a diagonal scale matrix S
such
that the implied prior distribution for the inverse of M
, the
variance/covariance matrix of the random vector, under the
distribution M ~ dwish(S,p+1)
in the BUGS parameterization, has
medians of the diagonal elements approximately equal to some target
variances specified by target
. It iteratively tries values of
S
via Monte Carlo simulation to select a value of S
with
the desired property.
The value of reltol
determines how close the match must be.
Larger values of nsim
and smaller values of reltol
will
lead to smaller Monte Carlo error in the estimate scale matrix.
A list with elements
bugs.df: Degrees of freedom to use for Wishart prior
distribution in BUGS, equal to p+1
where p
is the
dimension of target
.
bugs.scalemat: Scale matrix to use for Wishart prior distribution in BUGS.
varsum: Summary of prior distribution of implied variances;
medians should approximately equal target
.
corsum: Summary of prior distribution of implied correlations.
J.R. Lockwood [email protected]
tmp <- get_bugs_wishart_scalemat(target = c(10,4,4,8), nsim = 30000, reltol = 0.02, quietly=FALSE) print(tmp) ## if you now model precison matrix M ~ dwish(tmp$bugs.scalemat, ## tmp$bugs.df) in a BUGS language, this will imply a prior distribution ## for the inverse of M that has medians of the diagonal elements ## approximately equal to 'target'
tmp <- get_bugs_wishart_scalemat(target = c(10,4,4,8), nsim = 30000, reltol = 0.02, quietly=FALSE) print(tmp) ## if you now model precison matrix M ~ dwish(tmp$bugs.scalemat, ## tmp$bugs.df) in a BUGS language, this will imply a prior distribution ## for the inverse of M that has medians of the diagonal elements ## approximately equal to 'target'
Uses the jags
function in R2jags to fit a
latent-variable GLM with error-prone covariates that may have
heteroskedastic normal measurement error with variance that is a
function of the latent variable, such as commonly occurs with test
scores computed using item-response-theory (IRT) models.
lr_ancova(outcome_model, Y, W, Z, G, varfuncs, plotfile=NULL, seed=12345, modelfileonly=FALSE, scalemat=NULL, blockprior=TRUE, ...)
lr_ancova(outcome_model, Y, W, Z, G, varfuncs, plotfile=NULL, seed=12345, modelfileonly=FALSE, scalemat=NULL, blockprior=TRUE, ...)
outcome_model |
A character string indicating the outcome model. Valid values are currently 'normal', 'normalME', 'poisson', 'bernoulli_probit', and 'bernoulli_logit'. |
Y |
A numeric vector of outcome values. Missing values are allowed. |
W |
A numeric matrix of error-prone covariates. Missing values
are allowed, though no column of |
Z |
A numeric matrix of error-free covariates. Missing values
are not allowed. First column must be a vector of 1s to serve as a
model intercept because effects of groups in |
G |
A numeric or factor vector indicating group memberships of units. Missing values not allowed. |
varfuncs |
A list with as many components as there are
error-prone covariates, equal to the number of columns of |
plotfile |
Character string providing full path to a PDF file that will store some diagnostic plots regarding the variance functions. Default is NULL and will be assigned to a file in a temporary directory and the name of file will be returned. |
seed |
An integer that will be passed to |
modelfileonly |
If TRUE, function will return a link to a file that contains the JAGS model code, but will not actually fit the model. Default is FALSE. |
scalemat |
When there are multiple error-prone covariates, the
specification of the Bayesian model as implemented in JAGS requires
a scale matrix for a Wishart prior distribution for a precision
matrix. The default is NULL, in which case the function will set a
value of |
blockprior |
If TRUE (the default), specifies JAGS code to encourage updating regression model parameters as a block to improve MCMC mixing. |
... |
Additional arguments to |
Theory
The outcome is assumed to depend on
where
is a vector of latent variables,
is a
vector of observed, error-free variables, and
is a grouping
variable. For example, one may be interested in the effects of some
intervention where
indicates groupings of units that
received different treatments, and the variables
are potential confounders. This function addresses the case where
is unobserved, and error-prone proxies
are
instead observed. It is assumed that
for
mean-zero, normally-distributed measurement error
, and that
may be a function
of
. Such
error structures commonly arise with the use of test scores computed
using item-response-theory (IRT) models. Details on these issues and
other model assumptions are provided in the references. The model is
a generalization of errors-in-variables linear regression.
The model assumes that the outcome depends on
through a linear function of these predictors,
and parameters for this linear function are estimated. The conditional
distribution of
given these predictors that is assumed by
the model depends on
outcome_model
. If outcome_model
is
normal
, the conditional distribution of is assumed
to be normal, and the model also estimates a residual variance for
given the covariates. If
outcome_model
is
normalME
, it is assumed that there is a latent variable (call
it Yl
) that follows the same conditional distribution as when
outcome_model
is normal
, and then measures
Yl
with normal measurement error and the known information
about this error is passed as the last component of varfuncs
.
In this way, the lr_ancova
can support models with
heteroskedastic measurement error in both the predictors and the
outcome. If outcome_model
is poisson
, must
consists of non-negative integers and a log link is assumed. If
outcome_model
is bernoulli_logit
, must take
on values of 0 and 1, and a logit link is assumed. Finally, if
outcome_model
is bernoulli_probit
, must take
on values of 0 and 1, and a probit link is assumed.
The model assumes that the conditional distribution of
given
is normal with a mean vector that depends on
and a covariance matrix that is assumed not to
depend on
. Both the regression parameters and the
residual covariance matrix of this conditional distribution are
estimated.
All parameters of the model involving are
estimated using the observed data
using
assumptions and information about the distribution of the measurement
errors
. The structure assumed here is that measurement
errors are independent across units and across dimensions of
, and that the conditional distribution of
given
is a normal distribution with mean zero and variance
. The function
must be specified and can
be constant. Additional discussions of this class of error functions
are provided in the references, and details about how information
about
is conveyed to this function are provided below.
Syntax Details
Note that this function requires the R2jags package, which in turn requires JAGS to be installed on your system.
The function will check that the only column of Z
that is in
the span of the columns of the design matrix implied by the grouping
variable G
is the first column, corresponding to an intercept.
The effects of G
are parameterized with a sum-to-zero
constraint, so that the effect of each group is expressed relative to
the average of all group effects.
The varfuncs
argument requires the most thinking. This
argument is a list with as many elements as there are error-prone
covariates, or one plus the number of error-prone covariates if
outcome_model
is normalME
. In this latter case, the
final element must be the error variance function for .
Each element of the list varfuncs
is itself a list providing
the measurement error information about one of the error-prone
covariates (or the outcome, if outcome_model
is
normalME
). For each i
, varfuncs[[i]]
must be a
list following a particular structure. First,
varfuncs[[i]]$type
must be a character string taking one of
three possible values: constant
, piecewise_linear
or
log_polynomial
. The case constant
corresponds to the
case of homoskedastic measurement error where is
constant, and the variance of this measurement error must be provided
in
varfuncs[[i]]$vtab
. The other two cases correspond to the
case where the conditional measurement error variance
is a nontrivial function of
. In both of these cases,
varfuncs[[i]]$vtab
must be a matrix or data frame with exactly
two columns and K
rows, where the first column provides values
x[1],...,x[K]
of and the second column provides
values
g(x[1]),...,g(x[K])
. That is, the function
is conveyed via a lookup table. The value of
K
is selected by
the user. Larger values of K
will make the approximation to
more accurate but will cause the model estimation to
proceed more slowly. How the values in the lookup table are used to
approximate
more generally depends whether
varfuncs[[i]]$type
is piecewise_linear
or
log_polynomial
. In the case of piecewise_linear
, the
values in the lookup table are linearly interpolated. In the case of
log_polynomial
, a polynomial of degree
varfuncs[[i]]$degree
is fitted to the logs of the values of
g(x[1]),...,g(x[K])
, and the fitted model is used to build a
smooth approximation to the function . The default
value of
varfuncs[[i]]$degree
if it is not specified is 6. For
either the piecewise linear or log polynomial approximations, the
function g(X)
g(X) is set to g(x[1])
for values of
x
smaller than x[1]
, and is set of g(x[K])
for
values of x
larger than x[K]
. Diagnostic plots of the
approximate variance functions saved in PDF file whose location is
returned by lr_ancova
. The Examples section provides examples
that will be helpful in specifying varfuncs
.
When there are two or more error-prone covariates, the model estimates
a residual variance/covariance matrix of given
. Because the model is fit in a Bayesian framework,
a prior distribution is required for this matrix. We are using JAGS
and specify a prior distribution for the inverse of the residual
variance/covariance matrix using a Wishart distribution. The degrees
of freedom parameter of this distribution is set to one plus
ncol(W)
to be minimally informative. The scale matrix of this
distribution can be set by passing an appropriate matrix via the
scalemat
argument. If scalemat
is NULL, the function
specifies a diagonal scale matrix that attempts to make the prior
medians of the unknown residual variances approximately equal to the
residual variances obtained by regressing components of on
. See
get_bugs_wishart_scalemat
.
Such variances will be somewhat inflated due to measurement error in
but the prior variance of the Wishart distribution is
sufficiently large that this lack of alignment should be minimally
consequential in most applications. The value of
scalemat
used
in the estimation is returned by the function, and users can start
with the default and then pass alternative values via the
scalemat
argument for sensitivity analyses if desired.
A object of class rjags
, with additional information
specific to this context. The additional information is stored as a
list called lr_ancova_extras
with the following components:
model.location |
Path to file containing JAGS model code. |
plot.location |
Path to file containing diagnostic plots regarding the variance functions. |
group.map |
A dataframe mapping the original group labels in
|
scalemat |
The value of |
The parameters used in the JAGS model, and thus named in the model
object, use naming conventions described here. The parameters in the
linear function of that is related to
are partitioned into
betaYXZ
and betaYG
. In
applications involving analysis of causal effects of groupings, the
parameters betaYG
will generally be of most interest. When
outcome_model
is normal
, the residual standard deviation
of given
is also estimated and is
called
sdYgivenXZG
. Similarly, when outcome_model
is
normalME
, a residual standard deviation of the latent variable
corresponding to given
is also
estimated and is also called
sdYgivenXZG
. Note in this case
that the residual standard deviation of given its
corresponding latent variable is assumed to be known and specified via
varfuncs
.
The regression parameters for the conditional distribution of
given
are partitioned as
betaXZ
and betaXG
. The residual variance/covariance matrix for
given
is named
varXgivenXG
. Additional details on these parameters can be
found by looking at the JAGS model file whose location is returned as
noted above.
J.R. Lockwood [email protected]
Battauz, M. and Bellio, R. (2011). “Structural modeling of measurement error in generalized linear models with Rasch measures as covariates,” Psychometrika 76(1):40-56.
Lockwood J.R. and McCaffrey D.F. (2014). “Correcting for test score measurement error in ANCOVA models for estimating treatment effects,” Journal of Educational and Behavioral Statistics 39(1):22-52.
Lockwood J.R. and McCaffrey D.F. (2017). “Simulation-extrapolation with latent heteroskedastic variance,” Psychometrika 82(3):717-736.
Plummer, M. (2003). “JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling.” Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003), Vienna, Austria.
Rabe-Hesketh S., Pickles A. and Skrondal A. (2003). “Correcting for covariate measurement error in logistic regression using nonparametric maximum likelihood estimation,” Statistical Modelling 3:215-232.
jags
, get_bugs_wishart_scalemat
set.seed(3001) cat("NOTE: this example uses MCMC and takes a little while to run\n") ## example of estimating school "value-added" effects on math test scores, ## adjusting for lag 1 math and ELA scores and accounting for the ## heteroskedastic measurement errors in those scores. data(testscores) print(length(unique(testscores$schoolid))) ## to help interpretation of model coefficients and school effects, standardize ## current and lag 1 test scores to have mean zero and variance 1. Also adjust ## the conditional standard errors of measurement for the lag 1 scores. testscores$math <- as.vector(scale(testscores$math)) testscores$math_lag1_csem <- testscores$math_lag1_csem / sd(testscores$math_lag1) testscores$math_lag1 <- as.vector(scale(testscores$math_lag1)) testscores$lang_lag1_csem <- testscores$lang_lag1_csem / sd(testscores$lang_lag1) testscores$lang_lag1 <- as.vector(scale(testscores$lang_lag1)) ## create pieces needed to call lr_ancova. Note that first column of Z ## must be an intercept. outcome_model <- "normal" Y <- testscores$math W <- testscores[,c("math_lag1","lang_lag1")] Z <- cbind(1, testscores[,c("sped","frl")]) G <- testscores$schoolid ## create varfuncs. Need to be careful to pass conditional measurement error ## variances, which require squaring the CSEMs varfuncs <- list() tmp <- unique(testscores[,c("math_lag1","math_lag1_csem")]) names(tmp) <- c("x","gx") tmp <- tmp[order(tmp$x),] tmp$gx <- tmp$gx^2 varfuncs[[1]] <- list(type="log_polynomial", vtab=tmp) tmp <- unique(testscores[,c("lang_lag1","lang_lag1_csem")]) names(tmp) <- c("x","gx") tmp <- tmp[order(tmp$x),] tmp$gx <- tmp$gx^2 varfuncs[[2]] <- list(type="log_polynomial", vtab=tmp) ## fit the model. NOTE: in practice, larger values of n.iter and n.burnin ## would typically be used; they are kept small here so that the example ## runs relatively quickly. m1 <- lr_ancova(outcome_model, Y, W, Z, G, varfuncs, n.iter=300, n.burnin=100) ## you can check the approximation to the variance functions by looking at the ## PDF file: print(m1$lr_ancova_extras$plot.location) ## and also can look at the JAGS model file: print(m1$lr_ancova_extras$model.location) ## the model object is of class "rjags" and so inherits the appropriate methods, ## including print: print(m1) ## betaXG, betaXZ, and varXgivenZG are for the conditional distribution of X ## given (Z,G). betaYG, betaYXZ and sdYgivenXZG are for the conditional ## distribution of Y given (X,Z,G). ## ## the first two elements of betaYXZ are the coefficients for the two columns of ## X, whereas the following three elements are the coefficients for the three ## columns of Z. ## ## the school effects are in betaYG. extract their posterior means and ## posterior standard deviations: e <- m1$BUGSoutput$summary e <- as.data.frame(e[grep("betaYG",rownames(e)),c("mean","sd")]) ## check the sum-to-zero constraints: print(sum(e$mean)) ## put the actual school IDs onto "e" e$schoolid <- m1$lr_ancova_extras$group.map$G print(e) ## compare the school effect estimates to those from a simpler model that does ## not adjust for the lag 1 ELA score, and does not account for the measurement ## error in the lag 1 math score. Use sum-to-zero contrasts and recover the ## estimate for the last school as negative the sum of the other estimates. testscores$schid <- factor(testscores$schoolid) m0 <- lm(math ~ math_lag1 + sped + frl + schid, data=testscores, contrasts=list(schid = "contr.sum")) s <- coef(m0)[grep("schid", names(coef(m0)))] e$est_m0 <- c(s, -sum(s)) ## Such estimates should have some amount of omitted variable bias, which ## should manifest as the differences between the "m0" and "m1" estimates ## being positively correlated with average prior achievement. print(cor(tapply(testscores$math_lag1, testscores$schoolid, mean), e$est_m0 - e$mean)) print(cor(tapply(testscores$lang_lag1, testscores$schoolid, mean), e$est_m0 - e$mean))
set.seed(3001) cat("NOTE: this example uses MCMC and takes a little while to run\n") ## example of estimating school "value-added" effects on math test scores, ## adjusting for lag 1 math and ELA scores and accounting for the ## heteroskedastic measurement errors in those scores. data(testscores) print(length(unique(testscores$schoolid))) ## to help interpretation of model coefficients and school effects, standardize ## current and lag 1 test scores to have mean zero and variance 1. Also adjust ## the conditional standard errors of measurement for the lag 1 scores. testscores$math <- as.vector(scale(testscores$math)) testscores$math_lag1_csem <- testscores$math_lag1_csem / sd(testscores$math_lag1) testscores$math_lag1 <- as.vector(scale(testscores$math_lag1)) testscores$lang_lag1_csem <- testscores$lang_lag1_csem / sd(testscores$lang_lag1) testscores$lang_lag1 <- as.vector(scale(testscores$lang_lag1)) ## create pieces needed to call lr_ancova. Note that first column of Z ## must be an intercept. outcome_model <- "normal" Y <- testscores$math W <- testscores[,c("math_lag1","lang_lag1")] Z <- cbind(1, testscores[,c("sped","frl")]) G <- testscores$schoolid ## create varfuncs. Need to be careful to pass conditional measurement error ## variances, which require squaring the CSEMs varfuncs <- list() tmp <- unique(testscores[,c("math_lag1","math_lag1_csem")]) names(tmp) <- c("x","gx") tmp <- tmp[order(tmp$x),] tmp$gx <- tmp$gx^2 varfuncs[[1]] <- list(type="log_polynomial", vtab=tmp) tmp <- unique(testscores[,c("lang_lag1","lang_lag1_csem")]) names(tmp) <- c("x","gx") tmp <- tmp[order(tmp$x),] tmp$gx <- tmp$gx^2 varfuncs[[2]] <- list(type="log_polynomial", vtab=tmp) ## fit the model. NOTE: in practice, larger values of n.iter and n.burnin ## would typically be used; they are kept small here so that the example ## runs relatively quickly. m1 <- lr_ancova(outcome_model, Y, W, Z, G, varfuncs, n.iter=300, n.burnin=100) ## you can check the approximation to the variance functions by looking at the ## PDF file: print(m1$lr_ancova_extras$plot.location) ## and also can look at the JAGS model file: print(m1$lr_ancova_extras$model.location) ## the model object is of class "rjags" and so inherits the appropriate methods, ## including print: print(m1) ## betaXG, betaXZ, and varXgivenZG are for the conditional distribution of X ## given (Z,G). betaYG, betaYXZ and sdYgivenXZG are for the conditional ## distribution of Y given (X,Z,G). ## ## the first two elements of betaYXZ are the coefficients for the two columns of ## X, whereas the following three elements are the coefficients for the three ## columns of Z. ## ## the school effects are in betaYG. extract their posterior means and ## posterior standard deviations: e <- m1$BUGSoutput$summary e <- as.data.frame(e[grep("betaYG",rownames(e)),c("mean","sd")]) ## check the sum-to-zero constraints: print(sum(e$mean)) ## put the actual school IDs onto "e" e$schoolid <- m1$lr_ancova_extras$group.map$G print(e) ## compare the school effect estimates to those from a simpler model that does ## not adjust for the lag 1 ELA score, and does not account for the measurement ## error in the lag 1 math score. Use sum-to-zero contrasts and recover the ## estimate for the last school as negative the sum of the other estimates. testscores$schid <- factor(testscores$schoolid) m0 <- lm(math ~ math_lag1 + sped + frl + schid, data=testscores, contrasts=list(schid = "contr.sum")) s <- coef(m0)[grep("schid", names(coef(m0)))] e$est_m0 <- c(s, -sum(s)) ## Such estimates should have some amount of omitted variable bias, which ## should manifest as the differences between the "m0" and "m1" estimates ## being positively correlated with average prior achievement. print(cor(tapply(testscores$math_lag1, testscores$schoolid, mean), e$est_m0 - e$mean)) print(cor(tapply(testscores$lang_lag1, testscores$schoolid, mean), e$est_m0 - e$mean))
model.matrix
method for objects of class eivlm
.Extract model matrix from eivlm
object. Analogous to model.matrix
method for lm
objects.
## S3 method for class 'eivlm' model.matrix(object, ...)
## S3 method for class 'eivlm' model.matrix(object, ...)
object |
A model object of class |
... |
See help for |
Design matrix used in EIV regression.
print
method for objects of class eivlm
.Analogous to print
method for lm
objects.
## S3 method for class 'eivlm' print(x, digits = max(3L, getOption("digits") - 3L), ...)
## S3 method for class 'eivlm' print(x, digits = max(3L, getOption("digits") - 3L), ...)
x |
A model object of class |
digits , ...
|
See help for |
print
method for objects of class summary.eivlm
. Similar to print
method for summaries of
lm
objects, but provides additional information specific
to the EIV regression. The summary
method for objects of class
eivlm
returns an object of class summary.eivlm
.
## S3 method for class 'summary.eivlm' print(x, digits = max(3L, getOption("digits") - 3L), symbolic.cor = x$symbolic.cor, signif.stars = getOption("show.signif.stars"), ...)
## S3 method for class 'summary.eivlm' print(x, digits = max(3L, getOption("digits") - 3L), symbolic.cor = x$symbolic.cor, signif.stars = getOption("show.signif.stars"), ...)
x |
A model object of class |
digits , symbolic.cor , signif.stars , ...
|
See help for |
See help for summary.eivlm
for description of quantities
relevant to summarizing eivlm
objects.
summary
method for objects of class eivlm
.Computes summary quantities for a model of class eivlm
. The
computations include some quantities for the standard regression model
(uncorrected for covariate measurement error), as well as quantities
relevant to the EIV model.
## S3 method for class 'eivlm' summary(object, correlation = FALSE, symbolic.cor = FALSE, ...)
## S3 method for class 'eivlm' summary(object, correlation = FALSE, symbolic.cor = FALSE, ...)
object |
A model object of class |
correlation , symbolic.cor , ...
|
See help for |
An list object of class summary.eivlm
with components:
residuals , fitted.values , N , latent_resvar , vcov , relnames , coefficients
|
See |
call , terms , aliased , df , coefficients
|
See |
reliability |
If |
Sigma_error |
If |
symbolic.cor |
If |
latent_R2 |
Maximum likelihood estimate of R-squared for regression of Y on (X,Z). NOTE: This is not the R-squared of the fitted regression of Y on (W,Z). |
latent_R2_dfadj |
Estimate of R-squared for regression of Y on (X,Z) adjusted by number of estimated regression parameters. |
OTHER |
The object also includes components
|
J.R. Lockwood [email protected]
The model fitting function eivreg
, summary
.
Function coef
will extract the matrix of coefficients
with standard errors, t-statistics and p-values.
Cohort of grade 6 students with mathematics test scores from a target school year, as well as mathematics and language test scores from the prior school year (grade 5). Measurement error in test scores quantified by conditional standard error of measurement (CSEM).
data(testscores)
data(testscores)
A data frame with 4853 observations and 10 fields:
stuid
Unique identifier for each student (one record per student)
schoolid
Unique identifier for each student's grade 6 school
math
Grade 6 mathematics test score
math_csem
CSEM for grade 6 mathematics test score
math_lag1
Grade 5 mathematics test score
math_lag1_csem
CSEM for grade 5 mathematics test score
lang_lag1
Grade 5 language test score
lang_lag1_csem
CSEM for grade 5 language test score
sped
1 = student designated as special education; 0 otherwise
frl
1 = student participates in Free and Reduced Price lunch program; 0 otherwise
Anonymous
vcov
method for objects of class eivlm
.Extract variance/covariance matrix of estimated parameters from
eivlm
model object. Analogous to vcov
method for other models.
## S3 method for class 'eivlm' vcov(object, ...)
## S3 method for class 'eivlm' vcov(object, ...)
object |
A model object of class |
... |
Not currently implemented. |
Estimated variance/covariance matrix of estimated regression coefficients.