Title: | Selective Editing via Mixture Models |
---|---|
Description: | Detection of outliers and influential errors using a latent variable model. |
Authors: | Ugo Guarnera [aut], Teresa Buglielli [aut, cre] |
Maintainer: | Teresa Buglielli <[email protected]> |
License: | EUPL |
Version: | 1.0.2 |
Built: | 2024-11-20 06:58:39 UTC |
Source: | CRAN |
Simulated data from a Gaussian contamination model
data(ex1.data)
data(ex1.data)
A data frame with 500 observations and 2 variables (X1
,Y1
).
X1
error-free variable (numeric)
Y1
contaminated variable (numeric)
Data have been generated by a Gaussian model. The variable Y1 has been contaminated with parameters B=(-0.26, 1.26), sigma=1.21, w=0.05, lambda=10.
data(ex1.data)
data(ex1.data)
Simulated data from a Gaussian contaminated model
data(ex2.data)
data(ex2.data)
A data frame with 500 observations on the following 2 variables.
Y1
first numeric contaminated variable
Y2
second numeric contaminated variable
Data have been simulated by Gaussian contamination model with two contaminated variables (Y1,Y2) with parameters B=(1.03, 0.96), sigma=matrix(c(1.22, 1.42,1.42, 2.89),2,2), w=0.05, lambda=10.
data(ex2.data)
data(ex2.data)
Provides ML estimates of a Gaussian contamination model.
ml.est (y, x=NULL, model = "LN", lambda=3, w=0.05, lambda.fix=FALSE, w.fix=FALSE, eps=1e-7, max.iter=500, t.outl=0.5, graph=FALSE)
ml.est (y, x=NULL, model = "LN", lambda=3, w=0.05, lambda.fix=FALSE, w.fix=FALSE, eps=1e-7, max.iter=500, t.outl=0.5, graph=FALSE)
y |
matrix or data frame containing the response variables |
x |
optional matrix or data frame containing the error free covariates |
model |
data distribution: LN = lognormal(default), N=normal |
lambda |
starting value for the variance inflation factor (default=3) |
w |
starting value for the proportion of contaminated data (default=0.05) |
lambda.fix |
logical. TRUE if lambda is known |
w.fix |
logical. TRUE if w is known |
eps |
epsilon : tolerance parameter for the log-likelihood convergence (default=1e-7) |
max.iter |
maximum number of EM iterations (default=500) |
t.outl |
threshold value for posterior probabilities of identifying outliers (default=0.5) |
graph |
logical. TRUE to display graphics (default=FALSE) |
This function provides the parameter estimates of a contamination model where a set of y
variables is assumed to depend on a (possibly empty) set of covariates (x
variables)
through a mixture of two linear regressions with Gaussian residuals. The covariance matrices
of the two mixture components are assumed to be proportional (the proportionality constant being
lambda
). In case of no x
variables a mixture of two Gaussian distribution is estimated.
BIC and AIC scores (bic.aic
) are returned corresponding to both standard Gaussian model
and contamination model in order to help the user to avoid possible over-parametrisation.
According to the estimated model parameters, a matrix of predictions of ‘true’ y
values
(ypred
) is computed. To each unit in the dataset, a flag (outlier
) is assigned taking
value 0 or 1 depending on whether the posterior probability of being erroneous (tau
) is
greater than the user specified threshold (t.outl
).
The model is estimated using complete observations. Missing values in the x
variables are
not allowed. However, y
variables can be partly observed. Robust predictions of y
variables
are provided even when they are not observed.
A vector of missing pattern (pattern
) indicates which item is observed and which is missing.
In case the option ‘model = LN’ is specified, each zero value is changed in 1e-7 and a warning is returned.
In order to graphically monitor EM algorithm, a scatter plot is showed where outliers
are depicted as long as they are identified. The trajectory of the lambda
parameter
is also showed until convergence.
ml.est
returns a list containing the following components:
ypred |
matrix of predicted values for y variables |
B |
matrix of estimated regression coefficients |
sigma |
estimated covariance matrix |
lambda |
estimated variance inflation factor |
w |
estimated proportion of erroneous data |
tau |
vector of posterior probabilities of being contaminated |
outlier |
1 if the observation is classified as an outlier, 0 otherwise |
n.outlier |
total of outlier observations |
pattern |
vector of non-response patterns for y variables: 0 = missing, 1 = present value |
is.conv |
logical value: TRUE if the EM algorithm has converged |
n.iter |
number of iterations of EM algorithm |
sing |
if TRUE iteration are stopped because there is an almost perfect fit |
bic.aic |
Bayesian Information Criterion and Akaike Information Criterion for contaminated and non contaminated Gaussian models |
M. Teresa Buglielli <[email protected]>, Ugo Guarnera <[email protected]>
Di Zio, M., Guarnera, U. (2013) "A Contamination Model for Selective Editing",
Journal of Official Statistics. Volume 29, Issue 4, Pages 539-555
(https://doi.org/10.2478/jos-2013-0039).
Buglielli, M.T., Di Zio, M., Guarnera, U. (2010) "Use of Contamination Models for Selective Editing", European Conference on Quality in Survey Statistics Q2010, Helsinki, 4-6 May 2010
# Parameter estimation with one contaminated variable and one covariate data(ex1.data) ml.par <- ml.est(y=ex1.data[,"Y1"], x=ex1.data[,"X1"], graph=TRUE) str(ml.par) sum(ml.par$outlier) # number of outliers # Parameter estimation with two contaminated variables and no covariates ## Not run: data(ex2.data) par.joint <- ml.est(y=ex2.data, x=NULL, graph=TRUE) sum(par.joint$outlier) # number of outliers ## End(Not run)
# Parameter estimation with one contaminated variable and one covariate data(ex1.data) ml.par <- ml.est(y=ex1.data[,"Y1"], x=ex1.data[,"X1"], graph=TRUE) str(ml.par) sum(ml.par$outlier) # number of outliers # Parameter estimation with two contaminated variables and no covariates ## Not run: data(ex2.data) par.joint <- ml.est(y=ex2.data, x=NULL, graph=TRUE) sum(par.joint$outlier) # number of outliers ## End(Not run)
Provides predictions of y variables according to a Gaussian contamination model
pred.y (y, x=NULL, B, sigma, lambda, w, model="LN", t.outl=0.5)
pred.y (y, x=NULL, B, sigma, lambda, w, model="LN", t.outl=0.5)
y |
matrix or data frame containing the response variables |
x |
optional matrix or data frame containing the error free covariates |
B |
matrix of regression coefficients |
sigma |
covariance matrix |
lambda |
variance inflation factor |
w |
proportion of erroneous data |
model |
data distribution: LN = lognormal(default), N=normal |
t.outl |
threshold value for posterior probabilities of identifying outliers (default=0.5) |
This function provides expected values of a set of variables (y1.p,y2.p,...
) according
to a mixture of two regression models with Gaussian residuals (see ml.est
). If no covariates are available
(x
variables), a two component Gaussian mixture is used. Expected values (predictions) are computed
on the base of a set of parameters of appropriate dimensions (B, sigma, lambda,w
) and (possibly)
a matrix (or data frame) containing the error-free x
variables.
Missing values in the x
variables are not allowed. However, robust predictions of y
variables are
also provided when these variables are not observed. A vector of missing pattern (pattern
) indicates
which item is observed and which is missing.
For each unit in the data set the posterior probability of being erroneous (tau
) is computed and a
flag (outlier
) is provided taking value 0 or 1 depending on whether tau
is greater than the user
specified threshold (t.outl
).
pred.y
returns a data frame containing the following columns:
y1.p , y2.p , ...
|
predicted values for y variables |
tau |
posterior probabilities of being contaminated |
outlier |
1 if the observation is classified as an outlier, 0 otherwise |
pattern |
non-response patterns for y variables: 0 = missing, 1 = present value |
M. Teresa Buglielli <[email protected]>, Ugo Guarnera <[email protected]>
Buglielli, M.T., Di Zio, M., Guarnera, U. (2010) "Use of Contamination Models for Selective Editing", European Conference on Quality in Survey Statistics Q2010, Helsinki, 4-6 May 2010
# Parameter estimation with one contaminated variable and one covariate data(ex1.data) # Parameters estimated applying ml.est to \code{ex1.data} B1 <- as.matrix(c(-0.152, 1.215)) sigma1 <- as.matrix(1.25) lambda1 <- 15.5 w1 <- 0.0479 # Variable prediction ypred <- pred.y (y=ex1.data[,"Y1"], x=ex1.data[,"X1"], B=B1, sigma=sigma1, lambda=lambda1, w=w1, model="LN", t.outl=0.5) # Plot ypred vs Y1 sel.pairs(cbind(ypred[,1,drop=FALSE],ex1.data[,"Y1",drop=FALSE]), outl=ypred[,"outlier"])
# Parameter estimation with one contaminated variable and one covariate data(ex1.data) # Parameters estimated applying ml.est to \code{ex1.data} B1 <- as.matrix(c(-0.152, 1.215)) sigma1 <- as.matrix(1.25) lambda1 <- 15.5 w1 <- 0.0479 # Variable prediction ypred <- pred.y (y=ex1.data[,"Y1"], x=ex1.data[,"X1"], B=B1, sigma=sigma1, lambda=lambda1, w=w1, model="LN", t.outl=0.5) # Plot ypred vs Y1 sel.pairs(cbind(ypred[,1,drop=FALSE],ex1.data[,"Y1",drop=FALSE]), outl=ypred[,"outlier"])
Computes the score function and identifies influential errors
sel.edit (y, ypred, wgt=rep(1,nrow(as.matrix(y ))), tot=colSums(ypred * wgt), t.sel=0.01)
sel.edit (y, ypred, wgt=rep(1,nrow(as.matrix(y ))), tot=colSums(ypred * wgt), t.sel=0.01)
y |
matrix or data frame containing the response variables |
ypred |
matrix of predicted values for y variables |
wgt |
optional vector of sampling weights (default=1) |
tot |
optional vector containing reference estimates of totals for the y variables. If omitted, it is computed as the (possibly weighted) sum of predicted values |
t.sel |
optional vector of threshold values, one for each variable, for selective editing (default=0.01) |
This function ranks observations (rank
) according to the importance of their potential errors.
The order is made with respect to the global score function values (global.score
).
The function also selects the units to be edited (sel
) so that the expected residual error of
all variables is below a prefixed level of accuracy (t.sel
).
The global score (global.score
) is the maximum of the local scores computed for each variable
(y1.score, y2.score,...
).
The local scores are defined as a weighted (weights
) absolute difference between the observed
(y1, y2,...
) and the predicted values (y1.p, y2.p,...
) standardised with respect to
the reference total estimates (tot
).
The selection of the units to be edited because affected by an influential error (sel=1
) is
made according to a two-step algorithm:
1) order the observations with respect to the global.score
(decreasing order);
2) select the first k units such that, from the (k+1)th to the last observation, all the
residual errors (y1.reserr, y2.reserr,...
) for each variable are below t.sel
.
The function provides also an indicator function (y1.sel, y2.sel,...
) reporting
which variables contain an influential errors in a unit selected for the revision.
sel.edit
returns a data matrix containing the following columns:
y1 , y2 , ...
|
observed variables |
y1.p , y2.p , ...
|
predictions of y variables |
weights |
sampling weights |
y1.score , y2.score , ...
|
local scores |
global.score |
global score |
y1.reserr , y2.reserr , ...
|
residual errors |
y1.sel , y2.sel , ...
|
influential error flags |
rank |
rank according to global score |
sel |
1 if the observation contains an influential error, 0 otherwise |
M. Teresa Buglielli <[email protected]>, Ugo Guarnera <[email protected]>
Di Zio, M., Guarnera, U. (2013) "A Contamination Model for Selective Editing",
Journal of Official Statistics. Volume 29, Issue 4, Pages 539-555 (http://dx.doi.org/10.2478/jos-2013-0039).
Buglielli, M.T., Di Zio, M., Guarnera, U. (2010) "Use of Contamination Models for Selective Editing", European Conference on Quality in Survey Statistics Q2010, Helsinki, 4-6 May 2010.
# Example 1 # Parameter estimation with one contaminated variable and one covariate data(ex1.data) ml.par <- ml.est(y=ex1.data[,"Y1"], x=ex1.data[,"X1"]) # Detection of influential errors sel <- sel.edit(y=ex1.data[,"Y1"], ypred=ml.par$ypred) head(sel) sum(sel[,"sel"]) # orders results for decreasing importance of score sel.ord <- sel[order(sel[,"rank"]), ] # adds columns to data ex1.data <- cbind(ex1.data, tau=ml.par$tau, outlier=ml.par$outlier, sel[,c("rank", "sel")]) # plot of data with outliers and influential errors sel.pairs(ex1.data[,c("X1","Y1")],outl=ml.par$outlier, sel=sel[,"sel"]) # Example 2 data(ex2.data) par.joint <- ml.est(y=ex2.data) sel <- sel.edit(y=ex2.data, ypred=par.joint$ypred) sel.pairs(ex2.data,outl=par.joint$outlier, sel=sel[,"sel"])
# Example 1 # Parameter estimation with one contaminated variable and one covariate data(ex1.data) ml.par <- ml.est(y=ex1.data[,"Y1"], x=ex1.data[,"X1"]) # Detection of influential errors sel <- sel.edit(y=ex1.data[,"Y1"], ypred=ml.par$ypred) head(sel) sum(sel[,"sel"]) # orders results for decreasing importance of score sel.ord <- sel[order(sel[,"rank"]), ] # adds columns to data ex1.data <- cbind(ex1.data, tau=ml.par$tau, outlier=ml.par$outlier, sel[,c("rank", "sel")]) # plot of data with outliers and influential errors sel.pairs(ex1.data[,c("X1","Y1")],outl=ml.par$outlier, sel=sel[,"sel"]) # Example 2 data(ex2.data) par.joint <- ml.est(y=ex2.data) sel <- sel.edit(y=ex2.data, ypred=par.joint$ypred) sel.pairs(ex2.data,outl=par.joint$outlier, sel=sel[,"sel"])
A scatterplot matrix with outlier and influential errors is produced.
sel.pairs(x, outl = rep(0, nrow(x)), sel = rep(0,nrow(x)), labs = NULL, log = TRUE, legend=TRUE, title=NULL)
sel.pairs(x, outl = rep(0, nrow(x)), sel = rep(0,nrow(x)), labs = NULL, log = TRUE, legend=TRUE, title=NULL)
x |
matrix or data frame of the coordinates of points |
outl |
vector identifying outliers (1 or TRUE means outlier) |
sel |
vector identifying influential errors (1 or TRUE means influential error) |
labs |
names of the variables |
log |
if TRUE logarithm of |
legend |
if TRUE a legend is added to first boxplot |
title |
an overall title for the plot |
The ijth scatterplot contains x[,i] plotted against x[,j]. Outliers are represented as blue circles, influential errors as red circles and points that are both outlier and influential error as cyan circles.
In addition to a standard scatterplot, outliers and influential errors are highlighted.
sel.plot (data, vars=1:2, outl = rep(0, nrow(data)), sel = rep(0, nrow(data)), log = TRUE, n.identify=0, file=NULL, title=NULL)
sel.plot (data, vars=1:2, outl = rep(0, nrow(data)), sel = rep(0, nrow(data)), log = TRUE, n.identify=0, file=NULL, title=NULL)
data |
named matrix or data frame containing at least the coordinates of points |
vars |
vector with the names or column numbers of the two variables to plot |
outl |
vector identifying outliers (1 or TRUE means outlier) |
sel |
vector identifying influential errors (1 or TRUE means influential error) |
log |
if TRUE logarithm of |
n.identify |
number of points to be identified on the scattrerplot. Corresponding data are printed on console or file (if a file name is specified) |
file |
name of the output file. If |
title |
an overall title for the plot |
The scatterplot contains the first varaible plotted against the second. Outliers are represented as blue circles, influential errors as red circles and points that are both outlier and influential error as cyan circles.
data(ex2.data) par.joint <- ml.est(y=ex2.data) sel <- sel.edit(y=ex2.data, ypred=par.joint$ypred) sel.plot(ex2.data,outl=par.joint$outlier, sel=sel[,"sel"], title="EXAMPLE 2") ## Not run: sel.plot(ex2.data,outl=par.joint$outlier, sel=sel[,"sel"], title="EXAMPLE 2", n.identify=3) ## End(Not run)
data(ex2.data) par.joint <- ml.est(y=ex2.data) sel <- sel.edit(y=ex2.data, ypred=par.joint$ypred) sel.plot(ex2.data,outl=par.joint$outlier, sel=sel[,"sel"], title="EXAMPLE 2") ## Not run: sel.plot(ex2.data,outl=par.joint$outlier, sel=sel[,"sel"], title="EXAMPLE 2", n.identify=3) ## End(Not run)