Title: | Statistical Matching or Data Fusion |
---|---|
Description: | Integration of two data sources referred to the same target population which share a number of variables. Some functions can also be used to impute missing values in data sets through hot deck imputation methods. Methods to perform statistical matching when dealing with data from complex sample surveys are available too. |
Authors: | Marcello D'Orazio |
Maintainer: | Marcello D'Orazio <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.4.2 |
Built: | 2024-11-10 06:43:50 UTC |
Source: | CRAN |
Functions to perform statistical matching (aka data fusion), i.e. the integration of two data sources. Some functions can also be used to impute missing values in data sets through hot deck imputation methods.
Statistical matching (aka data fusion) aims at integrating two data sources (typically samples) referred to same target population and sharing a number of variables. The final objective is that of studying the relationship of variables not jointly observed in a single data sources. The integration can be performed at micro (a synthetic or fused file is the output) or macro level (goal is estimation of correlation coefficients, regression coefficients, contingency tables, etc.). Nonparametric hot deck imputation methods (random, rank and nearest neighbour donor) can be used to derive the synthetic data set. In alternative it is possible to use a mixed (parametric-nonparametric) procedure based on predictive mean matching. Methods to perform statistical matching when dealing with data from complex sample surveys are available too. Finally some functions can be used to explore the uncertainty due to the typical matching framework. For major details see D'Orazio et al. (2006) or the package vignette.
Marcello D'Orazio
Maintainer: Marcello D'Orazio <[email protected]>
D'Orazio M., Di Zio M., Scanu M. (2006) Statistical Matching, Theory and Practice. Wiley, Chichester.
This function permits to cross-tabulate two categorical variables, Y and Z, observed separately in two independent surveys (Y is collected in survey A and Z is collected in survey B) carried out on the same target population. The two surveys share a number of common variables X. When it is available a third survey C, carried on the same population, in which both Y and Z are collected, these data are used as a source of auxiliary information.
The statistical matching is performed by carrying out calibration of the survey weights, as suggested in Renssen (1998).
It is possible also to use the function to derive the estimates that a unit falls in one of the categories of the target variable (estimation are based on Liner Probability Models and are obtained as a by-product of the Renssen's method).
comb.samples(svy.A, svy.B, svy.C=NULL, y.lab, z.lab, form.x, estimation=NULL, micro=FALSE, ...)
comb.samples(svy.A, svy.B, svy.C=NULL, y.lab, z.lab, form.x, estimation=NULL, micro=FALSE, ...)
svy.A |
A |
svy.B |
A |
svy.C |
A When When |
y.lab |
A string providing the name of the Y variable, available in survey A and in survey C (if available). The Y variable can be a categorical variable ( |
z.lab |
A string providing the name of the Z variable available in survey B and in survey C (if available). The Z variable can be a categorical variable ( |
form.x |
A R formula specifying the matching variables (subset of the X variables) collected in all the surveys, and how have to be considered in combining samples. For instance, When dealing with categorical variables, To better understand the usage of Due to weights calibration features, it is preferable to work with categorical X variables. In some cases, procedure may be successful when a single continuous variable is considered jointly with one or more categorical variables but, often, it may be necessary to categorize the continuous variable (see Details). |
estimation |
A character string that identifies the method to be used to estimate the table of Y vs. Z when data from survey C are available. As suggested in Renssen (1998), two alternative methods are available: (i) Incomplete Two-Way Stratification ( |
micro |
Logical, when |
... |
Further arguments that may be necessary for calibration. In particular, the argument Note that when The number of iterations used in calibration can be modified by using the argument See |
This function estimates the contingency table of Y vs. Z by performing a series of calibrations of the survey weights. In practice the estimation is carried out on data in survey C by exploiting all the information from surveys A and B. When survey C is not available the table of Y vs. Z is estimated under the assumption of Conditional Independence (CIA), i.e. .
When data from survey C are available (Renssen, 1998), the table of Y vs. Z can be estimated by: Incomplete Two-Way Stratification (ITWS) or Synthetic Two-Way Stratification (STWS). In the first case (ITWS) the weights of the units in survey C are calibrated so that the new weights allow to reproduce the marginal distributions of Y estimated on survey A, and that of Z estimated on survey B. Note that the distribution of the X variables in survey A and in survey B, must be harmonized before performing ITWS (see harmonize.x
).
The Synthetic Two-Way Stratification allows to estimate the table of Y vs. Z by considering also the X variables observed in C. This method consists in correcting the table of Y vs. Z estimated under the CIA according to the relationship between Y and Z observed in survey C (for major details see Renssen, 1998).
When the argument micro
is set to TRUE
the function provides also Z.A
and Y.B
. The first data.frame has the same rows as svy.A
and the number of columns equals the number of categories of the Z variable specified via z.lab
. Each row provides the estimated probabilities of assuming a value in the various categories. The same happens for Y.B
which presents the estimated probabilities of assuming a category of y.lab
for each unit in B. The estimated probabilities are obtained by applying the linear probability models (for major details see Renssen, 1998). Unfortunately, such models may provide estimated probabilities less than 0 or greater than 1. Much caution should be used in using such predictions for practical purposes.
A R list with the results of the calibration procedure according to the input arguments.
yz.CIA |
The table of Y ( |
cal.C |
The survey object |
yz.est |
The table of Y ( |
Z.A |
Only when |
Y.B |
Only when |
call |
Stores the call to this function with all the values specified for the various arguments ( |
Marcello D'Orazio [email protected]
D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.
Renssen, R.H. (1998) “Use of Statistical Matching Techniques in Calibration Estimation”. Survey Methodology, 24, pp. 171–183.
calibrate
, svydesign
, harmonize.x
data(quine, package="MASS") #loads quine from MASS str(quine) quine$c.Days <- cut(quine$Days, c(-1, seq(0,20,10),100)) table(quine$c.Days) # split quine in two subsets suppressWarnings(RNGversion("3.5.0")) set.seed(124) lab.A <- sample(nrow(quine), 70, replace=TRUE) quine.A <- quine[lab.A, c("Eth","Sex","Age","Lrn")] quine.B <- quine[-lab.A, c("Eth","Sex","Age","c.Days")] # create svydesign objects require(survey) quine.A$f <- 70/nrow(quine) # sampling fraction quine.B$f <- (nrow(quine)-70)/nrow(quine) svy.qA <- svydesign(~1, fpc=~f, data=quine.A) svy.qB <- svydesign(~1, fpc=~f, data=quine.B) # Harmonizazion wrt the joint distribution # of ('Sex' x 'Age' x 'Eth') # vector of population total known # estimated from the full data set # note the formula! tot.m <- colSums(model.matrix(~Eth:Sex:Age-1, data=quine)) tot.m out.hz <- harmonize.x(svy.A=svy.qA, svy.B=svy.qB, x.tot=tot.m, form.x=~Eth:Sex:Age-1, cal.method="linear") # estimation of 'Lrn' vs. 'c.Days' under the CIA svy.qA.h <- out.hz$cal.A svy.qB.h <- out.hz$cal.B out.1 <- comb.samples(svy.A=svy.qA.h, svy.B=svy.qB.h, svy.C=NULL, y.lab="Lrn", z.lab="c.Days", form.x=~Eth:Sex:Age-1) out.1$yz.CIA addmargins(out.1$yz.CIA) # # incomplete two-way stratification # select a sample C from quine # and define a survey object suppressWarnings(RNGversion("3.5.0")) set.seed(4321) lab.C <- sample(nrow(quine), 50, replace=TRUE) quine.C <- quine[lab.C, c("Lrn","c.Days")] quine.C$f <- 50/nrow(quine) # sampling fraction svy.qC <- svydesign(~1, fpc=~f, data=quine.C) # call comb.samples out.2 <- comb.samples(svy.A=svy.qA.h, svy.B=svy.qB.h, svy.C=svy.qC, y.lab="Lrn", z.lab="c.Days", form.x=~Eth:Sex:Age-1, estimation="incomplete", calfun="linear", maxit=100) summary(weights(out.2$cal.C)) out.2$yz.est # estimated table of 'Lrn' vs. 'c.Days' # difference wrt the table 'Lrn' vs. 'c.Days' under CIA addmargins(out.2$yz.est)-addmargins(out.2$yz.CIA) # synthetic two-way stratification # only macro estimation quine.C <- quine[lab.C, ] quine.C$f <- 50/nrow(quine) # sampling fraction svy.qC <- svydesign(~1, fpc=~f, data=quine.C) out.3 <- comb.samples(svy.A=svy.qA.h, svy.B=svy.qB.h, svy.C=svy.qC, y.lab="Lrn", z.lab="c.Days", form.x=~Eth:Sex:Age-1, estimation="synthetic", calfun="linear",bounds=c(.5,Inf), maxit=100) summary(weights(out.3$cal.C)) out.3$yz.est # estimated table of 'Lrn' vs. 'c.Days' # difference wrt the table of 'Lrn' vs. 'c.Days' under CIA addmargins(out.3$yz.est)-addmargins(out.3$yz.CIA) # diff wrt the table of 'Lrn' vs. 'c.Days' under incomplete 2ws addmargins(out.3$yz.est)-addmargins(out.2$yz.CIA) # synthetic two-way stratification # with micro predictions out.4 <- comb.samples(svy.A=svy.qA.h, svy.B=svy.qB.h, svy.C=svy.qC, y.lab="Lrn", z.lab="c.Days", form.x=~Eth:Sex:Age-1, estimation="synthetic", micro=TRUE, calfun="linear",bounds=c(.5,Inf), maxit=100) head(out.4$Z.A) head(out.4$Y.B)
data(quine, package="MASS") #loads quine from MASS str(quine) quine$c.Days <- cut(quine$Days, c(-1, seq(0,20,10),100)) table(quine$c.Days) # split quine in two subsets suppressWarnings(RNGversion("3.5.0")) set.seed(124) lab.A <- sample(nrow(quine), 70, replace=TRUE) quine.A <- quine[lab.A, c("Eth","Sex","Age","Lrn")] quine.B <- quine[-lab.A, c("Eth","Sex","Age","c.Days")] # create svydesign objects require(survey) quine.A$f <- 70/nrow(quine) # sampling fraction quine.B$f <- (nrow(quine)-70)/nrow(quine) svy.qA <- svydesign(~1, fpc=~f, data=quine.A) svy.qB <- svydesign(~1, fpc=~f, data=quine.B) # Harmonizazion wrt the joint distribution # of ('Sex' x 'Age' x 'Eth') # vector of population total known # estimated from the full data set # note the formula! tot.m <- colSums(model.matrix(~Eth:Sex:Age-1, data=quine)) tot.m out.hz <- harmonize.x(svy.A=svy.qA, svy.B=svy.qB, x.tot=tot.m, form.x=~Eth:Sex:Age-1, cal.method="linear") # estimation of 'Lrn' vs. 'c.Days' under the CIA svy.qA.h <- out.hz$cal.A svy.qB.h <- out.hz$cal.B out.1 <- comb.samples(svy.A=svy.qA.h, svy.B=svy.qB.h, svy.C=NULL, y.lab="Lrn", z.lab="c.Days", form.x=~Eth:Sex:Age-1) out.1$yz.CIA addmargins(out.1$yz.CIA) # # incomplete two-way stratification # select a sample C from quine # and define a survey object suppressWarnings(RNGversion("3.5.0")) set.seed(4321) lab.C <- sample(nrow(quine), 50, replace=TRUE) quine.C <- quine[lab.C, c("Lrn","c.Days")] quine.C$f <- 50/nrow(quine) # sampling fraction svy.qC <- svydesign(~1, fpc=~f, data=quine.C) # call comb.samples out.2 <- comb.samples(svy.A=svy.qA.h, svy.B=svy.qB.h, svy.C=svy.qC, y.lab="Lrn", z.lab="c.Days", form.x=~Eth:Sex:Age-1, estimation="incomplete", calfun="linear", maxit=100) summary(weights(out.2$cal.C)) out.2$yz.est # estimated table of 'Lrn' vs. 'c.Days' # difference wrt the table 'Lrn' vs. 'c.Days' under CIA addmargins(out.2$yz.est)-addmargins(out.2$yz.CIA) # synthetic two-way stratification # only macro estimation quine.C <- quine[lab.C, ] quine.C$f <- 50/nrow(quine) # sampling fraction svy.qC <- svydesign(~1, fpc=~f, data=quine.C) out.3 <- comb.samples(svy.A=svy.qA.h, svy.B=svy.qB.h, svy.C=svy.qC, y.lab="Lrn", z.lab="c.Days", form.x=~Eth:Sex:Age-1, estimation="synthetic", calfun="linear",bounds=c(.5,Inf), maxit=100) summary(weights(out.3$cal.C)) out.3$yz.est # estimated table of 'Lrn' vs. 'c.Days' # difference wrt the table of 'Lrn' vs. 'c.Days' under CIA addmargins(out.3$yz.est)-addmargins(out.3$yz.CIA) # diff wrt the table of 'Lrn' vs. 'c.Days' under incomplete 2ws addmargins(out.3$yz.est)-addmargins(out.2$yz.CIA) # synthetic two-way stratification # with micro predictions out.4 <- comb.samples(svy.A=svy.qA.h, svy.B=svy.qB.h, svy.C=svy.qC, y.lab="Lrn", z.lab="c.Days", form.x=~Eth:Sex:Age-1, estimation="synthetic", micro=TRUE, calfun="linear",bounds=c(.5,Inf), maxit=100) head(out.4$Z.A) head(out.4$Y.B)
This function estimates the “closeness” of distributions of the same continuous variable(s) but estimated from different data sources.
comp.cont(data.A, data.B, xlab.A, xlab.B = NULL, w.A = NULL, w.B = NULL, ref = FALSE)
comp.cont(data.A, data.B, xlab.A, xlab.B = NULL, w.A = NULL, w.B = NULL, ref = FALSE)
data.A |
A dataframe or matrix containing the variable of interest |
data.B |
A dataframe or matrix containing the variable of interest |
xlab.A |
Character string providing the name of the variable in |
xlab.B |
Character string providing the name of the variable in |
w.A |
Character string providing the name of the optional weighting variable in |
w.B |
Character string providing the name of the optional weighting variable in |
ref |
Logical. When |
As a first output, the function returns some well–known summary measures (min, Q1, median, mean, Q3, max and sd) estimated from the available input data sources.
Secondly this function performs a comparison between the quantiles estimated from data.A
and data.B
; in particular, the average of the absolute value of the differences as well as the average of the squared differences are returned. The number of estimated percentiles depends on the minimum between the two sample sizes. Only quartiles are calculated when min(n.A, n.B)<=50; quintiles are estimated when min(n.A, n.B)>50 and min(n.A, n.B)<=150; deciles are estimated when min(n.A, n.B)>150 and min(n.A, n.B)<=250; finally quantiles for probs=seq(from = 0.05,to = 0.95,by = 0.05)
are estimated when min(n.A, n.B)>250. When the survey weights are available (indicated with th arguments w.A
and/or w.B
) they are used in estimating the quantiles by calling the function wtd.quantile
in the package Hmisc.
The function estimates also the dissimilarities between the estimated empirical distribution function. The measures considered are the maximum of the absolute differences, the sum between the maximum differences inverting the terms in the difference and the average of the absolute value of the differences. When the weights are provided they are used in estimating the empirical cumulative distribution function. Note that when ref=TRUE
the estimation of the density and of the empirical cumulative distribution are guided by the data in data.B
.
The final output is the total variation distance, the overlap and the Hellinger distance calculated considering the transformed categorized variable. The breaks to categorize the variable are decided according to the Freedman-Diaconis rule (nclass
) and, in this case, when ref=TRUE
the IQR is estimated solely on data.B
, whereas with ref=FALSE
it is estimated by joining the two data sources.
When present, the weights are used in estimating the relative frequencies of the categorized variable.
For additional details on these distances please see (comp.prop
)
A list
object with four components.
summary |
A matrix with summaries of |
diff.Qs |
Average of absolute and squared differences between the quantiles of |
dist.ecdf |
Dissimilarity measures between the estimated empirical cumulative distribution functions. |
dist.discr |
Distance between the distributions after discretization of the target variable. |
Marcello D'Orazio [email protected]
Bellhouse D.R. and J. E. Stafford (1999). “Density Estimation from Complex Surveys”. Statistica Sinica, 9, 407–424.
data(samp.A) data(samp.B) comp.cont(data.A = samp.A, data.B = samp.B, xlab.A = "age") comp.cont(data.A = samp.A, data.B = samp.B, xlab.A = "age", w.A = "ww", w.B = "ww")
data(samp.A) data(samp.B) comp.cont(data.A = samp.A, data.B = samp.B, xlab.A = "age") comp.cont(data.A = samp.A, data.B = samp.B, xlab.A = "age", w.A = "ww", w.B = "ww")
This function compares two (estimated) distributions of the same categorical variable(s).
comp.prop(p1, p2, n1, n2=NULL, ref=FALSE)
comp.prop(p1, p2, n1, n2=NULL, ref=FALSE)
p1 |
A vector or an array containing relative or absolute frequencies for one or more categorical variables. Usually it is the output of the function |
p2 |
A vector or an array containing relative or absolute frequencies for one or more categorical variables. Usually it is the output of the function |
n1 |
The size of the sample on which |
n2 |
The size of the sample on which |
ref |
Logical. When |
This function computes some similarity or dissimilarity measures between marginal (joint) distribution of categorical variables(s). The following measures are considered:
Dissimilarity index or total variation distance:
where are the relative frequencies (
). The dissimilarity index ranges from 0 (minimum dissimilarity) to 1. It can be interpreted as the smallest fraction of units that need to be reclassified in order to make the distributions equal. When
p2
is the reference distribution (true or expected distribution under a given hypothesis) than, following the Agresti's rule of thumb (Agresti 2002, pp. 329–330) , values of denotes that the estimated distribution
p1
follows the true or expected pattern quite closely.
Overlap between two distributions:
It is a measure of similarity which ranges from 0 to 1 (the distributions are equal). It is worth noting that .
Bhattacharyya coefficient:
It is a measure of similarity and ranges from 0 to 1 (the distributions are equal).
Hellinger's distance:
It is a dissimilarity measure ranging from 0 (distributions are equal) to 1 (max dissimilarity). It satisfies all the properties of a distance measure (; symmetry and triangle inequality).
Hellinger's distance is related to the dissimilarity index, and it is possible to show that:
Alongside with those similarity/dissimilarity measures the Pearson's Chi-squared is computed. Two formulas are considered. When p2
is the reference distribution (true or expected under some hypothesis, ref=TRUE
):
When p2
is a distribution estimated on a second sample then:
where is the expected frequency for category j, obtained as follows:
being and
the sizes of the samples.
The Chi-Square value can be used to test the hypothesis that two distributions are equal (). Unfortunately such a test would not be useful when the distribution are estimated from samples selected from a finite population using complex selection schemes (stratification, clustering, etc.). In such a case different alternative corrected Chi-square tests are available (cf. Sarndal et al., 1992, Sec. 13.5). One possibility consist in dividing the Pearson's Chi-square test by the generalised design effect of both the surveys. Its estimation is not straightforward (sampling design variables need to be available). Generally speacking, the generalised design effect is smaller than 1 in the presence of stratified random sampling designs, while it exceeds 1 the presence of a two stage cluster sampling design. For the purposes of analysis it is reported the value of the generalised design effect g that would determine the acceptance of the null hypothesis (equality of distributions) in the case of
(
), i.e. values of g such that
A list
object with two or three components depending on the argument ref
.
meas |
A vector with the measures of similarity/dissimilarity between the distributions: dissimilarity index ( |
chi.sq |
A vector with the following values: Pearson's Chi-square ( |
p.exp |
When |
Marcello D'Orazio [email protected]
Agresti A (2002) Categorical Data Analysis. Second Edition. Wiley, new York.
Sarndal CE, Swensson B, Wretman JH (1992) Model Assisted Survey Sampling. Springer–Verlag, New York.
data(quine, package="MASS") #loads quine from MASS str(quine) # split quine in two subsets suppressWarnings(RNGversion("3.5.0")) set.seed(124) lab.A <- sample(nrow(quine), 70, replace=TRUE) quine.A <- quine[lab.A, c("Eth","Sex","Age")] quine.B <- quine[-lab.A, c("Eth","Sex","Age")] # compare est. distributions from 2 samples # 1 variable tt.A <- xtabs(~Age, data=quine.A) tt.B <- xtabs(~Age, data=quine.B) comp.prop(p1=tt.A, p2=tt.B, n1=nrow(quine.A), n2=nrow(quine.B), ref=FALSE) # joint distr. of more variables tt.A <- xtabs(~Eth+Sex+Age, data=quine.A) tt.B <- xtabs(~Eth+Sex+Age, data=quine.B) comp.prop(p1=tt.A, p2=tt.B, n1=nrow(quine.A), n2=nrow(quine.B), ref=FALSE) # compare est. distr. with a one considered as reference tt.A <- xtabs(~Eth+Sex+Age, data=quine.A) tt.all <- xtabs(~Eth+Sex+Age, data=quine) comp.prop(p1=tt.A, p2=tt.all, n1=nrow(quine.A), n2=NULL, ref=TRUE)
data(quine, package="MASS") #loads quine from MASS str(quine) # split quine in two subsets suppressWarnings(RNGversion("3.5.0")) set.seed(124) lab.A <- sample(nrow(quine), 70, replace=TRUE) quine.A <- quine[lab.A, c("Eth","Sex","Age")] quine.B <- quine[-lab.A, c("Eth","Sex","Age")] # compare est. distributions from 2 samples # 1 variable tt.A <- xtabs(~Age, data=quine.A) tt.B <- xtabs(~Age, data=quine.B) comp.prop(p1=tt.A, p2=tt.B, n1=nrow(quine.A), n2=nrow(quine.B), ref=FALSE) # joint distr. of more variables tt.A <- xtabs(~Eth+Sex+Age, data=quine.A) tt.B <- xtabs(~Eth+Sex+Age, data=quine.B) comp.prop(p1=tt.A, p2=tt.B, n1=nrow(quine.A), n2=nrow(quine.B), ref=FALSE) # compare est. distr. with a one considered as reference tt.A <- xtabs(~Eth+Sex+Age, data=quine.A) tt.all <- xtabs(~Eth+Sex+Age, data=quine) comp.prop(p1=tt.A, p2=tt.all, n1=nrow(quine.A), n2=NULL, ref=TRUE)
Creates a synthetic data frame after the statistical matching of two data sources at micro level.
create.fused(data.rec, data.don, mtc.ids, z.vars, dup.x=FALSE, match.vars=NULL)
create.fused(data.rec, data.don, mtc.ids, z.vars, dup.x=FALSE, match.vars=NULL)
data.rec |
A matrix or data frame that plays the role of recipient in the statistical matching application. |
data.don |
A matrix or data frame that plays the role of donor in the statistical matching application. |
mtc.ids |
A matrix with two columns. Each row must contain the name or the index of the recipient record (row) in |
z.vars |
A character vector with the names of the variables available only in |
dup.x |
Logical. When |
match.vars |
A character vector with the names of the matching variables. It has to be specified only when |
This function allows to create the synthetic (or fused) data set after the application of a statistical matching in a micro framework. For details see D'Orazio et al. (2006).
The data frame data.rec
with the z.vars
filled in and, when dup.x=TRUE
, with the values of the matching variables match.vars
observed on the donor records.
Marcello D'Orazio [email protected]
D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.
NND.hotdeck
RANDwNND.hotdeck
rankNND.hotdeck
lab <- c(1:15, 51:65, 101:115) iris.rec <- iris[lab, c(1:3,5)] # recipient data.frame iris.don <- iris[-lab, c(1:2,4:5)] # donor data.frame # Now iris.rec and iris.don have the variables # "Sepal.Length", "Sepal.Width" and "Species" # in common. # "Petal.Length" is available only in iris.rec # "Petal.Width" is available only in iris.don # find the closest donors using NND hot deck; # distances are computed on "Sepal.Length" and "Sepal.Width" out.NND <- NND.hotdeck(data.rec=iris.rec, data.don=iris.don, match.vars=c("Sepal.Length", "Sepal.Width"), don.class="Species") # create synthetic data.set, without the # duplication of the matching variables fused.0 <- create.fused(data.rec=iris.rec, data.don=iris.don, mtc.ids=out.NND$mtc.ids, z.vars="Petal.Width") # create synthetic data.set, with the "duplication" # of the matching variables fused.1 <- create.fused(data.rec=iris.rec, data.don=iris.don, mtc.ids=out.NND$mtc.ids, z.vars="Petal.Width", dup.x=TRUE, match.vars=c("Sepal.Length", "Sepal.Width"))
lab <- c(1:15, 51:65, 101:115) iris.rec <- iris[lab, c(1:3,5)] # recipient data.frame iris.don <- iris[-lab, c(1:2,4:5)] # donor data.frame # Now iris.rec and iris.don have the variables # "Sepal.Length", "Sepal.Width" and "Species" # in common. # "Petal.Length" is available only in iris.rec # "Petal.Width" is available only in iris.don # find the closest donors using NND hot deck; # distances are computed on "Sepal.Length" and "Sepal.Width" out.NND <- NND.hotdeck(data.rec=iris.rec, data.don=iris.don, match.vars=c("Sepal.Length", "Sepal.Width"), don.class="Species") # create synthetic data.set, without the # duplication of the matching variables fused.0 <- create.fused(data.rec=iris.rec, data.don=iris.don, mtc.ids=out.NND$mtc.ids, z.vars="Petal.Width") # create synthetic data.set, with the "duplication" # of the matching variables fused.1 <- create.fused(data.rec=iris.rec, data.don=iris.don, mtc.ids=out.NND$mtc.ids, z.vars="Petal.Width", dup.x=TRUE, match.vars=c("Sepal.Length", "Sepal.Width"))
Imputes the missing values (NAs) in the recipient dataset with values observed on the donors units after search of donors with NND or random hotdeck.
create.imputed(data.rec, data.don, mtc.ids)
create.imputed(data.rec, data.don, mtc.ids)
data.rec |
A matrix or data frame that has missing values. |
data.don |
A matrix or data frame that is used for donation (imputation). |
mtc.ids |
A matrix with two columns. Each row must contain the name or the index of the recipient record (row) in |
This function allows to fill in the missing values (NAs) in the recipient with values observed in the donor data set after the search of donors via NND or random hotdeck with available functions in the package, i.e. NND.hotdeck
, RANDwNND.hotdeck
, rankNND.hotdeck
, and mixed.mtc
.
When the same recorc in the recipient dataset presents 2 or more NAs they all will be replaced with values observed on the chosen donor for that unit; this corresponds to joint hotdeck imputation.
The data frame data.rec
missing values (NAs) filled in.
Marcello D'Orazio [email protected]
D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.
NND.hotdeck
RANDwNND.hotdeck
rankNND.hotdeck
# introduce missing values # in Petal.Length of iris dataset set.seed(13579) pos <- sample(x = 1:nrow(iris), size = 15, replace = FALSE) iris.rec <- iris[pos, ] # recipient data.frame with missing values iris.rec[, "Petal.Length"] <- NA iris.don <- iris[-pos, ] # donor data.frame ALL observed # find the closest donors using NND hot deck; # distances are computed on "Petal.Width" # donors only of the same Specie out.NND <- NND.hotdeck(data.rec=iris.rec, data.don=iris.don, match.vars=c("Petal.Width"), don.class="Species") # impute missing iris.rec.imp <- create.imputed(data.rec=iris.rec, data.don=iris.don, mtc.ids=out.NND$mtc.ids) summary(iris.rec.imp$Petal.Length)
# introduce missing values # in Petal.Length of iris dataset set.seed(13579) pos <- sample(x = 1:nrow(iris), size = 15, replace = FALSE) iris.rec <- iris[pos, ] # recipient data.frame with missing values iris.rec[, "Petal.Length"] <- NA iris.don <- iris[-pos, ] # donor data.frame ALL observed # find the closest donors using NND hot deck; # distances are computed on "Petal.Width" # donors only of the same Specie out.NND <- NND.hotdeck(data.rec=iris.rec, data.don=iris.don, match.vars=c("Petal.Width"), don.class="Species") # impute missing iris.rec.imp <- create.imputed(data.rec=iris.rec, data.don=iris.don, mtc.ids=out.NND$mtc.ids) summary(iris.rec.imp$Petal.Length)
Transforms a factor or more factors contained in a data frame in a set of dummy variables, while numeric variables remain unchanged.
fact2dummy(data, all=TRUE, lab="x")
fact2dummy(data, all=TRUE, lab="x")
data |
A factor or a data frame that contains one or more factors (columns whose class is “factor” or “ordered”) that have to be substituted by the corresponding dummy variables. |
all |
Logical. When |
lab |
A character string with the name of the variable to be pasted with its levels. This is used only when |
This function substitutes categorical variables in the input data frame (columns whose class is “factor” or “ordered”) with the corresponding dummy variables. Note that if a factor includes a missing values (NA) then all the associated dummies will report an NA in correspondence of the missing observation (row).
A matrix with the dummy variables instead of initial factor variables.
Marcello D'Orazio [email protected]
x <- runif(5) y <- factor(c(1,2,1,2,2)) z <- ordered(c(1,2,3,2,2)) xyz <- data.frame(x,y,z) fact2dummy(xyz) fact2dummy(xyz, all=FALSE) #example with iris data frame str(iris) ir.mat <- fact2dummy(iris) head(ir.mat)
x <- runif(5) y <- factor(c(1,2,1,2,2)) z <- ordered(c(1,2,3,2,2)) xyz <- data.frame(x,y,z) fact2dummy(xyz) fact2dummy(xyz, all=FALSE) #example with iris data frame str(iris) ir.mat <- fact2dummy(iris) head(ir.mat)
This function permits to compute the bounds for cell probabilities in the contingency table Y vs. Z starting from the marginal tables (X vs. Y), (X vs. Z) and the joint distribution of the X variables, by considering all the possible subsets of the X variables. In this manner it is possible to identify which subset of the X variables produces the major reduction of the average width of conditional bounds.
Fbwidths.by.x(tab.x, tab.xy, tab.xz, deal.sparse="discard", nA=NULL, nB=NULL, ...)
Fbwidths.by.x(tab.x, tab.xy, tab.xz, deal.sparse="discard", nA=NULL, nB=NULL, ...)
tab.x |
A R table crossing the X variables. This table must be obtained by using the function |
tab.xy |
A R table of X vs. Y variable. This table must be obtained by using the function A single categorical Y variables is allowed. One or more categorical variables can be considered as X variables (common variables). The same X variables in |
tab.xz |
A R table of X vs. Z variable. This table must be obtained by using the function A single categorical Z variable is allowed. One or more categorical variables can be considered as X variables (common variables). The same X variables in |
deal.sparse |
Text, how to estimate the cell relative frequencies when dealing with too sparse tables. When |
nA |
Integer, sample size of file A used to estimate |
nB |
Integer, sample size of file B used to estimate |
... |
Additional arguments that may be required when deriving an estimate of uncertainty by calling |
This function permits to compute the Frechet bounds for the frequencies in the contingency table of Y vs. Z, starting from the conditional distributions P(Y|X) and P(Z|X) (for details see Frechet.bounds.cat
), by considering all the possible subsets of the X variables. In this manner it is possible to identify the subset of the X variables, with highest association with both Y and Z, that permits to reduce the uncertainty concerning the distribution of Y vs. Z.
The uncertainty is measured by the average of the widths of the bounds for the cells in the table Y vs. Z:
For details see Frechet.bounds.cat
.
Provided that uncertainty, measured in terms of , tends to reduce when conditioning on a higher number of X variables. Two penalties are introduced to account for the additional number of cells to be estimated when adding a X variable. The first penalty, introduced in D'Orazio et al. (2017), is:
Where is the number of cell in the table obtained by crossing the given subset of X variables and the
is the number of cell in the table achieved by crossing all the available X variables.
A second penalty takes into account the number of cells to estimate with respect to the sample size (D'Orazio et al., 2019). It is obtained as:
with and
. In practice, it is considered the number of cells to estimate compared to the sample size. This criterion is considered to measure sparseness too. In particular, for the purposes of this function, tables are NOT considered sparse when:
This rule is applied when deciding how to proceed with estimation in case of sparse table (argument deal.sparse
).
Note that sparseness can be measured in different manners. The outputs include also the empty cells in each table (due to statistical zeros or structural zeros) and the Cohen's effect size with respect to the case of uniform distribution of frequencies across cells (the value 1/no.of.cells in every cell):
values of jointly with
usually indicate severe sparseness.
A list with the estimated bounds for the cells in the table of Y vs. Z for each possible subset of the X variables. The final component in the list, sum.unc
, is a data.frame that summarizes the main results. In particular, it reports the number of X variables ("x.vars"
), the number of cells in each of the input tables and the cells with frequency equal to 0 (columns ending with freq0
). Moreover, it reported the value ("av.n"
) of the rule used to decide whether we are dealing with a sparse case (see Details) and the Cohen's effect size measured for the table crossing the considered combination of the X variables.
Finally, it is provided the average width of the uncertainty intervals ("av.width"
), the penalty terms g1 and g2 ("penalty1"
and "penalty2"
respectively), and the penalized average widths ("av.width.pen1"
and "av.width.pen2"
, where av.width.pen1=av.width+pen1 and av.width.pen2=av.width+pen2).
Marcello D'Orazio [email protected]
D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.
D'Orazio, M., Di Zio, M. and Scanu, M. (2017). “The use of uncertainty to choose matching variables in statistical matching”. International Journal of Approximate Reasoning , 90, pp. 433-440.
D'Orazio, M., Di Zio, M. and Scanu, M. (2019). “Auxiliary variable selection in a a statistical matching problem”. In Zhang, L.-C. and Chambers, R. L. (eds.) Analysis of Integrated Data, Chapman & Hall/CRC (Forthcoming).
Frechet.bounds.cat
, harmonize.x
data(quine, package="MASS") #loads quine from MASS str(quine) quine$c.Days <- cut(quine$Days, c(-1, seq(0,50,10),100)) table(quine$c.Days) # split quine in two subsets suppressWarnings(RNGversion("3.5.0")) set.seed(4567) lab.A <- sample(nrow(quine), 70, replace=TRUE) quine.A <- quine[lab.A, 1:4] quine.B <- quine[-lab.A, c(1:3,6)] # compute the tables required by Fbwidths.by.x() freq.xA <- xtabs(~Eth+Sex+Age, data=quine.A) freq.xB <- xtabs(~Eth+Sex+Age, data=quine.B) freq.xy <- xtabs(~Eth+Sex+Age+Lrn, data=quine.A) freq.xz <- xtabs(~Eth+Sex+Age+c.Days, data=quine.B) # apply Fbwidths.by.x() bounds.yz <- Fbwidths.by.x(tab.x=freq.xA+freq.xB, tab.xy=freq.xy, tab.xz=freq.xz) bounds.yz$sum.unc
data(quine, package="MASS") #loads quine from MASS str(quine) quine$c.Days <- cut(quine$Days, c(-1, seq(0,50,10),100)) table(quine$c.Days) # split quine in two subsets suppressWarnings(RNGversion("3.5.0")) set.seed(4567) lab.A <- sample(nrow(quine), 70, replace=TRUE) quine.A <- quine[lab.A, 1:4] quine.B <- quine[-lab.A, c(1:3,6)] # compute the tables required by Fbwidths.by.x() freq.xA <- xtabs(~Eth+Sex+Age, data=quine.A) freq.xB <- xtabs(~Eth+Sex+Age, data=quine.B) freq.xy <- xtabs(~Eth+Sex+Age+Lrn, data=quine.A) freq.xz <- xtabs(~Eth+Sex+Age+c.Days, data=quine.B) # apply Fbwidths.by.x() bounds.yz <- Fbwidths.by.x(tab.x=freq.xA+freq.xB, tab.xy=freq.xy, tab.xz=freq.xz) bounds.yz$sum.unc
This function permits to derive the bounds for cell probabilities of the table Y vs. Z starting from the marginal tables (X vs. Y), (X vs. Z) and the joint distribution of the X variables.
Frechet.bounds.cat(tab.x, tab.xy, tab.xz, print.f="tables", align.margins = FALSE, tol= 0.001, warn = TRUE)
Frechet.bounds.cat(tab.x, tab.xy, tab.xz, print.f="tables", align.margins = FALSE, tol= 0.001, warn = TRUE)
tab.x |
A R table crossing the X variables. This table must be obtained by using the function |
tab.xy |
A R table of X vs. Y variable. This table must be obtained by using the function A single categorical Y variable is allowed. One or more categorical variables can be considered as X variables (common variables). Obviously, the same X variables in When |
tab.xz |
A R table of X vs. Z variable. This table must be obtained by using the function A single categorical Z variable is allowed. One or more categorical variables can be considered as X variables (common variables). The same X variables in When |
print.f |
A string: when |
align.margins |
Logical (default |
tol |
Tolerance used in comparing joint distributions as far as X variables are considered (default |
warn |
Logical, when |
This function permits to compute the expected conditional Frechet bounds for the relative frequencies in the contingency table of Y vs. Z, starting from the distributions P(Y|X), P(Z|X) and P(X). The expected conditional bounds for the relative frequencies in the table Y vs. Z are:
The relative frequencies are computed from the frequencies in
tab.x
;
the relative frequencies are derived from
tab.xy
,
finally, are derived from
tab.xz
.
Estimation requires that all the starting tables share the same marginal distribution of the X variables.
This function returns also the unconditional bounds for the relative frequencies in the contingency table of Y vs. Z, i.e. computed also without considering the X variables:
These bounds represent the unique output when tab.x = NULL
.
Finally, the contingency table of Y vs. Z estimated under the Conditional Independence Assumption (CIA) is obtained by considering:
When tab.x = NULL
then it is also provided the expected table under the assumption of independence between Y and Z:
The presence of too many cells with 0s in the input contingency tables is an indication of sparseness; this is an unappealing situation when estimating the cells' relative frequencies needed to derive the bounds; in such cases the corresponding results may be unreliable. A possible alternative way of working consists in estimating the required parameters by considering a pseudo-Bayes estimator (see pBayes
); in practice the input tab.x
, tab.xy
and tab.xz
should be the ones provided by the pBayes
function.
When print.f="tables"
(default) a list with the following components:
low.u |
The estimated lower bounds for the relative frequencies in the table Y vs. Z without conditioning on the X variables. |
up.u |
The estimated upper bounds for the relative frequencies in the table Y vs. Z without conditioning on the X variables. |
CIA |
The estimated relative frequencies in the table Y vs. Z under the Conditional Independence Assumption (CIA). |
low.cx |
The estimated lower bounds for the relative frequencies in the table Y vs. Z when conditioning on the X variables. |
up.cx |
The estimated upper bounds for the relative frequencies in the table Y vs. Z when conditioning on the X variables. |
uncertainty |
The uncertainty associated to input data, measured in terms of average width of uncertainty bounds with and without conditioning on the X variables. |
When print.f="data.frame"
the output list contains just two components:
bounds |
A data.frame whose columns reports the estimated uncertainty bounds. |
uncertainty |
The uncertainty associated to input data, measured in terms of average width of uncertainty bounds with and without conditioning on the X variables. |
Marcello D'Orazio [email protected]
D'Orazio, M., Di Zio, M. and Scanu, M. (2006) “Statistical Matching for Categorical Data: Displaying Uncertainty and Using Logical Constraints”, Journal of Official Statistics, 22, pp. 137–157.
D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.
data(quine, package="MASS") #loads quine from MASS str(quine) # split quine in two subsets suppressWarnings(RNGversion("3.5.0")) set.seed(7654) lab.A <- sample(nrow(quine), 70, replace=TRUE) quine.A <- quine[lab.A, 1:3] quine.B <- quine[-lab.A, 2:4] # compute the tables required by Frechet.bounds.cat() freq.xA <- xtabs(~Sex+Age, data=quine.A) freq.xB <- xtabs(~Sex+Age, data=quine.B) freq.xy <- xtabs(~Sex+Age+Eth, data=quine.A) freq.xz <- xtabs(~Sex+Age+Lrn, data=quine.B) # apply Frechet.bounds.cat() bounds.yz <- Frechet.bounds.cat(tab.x=freq.xA+freq.xB, tab.xy=freq.xy, tab.xz=freq.xz, print.f="data.frame") bounds.yz # harmonize distr. of Sex vs. Age during computations # in Frechet.bounds.cat() #compare marg. distribution of Xs in A and B vs. pooled estimate comp.prop(p1=margin.table(freq.xy,c(1,2)), p2=freq.xA+freq.xB, n1=nrow(quine.A), n2=nrow(quine.A)+nrow(quine.B), ref=TRUE) comp.prop(p1=margin.table(freq.xz,c(1,2)), p2=freq.xA+freq.xB, n1=nrow(quine.A), n2=nrow(quine.A)+nrow(quine.B), ref=TRUE) bounds.yz <- Frechet.bounds.cat(tab.x=freq.xA+freq.xB, tab.xy=freq.xy, tab.xz=freq.xz, print.f="data.frame", align.margins=TRUE) bounds.yz
data(quine, package="MASS") #loads quine from MASS str(quine) # split quine in two subsets suppressWarnings(RNGversion("3.5.0")) set.seed(7654) lab.A <- sample(nrow(quine), 70, replace=TRUE) quine.A <- quine[lab.A, 1:3] quine.B <- quine[-lab.A, 2:4] # compute the tables required by Frechet.bounds.cat() freq.xA <- xtabs(~Sex+Age, data=quine.A) freq.xB <- xtabs(~Sex+Age, data=quine.B) freq.xy <- xtabs(~Sex+Age+Eth, data=quine.A) freq.xz <- xtabs(~Sex+Age+Lrn, data=quine.B) # apply Frechet.bounds.cat() bounds.yz <- Frechet.bounds.cat(tab.x=freq.xA+freq.xB, tab.xy=freq.xy, tab.xz=freq.xz, print.f="data.frame") bounds.yz # harmonize distr. of Sex vs. Age during computations # in Frechet.bounds.cat() #compare marg. distribution of Xs in A and B vs. pooled estimate comp.prop(p1=margin.table(freq.xy,c(1,2)), p2=freq.xA+freq.xB, n1=nrow(quine.A), n2=nrow(quine.A)+nrow(quine.B), ref=TRUE) comp.prop(p1=margin.table(freq.xz,c(1,2)), p2=freq.xA+freq.xB, n1=nrow(quine.A), n2=nrow(quine.A)+nrow(quine.B), ref=TRUE) bounds.yz <- Frechet.bounds.cat(tab.x=freq.xA+freq.xB, tab.xy=freq.xy, tab.xz=freq.xz, print.f="data.frame", align.margins=TRUE) bounds.yz
This function computes the Gower's distance (dissimilarity) between units in a dataset or between observations in two distinct datasets.
gower.dist(data.x, data.y=data.x, rngs=NULL, KR.corr=TRUE, var.weights = NULL, robcb=NULL)
gower.dist(data.x, data.y=data.x, rngs=NULL, KR.corr=TRUE, var.weights = NULL, robcb=NULL)
data.x |
A matrix or a data frame containing variables that should be used in the computation of the distance. Columns of mode Missing values ( If only |
data.y |
A numeric matrix or data frame with the same variables, of the same type, as those in |
rngs |
A vector with the ranges to scale the variables. Its length must be equal to number of variables in rngs["X1"] <- max(data.x[,"X1"], data.y[,"X1"]) - min(data.x[,"X1"], data.y[,"X1"]) . |
KR.corr |
When |
var.weights |
By default ( |
robcb |
By default is ( |
This function computes distances between records when variables of different type (categorical and continuous) have been observed. In order to handle different types of variables, the Gower's dissimilarity coefficient (Gower, 1971) is used. By default (KR.corr=TRUE
) the Kaufman and Rousseeuw (1990) extension of the Gower's dissimilarity coefficient is used.
The final dissimilarity between the ith and jth unit is obtained as a weighted sum of dissimilarities for each variable:
In particular, represents the distance between the ith and jth unit computed considering the kth variable, while
is the weight assigned to variable k (by default 1 for all the variables, unless different weights are provided by user with argument
var.weights
). Distance depends on the nature of the variable:
logical
columns are considered as asymmetric binary variables, for such case if
, 1 otherwise;
factor
or character
columns are considered as categorical nominal variables and if
, 1 otherwise;
numeric
columns are considered as interval-scaled variables and
being the range of the kth variable. The range is the one supplied with the argument
rngs
(rngs[k]
) or the one computed on available data (when rngs=NULL
);
ordered
columns are considered as categorical ordinal variables and the values are substituted with the corresponding position index, in the factor levels. When
KR.corr=FALSE
these position indexes (that are different from the output of the R function rank
) are transformed in the following manner
These new values, , are treated as observations of an interval scaled variable.
As far as the weight is concerned:
if
or
;
if the variable is asymmetric binary and
or
;
in all the other cases.
In practice, NAs
and couple of cases with do not contribute to distance computation.
A matrix
object with distances between rows of data.x
and those of data.y
.
Marcello D'Orazio [email protected]
Gower, J. C. (1971), “A general coefficient of similarity and some of its properties”. Biometrics, 27, 623–637.
Kaufman, L. and Rousseeuw, P.J. (1990), Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York.
x1 <- as.logical(rbinom(10,1,0.5)) x2 <- sample(letters, 10, replace=TRUE) x3 <- rnorm(10) x4 <- ordered(cut(x3, -4:4, include.lowest=TRUE)) xx <- data.frame(x1, x2, x3, x4, stringsAsFactors = FALSE) # matrix of distances between observations in xx dx <- gower.dist(xx) head(dx) # matrix of distances between first obs. in xx # and the remaining ones gower.dist(data.x=xx[1:6,], data.y=xx[7:10,], var.weights = c(1,2,5,2))
x1 <- as.logical(rbinom(10,1,0.5)) x2 <- sample(letters, 10, replace=TRUE) x3 <- rnorm(10) x4 <- ordered(cut(x3, -4:4, include.lowest=TRUE)) xx <- data.frame(x1, x2, x3, x4, stringsAsFactors = FALSE) # matrix of distances between observations in xx dx <- gower.dist(xx) head(dx) # matrix of distances between first obs. in xx # and the remaining ones gower.dist(data.x=xx[1:6,], data.y=xx[7:10,], var.weights = c(1,2,5,2))
This function permits to harmonize the marginal or the joint distribution of a set of variables observed independently in two sample surveys carried out on the same target population. This harmonization is carried out by using the calibration of the survey weights of the sample units in both the surveys according to the procedure suggested by Renssen (1998).
harmonize.x(svy.A, svy.B, form.x, x.tot=NULL, cal.method="linear", ...)
harmonize.x(svy.A, svy.B, form.x, x.tot=NULL, cal.method="linear", ...)
svy.A |
A |
svy.B |
A |
form.x |
A R formula specifying which of the variables, common to both the surveys, have to be considered, and how have to be considered. For instance When dealing with categorical variables, the formula Due to weights calibration features, it is preferable to work with categorical X variables. In some cases, the procedure may be successful when a single continuous variable is considered jointly with one or more categorical variables. When dealing with several continuous variable it may be preferable to categorize them. |
x.tot |
A vector or table with known population totals for the X variables. A vector is required when When |
cal.method |
A string that specifies how the calibration of the weights in Alternatively, it is possible to rake the origin survey weights by specifying |
... |
Further arguments that may be necessary for calibration or post-stratification. The number of iterations used in calibration can be modified too by using the argument See |
This function harmonizes the totals of the X variables, observed in both survey A and survey B, to be equal to given known totals specified via x.tot
. When these totals are not known (x.tot=NULL
) they are estimated by combining the estimates derived from the two separate surveys. The harmonization is carried out according to a procedure suggested by Renssen (1998) based on calibration of survey weights (for major details on calibration see Sarndal and Lundstrom, 2005). The procedure is particularly suited to deal with categorical X variables, in this case it permits to harmonize the joint or the marginal distribution of the categorical variables being considered. Note that an incomplete crossing of the X variables can be considered: i.e. harmonisation wrt to the joint distribution of and the marginal distribution of
).
The calibration procedure may not produce the final result due to convergence problems. In this case an error message appears. In order to reach convergence it may be necessary to launch the procedure with less constraints (i.e a reduced number of population totals) by joining adjacent categories or by discarding some variables.
In some limited cases, it could be possible to consider both categorical and continuous variables. In this situation it may happen that calibration is not successful. In order to reach convergence it may be necessary to categorize the continuous X variables.
Post-stratification is a special case of calibration; all the weights of the units in a given post-stratum are modified so as to reproduce the known total for that post-stratum. Post-stratification avoids problems of convergence but, on the other hand, it may produce final weights with a higher variability than those derived from the calibration.
A R with list the results of calibration procedures carried out on survey A and survey B, respectively. In particular the following components will be provided:
cal.A |
The survey object |
cal.B |
The survey object |
weights.A |
The new calibrated weights associated to the the units in |
weights.B |
The new calibrated weights associated to the the units in |
call |
Stores the call to this function with all the values specified for the various arguments ( |
Marcello D'Orazio [email protected]
D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.
Renssen, R.H. (1998) “Use of Statistical Matching Techniques in Calibration Estimation”. Survey Methodology, N. 24, pp. 171–183.
Sarndal, C.E. and Lundstrom, S. (2005) Estimation in Surveys with Nonresponse. Wiley, Chichester.
comb.samples
, calibrate
, svydesign
, postStratify
,
data(quine, package="MASS") #loads quine from MASS str(quine) # split quine in two subsets suppressWarnings(RNGversion("3.5.0")) set.seed(7654) lab.A <- sample(nrow(quine), 70, replace=TRUE) quine.A <- quine[lab.A, c("Eth","Sex","Age","Lrn")] quine.B <- quine[-lab.A, c("Eth","Sex","Age","Days")] # create svydesign objects require(survey) quine.A$f <- 70/nrow(quine) # sampling fraction quine.B$f <- (nrow(quine)-70)/nrow(quine) svy.qA <- svydesign(~1, fpc=~f, data=quine.A) svy.qB <- svydesign(~1, fpc=~f, data=quine.B) #------------------------------------------------------ # example (1) # Harmonizazion of the distr. of Sex vs. Age # usign poststratification # (1.a) known population totals # the population toatal are computed on the full data frame tot.sex.age <- xtabs(~Sex+Age, data=quine) tot.sex.age out.hz <- harmonize.x(svy.A=svy.qA, svy.B=svy.qB, form.x=~Sex+Age, x.tot=tot.sex.age, cal.method="poststratify") tot.A <- xtabs(out.hz$weights.A~Sex+Age, data=quine.A) tot.B <- xtabs(out.hz$weights.B~Sex+Age, data=quine.B) tot.sex.age-tot.A tot.sex.age-tot.B # (1.b) unknown population totals (x.tot=NULL) # the population total is estimated by combining totals from the # two surveys out.hz <- harmonize.x(svy.A=svy.qA, svy.B=svy.qB, form.x=~Sex+Age, x.tot=NULL, cal.method="poststratify") tot.A <- xtabs(out.hz$weights.A~Sex+Age, data=quine.A) tot.B <- xtabs(out.hz$weights.B~Sex+Age, data=quine.B) tot.A tot.A-tot.B #----------------------------------------------------- # example (2) # Harmonizazion wrt the maginal distribution # of 'Eth', 'Sex' and 'Age' # using linear calibration # (2.a) vector of population total known # estimated from the full data set # note the formula! only marginal distribution of the # variables are considered tot.m <- colSums(model.matrix(~Eth+Sex+Age-1, data=quine)) tot.m out.hz <- harmonize.x(svy.A=svy.qA, svy.B=svy.qB, x.tot=tot.m, form.x=~Eth+Sex+Age-1, cal.method="linear") summary(out.hz$weights.A) #check for negative weights summary(out.hz$weights.B) #check for negative weights tot.m svytable(formula=~Eth, design=out.hz$cal.A) svytable(formula=~Eth, design=out.hz$cal.B) svytable(formula=~Sex, design=out.hz$cal.A) svytable(formula=~Sex, design=out.hz$cal.B) # Note: margins are equal but joint distributions are not! svytable(formula=~Sex+Age, design=out.hz$cal.A) svytable(formula=~Sex+Age, design=out.hz$cal.B) # (2.b) vector of population total unknown out.hz <- harmonize.x(svy.A=svy.qA, svy.B=svy.qB, x.tot=NULL, form.x=~Eth+Sex+Age-1, cal.method="linear") svytable(formula=~Eth, design=out.hz$cal.A) svytable(formula=~Eth, design=out.hz$cal.B) svytable(formula=~Sex, design=out.hz$cal.A) svytable(formula=~Sex, design=out.hz$cal.B) #----------------------------------------------------- # example (3) # Harmonizazion wrt the joint distribution of 'Sex' vs. 'Age' # and the marginal distribution of 'Eth' # using raking # vector of population total known # estimated from the full data set # note the formula! tot.m <- colSums(model.matrix(~Eth+(Sex:Age-1)-1, data=quine)) tot.m out.hz <- harmonize.x(svy.A=svy.qA, svy.B=svy.qB, x.tot=tot.m, form.x=~Eth+(Sex:Age)-1, cal.method="raking") summary(out.hz$weights.A) #check for negative weights summary(out.hz$weights.B) #check for negative weights tot.m svytable(formula=~Eth, design=out.hz$cal.A) svytable(formula=~Eth, design=out.hz$cal.B) svytable(formula=~Sex+Age, design=out.hz$cal.A) svytable(formula=~Sex+Age, design=out.hz$cal.B)
data(quine, package="MASS") #loads quine from MASS str(quine) # split quine in two subsets suppressWarnings(RNGversion("3.5.0")) set.seed(7654) lab.A <- sample(nrow(quine), 70, replace=TRUE) quine.A <- quine[lab.A, c("Eth","Sex","Age","Lrn")] quine.B <- quine[-lab.A, c("Eth","Sex","Age","Days")] # create svydesign objects require(survey) quine.A$f <- 70/nrow(quine) # sampling fraction quine.B$f <- (nrow(quine)-70)/nrow(quine) svy.qA <- svydesign(~1, fpc=~f, data=quine.A) svy.qB <- svydesign(~1, fpc=~f, data=quine.B) #------------------------------------------------------ # example (1) # Harmonizazion of the distr. of Sex vs. Age # usign poststratification # (1.a) known population totals # the population toatal are computed on the full data frame tot.sex.age <- xtabs(~Sex+Age, data=quine) tot.sex.age out.hz <- harmonize.x(svy.A=svy.qA, svy.B=svy.qB, form.x=~Sex+Age, x.tot=tot.sex.age, cal.method="poststratify") tot.A <- xtabs(out.hz$weights.A~Sex+Age, data=quine.A) tot.B <- xtabs(out.hz$weights.B~Sex+Age, data=quine.B) tot.sex.age-tot.A tot.sex.age-tot.B # (1.b) unknown population totals (x.tot=NULL) # the population total is estimated by combining totals from the # two surveys out.hz <- harmonize.x(svy.A=svy.qA, svy.B=svy.qB, form.x=~Sex+Age, x.tot=NULL, cal.method="poststratify") tot.A <- xtabs(out.hz$weights.A~Sex+Age, data=quine.A) tot.B <- xtabs(out.hz$weights.B~Sex+Age, data=quine.B) tot.A tot.A-tot.B #----------------------------------------------------- # example (2) # Harmonizazion wrt the maginal distribution # of 'Eth', 'Sex' and 'Age' # using linear calibration # (2.a) vector of population total known # estimated from the full data set # note the formula! only marginal distribution of the # variables are considered tot.m <- colSums(model.matrix(~Eth+Sex+Age-1, data=quine)) tot.m out.hz <- harmonize.x(svy.A=svy.qA, svy.B=svy.qB, x.tot=tot.m, form.x=~Eth+Sex+Age-1, cal.method="linear") summary(out.hz$weights.A) #check for negative weights summary(out.hz$weights.B) #check for negative weights tot.m svytable(formula=~Eth, design=out.hz$cal.A) svytable(formula=~Eth, design=out.hz$cal.B) svytable(formula=~Sex, design=out.hz$cal.A) svytable(formula=~Sex, design=out.hz$cal.B) # Note: margins are equal but joint distributions are not! svytable(formula=~Sex+Age, design=out.hz$cal.A) svytable(formula=~Sex+Age, design=out.hz$cal.B) # (2.b) vector of population total unknown out.hz <- harmonize.x(svy.A=svy.qA, svy.B=svy.qB, x.tot=NULL, form.x=~Eth+Sex+Age-1, cal.method="linear") svytable(formula=~Eth, design=out.hz$cal.A) svytable(formula=~Eth, design=out.hz$cal.B) svytable(formula=~Sex, design=out.hz$cal.A) svytable(formula=~Sex, design=out.hz$cal.B) #----------------------------------------------------- # example (3) # Harmonizazion wrt the joint distribution of 'Sex' vs. 'Age' # and the marginal distribution of 'Eth' # using raking # vector of population total known # estimated from the full data set # note the formula! tot.m <- colSums(model.matrix(~Eth+(Sex:Age-1)-1, data=quine)) tot.m out.hz <- harmonize.x(svy.A=svy.qA, svy.B=svy.qB, x.tot=tot.m, form.x=~Eth+(Sex:Age)-1, cal.method="raking") summary(out.hz$weights.A) #check for negative weights summary(out.hz$weights.B) #check for negative weights tot.m svytable(formula=~Eth, design=out.hz$cal.A) svytable(formula=~Eth, design=out.hz$cal.B) svytable(formula=~Sex+Age, design=out.hz$cal.A) svytable(formula=~Sex+Age, design=out.hz$cal.B)
This function computes the Mahalanobis distance among units in a dataset or between observations in two distinct datasets.
mahalanobis.dist(data.x, data.y=NULL, vc=NULL)
mahalanobis.dist(data.x, data.y=NULL, vc=NULL)
data.x |
A matrix or a data frame containing variables that should be used in the computation of the distance between units. Only continuous variables are allowed. Missing values ( When only |
data.y |
A numeric matrix or data frame with the same variables, of the same type, as those in |
vc |
Covariance matrix that should be used in distance computation. If it is not supplied ( |
The Mahalanobis distance is calculated by means of:
The covariance matrix S is estimated from the available data when vc=NULL
, otherwise the one supplied via the argument vc
is used.
A matrix
object with distances among rows of data.x
and those of data.y
.
Marcello D'Orazio [email protected]
Mahalanobis, P C (1936) “On the generalised distance in statistics”. Proceedings of the National Institute of Sciences of India 2, pp. 49-55.
md1 <- mahalanobis.dist(iris[1:6,1:4]) md2 <- mahalanobis.dist(data.x=iris[1:6,1:4], data.y=iris[51:60, 1:4]) vv <- var(iris[,1:4]) md1a <- mahalanobis.dist(data.x=iris[1:6,1:4], vc=vv) md2a <- mahalanobis.dist(data.x=iris[1:6,1:4], data.y=iris[51:60, 1:4], vc=vv)
md1 <- mahalanobis.dist(iris[1:6,1:4]) md2 <- mahalanobis.dist(data.x=iris[1:6,1:4], data.y=iris[51:60, 1:4]) vv <- var(iris[,1:4]) md1a <- mahalanobis.dist(data.x=iris[1:6,1:4], vc=vv) md2a <- mahalanobis.dist(data.x=iris[1:6,1:4], data.y=iris[51:60, 1:4], vc=vv)
This function computes the Maximum distance (or norm) between units in a dataset or between observations in two distinct datasets.
maximum.dist(data.x, data.y=data.x, rank=FALSE)
maximum.dist(data.x, data.y=data.x, rank=FALSE)
data.x |
A matrix or a data frame containing variables that should be used in the computation of the distance. Only continuous variables are allowed. Missing values ( When only |
data.y |
A numeric matrix or data frame with the same variables, of the same type, as those in |
rank |
Logical, when |
This function computes the distance also know as minimax distance. In practice the distance between two records is the maximum of the absolute differences on the available variables:
When rank=TRUE
the original values are substituted by their ranks divided by the number of values plus one (following suggestion in Kovar et al. 1988).
A matrix
object with distances between rows of data.x
and those of data.y
.
Marcello D'Orazio [email protected]
Kovar, J.G., MacMillan, J. and Whitridge, P. (1988). “Overview and strategy for the Generalized Edit and Imputation System”. Statistics Canada, Methodology Branch Working Paper No. BSMD 88-007 E/F.
rank
,
md1 <- maximum.dist(iris[1:10,1:4]) md2 <- maximum.dist(iris[1:10,1:4], rank=TRUE) md3 <- maximum.dist(data.x=iris[1:50,1:4], data.y=iris[51:100,1:4]) md4 <- maximum.dist(data.x=iris[1:50,1:4], data.y=iris[51:100,1:4], rank=TRUE)
md1 <- maximum.dist(iris[1:10,1:4]) md2 <- maximum.dist(iris[1:10,1:4], rank=TRUE) md3 <- maximum.dist(data.x=iris[1:50,1:4], data.y=iris[51:100,1:4]) md4 <- maximum.dist(data.x=iris[1:50,1:4], data.y=iris[51:100,1:4], rank=TRUE)
This function implements some mixed methods to perform statistical matching between two data sources.
mixed.mtc(data.rec, data.don, match.vars, y.rec, z.don, method="ML", rho.yz=NULL, micro=FALSE, constr.alg="Hungarian")
mixed.mtc(data.rec, data.don, match.vars, y.rec, z.don, method="ML", rho.yz=NULL, micro=FALSE, constr.alg="Hungarian")
data.rec |
A matrix or data frame that plays the role of recipient in the statistical matching application. This data set must contain all variables (columns) that should be used in statistical matching, i.e. the variables called by the arguments |
data.don |
A matrix or data frame that plays the role of donor in the statistical matching application. This data set must contain all the numeric variables (columns) that should be used in statistical matching, i.e. the variables called by the arguments |
match.vars |
A character vector with the names of the common variables (the columns in both the data frames) to be used as matching variables (X). |
y.rec |
A character vector with the name of the target variable Y that is observed only for units in |
z.don |
A character vector with the name of the target variable Z that is observed only for units in |
method |
A character vector that identifies the method that should be used to estimate the parameters of the regression models: Y vs. X and Z vs. X. Maximum Likelihood method is used when |
rho.yz |
A numeric value representing a guess for the correlation between the Y ( By default ( |
micro |
Logical. When |
constr.alg |
A string that has to be specified when |
This function implements some mixed methods to perform statistical matching. A mixed method consists of two steps:
(1) adoption of a parametric model for the joint distribution of and estimation of its parameters;
(2) derivation of a complete “synthetic” data set (recipient data set filled in with values for the Z variable) using a nonparametric approach.
In this case, as far as (1) is concerned, it is assumed that follows a multivariate normal distribution. Please note that if some of the X are categorical, then they are recoded into dummies before starting with the estimation. In such a case, the assumption of multivariate normal distribution may be questionable.
The whole procedure is based on the imputation method known as predictive mean matching. The procedure consists of three steps:
step 1a) Regression step: the two linear regression models Y vs. X and Z vs. X are considered and their parameters are estimated.
step 1b) Computation of intermediate values. For the units in data.rec
the following intermediate values are derived:
for each , being
the number of units in
data.rec
(rows of data.rec
). Note that, is a random draw from the multivariate normal distribution with zero mean and estimated residual variance
.
Similarly, for the units in data.don
the following intermediate values are derived:
for each , being
the number of units in
data.don
(rows of data.don
). is a random draw from the multivariate normal distribution with zero mean and estimated residual variance
.
step 2) Matching step. For each observation (row) in data.rec
a donor is chosen in data.don
through a nearest neighbor constrained distance hot deck procedure. The distances are computed between and
using Mahalanobis distance.
For further details see Sections 2.5.1 and 3.6.1 in D'Orazio et al. (2006).
In step 1a) the parameters of the regression model can be estimated by means of the Maximum Likelihood method (method="ML"
) (see D'Orazio et al., 2006, pp. 19–23,73–75) or, using the Moriarity and Scheuren (2001 and 2003) approach (method="MS"
) (see also D'Orazio et al., 2006, pp. 75–76). The two estimation methods are compared in D'Orazio et al. (2005).
When method="MS"
, if the value specified for the argument rho.yz
is not compatible with the other correlation coefficients estimated from the data, then it is substituted with the closest value compatible with the other estimated coefficients.
When micro=FALSE
only the estimation of the parameters is performed (step 1a). Otherwise,
(micro=TRUE
) the whole procedure is carried out.
A list with a varying number of components depending on the values of the arguments
method
and rho.yz
.
mu |
The estimated mean vector. |
vc |
The estimated variance–covariance matrix. |
cor |
The estimated correlation matrix. |
res.var |
A vector with estimates of the residual variances |
start.prho.yz |
It is the initial guess for the partial correlation coefficient |
rho.yz |
Returned in output only when |
phi |
When |
filled.rec |
The |
mtc.ids |
when |
dist.rd |
A vector with the distances between each recipient unit and the corresponding donor, returned only in case |
call |
How the function has been called. |
Marcello D'Orazio [email protected]
D'Orazio, M., Di Zio, M. and Scanu, M. (2005). “A comparison among different estimators of regression parameters on statistically matched files through an extensive simulation study”, Contributi, 2005/10, Istituto Nazionale di Statistica, Rome.
D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.
Hornik K. (2012). clue: Cluster ensembles. R package version 0.3-45. https://CRAN.R-project.org/package=clue.
Moriarity, C., and Scheuren, F. (2001). “Statistical matching: a paradigm for assessing the uncertainty in the procedure”. Journal of Official Statistics, 17, 407–422.
Moriarity, C., and Scheuren, F. (2003). “A note on Rubin's statistical matching using file concatenation with adjusted weights and multiple imputation”, Journal of Business and Economic Statistics, 21, 65–73.
# reproduce the statistical matching framework # starting from the iris data.frame suppressWarnings(RNGversion("3.5.0")) set.seed(98765) pos <- sample(1:150, 50, replace=FALSE) ir.A <- iris[pos,c(1,3:5)] ir.B <- iris[-pos, 2:5] xx <- intersect(colnames(ir.A), colnames(ir.B)) xx # common variables # ML estimation method under CIA ((rho_YZ|X=0)); # only parameter estimates (micro=FALSE) # only continuous matching variables xx.mtc <- c("Petal.Length", "Petal.Width") mtc.1 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc, y.rec="Sepal.Length", z.don="Sepal.Width") # estimated correlation matrix mtc.1$cor # ML estimation method under CIA ((rho_YZ|X=0)); # only parameter estimates (micro=FALSE) # categorical variable 'Species' used as matching variable xx.mtc <- xx mtc.2 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc, y.rec="Sepal.Length", z.don="Sepal.Width") # estimated correlation matrix mtc.2$cor # ML estimation method with partial correlation coefficient # set equal to 0.5 (rho_YZ|X=0.5) # only parameter estimates (micro=FALSE) mtc.3 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc, y.rec="Sepal.Length", z.don="Sepal.Width", rho.yz=0.5) # estimated correlation matrix mtc.3$cor # ML estimation method with partial correlation coefficient # set equal to 0.5 (rho_YZ|X=0.5) # with imputation step (micro=TRUE) mtc.4 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc, y.rec="Sepal.Length", z.don="Sepal.Width", rho.yz=0.5, micro=TRUE, constr.alg="Hungarian") # first rows of data.rec filled in with z head(mtc.4$filled.rec) # # Moriarity and Scheuren estimation method under CIA; # only with parameter estimates (micro=FALSE) mtc.5 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc, y.rec="Sepal.Length", z.don="Sepal.Width", method="MS") # the starting value of rho.yz and the value used # in computations mtc.5$rho.yz # estimated correlation matrix mtc.5$cor # Moriarity and Scheuren estimation method # with correlation coefficient set equal to -0.15 (rho_YZ=-0.15) # with imputation step (micro=TRUE) mtc.6 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc, y.rec="Sepal.Length", z.don="Sepal.Width", method="MS", rho.yz=-0.15, micro=TRUE, constr.alg="lpSolve") # the starting value of rho.yz and the value used # in computations mtc.6$rho.yz # estimated correlation matrix mtc.6$cor # first rows of data.rec filled in with z imputed values head(mtc.6$filled.rec)
# reproduce the statistical matching framework # starting from the iris data.frame suppressWarnings(RNGversion("3.5.0")) set.seed(98765) pos <- sample(1:150, 50, replace=FALSE) ir.A <- iris[pos,c(1,3:5)] ir.B <- iris[-pos, 2:5] xx <- intersect(colnames(ir.A), colnames(ir.B)) xx # common variables # ML estimation method under CIA ((rho_YZ|X=0)); # only parameter estimates (micro=FALSE) # only continuous matching variables xx.mtc <- c("Petal.Length", "Petal.Width") mtc.1 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc, y.rec="Sepal.Length", z.don="Sepal.Width") # estimated correlation matrix mtc.1$cor # ML estimation method under CIA ((rho_YZ|X=0)); # only parameter estimates (micro=FALSE) # categorical variable 'Species' used as matching variable xx.mtc <- xx mtc.2 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc, y.rec="Sepal.Length", z.don="Sepal.Width") # estimated correlation matrix mtc.2$cor # ML estimation method with partial correlation coefficient # set equal to 0.5 (rho_YZ|X=0.5) # only parameter estimates (micro=FALSE) mtc.3 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc, y.rec="Sepal.Length", z.don="Sepal.Width", rho.yz=0.5) # estimated correlation matrix mtc.3$cor # ML estimation method with partial correlation coefficient # set equal to 0.5 (rho_YZ|X=0.5) # with imputation step (micro=TRUE) mtc.4 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc, y.rec="Sepal.Length", z.don="Sepal.Width", rho.yz=0.5, micro=TRUE, constr.alg="Hungarian") # first rows of data.rec filled in with z head(mtc.4$filled.rec) # # Moriarity and Scheuren estimation method under CIA; # only with parameter estimates (micro=FALSE) mtc.5 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc, y.rec="Sepal.Length", z.don="Sepal.Width", method="MS") # the starting value of rho.yz and the value used # in computations mtc.5$rho.yz # estimated correlation matrix mtc.5$cor # Moriarity and Scheuren estimation method # with correlation coefficient set equal to -0.15 (rho_YZ=-0.15) # with imputation step (micro=TRUE) mtc.6 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc, y.rec="Sepal.Length", z.don="Sepal.Width", method="MS", rho.yz=-0.15, micro=TRUE, constr.alg="lpSolve") # the starting value of rho.yz and the value used # in computations mtc.6$rho.yz # estimated correlation matrix mtc.6$cor # first rows of data.rec filled in with z imputed values head(mtc.6$filled.rec)
This function implements the distance hot deck method to match the records of two data sources that share some variables.
NND.hotdeck(data.rec, data.don, match.vars, don.class=NULL, dist.fun="Manhattan", constrained=FALSE, constr.alg="Hungarian", k=1, keep.t=FALSE, ...)
NND.hotdeck(data.rec, data.don, match.vars, don.class=NULL, dist.fun="Manhattan", constrained=FALSE, constr.alg="Hungarian", k=1, keep.t=FALSE, ...)
data.rec |
A matrix or data frame that plays the role of recipient. This data frame must contain the variables (columns) that should be used, directly or indirectly, in the matching application (specified via Missing values ( |
data.don |
A matrix or data frame that plays the role of donor. The variables (columns) involved, directly or indirectly, in the computation of distance must be the same and of the same type as those in |
match.vars |
A character vector with the names of the matching variables (the columns in both the data frames) that have to be used to compute distances between records (rows) in |
don.class |
A character vector with the names of the variables (columns in both the data frames) that have to be used to identify the donation classes. In this case the computation of distances is limited to those units of The variables chosen for the creation of the donation classes should NOT contain missing values (NAs). When not specified (default), no donation classes are used. This choice may require more memory to store a larger distance matrix and a higher computational effort. |
dist.fun |
A string with the name of the distance function that has to be used. The following distances are allowed: “Manhattan” (aka “City block”; default), “Euclidean”, “Mahalanobis”,“exact” or “exact matching”, “Gower”, “minimax” or one of the distance functions available in the package proxy. Note that the distance is computed using the function When |
constrained |
Logical. When |
constr.alg |
A string that has to be specified when |
k |
The number of times that a unit in |
keep.t |
Logical, when donation classes are used by setting |
... |
Additional arguments that may be required by |
This function finds a donor record in data.don
for each record in data.rec
. In the unconstrained case, it searches for the closest donor record according to the chosen distance function. When for a given recipient record there are more donors available at the minimum distance, one of them is picked at random.
In the constrained case a donor can be used just a fixed number of times, as specified by the k
argument, but the whole set of donors is chosen in order to minimize the overall matching distance. When k=1
the number of units (rows) in the donor data set has to be larger or equal to the number of units of the recipient data set; when the donation classes are used, this condition must be satisfied in each donation class. For further details on nearest neighbor distance hot deck refer to Chapter 2 in D'Orazio et al. (2006).
This function can also be used to impute missing values in a data set using the nearest neighbor distance hot deck. In this case data.rec
is the part of the initial data set that contains missing values on the target variable; on the contrary, data.don
is the part of the data set without missing values on it. See R code in the Examples for details.
Please note that only “Gower” and “minimax” distance functions allow for the presence of missing values (NA
s) in the variables used in computing distances. In both the cases when one of the of the observations presents a variable showing an NA, then this variable is excluded from the computation of distance between them.
A R list with the following components:
mtc.ids |
A matrix with the same number of rows of |
dist.rd |
A vector with the distances between each recipient unit and the corresponding donor. |
noad |
When |
call |
How the function has been called. |
Marcello D'Orazio [email protected]
D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.
Hornik K. (2012). clue: Cluster ensembles. R package version 0.3-45. https://CRAN.R-project.org/package=clue.
Rodgers, W.L. (1984). “An evaluation of statistical matching”. Journal of Business and Economic Statistics, 2, 91–102.
Singh, A.C., Mantel, H., Kinack, M. and Rowe, G. (1993). “Statistical matching: use of auxiliary information as an alternative to the conditional independence assumption”. Survey Methodology, 19, 59–79.
# create the classical matching framework lab <- c(1:15, 51:65, 101:115) iris.rec <- iris[lab, c(1:3,5)] # recipient data.frame iris.don <- iris[-lab, c(1:2,4:5)] #donor data.frame # Now iris.rec and iris.don have the variables # "Sepal.Length", "Sepal.Width" and "Species" # in common. # "Petal.Length" is available only in iris.rec # "Petal.Width" is available only in iris.don # Find the closest donors donors computing distance # on "Sepal.Length" and "Sepal.Width" # unconstrained case, Euclidean distance out.NND.1 <- NND.hotdeck(data.rec=iris.rec, data.don=iris.don, match.vars=c("Sepal.Length", "Sepal.Width") ) # create the synthetic data.set: # fill in "Petal.Width" in iris.rec fused.1 <- create.fused(data.rec=iris.rec, data.don=iris.don, mtc.ids=out.NND.1$mtc.ids, z.vars="Petal.Width") head(fused.1) # Find the closest donors computing distance # on "Sepal.Length", "Sepal.Width" and Species; # unconstrained case, Gower's distance out.NND.2 <- NND.hotdeck(data.rec=iris.rec, data.don=iris.don, match.vars=c("Sepal.Length", "Sepal.Width", "Species"), dist.fun="Gower") # find the closest donors using "Species" to form donation classes # and "Sepal.Length" and "Sepal.Width" to compute distance; # unconstrained case. out.NND.3 <- NND.hotdeck(data.rec=iris.rec, data.don=iris.don, match.vars=c("Sepal.Length", "Sepal.Width"), don.class="Species") # find the donors using "Species" to form donation classes # and "Sepal.Length" and "Sepal.Width" to compute distance; # constrained case, "Hungarian" algorithm library(clue) out.NND.4 <- NND.hotdeck(data.rec=iris.rec, data.don=iris.don, match.vars=c("Sepal.Length", "Sepal.Width"), don.class="Species", constrained=TRUE, constr.alg="Hungarian") # Example of Imputation of missing values. # Introducing missing values in iris ir.mat <- iris miss <- rbinom(nrow(iris), 1, 0.3) ir.mat[miss==1,"Sepal.Length"] <- NA iris.rec <- ir.mat[miss==1,-1] iris.don <- ir.mat[miss==0,] #search for NND donors imp.NND <- NND.hotdeck(data.rec=iris.rec, data.don=iris.don, match.vars=c("Sepal.Width","Petal.Length", "Petal.Width"), don.class="Species") # imputing missing values iris.rec.imp <- create.fused(data.rec=iris.rec, data.don=iris.don, mtc.ids=imp.NND$mtc.ids, z.vars="Sepal.Length") # rebuild the imputed data.frame final <- rbind(iris.rec.imp, iris.don) head(final)
# create the classical matching framework lab <- c(1:15, 51:65, 101:115) iris.rec <- iris[lab, c(1:3,5)] # recipient data.frame iris.don <- iris[-lab, c(1:2,4:5)] #donor data.frame # Now iris.rec and iris.don have the variables # "Sepal.Length", "Sepal.Width" and "Species" # in common. # "Petal.Length" is available only in iris.rec # "Petal.Width" is available only in iris.don # Find the closest donors donors computing distance # on "Sepal.Length" and "Sepal.Width" # unconstrained case, Euclidean distance out.NND.1 <- NND.hotdeck(data.rec=iris.rec, data.don=iris.don, match.vars=c("Sepal.Length", "Sepal.Width") ) # create the synthetic data.set: # fill in "Petal.Width" in iris.rec fused.1 <- create.fused(data.rec=iris.rec, data.don=iris.don, mtc.ids=out.NND.1$mtc.ids, z.vars="Petal.Width") head(fused.1) # Find the closest donors computing distance # on "Sepal.Length", "Sepal.Width" and Species; # unconstrained case, Gower's distance out.NND.2 <- NND.hotdeck(data.rec=iris.rec, data.don=iris.don, match.vars=c("Sepal.Length", "Sepal.Width", "Species"), dist.fun="Gower") # find the closest donors using "Species" to form donation classes # and "Sepal.Length" and "Sepal.Width" to compute distance; # unconstrained case. out.NND.3 <- NND.hotdeck(data.rec=iris.rec, data.don=iris.don, match.vars=c("Sepal.Length", "Sepal.Width"), don.class="Species") # find the donors using "Species" to form donation classes # and "Sepal.Length" and "Sepal.Width" to compute distance; # constrained case, "Hungarian" algorithm library(clue) out.NND.4 <- NND.hotdeck(data.rec=iris.rec, data.don=iris.don, match.vars=c("Sepal.Length", "Sepal.Width"), don.class="Species", constrained=TRUE, constr.alg="Hungarian") # Example of Imputation of missing values. # Introducing missing values in iris ir.mat <- iris miss <- rbinom(nrow(iris), 1, 0.3) ir.mat[miss==1,"Sepal.Length"] <- NA iris.rec <- ir.mat[miss==1,-1] iris.don <- ir.mat[miss==0,] #search for NND donors imp.NND <- NND.hotdeck(data.rec=iris.rec, data.don=iris.don, match.vars=c("Sepal.Width","Petal.Length", "Petal.Width"), don.class="Species") # imputing missing values iris.rec.imp <- create.fused(data.rec=iris.rec, data.don=iris.don, mtc.ids=imp.NND$mtc.ids, z.vars="Sepal.Length") # rebuild the imputed data.frame final <- rbind(iris.rec.imp, iris.don) head(final)
Estimation of cells counts in contingency tables by means of the pseudo-Bayes estimator.
pBayes(x, method="m.ind", const=NULL)
pBayes(x, method="m.ind", const=NULL)
x |
A contingency table with observed cell counts. Typically the output of |
method |
The method for estimating the final cell frequencies. The following options are available:
|
const |
Numeric value, a user defined constant |
This function estimates the frequencies in a contingency table by using the pseudo-Bayes approach. In practice the estimator being considered is a weighted average of the input (observed) cells counts and a suitable prior guess,
, for cells probabilities :
depends on the parameters of Dirichlet prior distribution being considered (for major details see Chapter 12 in Bishop et al., 1974).
It is worth noting that with a constant prior guess
(
), then
and in practice corresponds to adding
to each cell before estimation of the relative frequencies (
method = "invcat"
); when the constant 0.5 is added to each cell (
method = "Jeffreys"
); finally when the quantity
is added to each cell (
method = "minimax"
). All these cases corresponds to adding a flattening constant; the higher is the value of the more the estimates will be shrinked towards
(flattening).
When method = "m.ind"
the prior guess is estimated under the hypothesis of mutual independence between the variables crossed in the initial contingency table
x
, supposed to be at least a two-way table. In this case the value of is estimated via a data-driven approach by considering
On the contrary, when method = "h.assoc"
the prior guess is estimated under the hypothesis of homogeneous association between the variables crossed in the initial contingency table
x
.
Please note that when the input table is estimated from sample data where a weight is assigned to each unit, the weights should be used in estimating the input table, but it is suggested to rescale them so that their sum is equal to n, the sample size.
A list
object with three components.
info |
A vector with the sample size |
prior |
A table having the same dimension as |
pseudoB |
A table with having the same dimension as |
Marcello D'Orazio [email protected]
Bishop Y.M.M., Fienberg, S.E., Holland, P.W. (1974) Discrete Multivariate Analysis: Theory and Practice. The Massachusetts Institute of Technology
data(samp.A, package="StatMatch") tab <- xtabs(~ area5 + urb + c.age + sex + edu7, data = samp.A) out.pb <- pBayes(x=tab, method="m.ind") out.pb$info out.pb <- pBayes(x=tab, method="h.assoc") out.pb$info out.pb <- pBayes(x=tab, method="Jeffreys") out.pb$info # usage of weights in estimating the input table n <- nrow(samp.A) r.w <- samp.A$ww / sum(samp.A$ww) * n # rescale weights to sum up to n tab.w <- xtabs(r.w ~ area5 + urb + c.age + sex + edu7, data = samp.A) out.pbw <- pBayes(x=tab.w, method="m.ind") out.pbw$info
data(samp.A, package="StatMatch") tab <- xtabs(~ area5 + urb + c.age + sex + edu7, data = samp.A) out.pb <- pBayes(x=tab, method="m.ind") out.pb$info out.pb <- pBayes(x=tab, method="h.assoc") out.pb$info out.pb <- pBayes(x=tab, method="Jeffreys") out.pb$info # usage of weights in estimating the input table n <- nrow(samp.A) r.w <- samp.A$ww / sum(samp.A$ww) * n # rescale weights to sum up to n tab.w <- xtabs(r.w ~ area5 + urb + c.age + sex + edu7, data = samp.A) out.pbw <- pBayes(x=tab.w, method="m.ind") out.pbw$info
Frechet.bounds.cat
functionThe function uses the output of the function Frechet.bounds.cat
to produce a basic graphical representation of the uncertainty bounds related to the contingency table of Y vs. Z.
plotBounds(outFB)
plotBounds(outFB)
outFB |
the list provided in output from |
This function represents graphically the uncertainty bounds estimated by the function Frechet.bounds.cat
for each relative frequency in the contingency table of Y vs. Z.
the dotted line indicates the width of the bounds estimated without conditioning on the Xs (the size is reported in parenthesis below the line). The full line indicates the width of the estimated bounds conditional on the Xs (expected conditional Frechet bounds for the relative frequencies in the contingency table of Y vs. Z (size reported below the line, not in the the parenthesis).
Not that when the X are not used it is drawn only the width of the unconditional bounds and the size is shown below the line.
The figure on the top od the line indicated the estimated relative frequency under the assumption of independence between Y and Z conditional on one or more X variables (Conditional Independence Assumption, CIA; for details see help pages of Frechet.bounds.cat
), otherwise it corresponds to the estimated relative frequency under the assumption of independence between Y and Z.
The required graphical representation is drawn using standard graphics facilities.
Marcello D'Orazio [email protected]
# # compute the tables required by Frechet.bounds.cat() # freq.xA <- xtabs(~sex+c.age, data=samp.A) # freq.xB <- xtabs(~sex+c.age, data=samp.B) # freq.xy <- xtabs(~sex+c.age+c.neti, data=samp.A) # freq.xz <- xtabs(~sex+c.age+labour5, data=samp.B) # # # apply Frechet.bounds.cat() # bounds.yz <- Frechet.bounds.cat(tab.x=freq.xA+freq.xB, tab.xy=freq.xy, # tab.xz=freq.xz, print.f="data.frame") # # # plot.bounds(bounds.yz)
# # compute the tables required by Frechet.bounds.cat() # freq.xA <- xtabs(~sex+c.age, data=samp.A) # freq.xB <- xtabs(~sex+c.age, data=samp.B) # freq.xy <- xtabs(~sex+c.age+c.neti, data=samp.A) # freq.xz <- xtabs(~sex+c.age+labour5, data=samp.B) # # # apply Frechet.bounds.cat() # bounds.yz <- Frechet.bounds.cat(tab.x=freq.xA+freq.xB, tab.xy=freq.xy, # tab.xz=freq.xz, print.f="data.frame") # # # plot.bounds(bounds.yz)
Compares graphically the estimated distributions for the same continuous variable using data coming from two different data sources.
plotCont(data.A, data.B, xlab.A, xlab.B=NULL, w.A=NULL, w.B=NULL, type="density", ref=FALSE)
plotCont(data.A, data.B, xlab.A, xlab.B=NULL, w.A=NULL, w.B=NULL, type="density", ref=FALSE)
data.A |
A dataframe or matrix containing the variable of interest |
data.B |
A dataframe or matrix containing the variable of interest |
xlab.A |
Character string providing the name of the variable in |
xlab.B |
Character string providing the name of the variable in |
w.A |
Character string providing the name of the optional weighting variable in |
w.B |
Character string providing the name of the optional weighting variable in |
type |
A character string indicating the type of graphical representation that should be used to compare the estimated distributions of |
ref |
Logical, indicating whether the distribution estimated from |
This function compares graphically the distribution of the same variable but estimated from data coming from two different data sources. The graphical comparison con be done in different manners.
When type="hist"
the continuous variable is categorized and the corresponding histograms, estimated from data.A
and data.B
, are compared. When present, the weights are used in estimating the relative frequencies. Note that the breaks to categorize the variable are decided according to the Freedman-Diaconis rule (nclass
) and, in this case, with ref=TRUE
the IQR is estimated solely on data.B
, whereas with ref=FALSE
it is estimated by joining the two data sources.
With type="density"
the density plots are drawn; when available the weights are used in the estimation of the density that are based on the histograms (as suggested by Bellhouse and Stafford, 1999). Whentype="ecdf"
the comparison relies on the empirical cumulative distribution function, that can be estimated considering the weights. Note that when ref=TRUE
the estimation of the density and the empirical cumulative distribution are guided by the data in data.B
.
The comparison is based on percentiles with type="qqplot"
and type="qqshift"
. In the first case, the function draws a scatterplot (red dots) of the estimated percentiles of xlab.A
vs. those of xlab.B
; the dashed line indicated the ideal situation of equality of percentiles (points lying on the line). When type="qqshift"
the scatterplot refers to (percentiles.A - percentiles.B) vs. percentiles.B; in this case the points lying on horizontal line passing through 0 indicate equality (difference equal to 0). Note that the number of estimated percentiles depends on the minimum between the two sample sizes. Only quartiles are calculated when min(n.A, n.B)<=50; quintiles are estimated when min(n.A, n.B)>50 and min(n.A, n.B)<=150; deciles are estimated when min(n.A, n.B)>150 and min(n.A, n.B)<=250; finally quantiles for probs=seq(from = 0.05,to = 0.95,by = 0.05)
are estimated when min(n.A, n.B)>250. When survey weights are available (indicated through w.A
and/or w.B
) they are used in estimating the quantiles by calling the function wtd.quantile
in the package Hmisc.
The required graphical representation is drawn using the ggplot2 facilities.
Marcello D'Orazio [email protected]
Bellhouse D.R. and J. E. Stafford (1999). “Density Estimation from Complex Surveys”. Statistica Sinica, 9, 407–424.
# plotCont(data.A = samp.A, data.B = samp.B, xlab.A="age") # plotCont(data.A = samp.A, data.B = samp.B, xlab.A="age", w.A = "ww")
# plotCont(data.A = samp.A, data.B = samp.B, xlab.A="age") # plotCont(data.A = samp.A, data.B = samp.B, xlab.A="age", w.A = "ww")
Compares graphically the estimated distributions for the same categorical variable(s) using data coming from two different data sources.
plotTab(data.A, data.B, xlab.A, xlab.B=NULL, w.A=NULL, w.B=NULL)
plotTab(data.A, data.B, xlab.A, xlab.B=NULL, w.A=NULL, w.B=NULL)
data.A |
A dataframe or matrix containing the variable of interest |
data.B |
A dataframe or matrix containing the variable of interest |
xlab.A |
Character string providing the name(s) of one or more variables in |
xlab.B |
Character string providing the name(s) of one or more variables in |
w.A |
Character string providing the name of the optional weighting variable in |
w.B |
Character string providing the name of the optional weighting variable in |
This function compares graphically the (joint) distribution of the same variables but estimated from data coming from two different data sources. The graphical comparison is done using barcharts.
The required graphical representation is drawn using the ggplot2 facilities.
Marcello D'Orazio [email protected]
# plotTab(data.A = samp.A, data.B = samp.B, xlab.A="edu7", w.A = "ww") # plotTab(data.A = samp.A, data.B = samp.B, xlab.A=c("urb", "sex"), w.A = "ww", w.B="ww")
# plotTab(data.A = samp.A, data.B = samp.B, xlab.A="edu7", w.A = "ww") # plotTab(data.A = samp.A, data.B = samp.B, xlab.A=c("urb", "sex"), w.A = "ww", w.B="ww")
This function computes some association and Proportional Reduction in Error (PRE) measures between a categorical nominal variable and each of the other available predictors (being also categorical variables).
pw.assoc(formula, data, weights=NULL, out.df=FALSE)
pw.assoc(formula, data, weights=NULL, out.df=FALSE)
formula |
A formula of the type |
data |
The data frame which contains the variables called by |
weights |
The name of the variable in |
out.df |
Logical. If |
This function computes some association, PRE measures, AIC and BIC for each couple response-predictor that can be created starting from argument formula
. In particular, a two-way contingency table is built for each available X variable (X in rows and Y in columns); then the following measures are considered.
Cramer's V:
n is the sample size, I is the number of rows (categories of X) and J is the number of columns (categories of Y). Cramer's V ranges from 0 to 1.
Bias-corrected Cramer's V () proposed by Bergsma (2013).
Mutual information:
equal to 0 in case of independence but with infinite upper bound, i.e. . In it
.
A normalized version of , ranging from 0 (independence) to 1 and not affected by number of categories (I and J):
being and
the entropy of the variable X and Y, respectively.
Goodman-Kruskal (i.e. response conditional on the given predictor):
It ranges from 0 to 1, and denotes how much the knowledge of the row variable X (predictor) helps in reducing the prediction error of the values of the column variable Y (response).
Goodman-Kruskal :
It takes values in the interval [0,1] and has the same PRE meaning of the lambda.
Theil's uncertainty coefficient:
It takes values in the interval [0,1] and measures the reduction of uncertainty in the column variable Y due to knowing the row variable X. Note that the numerator of U(Y|X) is the mutual information I(X;Y)
It is worth noting that ,
and U can be viewed as measures of the proportional reduction of the variance of the Y variable when passing from its marginal distribution to its conditional distribution given the predictor X, derived from the general expression (cf. Agresti, 2002, p. 56):
They differ in the way of measuring variance, in fact it does not exist a general accepted definition of the variance for a categorical variable.
Finally, AIC (and BIC) is calculated, as suggested in Sakamoto and Akaike (1977). In particular:
being the parameters (conditional probabilities) to estimate. Note that the R package catdap provides functions to identify the best subset of predictors based on AIC.
Please note that the missing values are excluded from the tables and therefore excluded from the estimation of the various measures.
When out.df=FALSE
(default) a list
object with four components:
V |
A vector with the estimated Cramer's V for each couple response-predictor. |
bcV |
A vector with the estimated bias-corrected Cramer's V for each couple response-predictor. |
mi |
A vector with the estimated mutual information I(X;Y) for each couple response-predictor. |
norm.mi |
A vector with the normalized mutual information I(X;Y)* for each couple response-predictor. |
lambda |
A vector with the values of Goodman-Kruscal |
tau |
A vector with the values of Goodman-Kruscal |
U |
A vector with the values of Theil's uncertainty coefficient U(Y|X) for each couple response-predictor. |
AIC |
A vector with the values of AIC(Y|X) for each couple response-predictor. |
BIC |
A vector with the values of BIC(Y|X) for each couple response-predictor. |
npar |
A vector with the number of parameters (conditional probabilities) estimated to calculate AIC and BIC for each couple response-predictor. |
When out.df=TRUE
the output will be a data.frame with a column for each measure.
Marcello D'Orazio [email protected]
Agresti A (2002) Categorical Data Analysis. Second Edition. Wiley, new York.
Bergsma W (2013) A bias-correction for Cramer's V and Tschuprow's T. Journal of the Korean Statistical Society, 42, 323–328.
The Institute of Statistical Mathematics (2018). catdap: Categorical Data Analysis Program Package. R package version 1.3.4. https://CRAN.R-project.org/package=catdap
Sakamoto Y and Akaike, H (1977) Analysis of Cross-Classified Data by AIC. Ann. Inst. Statist. Math., 30, 185-197.
data(quine, package="MASS") #loads quine from MASS str(quine) # how Lrn is response variable pw.assoc(Lrn~Age+Sex+Eth, data=quine) # usage of units' weights quine$ww <- runif(nrow(quine), 1,4) #random gen 1<=weights<=4 pw.assoc(Lrn~Age+Sex+Eth, data=quine, weights="ww")
data(quine, package="MASS") #loads quine from MASS str(quine) # how Lrn is response variable pw.assoc(Lrn~Age+Sex+Eth, data=quine) # usage of units' weights quine$ww <- runif(nrow(quine), 1,4) #random gen 1<=weights<=4 pw.assoc(Lrn~Age+Sex+Eth, data=quine, weights="ww")
This function implements a variant of the distance hot deck method. For each recipient record a subset of of the closest donors is retained and then a donor is selected at random.
RANDwNND.hotdeck(data.rec, data.don, match.vars=NULL, don.class=NULL, dist.fun="Manhattan", cut.don="rot", k=NULL, weight.don=NULL, keep.t=FALSE, ...)
RANDwNND.hotdeck(data.rec, data.don, match.vars=NULL, don.class=NULL, dist.fun="Manhattan", cut.don="rot", k=NULL, weight.don=NULL, keep.t=FALSE, ...)
data.rec |
A numeric matrix or data frame that plays the role of recipient. This data frame must contain the variables (columns), specified via Missing values ( |
data.don |
A matrix or data frame that plays the role of donor. This data frame must contain the variables (columns), specified via |
match.vars |
A character vector with the names of the variables (the columns in both the data frames) that have to be used to compute distances between records (rows) in |
don.class |
A character vector with the names of the variables (columns in both the data frames) that have to be used to identify donation classes. In this case the computation of distances is limited to those units in When not specified (default), no donation classes are used. This may result in a heavy computational effort. |
dist.fun |
A string with the name of the distance function that has to be used. The following distances can be used: “Manhattan” (aka “City block”; default), “Euclidean”, “Mahalanobis”,“exact” or “exact matching”, “Gower”, “minimax”, “difference”, or one of the distance functions available in the package proxy. Note that the distances are computed using the function By setting When |
cut.don |
A character string that, jointly with the argument
|
k |
Depends on the |
weight.don |
A character string providing the name of the variable with the weights associated to the donor units in |
keep.t |
Logical, when donation classes are used by setting |
... |
Additional arguments that may be required by |
This function finds a donor record for each record in the recipient data set. The donor is chosen at random in the subset of available donors. This procedure is known as random hot deck (cf. Andridge and Little, 2010). In RANDwNND.hotdeck
, the number of closest donors retained to form the subset is determined according to criterion specified with the argument cut.don
.
The selection of the donor among those in the subset is carried out with equal probability (weight.don=NULL
) or with probability proportional to a weight associated to the donors, specified via the weight.don
argument. This procedure is is known as weighted random hot deck (cf. Andridge and Little, 2010).
The search for the subset of the closest donors can be speed up by using the Approximate Nearest Neighbor search as implemented in the function nn2
provided by the package RANN. Note that this search can be used in all the cases with the exception of cut.don="k.dist"
.
Note that the same donor can be used more than once.
This function can also be used to impute missing values in a data set. In this case data.rec
is the part of the initial data set that contains missing values; on the contrary, data.don
is the part of the data set without missing values. See R code in the Examples for details.
A R list with the following components:
mtc.ids |
A matrix with the same number of rows of |
sum.dist |
A matrix with summary statistics concerning the subset of the closest donors. The first three columns report the minimum, the maximum and the standard deviation of the distances among the recipient record and the donors in the subset of the closest donors, respectively. The 4th column reports the cutting distance, i.e. the value of the distance such that donors at a higher distance are discarded. The 5th column reports the distance between the recipient and the donor chosen at random in the subset of the donors. |
noad |
For each recipient unit, reports the number of donor records in the subset of closest donors. |
call |
How the function has been called. |
Marcello D'Orazio [email protected]
Andridge, R.R., and Little, R.J.A. (2010) “A Review of Hot Deck Imputation for Survey Non-response”. International Statistical Review, 78, 40–64.
D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.
Rodgers, W.L. (1984). “An evaluation of statistical matching”. Journal of Business and Economic Statistics, 2, 91–102.
Singh, A.C., Mantel, H., Kinack, M. and Rowe, G. (1993). “Statistical matching: use of auxiliary information as an alternative to the conditional independence assumption”. Survey Methodology, 19, 59–79.
data(samp.A, samp.B, package="StatMatch") #loads data sets ?samp.A ?samp.B # samp.A plays the role of recipient # samp.B plays the role of donor # find a donor in the in the same region ("area5") and with the same # gender ("sex"), then only the closest k=20 donors in terms of # "age" are cnsidered and one of them is picked up at random out.RND.1 <- RANDwNND.hotdeck(data.rec=samp.A, data.don=samp.B, don.class=c("area5", "sex"), dist.fun="ANN", match.vars="age", cut.don="exact", k=20) # create the synthetic (or fused) data.frame: # fill in "labour5" in A fused.1 <- create.fused(data.rec=samp.A, data.don=samp.B, mtc.ids=out.RND.1$mtc.ids, z.vars="labour5") head(fused.1) # weights ("ww") are used in selecting the donor in the final step out.RND.2 <- RANDwNND.hotdeck(data.rec=samp.A, data.don=samp.B, don.class=c("area5", "sex"), dist.fun="ANN", match.vars="age", cut.don="exact", k=20, weight.don="ww") fused.2 <- create.fused(data.rec=samp.A, data.don=samp.B, mtc.ids=out.RND.2$mtc.ids, z.vars="labour5") head(fused.2) # find a donor in the in the same region ("area5") and with the same # gender ("sex"), then only the donors with "age" <= to the age of the # recipient are considered, # then one of them is picked up at random out.RND.3 <- RANDwNND.hotdeck(data.rec=samp.A, data.don=samp.B, don.class=c("area5", "sex"), dist.fun="diff", match.vars="age", cut.don="<=") # create the synthetic (or fused) data.frame: # fill in "labour5" in A fused.3 <- create.fused(data.rec=samp.A, data.don=samp.B, mtc.ids=out.RND.3$mtc.ids, z.vars="labour5") head(fused.3) # Example of Imputation of missing values # introducing missing vales in iris ir.mat <- iris miss <- rbinom(nrow(iris), 1, 0.3) ir.mat[miss==1,"Sepal.Length"] <- NA iris.rec <- ir.mat[miss==1,-1] iris.don <- ir.mat[miss==0,] #search for NND donors imp.RND <- RANDwNND.hotdeck(data.rec=iris.rec, data.don=iris.don, match.vars=c("Sepal.Width","Petal.Length", "Petal.Width"), don.class="Species") # imputing missing values iris.rec.imp <- create.fused(data.rec=iris.rec, data.don=iris.don, mtc.ids=imp.RND$mtc.ids, z.vars="Sepal.Length") # rebuild the imputed data.frame final <- rbind(iris.rec.imp, iris.don) head(final)
data(samp.A, samp.B, package="StatMatch") #loads data sets ?samp.A ?samp.B # samp.A plays the role of recipient # samp.B plays the role of donor # find a donor in the in the same region ("area5") and with the same # gender ("sex"), then only the closest k=20 donors in terms of # "age" are cnsidered and one of them is picked up at random out.RND.1 <- RANDwNND.hotdeck(data.rec=samp.A, data.don=samp.B, don.class=c("area5", "sex"), dist.fun="ANN", match.vars="age", cut.don="exact", k=20) # create the synthetic (or fused) data.frame: # fill in "labour5" in A fused.1 <- create.fused(data.rec=samp.A, data.don=samp.B, mtc.ids=out.RND.1$mtc.ids, z.vars="labour5") head(fused.1) # weights ("ww") are used in selecting the donor in the final step out.RND.2 <- RANDwNND.hotdeck(data.rec=samp.A, data.don=samp.B, don.class=c("area5", "sex"), dist.fun="ANN", match.vars="age", cut.don="exact", k=20, weight.don="ww") fused.2 <- create.fused(data.rec=samp.A, data.don=samp.B, mtc.ids=out.RND.2$mtc.ids, z.vars="labour5") head(fused.2) # find a donor in the in the same region ("area5") and with the same # gender ("sex"), then only the donors with "age" <= to the age of the # recipient are considered, # then one of them is picked up at random out.RND.3 <- RANDwNND.hotdeck(data.rec=samp.A, data.don=samp.B, don.class=c("area5", "sex"), dist.fun="diff", match.vars="age", cut.don="<=") # create the synthetic (or fused) data.frame: # fill in "labour5" in A fused.3 <- create.fused(data.rec=samp.A, data.don=samp.B, mtc.ids=out.RND.3$mtc.ids, z.vars="labour5") head(fused.3) # Example of Imputation of missing values # introducing missing vales in iris ir.mat <- iris miss <- rbinom(nrow(iris), 1, 0.3) ir.mat[miss==1,"Sepal.Length"] <- NA iris.rec <- ir.mat[miss==1,-1] iris.don <- ir.mat[miss==0,] #search for NND donors imp.RND <- RANDwNND.hotdeck(data.rec=iris.rec, data.don=iris.don, match.vars=c("Sepal.Width","Petal.Length", "Petal.Width"), don.class="Species") # imputing missing values iris.rec.imp <- create.fused(data.rec=iris.rec, data.don=iris.don, mtc.ids=imp.RND$mtc.ids, z.vars="Sepal.Length") # rebuild the imputed data.frame final <- rbind(iris.rec.imp, iris.don) head(final)
This function implements rank hot deck distance method. For each recipient record the closest donors is chosen by considering the distance between the percentage points of the empirical cumulative distribution function.
rankNND.hotdeck(data.rec, data.don, var.rec, var.don=var.rec, don.class=NULL, weight.rec=NULL, weight.don=NULL, constrained=FALSE, constr.alg="Hungarian", keep.t=FALSE)
rankNND.hotdeck(data.rec, data.don, var.rec, var.don=var.rec, don.class=NULL, weight.rec=NULL, weight.don=NULL, constrained=FALSE, constr.alg="Hungarian", keep.t=FALSE)
data.rec |
A numeric matrix or data frame that plays the role of recipient. This data frame must contain the variable Missing values ( |
data.don |
A matrix or data frame that plays the role of donor. This data frame must contain the variable |
var.rec |
A character vector with the name of the variable in |
var.don |
A character vector with the name of the variable |
don.class |
A character vector with the names of the variables (columns in both the data frames) that identify donation classes. In each donation class the computation of percentage points is carried out independently. Then only distances between percentage points of the units in the same donation class are computed. The case of empty donation classes should be avoided. It would be preferable that the variables used to form donation classes are defined as When not specified (default), no donation classes are used. |
weight.rec |
Eventual name of the variable in |
weight.don |
Eventual name of the variable in |
constrained |
Logical. When |
constr.alg |
A string that has to be specified when |
keep.t |
Logical, when donation classes are used by setting |
This function finds a donor record for each record in the recipient data set. The chosen donor is the one at the closest distance in terms of empirical cumulative distribution (Singh et al., 1990). In practice the distance is computed by considering the estimated empirical cumulative distribution for the reference variable (var.rec
and var.don
) in data.rec
and data.don
. The empirical cumulative distribution function is estimated by:
being if
and 0 otherwise.
In presence of weights, the empirical cumulative distribution function is estimated by:
In the unconstrained case, when there are more donors at the same distance, one of them is chosen at random.
When the donation class are introduced, then the empirical cumulative distribution function is estimated independently in each donation classes and the search of a recipient is restricted to donors in the same donation class.
A donor can be chosen more than once. To avoid it set constrained=TRUE
. In such a case a donor can be chosen just once and the selection of the donors is carried out by solving a transportation problem with the objective of minimizing the overall matching distance (sum of the distances recipient-donor).
A R list with the following components:
mtc.ids |
A matrix with the same number of rows of |
dist.rd |
A vector with the distances between each recipient unit and the corresponding donor. |
noad |
The number of available donors at the minimum distance for each recipient unit (only in unconstrained case) |
call |
How the function has been called. |
Marcello D'Orazio [email protected]
D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.
Singh, A.C., Mantel, H., Kinack, M. and Rowe, G. (1993). “Statistical matching: use of auxiliary information as an alternative to the conditional independence assumption”. Survey Methodology, 19, 59–79.
data(samp.A, samp.B, package="StatMatch") #loads data sets # samp.A plays the role of recipient ?samp.A # samp.B plays the role of donor ?samp.B # rankNND.hotdeck() # donation classes formed using "area5" # ecdf conputed on "age" # UNCONSTRAINED case out.1 <- rankNND.hotdeck(data.rec=samp.A, data.don=samp.B, var.rec="age", don.class="area5") fused.1 <- create.fused(data.rec=samp.A, data.don=samp.B, mtc.ids=out.1$mtc.ids, z.vars="labour5") head(fused.1) # as before but ecdf estimated using weights # UNCONSTRAINED case out.2 <- rankNND.hotdeck(data.rec=samp.A, data.don=samp.B, var.rec="age", don.class="area5", weight.rec="ww", weight.don="ww") fused.2 <- create.fused(data.rec=samp.A, data.don=samp.B, mtc.ids=out.2$mtc.ids, z.vars="labour5") head(fused.2)
data(samp.A, samp.B, package="StatMatch") #loads data sets # samp.A plays the role of recipient ?samp.A # samp.B plays the role of donor ?samp.B # rankNND.hotdeck() # donation classes formed using "area5" # ecdf conputed on "age" # UNCONSTRAINED case out.1 <- rankNND.hotdeck(data.rec=samp.A, data.don=samp.B, var.rec="age", don.class="area5") fused.1 <- create.fused(data.rec=samp.A, data.don=samp.B, mtc.ids=out.1$mtc.ids, z.vars="labour5") head(fused.1) # as before but ecdf estimated using weights # UNCONSTRAINED case out.2 <- rankNND.hotdeck(data.rec=samp.A, data.don=samp.B, var.rec="age", don.class="area5", weight.rec="ww", weight.don="ww") fused.2 <- create.fused(data.rec=samp.A, data.don=samp.B, mtc.ids=out.2$mtc.ids, z.vars="labour5") head(fused.2)
This data set provides a limited number of variables observed at persons levels among those usually collected in the European Union Statistics on Income and Living Conditions Survey (EU–SILC). The data are artificially generated, just to show the application of the statistical matching techniques implemented in StatMatch.
data(samp.A)
data(samp.A)
A data frame with 3009 observations and the following variables:
unique unit identifier of the type aa.bb
where aa
identifies the Household while bb
identifies the household member
large geographic area, factor with 5 levels: ‘NE’=North–East, ‘NO’=North–West, ‘C’=center, ‘S’=South, ‘I’=islands
Degree of urbanization, factor with 3 levels: ‘1’=densely populated area, ‘2’=intermediate area, ‘3’=thinly populated area
integer, size of the household in which the person lives
factor with 5 levels derived from hsize
, where the 5th level ‘>=5’ denotes 5 and more people in the household
integer, the person's age
factor, age categorized in 5 classes
factor, the person's gender: ‘1’=male, ‘2’=female
factor, the person's marital status: ‘1’=never married, ‘2’=married, ‘3’=other (separated, widowed, divorced)
factor, the person's highest education level attained, follows the ISCED-97 categories: ‘0’=pre–primary education, ‘1’=primary education, ‘2’=lower secondary education, ‘3’= (upper) secondary education, ‘4’= post–secondary non tertiary education, ‘5’=first stage of tertiary education (not leading directly to an advanced research qualification), ‘6’=second stage of tertiary education (leading to an advanced research qualification)
numeric, the person's net income in Euros
factor, the person's net income categorized in 7 classes of thousand of Euros
numeric, the unit's weight
Please note that this data set is just for illustrative purposes. The unit's weight do not reflect the Italian population size. The variables included are derived starting from the those usually observed in the EU–SILC survey.
This data set is artificially created starting from the EU–SILC survey structure.
https://ec.europa.eu/eurostat/web/income-and-living-conditions/overview
This data set provides a limited number of variables observed at persons levels among those usually collected in the European Union Statistics on Income and Living Conditions Survey (EU–SILC). The data are artificially generated, just to show the application of the statistical matching techniques implemented in StatMatch.
data(samp.B)
data(samp.B)
A data frame with 6686 observations and the following variables:
unique unit identifier of the type aa.bb
where aa
identifies the Household while bb
identifies the household member
large geographic area, factor with 5 levels: ‘NE’=North–East, ‘NO’=North–West, ‘C’=center, ‘S’=South, ‘I’=islands
Degree of urbanization, factor with 3 levels: ‘1’=densely populated area, ‘2’=intermediate area, ‘3’=thinly populated area
integer, size of the household in which the person lives
factor with 5 levels derived from hsize
, where the 5th level ‘>=5’ denotes 5 and more people in the household
integer, the person's age
factor, age categorized in 5 classes
factor, the person's gender: ‘1’=male, ‘2’=female
factor, the person's marital status: ‘1’=never married, ‘2’=married, ‘3’=other (separated, widowed, divorced)
factor, the person's highest education level attained, follows the ISCED-97 categories: ‘0’=pre–primary education, ‘1’=primary education, ‘2’=lower secondary education, ‘3’= (upper) secondary education, ‘4’= post–secondary non tertiary education, ‘5’=first stage of tertiary education (not leading directly to an advanced research qualification), ‘6’=second stage of tertiary education (leading to an advanced research qualification)
the person's self–defined economic status, factor with 5 levels: ‘1’=employee working full–time or part–time, ‘2’=self–employed working full–time or part–time, ‘3’=unemployed, ‘4’=In retirement or in early retirement or has given up business, ‘5’=other status (student, permanent disabled, in compulsory military service, fulfilling domestic tasks, etc.)
numeric, the unit's weight
Please note that this data set is just for illustrative purposes. The unit's weight do not reflect the Italian population size. The variables included are derived starting from the those usually observed in the EU–SILC survey.
This data set is artificially created starting from the EU–SILC survey structure.
https://ec.europa.eu/eurostat/web/income-and-living-conditions/overview
This data set provides a limited number of variables observed at persons levels among those usually collected in the European Union Statistics on Income and Living Conditions Survey (EU–SILC). The data are artificially generated, just to show the application of the statistical matching techniques implemented in StatMatch.
data(samp.C)
data(samp.C)
A data frame with 980 observations and the following variables:
unique unit identifier of the type aa.bb
where aa
identifies the Household while bb
identifies the household member
large geographic area, factor with 5 levels: ‘NE’=North–East, ‘NO’=North–West, ‘C’=center, ‘S’=South, ‘I’=islands
Degree of urbanization, factor with 3 levels: ‘1’=densely populated area, ‘2’=intermediate area, ‘3’=thinly populated area
integer, size of the household in which the person lives
factor with 5 levels derived from hsize
, where the 5th level ‘>=5’ denotes 5 and more people in the household
integer, the person's age
factor, age categorized in 5 classes
factor, the person's gender: ‘1’=male, ‘2’=female
factor, the person's marital status: ‘1’=never married, ‘2’=married, ‘3’=other (separated, widowed, divorced)
factor, the person's highest education level attained, follows the ISCED-97 categories: ‘0’=pre–primary education, ‘1’=primary education, ‘2’=lower secondary education, ‘3’= (upper) secondary education, ‘4’= post–secondary non tertiary education, ‘5’=first stage of tertiary education (not leading directly to an advanced research qualification), ‘6’=second stage of tertiary education (leading to an advanced research qualification)
the person's self–defined economic status, factor with 5 levels: ‘1’=employee working full–time or part–time, ‘2’=self–employed working full–time or part–time, ‘3’=unemployed, ‘4’=In retirement or in early retirement or has given up business, ‘5’=other status (student, permanent disabled, in compulsory military service, fulfilling domestic tasks, etc.)
numeric, the person's net income in Euros
factor, the person's net income categorized in 7 classes of thousand of Euros
numeric, the unit's weight
Please note that this data set is just for illustrative purposes. The unit's weight do not reflect the Italian population size. The variables included are derived starting from the those usually observed in the EU–SILC survey.
This data set is artificially created starting from the EU–SILC survey structure.
https://ec.europa.eu/eurostat/web/income-and-living-conditions/overview
This function identifies the “best” subset of matching variables in terms of reduction of uncertainty when estimating relative frequencies in the contingency table Y vs. Z. The sequential procedure presented in D'Orazio et al. (2017 and 2019) is implemented. This procedure avoids exploring all the possible combinations of the available X variables as in Fbwidths.by.x
.
selMtc.by.unc(tab.x, tab.xy, tab.xz, corr.d=2, nA=NULL, nB=NULL, align.margins=FALSE)
selMtc.by.unc(tab.x, tab.xy, tab.xz, corr.d=2, nA=NULL, nB=NULL, align.margins=FALSE)
tab.x |
A R table crossing the X variables. This table must be obtained by using the function |
tab.xy |
A R table of X vs. Y variable. This table must be obtained by using the function A single categorical Y variables is allowed. At least three categorical variables should be considered as X variables (common variables). The same X variables in |
tab.xz |
A R table of X vs. Z variable. This table must be obtained by using the function A single categorical Z variable is allowed. At least three categorical variables should be considered as X variables (common variables). The same X variables in |
corr.d |
Integer, indicates the penalty that should be introduced in estimating the uncertainty by means of the average width of cell bounds. When |
nA |
Integer, sample size of file A used to estimate |
nB |
Integer, sample size of file B used to estimate |
align.margins |
Logical (default |
This function follows the sequential procedure described in D'Orazio et al. (2017, 2019) to identify the combination of common variables most effective in reducing uncertainty when estimating the contingency table Y vs. Z. Initially, the available Xs are ordered according to the reduction of average width of uncertainty bounds when conditioning on each of them. Then in each step one the remaining X variables is added until the table became too sparse; in practice the procedure stops when:
For major details see also Fbwidths.by.x
.
A list with the main outcomes of the procedure.
ini.ord |
Average width of uncertainty bounds when conditioning on each of the available X variables. Variable most effective in reducing uncertainty comes first. The ordering determines the order in which they are entered in the sequential procedure. |
list.xs |
List with the various combinations of the matching variables being considered in each step. |
av.df |
Data.frame with all the relevant information for each of combination of X variables. The last row corresponds to the combination of the X variables identified as the best in reducing average width of uncertainty bounds (penalized or not depending on the input argument |
Marcello D'Orazio [email protected]
D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.
D'Orazio, M., Di Zio, M. and Scanu, M. (2017). “The use of uncertainty to choose matching variables in statistical matching”. International Journal of Approximate Reasoning, 90, pp. 433-440.
D'Orazio, M., Di Zio, M. and Scanu, M. (2019). “Auxiliary variable selection in a a statistical matching problem”. In Zhang, L.-C. and Chambers, R. L. (eds.) Analysis of Integrated Data, Chapman & Hall/CRC (forthcoming).
Fbwidths.by.x
, Frechet.bounds.cat
data(quine, package="MASS") #loads quine from MASS str(quine) quine$c.Days <- cut(quine$Days, c(-1, seq(0,50,10),100)) table(quine$c.Days) # split quine in two subsets suppressWarnings(RNGversion("3.5.0")) set.seed(1111) lab.A <- sample(nrow(quine), 70, replace=TRUE) quine.A <- quine[lab.A, 1:4] quine.B <- quine[-lab.A, c(1:3,6)] # compute the tables required by Fbwidths.by.x() freq.xA <- xtabs(~Eth+Sex+Age, data=quine.A) freq.xB <- xtabs(~Eth+Sex+Age, data=quine.B) freq.xy <- xtabs(~Eth+Sex+Age+Lrn, data=quine.A) freq.xz <- xtabs(~Eth+Sex+Age+c.Days, data=quine.B) # apply Fbwidths.by.x() bb <- Fbwidths.by.x(tab.x=freq.xA+freq.xB, tab.xy=freq.xy, tab.xz=freq.xz, warn=FALSE) bb$sum.unc cc <- selMtc.by.unc(tab.x=freq.xA+freq.xB, tab.xy=freq.xy, tab.xz=freq.xz, corr.d=0) cc$ini.ord cc$av.df
data(quine, package="MASS") #loads quine from MASS str(quine) quine$c.Days <- cut(quine$Days, c(-1, seq(0,50,10),100)) table(quine$c.Days) # split quine in two subsets suppressWarnings(RNGversion("3.5.0")) set.seed(1111) lab.A <- sample(nrow(quine), 70, replace=TRUE) quine.A <- quine[lab.A, 1:4] quine.B <- quine[-lab.A, c(1:3,6)] # compute the tables required by Fbwidths.by.x() freq.xA <- xtabs(~Eth+Sex+Age, data=quine.A) freq.xB <- xtabs(~Eth+Sex+Age, data=quine.B) freq.xy <- xtabs(~Eth+Sex+Age+Lrn, data=quine.A) freq.xz <- xtabs(~Eth+Sex+Age+c.Days, data=quine.B) # apply Fbwidths.by.x() bb <- Fbwidths.by.x(tab.x=freq.xA+freq.xB, tab.xy=freq.xy, tab.xz=freq.xz, warn=FALSE) bb$sum.unc cc <- selMtc.by.unc(tab.x=freq.xA+freq.xB, tab.xy=freq.xy, tab.xz=freq.xz, corr.d=0) cc$ini.ord cc$av.df