Surveys have long been an important way of obtaining accurate
information from a finite population. For instance, governments need to
obtain descriptive statistics of the population for purposes of
evaluating and implementing their policies. For those concerned with
official statistics in the first third of the twentieth century, the
major issue was to establish a standard of acceptable practice. Neyman
(1934) created such a framework by introducing the role of randomization
methods in the sampling process. He advocated the use of the
randomization distribution induced by the sampling design to evaluate
the frequentist properties of alternative procedures. He also introduced
the idea of stratification with optimal sample size allocation and the
use of unequal selection probabilities. His work was recognized as the
cornerstone of design-based sample survey theory and inspired many other
authors. For example, Horvitz and Thompson (1952) proposed a general
theory of unequal probability sampling and the probability weighted
estimation method, the so-called “Horvitz and Thompson’s
estimator”.
The design-based sample survey theory has been very appealing to official statistics agencies around the world. As pointed out by Skinner, Holt and Smith (1989), page 2, the main reason is that it is essentially distribution-free. Indeed, all advances in survey sampling theory from Neyman onwards have been strongly influenced by the descriptive use of survey sampling. The consequence of this has been a lack of theoretical developments related to the analytic use of surveys, in particular for prediction purposes. In some specific situations, the design-based approach has proved to be inefficient, providing inadequate predictors. For instance, estimation in small domains and the presence of the non-response cannot be dealt with by the design-based approach without some implicit assumptions, which is equivalent to assuming a model. Supporters of the design-based approach argue that model-based inference largely depends on the model assumptions, which might not be true. On the other hand, interval inference for target population parameters (usually totals or means) relies on the Central Limit Theorem, which cannot be applied in many practical situations, where the sample size is not large enough and/or independence assumptions of the random variables involved are not realistic.
Basu (1971) did not accept estimates of population quantities which depend on the sampling rule, like the inclusion probabilities. He argued that this estimation procedure does not satisfy the likelihood principle, at which he was adept. Basu (1971) created the circus elephant example to show that the Horvitz-Thompson estimator could lead to inappropriate estimates and proposed an alternative estimator. The question that arises is whether it is possible to conciliate both approaches. In the superpopulation model context, Zacks (2002) showed that some design-based estimators can be recovered by using a general regression model approach. Little (2003) claims that: “careful model specification sensitive to the survey design can address the concerns with model specifications, and Bayesian statistics provide a coherent and unified treatment of descriptive and analytic survey inference”. He gave some illustrative examples of how standard design-based inference can be derived from the Bayesian perspective, using some models with non-informative prior distributions.
In the Bayesian context, another appealing proposal to conciliate the
design-based and model-based approaches was proposed by Smouse (1984).
The method incorporates prior information in finite population inference
models by relying on Bayesian least squares techniques and requires only
the specification of first and second moments of the distributions
involved, describing prior knowledge about the structures present in the
population. The approach is an alternative to the methods of
randomization and appears midway between two extreme views: on the one
hand the design-based procedures and on the other those based on
superpopulation models. O’Hagan (1985), in an unpublished report,
presented the Bayes linear estimators in some specific sample survey
contexts and O’Hagan (1987) also derived Bayes linear estimators for
some randomized response models. O’Hagan (1985) dealt with several
population structures, such as stratification and clustering, by
assuming suitable hypotheses about the first and second moments and
showed how some common design-based estimators can be obtained as a
particular case of his more general approach. He also pointed out that
his estimates do not account for non-informative sampling. He quoted
Scott (1977) and commented that informative sampling should be carried
out by a full Bayesian analysis. An important reference about
informative sampling dealing with hierarchical models can be found in
Pfeffermann, Moura and Silva (2006).
The Bayes approach has been found to be successful in many
applications, particularly when the data analysis has been improved by
expert judgments. But while Bayesian models have many appealing
features, their application often involves the full specification of a
prior distribution for a large number of parameters. Goldstein and Wooff
(2007), section 1.2, argue that as the complexity of the problem
increases, our actual ability to fully specify the prior and/or the
sampling model in detail is impaired. They conclude that in such
situations, there is a need to develop methods based on partial belief
specification.
Hartigan (1969) proposed an estimation method, termed Bayes
linear estimation approach, that only requires the specification of
first and second moments. The resulting estimators have the
property of minimizing posterior squared error loss among all estimators
that are linear in the data and can be thought of as
approximations to posterior means. The Bayes linear estimation
approach is fully employed in this article and is briefly described
below.
Let ys
be the vector with observations and θ be the parameter to be estimated.
For each value of θ and each
possible estimate d, belonging
to the parametric space Θ, we
associate a quadratic loss function L(θ, d) = (θ − d)′(θ − d) = tr(θ − d)(θ − d)′.
The main interest is to find the value of d that minimizes r(d) = E[L(θ, d)|ys],
the conditional expected value of the quadratic loss function given the
data.
Suppose that the joint distribution of θ and ys is partially specified by only their first two moments:
where a and f, respectively, denote mean vectors
and R, AQ and Q the covariance matrix elements of
θ and ys.
The Bayes linear estimator (BLE) of θ is the value of d that minimizes the expected value of this quadratic loss function within the class of all linear estimates of the form d = d(ys) = h + Hys, for some vector h and matrix H. Thus, the BLE of θ, d̂, and its associated variance, V̂(d̂), are respectively given by:
It should be noted that the BLE depends on the specification of the first and second moments of the joint distribution partially specified in (2.1).
From the Bayes linear approach applied to the general linear regression model for finite population prediction, the paper shows how to obtain some particular design-based estimators, as in simple random sampling and stratified simple random sampling.
The package contain the main following functions: