Title: | Different Methods for Stratified Sampling |
---|---|
Description: | Integrating a stratified structure in the population in a sampling design can considerably reduce the variance of the Horvitz-Thompson estimator. We propose in this package different methods to handle the selection of a balanced sample in stratified population. For more details see Raphaël Jauslin, Esther Eustache and Yves Tillé (2021) <doi:10.1007/s42081-021-00134-y>. The package propose also a method based on optimal transport and balanced sampling, see Raphaël Jauslin and Yves Tillé <doi:10.1016/j.jspi.2022.12.003>. |
Authors: | Raphael Jauslin [aut, cre] , Esther Eustache [aut], Bardia Panahbehagh [aut] , Yves Tillé [aut] |
Maintainer: | Raphael Jauslin <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.4.2 |
Built: | 2025-01-31 18:18:28 UTC |
Source: | CRAN |
Selects at the same time a well-spread and a balanced sample using a sequential implementation.
balseq(pik, Xaux, Xspread = NULL, rord = TRUE)
balseq(pik, Xaux, Xspread = NULL, rord = TRUE)
pik |
A vector of inclusion probabilities. |
Xaux |
A matrix of auxiliary variables. The matrix must contains the |
Xspread |
An optional matrix of spatial coordinates. |
rord |
A logical variable that specify if reordering is applied. Default TRUE. |
The function selects a sample using a sequential algorithm. At the same time, it respects the balancing equations (Xaux
) and select a well-spread sample (Xspread
). Algorithm uses a
linear program to satisfy the constraints.
Return the selected indices in 1,2,...,N
BalancedSampling:lcube
, sampling:samplecube
.
N <- 100 n <- 10 p <- 10 pik <- rep(n/N,N) Xaux <- array(rnorm(N*p,3,1),c(N,p)) Xspread <- cbind(runif(N),runif(N)) Xaux <- cbind(pik,Xaux) s <- balseq(pik,Xaux) colSums(Xaux[s,]/as.vector(pik[s])) colSums(Xaux) s <- balseq(pik,Xaux,Xspread) colSums(Xaux[s,]/as.vector(pik[s])) colSums(Xaux)
N <- 100 n <- 10 p <- 10 pik <- rep(n/N,N) Xaux <- array(rnorm(N*p,3,1),c(N,p)) Xspread <- cbind(runif(N),runif(N)) Xaux <- cbind(pik,Xaux) s <- balseq(pik,Xaux) colSums(Xaux[s,]/as.vector(pik[s])) colSums(Xaux) s <- balseq(pik,Xaux,Xspread) colSums(Xaux[s,]/as.vector(pik[s])) colSums(Xaux)
Select a stratified balanced sample. The function is similar to balancedstratification
of the package sampling.
balstrat(X, strata, pik, rand = TRUE, landing = TRUE)
balstrat(X, strata, pik, rand = TRUE, landing = TRUE)
X |
A matrix of size ( |
strata |
A vector of integers that specifies the stratification. |
pik |
A vector of inclusion probabilities. |
rand |
if TRUE, the data are randomly arranged. Default TRUE |
landing |
if TRUE, landing by linear programming otherwise supression of variables. Default TRUE |
The function implements the method proposed by Chauvet (2009). Firstly, a flight phase is performed on each strata. Secondly, a flight phase is applied on the whole population by aggregating the strata. Finally, a landing phase is applied by suppression of variables.
A vector with elements equal to 0 or 1. The value 1 indicates that the unit is selected while the value 0 is for rejected units.
Chauvet, G. (2009). Stratified balanced sampling. Survey Methodology, 35:115-119.
N <- 100 n <- 10 p <- 4 X <- matrix(rgamma(N*p,4,25),ncol = p) strata <- as.matrix(rep(1:n,each = N/n)) pik <- rep(n/N,N) s <- balstrat(X,strata,pik) t(X/pik)%*%s t(X/pik)%*%pik Xcat <- disj(strata) t(Xcat)%*%s t(Xcat)%*%pik
N <- 100 n <- 10 p <- 4 X <- matrix(rgamma(N*p,4,25),ncol = p) strata <- as.matrix(rep(1:n,each = N/n)) pik <- rep(n/N,N) s <- balstrat(X,strata,pik) t(X/pik)%*%s t(X/pik)%*%pik Xcat <- disj(strata) t(Xcat)%*%s t(Xcat)%*%pik
We propose a method based on the output of the function otmatch
. The method consists of choosing a unit from sample 2 to assign to a particular unit from sample 1.
bsmatch(object, Z2)
bsmatch(object, Z2)
object |
A data.frame, output from the function |
Z2 |
A optional matrix, if we want to add some variables for the stratified balanced sampling step. |
All details of the method can be seen in the manuscript: Raphaël Jauslin and Yves Tillé (2021) <arXiv:2105.08379>.
A list of two objects, A data.frame that contains the matching and the normalized weights. The first two columns of the data.frame contain the unit identities of the two samples. The third column are the final weights. All remaining columns are the matching variables.
#--- SET UP N=1000 p=5 X=array(rnorm(N*p),c(N,p)) EPS= 1e-9 n1=100 n2=200 s1=sampling::srswor(n1,N) s2=sampling::srswor(n2,N) id1=(1:N)[s1==1] id2=(1:N)[s2==1] d1=rep(N/n1,n1) d2=rep(N/n2,n2) X1=X[s1==1,] X2=X[s2==1,] #--- HARMONIZATION re=harmonize(X1,d1,id1,X2,d2,id2) w1=re$w1 w2=re$w2 #--- STATISTICAL MATCHING WITH OT object = otmatch(X1,id1,X2,id2,w1,w2) #--- BALANCED SAMPLING out <- bsmatch(object)
#--- SET UP N=1000 p=5 X=array(rnorm(N*p),c(N,p)) EPS= 1e-9 n1=100 n2=200 s1=sampling::srswor(n1,N) s2=sampling::srswor(n2,N) id1=(1:N)[s1==1] id2=(1:N)[s2==1] d1=rep(N/n1,n1) d2=rep(N/n2,n2) X1=X[s1==1,] X2=X[s2==1,] #--- HARMONIZATION re=harmonize(X1,d1,id1,X2,d2,id2) w1=re$w1 w2=re$w2 #--- STATISTICAL MATCHING WITH OT object = otmatch(X1,id1,X2,id2,w1,w2) #--- BALANCED SAMPLING out <- bsmatch(object)
This function is returning the number of unit that we need such that some conditions are fulfilled. See Details
c_bound(pik)
c_bound(pik)
pik |
vector of the inclusion probabilities. |
The function is computing the number of unit that we need to add such that the following conditions are fulfilled :
Let be the constant such that
, we must have that
An integer value, the number of units that we need to respect the constraints.
This function is returning the number of unit that we need such that some conditions are fulfilled. See Details
c_bound2(pik)
c_bound2(pik)
pik |
vector of the inclusion probabilities. |
The function is computing the number of unit that we need to add such that the following conditions are fulfilled :
Let be the constant such that
, we must have that
An integer value, the number of units that we need to respect the constraints.
This function is inspired by the function calib
of the package sampling. It computes the g-weights of the calibration estimator.
calibRaking(Xs, d, total, q, max_iter = 500L, tol = 1e-09)
calibRaking(Xs, d, total, q, max_iter = 500L, tol = 1e-09)
Xs |
A matrix of calibration variables. |
d |
A vector, the initial weights. |
total |
A vector that represents the initial weights. |
q |
A vector of positive value that account for heteroscedasticity. |
max_iter |
An integer, the maximum number of iterations. Default = 500. |
tol |
A scalar that represents the tolerance value for the algorithm. Default = 1e-9. |
More details on the different calibration methods can be read in Tillé Y. (2020).
A vector, the value of the g-weights.
Tillé, Y. (2020). Sampling and estimation from finite populations. Wiley, New York
Maximum entropy sampling with fixed sample size. It select a sample with fixed sample size with unequal inclusion probabilities.
cps(pik, eps = 1e-06)
cps(pik, eps = 1e-06)
pik |
A vector of inclusion probabilities. |
eps |
A scalar that specify the tolerance to transform a small value to the value 0. |
Conditional Poisson sampling, the sampling design maximizes the entropy:
where s is of fixed sample size. Indeed, Poisson sampling is known for maximizing the entropy but has no fixed sample size. The function selects a sample of fixed sample that maximizes entropy.
This function is a C++ implementation of UPmaxentropy
of the package sampling
. More details could be find in Tille (2006).
A vector with elements equal to 0 or 1. The value 1 indicates that the unit is selected while the value 0 is for rejected units.
Tille, Y. (2006), Sampling Algorithms, springer
pik <- inclprob(seq(100,1,length.out = 100),10) s <- cps(pik) # simulation with piktfrompik MUCH MORE FASTER s <- rep(0,length(pik)) SIM <- 100 pikt <- piktfrompik(pik) w <- pikt/(1-pikt) q <- qfromw(w,sum(pik)) for(i in 1 :SIM){ s <- s + sfromq(q) } p <- s/SIM # estimated inclusion probabilities t <- (p-pik)/sqrt(pik*(1-pik)/SIM) 1 - sum(t > 1.6449)/length(pik) # should be approximately equal to 0.95
pik <- inclprob(seq(100,1,length.out = 100),10) s <- cps(pik) # simulation with piktfrompik MUCH MORE FASTER s <- rep(0,length(pik)) SIM <- 100 pikt <- piktfrompik(pik) w <- pikt/(1-pikt) q <- qfromw(w,sum(pik)) for(i in 1 :SIM){ s <- s + sfromq(q) } p <- s/SIM # estimated inclusion probabilities t <- (p-pik)/sqrt(pik*(1-pik)/SIM) 1 - sum(t > 1.6449)/length(pik) # should be approximately equal to 0.95
This function transforms a categorical vector into a matrix of indicators.
disj(strata_input)
disj(strata_input)
strata_input |
A vector of integers that represents the categories. |
A matrix of indicators.
strata <- rep(c(1,2,3),each = 4) disj(strata)
strata <- rep(c(1,2,3),each = 4) disj(strata)
This function transforms a categorical matrix into a matrix of indicators variables.
disjMatrix(strata_input)
disjMatrix(strata_input)
strata_input |
A matrix of integers that contains categorical vector in each column. |
A matrix of indicators.
Xcat <- matrix(c(sample(x = 1:6, size = 100, replace = TRUE), sample(x = 1:6, size = 100, replace = TRUE), sample(x = 1:6, size = 100, replace = TRUE)),ncol = 3) disjMatrix(Xcat)
Xcat <- matrix(c(sample(x = 1:6, size = 100, replace = TRUE), sample(x = 1:6, size = 100, replace = TRUE), sample(x = 1:6, size = 100, replace = TRUE)),ncol = 3) disjMatrix(Xcat)
Calculate the squared Euclidean distance from unit to the other units.
distUnitk(X, k, tore, toreBound)
distUnitk(X, k, tore, toreBound)
X |
matrix representing the spatial coordinates. |
k |
the unit index to be used. |
tore |
an optional logical value, if we are considering the distance on a tore. See Details. |
toreBound |
an optional numeric value that specify the length of the tore. |
Let be the spatial coordinates of the unit
. The classical euclidean distance is given by
When the points are distributed on a regular grid of
.
It is possible to consider the units like they were placed on a tore. It can be illustrated by Pac-Man passing through the wall to get away from ghosts. Specifically,
we could consider two units on the same column (resp. row) that are on the opposite have a small distance,
The option toreBound
specify the length of the tore in the case of .
It is omitted if the
tore
option is equal to FALSE
.
a vector of length that contains the distances from the unit
to all other units.
dist
.
N <- 5 x <- seq(1,N,1) X <- as.matrix(expand.grid(x,x)) distUnitk(X,k = 2,tore = TRUE,toreBound = 5) distUnitk(X,k = 2,tore = FALSE,toreBound = -1)
N <- 5 x <- seq(1,N,1) X <- as.matrix(expand.grid(x,x)) distUnitk(X,k = 2,tore = TRUE,toreBound = 5) distUnitk(X,k = 2,tore = FALSE,toreBound = -1)
This function implements the method proposed by Hasler and Tillé (2014). It should be used for selecting a sample from highly stratified population.
fbs(X, strata, pik, rand = TRUE, landing = TRUE)
fbs(X, strata, pik, rand = TRUE, landing = TRUE)
X |
A matrix of size ( |
strata |
A vector of integers that specifies the stratification. |
pik |
A vector of inclusion probabilities. |
rand |
if TRUE, the data are randomly arranged. Default TRUE |
landing |
if TRUE, landing by linear programming otherwise supression of variables. Default TRUE |
Firstly a flight phase is performed on each strata. Secondly, several flight phases are applied by adding one by one the stratum. By doing this, some strata are managed on-the-fly. Finally, a landing phase is applied by suppression of the variables. If the number of element selected in each stratum is not equal to an integer, the function can be very time-consuming.
A vector with elements equal to 0 or 1. The value 1 indicates that the unit is selected while the value 0 is for rejected units.
Hasler, C. and Tillé Y. (2014). Fast balanced sampling for highly stratified population. Computational Statistics and Data Analysis, 74, 81-94
N <- 100 n <- 10 x1 <- rgamma(N,4,25) x2 <- rgamma(N,4,25) strata <- rep(1:n,each = N/n) pik <- rep(n/N,N) X <- as.matrix(cbind(matrix(c(x1,x2),ncol = 2))) s <- fbs(X,strata,pik) t(X/pik)%*%s t(X/pik)%*%pik Xcat <- disj(strata) t(Xcat)%*%s t(Xcat)%*%pik
N <- 100 n <- 10 x1 <- rgamma(N,4,25) x2 <- rgamma(N,4,25) strata <- rep(1:n,each = N/n) pik <- rep(n/N,N) X <- as.matrix(cbind(matrix(c(x1,x2),ncol = 2))) s <- fbs(X,strata,pik) t(X/pik)%*%s t(X/pik)%*%pik Xcat <- disj(strata) t(Xcat)%*%s t(Xcat)%*%pik
This function computes the flight phase of the cube method proposed by Chauvet and Tillé (2006).
ffphase(Xbal, prob, order = TRUE)
ffphase(Xbal, prob, order = TRUE)
Xbal |
A matrix of size ( |
prob |
A vector of inclusion probabilities. |
order |
if the units are reordered, Default TRUE. |
This function implements the method proposed by (Chauvet and Tillé 2006). It recursively transforms the vector of inclusion probabilities pik
into a
sample that respects the balancing equations. The algorithm stops when the null space of the sub-matrix is empty.
For more information see (Chauvet and Tillé 2006).
Updated vector of pik
that contains 0 and 1 for unit that are rejected or selected.
N <- 100 n <- 10 p <- 4 pik <- rep(n/N,N) X <- cbind(pik,matrix(rgamma(N*p,4,25),ncol= p)) pikstar <- ffphase(X,pik) t(X/pik)%*%pikstar t(X/pik)%*%pik pikstar
N <- 100 n <- 10 p <- 4 pik <- rep(n/N,N) X <- cbind(pik,matrix(rgamma(N*p,4,25),ncol= p)) pikstar <- ffphase(X,pik) t(X/pik)%*%pikstar t(X/pik)%*%pik pikstar
This function is computing a sub-matrix used in stratifiedcube
.
findB(X, strata)
findB(X, strata)
X |
A matrix of size ( |
strata |
A vector of integers that specifies the stratification. |
The function finds the smallest matrix B such that it contains only one more row than the number of columns. It consecutively adds the right number of rows depending on the number of categories that is added.
A list of two components. The sub-matrix of X
and the corresponding disjunctive matrix.
If we use the function cbind
to combine the two matrices, the resulting matrix has only one more row than the number of columns.
N <- 1000 strata <- sample(x = 1:6, size = N, replace = TRUE) p <- 3 X <- matrix(rnorm(N*p),ncol = 3) findB(X,strata)
N <- 1000 strata <- sample(x = 1:6, size = N, replace = TRUE) p <- 3 X <- matrix(rnorm(N*p),ncol = 3) findB(X,strata)
This function is inspired by the function calib
of the package sampling. It computes the g-weights of the calibration estimator.
gencalibRaking(Xs, Zs, d, total, q, max_iter = 500L, tol = 1e-09)
gencalibRaking(Xs, Zs, d, total, q, max_iter = 500L, tol = 1e-09)
Xs |
A matrix of calibration variables. |
Zs |
A matrix of instrumental variables with same dimension as Xs. |
d |
A vector, the initial weights. |
total |
A vector that represents the initial weights. |
q |
A vector of positive value that account for heteroscedasticity. |
max_iter |
An integer, the maximum number of iterations. Default = 500. |
tol |
A scalar that represents the tolerance value for the algorithm. Default = 1e-9. |
More details on the different calibration methods can be read in Tillé Y. (2020).
A vector, the value of the g-weights.
Tillé, Y. (2020). Sampling and estimation from finite populations. Wiley, New York
This function harmonize the two weight schemes such that the totals are equal.
harmonize(X1, d1, id1, X2, d2, id2, totals)
harmonize(X1, d1, id1, X2, d2, id2, totals)
X1 |
A matrix, the matching variables of sample 1. |
d1 |
A numeric vector that contains the initial weights of the sample 1. |
id1 |
A character or numeric vector that contains the labels of the units in sample 1. |
X2 |
A matrix, the matching variables of sample 2. |
d2 |
A numeric vector that contains the initial weights of the sample 1. |
id2 |
A character or numeric vector that contains the labels of the units in sample 2. |
totals |
An optional numeric vector that contains the totals of the matching variables. |
All details of the method can be seen in the manuscript: Raphaël Jauslin and Yves Tillé (2021) <arXiv:>.
A list of two vectors, the new weights of sample 1 (respectively new weights of sample 2).
#--- SET UP N = 1000 p = 5 X = array(rnorm(N*p),c(N,p)) n1=100 n2=200 s1 = sampling::srswor(n1,N) s2 = sampling::srswor(n2,N) id1=(1:N)[s1==1] id2=(1:N)[s2==1] d1=rep(N/n1,n1) d2=rep(N/n2,n2) X1 = X[s1==1,] X2 = X[s2==1,] re <- harmonize(X1,d1,id1,X2,d2,id2) colSums(re$w1*X1) colSums(re$w2*X2) #--- if the true totals is known totals <- c(N,colSums(X)) re <- harmonize(X1,d1,id1,X2,d2,id2,totals) colSums(re$w1*X1) colSums(re$w2*X2) colSums(X)
#--- SET UP N = 1000 p = 5 X = array(rnorm(N*p),c(N,p)) n1=100 n2=200 s1 = sampling::srswor(n1,N) s2 = sampling::srswor(n2,N) id1=(1:N)[s1==1] id2=(1:N)[s2==1] d1=rep(N/n1,n1) d2=rep(N/n2,n2) X1 = X[s1==1,] X2 = X[s2==1,] re <- harmonize(X1,d1,id1,X2,d2,id2) colSums(re$w1*X1) colSums(re$w2*X2) #--- if the true totals is known totals <- c(N,colSums(X)) re <- harmonize(X1,d1,id1,X2,d2,id2,totals) colSums(re$w1*X1) colSums(re$w2*X2) colSums(X)
Computes first-order inclusion probabilities from a vector of positive numbers.
inclprob(x, n)
inclprob(x, n)
x |
vector of positive numbers. |
n |
sample size (could be a positive real value). |
The function is implemented in C++ so that it can be used in the code of other C++ functions. The implementation is based on the function inclusionprobabilities
of the package sampling.
A vector of inclusion probabilities proportional to x
and such that the sum is equal to the value n
.
x <- runif(100) pik <- inclprob(x,70) sum(pik)
x <- runif(100) pik <- inclprob(x,70) sum(pik)
This function performs the landing phase of the cube method using suppression of variables proposed by Chauvet and Tillé (2006).
landingRM(A, pikstar, EPS = 1e-07)
landingRM(A, pikstar, EPS = 1e-07)
A |
matrix of auxiliary variables on which the sample must be balanced. (The matrix should be divided by the original inclusion probabilities.) |
pikstar |
vector of updated inclusion probabilities by the flight phase. See |
EPS |
epsilon value |
A vector with elements equal to 0 or 1. The value 1 indicates that the unit is selected while the value 0 is for rejected units.
Chauvet, G. and Tillé, Y. (2006). A fast algorithm of balanced sampling. Computational Statistics, 21/1:53-62
N <- 1000 n <- 10 p <- 4 pik <- rep(n/N,N) X <- cbind(pik,matrix(rgamma(N*p,4,25),ncol= p)) pikstar <- ffphase(X,pik) s <- landingRM(X/pik*pikstar,pikstar) sum(s) t(X/pik)%*%pik t(X/pik)%*%pikstar t(X/pik)%*%s
N <- 1000 n <- 10 p <- 4 pik <- rep(n/N,N) X <- cbind(pik,matrix(rgamma(N*p,4,25),ncol= p)) pikstar <- ffphase(X,pik) s <- landingRM(X/pik*pikstar,pikstar) sum(s) t(X/pik)%*%pik t(X/pik)%*%pikstar t(X/pik)%*%s
This function computes the matrix of the joint inclusion of the maximum entropy sampling with fixed sample size. It can handle unequal inclusion probabilities.
maxentpi2(pikr)
maxentpi2(pikr)
pikr |
A vector of inclusion probabilities. |
The sampling design maximizes the entropy design:
This function is a C++ implementation of UPMEpik2frompikw
.
More details could be find in Tille (2006).
A matrix, the joint inclusion probabilities.
Tille, Y. (2006), Sampling Algorithms, springer
This function returns the number of factor in each column of a categorical matrix.
ncat(Xcat_input)
ncat(Xcat_input)
Xcat_input |
A matrix of integers that contains categorical vector in each column. |
A row vector that contains the number of categories in each column.
Xcat <- matrix(c(sample(x = 1:6, size = 100, replace = TRUE), sample(x = 1:6, size = 100, replace = TRUE), sample(x = 1:6, size = 100, replace = TRUE)),ncol = 3) ncat(Xcat)
Xcat <- matrix(c(sample(x = 1:6, size = 100, replace = TRUE), sample(x = 1:6, size = 100, replace = TRUE), sample(x = 1:6, size = 100, replace = TRUE)),ncol = 3) ncat(Xcat)
This function implements the One-step One Decision method. It can be used using equal or unequal inclusion probabilities. The method is particularly useful for selecting a sample from a stream.
osod(pikr, full = FALSE)
osod(pikr, full = FALSE)
pikr |
A vector of inclusion probabilities. |
full |
An optional boolean value, to specify whether the full population (the entire vector) is used to update inclusion probabilities. Default: FALSE |
The method sequentially transforms the vector of inclusion probabilities into a sample whose values are equal to 0 or 1. The method respects the inclusion probabilities and can handle equal or unequal inclusion probabilities.
The method does not take into account the whole vector of inclusion probabilities by having a sequential implementation. This means that the method is fast and can be implemented in a flow.
A vector with elements equal to 0 or 1. The value 1 indicates that the unit is selected while the value 0 is for rejected units.
N <- 1000 n <- 100 pik <- inclprob(runif(N),n) s <- osod(pik)
N <- 1000 n <- 100 pik <- inclprob(runif(N),n) s <- osod(pik)
This function computes the statistical matching between two complex survey samples with weighting schemes. The function uses the function transport
of the package transport
.
otmatch( X1, id1, X2, id2, w1, w2, dist_method = "Euclidean", transport_method = "shortsimplex", EPS = 1e-09 )
otmatch( X1, id1, X2, id2, w1, w2, dist_method = "Euclidean", transport_method = "shortsimplex", EPS = 1e-09 )
X1 |
A matrix, the matching variables of sample 1. |
id1 |
A character or numeric vector that contains the labels of the units in sample 1. |
X2 |
A matrix, the matching variables of sample 2. |
id2 |
A character or numeric vector that contains the labels of the units in sample 1. |
w1 |
A numeric vector that contains the weights of the sample 1, harmonized by the function |
w2 |
A numeric vector that contains the weights of the sample 2, harmonized by the function |
dist_method |
A string that specified the distance used by the function |
transport_method |
A string that specified the distance used by the function |
EPS |
an numeric scalar to determine if the value is rounded to 0. |
All details of the method can be seen in : Raphaël Jauslin and Yves Tillé (2021) <arXiv:2105.08379>.
A data.frame that contains the matching. The first two columns contain the unit identities of the two samples. The third column is the final weights. All remaining columns are the matching variables.
#--- SET UP N=1000 p=5 X=array(rnorm(N*p),c(N,p)) EPS= 1e-9 n1=100 n2=200 s1 = sampling::srswor(n1,N) s2 = sampling::srswor(n2,N) id1=(1:N)[s1==1] id2=(1:N)[s2==1] d1=rep(N/n1,n1) d2=rep(N/n2,n2) X1=X[s1==1,] X2=X[s2==1,] #--- HARMONIZATION re=harmonize(X1,d1,id1,X2,d2,id2) w1=re$w1 w2=re$w2 #--- STATISTICAL MATCHING WITH OT object = otmatch(X1,id1,X2,id2,w1,w2) round(colSums(object$weight*object[,4:ncol(object)]),3) round(colSums(w1*X1),3) round(colSums(w2*X2),3)
#--- SET UP N=1000 p=5 X=array(rnorm(N*p),c(N,p)) EPS= 1e-9 n1=100 n2=200 s1 = sampling::srswor(n1,N) s2 = sampling::srswor(n2,N) id1=(1:N)[s1==1] id2=(1:N)[s2==1] d1=rep(N/n1,n1) d2=rep(N/n2,n2) X1=X[s1==1,] X2=X[s2==1,] #--- HARMONIZATION re=harmonize(X1,d1,id1,X2,d2,id2) w1=re$w1 w2=re$w2 #--- STATISTICAL MATCHING WITH OT object = otmatch(X1,id1,X2,id2,w1,w2) round(colSums(object$weight*object[,4:ncol(object)]),3) round(colSums(w1*X1),3) round(colSums(w2*X2),3)
This function finds the pik
from an initial q
.
pikfromq(q)
pikfromq(q)
q |
A matrix that is computed from the function |
More details could be find in Tille (2006).
A vector of inclusion probability computed from the matrix q
.
Tille, Y. (2006), Sampling Algorithms, springer
This function finds the pikt
from an initial pik
.
piktfrompik(pik, max_iter = 500L, tol = 1e-08)
piktfrompik(pik, max_iter = 500L, tol = 1e-08)
pik |
A vector of inclusion probabilities. The vector must contains only value that are not integer. |
max_iter |
An integer that specify the maximum iteration in the Newton-Raphson algorithm. Default |
tol |
A scalar that specify the tolerance convergence for the Newton-Raphson algorithm. Default |
The management of probabilities equal to 0 or 1 is done in the cps function.
pikt
is the vector of inclusion probabilities of a Poisson sampling with the right parameter. The vector is found by Newtwon-Raphson algorithm.
More details could be find in Tille (2006).
An updated vector of inclusion probability.
Tille, Y. (2006), Sampling Algorithms, springer
This function finds the matrix q
form a particular w
.
qfromw(w, n)
qfromw(w, n)
w |
A vector of weights. |
n |
An integer that is equal to the sum of the inclusion probabilities. |
w
is generally computed by the formula pik/(1-pik)
, where n
is equal to the sum of the vector pik
.
More details could be find in Tille (2006).
A matrix of size N
x n
, where N
is equal to the length of the vector w
.
Tille, Y. (2006), Sampling Algorithms, springer
This function finds sample s
form the matrix q
.
sfromq(q)
sfromq(q)
q |
A matrix that is computed from the function |
More details could be find in Tille (2006).
A vector with elements equal to 0 or 1. The value 1 indicates that the unit is selected while the value 0 is for rejected units.
Tille, Y. (2006), Sampling Algorithms, springer
This function implements a method for selecting a stratified sample. It really improves the performance of the function fbs
and balstrat
.
stratifiedcube( X, strata, pik, EPS = 1e-07, rand = TRUE, landing = TRUE, lp = TRUE )
stratifiedcube( X, strata, pik, EPS = 1e-07, rand = TRUE, landing = TRUE, lp = TRUE )
X |
A matrix of size ( |
strata |
A vector of integers that specifies the stratification.. |
pik |
A vector of inclusion probabilities. |
EPS |
epsilon value |
rand |
if TRUE, the data are randomly arranged. Default TRUE |
landing |
if FALSE, no landing phase is done. |
lp |
if TRUE, landing by linear programming otherwise supression of variables. Default TRUE |
The function is selecting a balanced sample very quickly even if the sum of inclusion probabilities within strata are non-integer. The function should be used in preference. Firstly, a flight phase is performed on each strata. Secondly, the function findB
is used to find a particular matrix to apply a flight phase by using the cube method proposed by Chauvet, G. and Tillé, Y. (2006). Finally, a landing phase is applied by suppression of variables.
A vector with elements equal to 0 or 1. The value 1 indicates that the unit is selected while the value 0 is for rejected units.
Chauvet, G. and Tillé, Y. (2006). A fast algorithm of balanced sampling. Computational Statistics, 21/1:53-62
fbs
, balstrat
, landingRM
, ffphase
# EXAMPLE WITH EQUAL INCLUSION PROBABILITES AND SUM IN EACH STRATA INTEGER N <- 100 n <- 10 p <- 4 X <- matrix(rgamma(N*p,4,25),ncol = p) strata <- rep(1:n,each = N/n) pik <- rep(n/N,N) s <- stratifiedcube(X,strata,pik) t(X/pik)%*%s t(X/pik)%*%pik Xcat <- disj(strata) t(Xcat)%*%s t(Xcat)%*%pik # EXAMPLE WITH UNEQUAL INCLUSION PROBABILITES AND SUM IN EACH STRATA INTEGER N <- 100 n <- 10 X <- cbind(rgamma(N,4,25),rbinom(N,20,0.1),rlnorm(N,9,0.1),runif(N)) colSums(X) strata <- rbinom(N,10,0.7) strata <- sampling::cleanstrata(strata) pik <- as.vector(sampling::inclusionprobastrata(strata,ceiling(table(strata)*0.10))) EPS = 1e-7 s <- stratifiedcube(X,strata,pik) test <- stratifiedcube(X,strata,pik,landing = FALSE) t(X/pik)%*%s t(X/pik)%*%test t(X/pik)%*%pik Xcat <- disj(strata) t(Xcat)%*%s t(Xcat)%*%test t(Xcat)%*%pik # EXAMPLE WITH UNEQUAL INCLUSION PROBABILITES AND SUM IN EACH STRATA NOT INTEGER set.seed(3) N <- 100 n <- 10 X <- cbind(rgamma(N,4,25),rbinom(N,20,0.1),rlnorm(N,9,0.1),runif(N)) strata <- rbinom(N,10,0.7) strata <- sampling::cleanstrata(strata) pik <- runif(N) EPS = 1e-7 tapply(pik,strata,sum) table(strata) s <- stratifiedcube(X,strata,pik,landing = TRUE) test <- stratifiedcube(X,strata,pik,landing = FALSE) t(X/pik)%*%s t(X/pik)%*%test t(X/pik)%*%pik Xcat <- disj(strata) t(Xcat)%*%s t(Xcat)%*%pik t(Xcat)%*%test
# EXAMPLE WITH EQUAL INCLUSION PROBABILITES AND SUM IN EACH STRATA INTEGER N <- 100 n <- 10 p <- 4 X <- matrix(rgamma(N*p,4,25),ncol = p) strata <- rep(1:n,each = N/n) pik <- rep(n/N,N) s <- stratifiedcube(X,strata,pik) t(X/pik)%*%s t(X/pik)%*%pik Xcat <- disj(strata) t(Xcat)%*%s t(Xcat)%*%pik # EXAMPLE WITH UNEQUAL INCLUSION PROBABILITES AND SUM IN EACH STRATA INTEGER N <- 100 n <- 10 X <- cbind(rgamma(N,4,25),rbinom(N,20,0.1),rlnorm(N,9,0.1),runif(N)) colSums(X) strata <- rbinom(N,10,0.7) strata <- sampling::cleanstrata(strata) pik <- as.vector(sampling::inclusionprobastrata(strata,ceiling(table(strata)*0.10))) EPS = 1e-7 s <- stratifiedcube(X,strata,pik) test <- stratifiedcube(X,strata,pik,landing = FALSE) t(X/pik)%*%s t(X/pik)%*%test t(X/pik)%*%pik Xcat <- disj(strata) t(Xcat)%*%s t(Xcat)%*%test t(Xcat)%*%pik # EXAMPLE WITH UNEQUAL INCLUSION PROBABILITES AND SUM IN EACH STRATA NOT INTEGER set.seed(3) N <- 100 n <- 10 X <- cbind(rgamma(N,4,25),rbinom(N,20,0.1),rlnorm(N,9,0.1),runif(N)) strata <- rbinom(N,10,0.7) strata <- sampling::cleanstrata(strata) pik <- runif(N) EPS = 1e-7 tapply(pik,strata,sum) table(strata) s <- stratifiedcube(X,strata,pik,landing = TRUE) test <- stratifiedcube(X,strata,pik,landing = FALSE) t(X/pik)%*%s t(X/pik)%*%test t(X/pik)%*%pik Xcat <- disj(strata) t(Xcat)%*%s t(Xcat)%*%pik t(Xcat)%*%test
This function implements a method to select a sample using the Deville's systmatic algorithm.
sys_deville(pik)
sys_deville(pik)
pik |
A vector of inclusion probabilities. |
Return the selected indices in 1,2,...,N
Deville, J.-C. (1998), Une nouvelle méthode de tirage à probabilité inégales. Technical Report 9804, Ensai, France.
Chauvet, G. (2012), On a characterization of ordered pivotal sampling, Bernoulli, 18(4):1320-1340
set.seed(1) pik <- c(0.2,0.5,0.3,0.4,0.9,0.8,0.5,0.4) sys_deville(pik)
set.seed(1) pik <- c(0.2,0.5,0.3,0.4,0.9,0.8,0.5,0.4) sys_deville(pik)
This function returns the second order inclusion probabilities of Deville's systematic.
sys_devillepi2(pik)
sys_devillepi2(pik)
pik |
A vector of inclusion probabilities |
A matrix of second order inclusion probabilities.
Deville, J.-C. (1998), Une nouvelle méthode de tirage à probabilité inégales. Technical Report 9804, Ensai, France.
Chauvet, G. (2012), On a characterization of ordered pivotal sampling, Bernoulli, 18(4):1320-1340
set.seed(1) N <- 30 n <- 4 pik <- as.vector(inclprob(runif(N),n)) PI <- sys_devillepi2(pik) #image(as(as.matrix(PI),"sparseMatrix")) pik <- c(0.2,0.5,0.3,0.4,0.9,0.8,0.5,0.4) PI <- sys_devillepi2(pik) #image(as(as.matrix(PI),"sparseMatrix"))
set.seed(1) N <- 30 n <- 4 pik <- as.vector(inclprob(runif(N),n)) PI <- sys_devillepi2(pik) #image(as(as.matrix(PI),"sparseMatrix")) pik <- c(0.2,0.5,0.3,0.4,0.9,0.8,0.5,0.4) PI <- sys_devillepi2(pik) #image(as(as.matrix(PI),"sparseMatrix"))
Variance approximation calculated as the conditional variance with respect to the balancing equations of a particular Poisson design. See Tillé (2020)
vApp(Xaux, pik, y)
vApp(Xaux, pik, y)
Xaux |
A matrix of size ( |
pik |
A vector of inclusion probabilities. The vector has the size |
y |
A variable of interest. |
Approximated variance of the Horvitz-Thompson estimator.
Tillé, Y. (2020), Sampling and Estimation from finite populations, Wiley,
N <- 100 n <- 40 x1 <- rgamma(N,4,25) x2 <- rgamma(N,4,25) pik <- rep(n/N,N) Xaux <- cbind(pik,as.matrix(matrix(c(x1,x2),ncol = 2))) Xspread <- cbind(runif(N),runif(N)) s <- balseq(pik,Xaux,Xspread) y <- Xaux%*%c(1,1,3) + rnorm(N,120) # variable of interest vEst(Xaux[s,],pik[s],y[s]) vDBS(Xaux[s,],Xspread[s,],pik[s],y[s]) vApp(Xaux,pik,y)
N <- 100 n <- 40 x1 <- rgamma(N,4,25) x2 <- rgamma(N,4,25) pik <- rep(n/N,N) Xaux <- cbind(pik,as.matrix(matrix(c(x1,x2),ncol = 2))) Xspread <- cbind(runif(N),runif(N)) s <- balseq(pik,Xaux,Xspread) y <- Xaux%*%c(1,1,3) + rnorm(N,120) # variable of interest vEst(Xaux[s,],pik[s],y[s]) vDBS(Xaux[s,],Xspread[s,],pik[s],y[s]) vApp(Xaux,pik,y)
Approximated variance for balanced sampling
varApp(X, strata, pik, y)
varApp(X, strata, pik, y)
X |
A matrix of size ( |
strata |
A vector of integers that represents the categories. |
pik |
A vector of inclusion probabilities. |
y |
A variable of interest. |
This function gives an approximation of the variance of the Horvitz-Thompson total estimator presented by Hasler and Tillé (2014).
a scalar, the value of the approximated variance.
Hasler, C. and Tillé, Y. (2014). Fast balanced sampling for highly stratified population. Computational Statistics and Data Analysis, 74:81-94.
N <- 1000 n <- 400 x1 <- rgamma(N,4,25) x2 <- rgamma(N,4,25) strata <- as.matrix(rep(1:40,each = 25)) # 25 strata Xcat <- disjMatrix(strata) pik <- rep(n/N,N) X <- as.matrix(matrix(c(x1,x2),ncol = 2)) s <- stratifiedcube(X,strata,pik) y <- 20*strata + rnorm(1000,120) # variable of interest # y_ht <- sum(y[which(s==1)]/pik[which(s == 1)]) # Horvitz-Thompson estimator # (sum(y_ht) - sum(y))^2 # true variance varEst(X,strata,pik,s,y) varApp(X,strata,pik,y)
N <- 1000 n <- 400 x1 <- rgamma(N,4,25) x2 <- rgamma(N,4,25) strata <- as.matrix(rep(1:40,each = 25)) # 25 strata Xcat <- disjMatrix(strata) pik <- rep(n/N,N) X <- as.matrix(matrix(c(x1,x2),ncol = 2)) s <- stratifiedcube(X,strata,pik) y <- 20*strata + rnorm(1000,120) # variable of interest # y_ht <- sum(y[which(s==1)]/pik[which(s == 1)]) # Horvitz-Thompson estimator # (sum(y_ht) - sum(y))^2 # true variance varEst(X,strata,pik,s,y) varApp(X,strata,pik,y)
Estimator of the approximated variance for balanced sampling
varEst(X, strata, pik, s, y)
varEst(X, strata, pik, s, y)
X |
A matrix of size ( |
strata |
A vector of integers that represents the categories. |
pik |
A vector of inclusion probabilities. |
s |
A sample (vector of 0 and 1, if rejected or selected). |
y |
A variable of interest. |
This function gives an estimator of the approximated variance of the Horvitz-Thompson total estimator presented by Hasler C. and Tillé Y. (2014).
a scalar, the value of the estimated variance.
Raphaël Jauslin [email protected]
Hasler, C. and Tillé, Y. (2014). Fast balanced sampling for highly stratified population. Computational Statistics and Data Analysis, 74:81-94.
N <- 1000 n <- 400 x1 <- rgamma(N,4,25) x2 <- rgamma(N,4,25) strata <- as.matrix(rep(1:40,each = 25)) # 25 strata Xcat <- disjMatrix(strata) pik <- rep(n/N,N) X <- as.matrix(matrix(c(x1,x2),ncol = 2)) s <- stratifiedcube(X,strata,pik) y <- 20*strata + rnorm(1000,120) # variable of interest # y_ht <- sum(y[which(s==1)]/pik[which(s == 1)]) # Horvitz-Thompson estimator # (sum(y_ht) - sum(y))^2 # true variance varEst(X,strata,pik,s,y) varApp(X,strata,pik,y)
N <- 1000 n <- 400 x1 <- rgamma(N,4,25) x2 <- rgamma(N,4,25) strata <- as.matrix(rep(1:40,each = 25)) # 25 strata Xcat <- disjMatrix(strata) pik <- rep(n/N,N) X <- as.matrix(matrix(c(x1,x2),ncol = 2)) s <- stratifiedcube(X,strata,pik) y <- 20*strata + rnorm(1000,120) # variable of interest # y_ht <- sum(y[which(s==1)]/pik[which(s == 1)]) # Horvitz-Thompson estimator # (sum(y_ht) - sum(y))^2 # true variance varEst(X,strata,pik,s,y) varApp(X,strata,pik,y)
Variance estimator for sample that are at the same time well spread and balanced on auxiliary variables. See Grafstr\"om and Till\'e (2013)
vDBS(Xauxs, Xspreads, piks, ys)
vDBS(Xauxs, Xspreads, piks, ys)
Xauxs |
A matrix of size ( |
Xspreads |
Matrix of spatial coordinates. |
piks |
A vector of inclusion probabilities. The vector has the size |
ys |
A variable of interest. The vector has the size |
Estimated variance of the horvitz-thompson estimator.
Grafstr\"om, A. and Till\'e, Y. (2013), Doubly balanced spatial sampling with spreading and restitution of auxiliary totals, Environmetrics, 14(2):120-131
N <- 100 n <- 40 x1 <- rgamma(N,4,25) x2 <- rgamma(N,4,25) pik <- rep(n/N,N) Xaux <- cbind(pik,as.matrix(matrix(c(x1,x2),ncol = 2))) Xspread <- cbind(runif(N),runif(N)) s <- balseq(pik,Xaux,Xspread) y <- Xaux%*%c(1,1,3) + rnorm(N,120) # variable of interest vEst(Xaux[s,],pik[s],y[s]) vDBS(Xaux[s,],Xspread[s,],pik[s],y[s]) vApp(Xaux,pik,y)
N <- 100 n <- 40 x1 <- rgamma(N,4,25) x2 <- rgamma(N,4,25) pik <- rep(n/N,N) Xaux <- cbind(pik,as.matrix(matrix(c(x1,x2),ncol = 2))) Xspread <- cbind(runif(N),runif(N)) s <- balseq(pik,Xaux,Xspread) y <- Xaux%*%c(1,1,3) + rnorm(N,120) # variable of interest vEst(Xaux[s,],pik[s],y[s]) vDBS(Xaux[s,],Xspread[s,],pik[s],y[s]) vApp(Xaux,pik,y)
Estimated variance approximation calculated as the conditional variance with respect to the balancing equations of a particular Poisson design. See Tillé (2020)
vEst(Xauxs, piks, ys)
vEst(Xauxs, piks, ys)
Xauxs |
A matrix of size ( |
piks |
A vector of inclusion probabilities. The vector has the size of the sample |
ys |
A variable of interest.The vector has the size |
Estimated variance of the horvitz-thompson estimator.
Tillé, Y. (2020), Sampling and Estimation from finite populations, Wiley,
N <- 100 n <- 40 x1 <- rgamma(N,4,25) x2 <- rgamma(N,4,25) pik <- rep(n/N,N) Xaux <- cbind(pik,as.matrix(matrix(c(x1,x2),ncol = 2))) Xspread <- cbind(runif(N),runif(N)) s <- balseq(pik,Xaux,Xspread) y <- Xaux%*%c(1,1,3) + rnorm(N,120) # variable of interest vEst(Xaux[s,],pik[s],y[s]) vDBS(Xaux[s,],Xspread[s,],pik[s],y[s]) vApp(Xaux,pik,y)
N <- 100 n <- 40 x1 <- rgamma(N,4,25) x2 <- rgamma(N,4,25) pik <- rep(n/N,N) Xaux <- cbind(pik,as.matrix(matrix(c(x1,x2),ncol = 2))) Xspread <- cbind(runif(N),runif(N)) s <- balseq(pik,Xaux,Xspread) y <- Xaux%*%c(1,1,3) + rnorm(N,120) # variable of interest vEst(Xaux[s,],pik[s],y[s]) vDBS(Xaux[s,],Xspread[s,],pik[s],y[s]) vApp(Xaux,pik,y)