| Title: | Optimal Stratification of Univariate Populations |
|---|---|
| Description: | The stratification of univariate populations under stratified sampling designs is implemented according to Khan et al. (2002) <doi:10.1177/0008068320020518> and Khan et al. (2015) <doi:10.1080/02664763.2015.1018674> in this library. It determines the Optimum Strata Boundaries (OSB) and Optimum Sample Sizes (OSS) for the study variable, y, using the best-fit frequency distribution of a survey variable (if data is available) or a hypothetical distribution (if data is not available). The method formulates the problem of determining the OSB as mathematical programming problem which is solved by using a dynamic programming technique. If a dataset of the population is available to the surveyor, the method estimates its best-fit distribution and determines the OSB and OSS under Neyman allocation directly. When the dataset is not available, stratification is made based on the assumption that the values of the study variable, y, are available as hypothetical realizations of proxy values of y from recent surveys. Thus, it requires certain distributional assumptions about the study variable. At present, it handles stratification for the populations where the study variable follows a continuous distribution, namely, Pareto, Triangular, Right-triangular, Weibull, Gamma, Exponential, Uniform, Normal, Log-normal and Cauchy distributions. |
| Authors: | Karuna G. Reddy [aut, cre], M. G. M. Khan [aut] |
| Maintainer: | Karuna G. Reddy <[email protected]> |
| License: | GPL-2 |
| Version: | 1.0-5 |
| Built: | 2026-06-23 19:18:53 UTC |
| Source: | https://github.com/cran/stratifyR |
The Anaemia data comes from the Fiji National Nutritional Survey in 2004 on the "Micronutrient Status of Women in Fiji".
data(anaemia)data(anaemia)
A population data frame with 724 rows on some of the key components collected in the survey. The variables are:
HaemoglobinLevel of Haemoglobin (mmol/L)
IronLevel of Iron (ng/mL)
FolateLevel of Folate (mmol/L)
This survey was conducted by the Ministry of Heath in Fiji. More details can be found at: https://ghdx.healthdata.org/record/fiji-national-nutrition-survey-2004
data(anaemia) head(anaemia) Iron <- anaemia$Iron min(Iron); max(Iron) hist(anaemia$Haemoglobin) boxplot(anaemia$Folate)data(anaemia) head(anaemia) Iron <- anaemia$Iron min(Iron); max(Iron) hist(anaemia$Haemoglobin) boxplot(anaemia$Folate)
This function creates a matrix whose rows and columns depend on the range or distance of the data and the number of strata solutions that the user is seeking to compute. The matrix stores the objective function values calculated by the algorithm only to be accessed later for the purpose of presenting the OSB.
create.mat(my_env)create.mat(my_env)
my_env |
The environment my_env has various constants stored from earlier operations dealing with information on the data |
stores numerical quantities of the objective function and stores in the two matrices inside the my_env to be accessed by other functions
Karuna Reddy <[email protected]>
MGM Khan <[email protected]>
Allocate data To calculate the stratum sample sizes (nh) for a fixed sample size (n) directly based on the data
data.alloc(data, my_env)data.alloc(data, my_env)
data |
Input dataset. |
my_env |
Environment object. |
...
Uses OSB to compute stratum weights (Wh), sample variances (Vh from data), and Neyman allocations (nh). Builds the summary table once at the end.
This function uses the Dynamic Programming (DP) solution procedure in solving the objective function for the univariate stratification problem. It calculates the objective function values using the brute-force algorithm and stores those values in the matrices and keeps a copy in my_env so that a global minimum could be obtained.
data.optim(k, n, incf, minYk, maxYk, isFirstRun = TRUE, my_env)data.optim(k, n, incf, minYk, maxYk, isFirstRun = TRUE, my_env)
k |
A numeric: number of strata |
n |
A numeric: is the distance*1000 |
incf |
A numeric: 10e-3 when k=1 and 10e-5 for k>=2 |
minYk |
A numeric: index to access minimum elements in the matrix |
maxYk |
A numeric: index to access maximum elements in the matrix |
isFirstRun |
A boolean: TRUE/FALSE parameter |
my_env |
The environment my_env has various constants and calculations stored from earlier opeartions through various other functions |
returns the array filled with calculations of objective function values
Karuna Reddy <[email protected]>
M
GM Khan <[email protected]>
Used by the DP recurrences to evaluate the stratification objective at a specific remaining distance 'd' and first-stratum width 'y', given stratum cost 'c' and constants tucked in 'my_env.'
data.root(d, y, c, my_env)data.root(d, y, c, my_env)
d |
numeric: remaining distance/range on the (scaled) axis |
y |
numeric: first stratum width on the (scaled) axis |
c |
numeric: per stratum cost multiplier (Ch[k]) |
my_env |
environment: holds distribution name, parameters and scaling |
numeric: sqrt(objective) or -1 if branch is infeasible/invalid
Uses OSB to compute stratum weights (Wh), variances (Vh), Neyman allocations (nh), and related totals under a *given* population distribution. Integrations are cached per stratum; the output data.frame is built once at the end (faster).
distr.alloc(my_env)distr.alloc(my_env)
my_env |
Environment carrying all precomputed values and constants. |
Populates my_env$output, my_env$out, and totals (WhTot, NhTot, etc.)
This function uses the Dynamic Programming (DP) solution procedure in solving the objective function for the univariate stratification problem. It calculates the objective function values using the brute-force algorithm and stores those values in the matrices and keeps a copy in my_env so that a global minimum could be obtained.
distr.optim(k, n, incf, minYk, maxYk, isFirstRun = TRUE, my_env)distr.optim(k, n, incf, minYk, maxYk, isFirstRun = TRUE, my_env)
k |
A numeric: number of strata |
n |
A numeric: is the distance*1000 |
incf |
A numeric: 10e-3 when k=1 and 10e-5 for k>=2 |
minYk |
A numeric: index to access minimum elements in the matrix |
maxYk |
A numeric: index to access maximum elements in the matrix |
isFirstRun |
A boolean: TRUE/FALSE parameter |
my_env |
My environment my_env has various constants and calculations stored from earlier opeartions through various other functions |
returns the array filled with calculations of objective function values
Karuna Reddy <[email protected]>
MGM Khan <[email protected]>
Used by the DP recurrences in the "distribution-known" pathway. All distribution parameters are interpreted on the "scaled" axis (initval .. initval+dist), consistent with 'strata.distr()'.
distr.root(d, y, c, my_env)distr.root(d, y, c, my_env)
d |
numeric: remaining distance/range on the (scaled) axis |
y |
numeric: width of the first stratum on the (scaled) axis |
c |
numeric: per stratum cost multiplier (Ch[k]) |
my_env |
environment: holds distribution name, scaled parameters, and scaling constants ('initval', 'maxval', etc.) |
numeric: sqrt(objective) or -1 if branch is infeasible/invalid
Karuna Reddy <[email protected]>
MGM Khan <[email protected]>
This function calculates the value of the error according to the normally distributed variable using the idea presented in Abramowitz and Stegun (2011)
erf(x)erf(x)
x |
The data that is provided |
Gives the error for a normal variable
Karuna Reddy <[email protected]>
MGM Khan <[email protected]>
Determine best-fit distribution Identify the best-fit distribution for a univariate numeric vector
get.dist(data, my_env)get.dist(data, my_env)
data |
Input dataset. |
my_env |
Environment object. |
...
Fits several candidate distributions via MLE and selects the model with the lowest (finite) AIC. For strictly positive data, tail-sensitive families are also tried. If a triangular fit narrowly wins (Delta AIC <= 10 over a tail model, prefer the tail model (helps avoid spurious triangular wins on mildly skewed data).
Returns a list with: - distr: best model name (e.g., "gamma", "weibull", "triangle", ...) - params: named parameter vector for the best model - aic: named numeric vector of AICs for all attempted models (NA if failed) - fits_ok: named logical vector (TRUE where fit succeeded) - messages:named list of diagnostic messages per model (errors/warnings)
The hies data comes from the HIES survey conducted in Fiji in the year 2010. The data contains only two aspects of the survey.
data(hies)data(hies)
A data frame with 3566 observations on two of the major quantities collected in the survey. The variables are:
ExpenditureLevel of expenditure (FJD)
IncomeLevel of income (FJD)
This survey was conducted in 2010 by the Bureau of Statistics (FIBoS) - Fiji Government.
data(hies$Income) min(hies$Income); max(hies$Income) hist(hies$Income) boxplot(hies$Income)data(hies$Income) min(hies$Income); max(hies$Income) hist(hies$Income) boxplot(hies$Income)
The data contains the mathematics coursework marks, final examination marks and grades obtained by students in a first year mathematics course at The University level in the year 2010 in Fiji.
data(math)data(math)
A data frame with 353 observations which represent mathematics marks and grades for first year math students at university level. The variable is as follows:
cwCoursework marks in 1st year mathematics (0-50)
end_examThe end of semester examination marks maths (0-50)
final_marksFinal examination marks in maths, which is an addition of the cw and end_exam (0-100)
gradeThe grade obtained by the student based on the final marks
The data was obtained by a masters students at USP, Fiji.
data(math) min(math$final_marks); max(math$final_marks) hist(math$final_marks) boxplot(math$final_marks)data(math) min(math$final_marks); max(math$final_marks) hist(math$final_marks) boxplot(math$final_marks)
This function is called in data.optim() or distr.optim() which basically compares and returns the smaller value out of two given sets of values.
minim.val(val1, val2)minim.val(val1, val2)
val1 |
A numeric: the first value |
val2 |
A numeric: the second value |
returns the minimum value
Karuna Reddy <[email protected]>
MGM Khan <[email protected]>
This function calculates the value of the mode of the data that is provided
mode.val(x)mode.val(x)
x |
The data that is provided |
Gives the mode
Karuna Reddy <[email protected]>
MGM Khan <[email protected]>
This function re-calculates or re-allocate the stratum sample sizes (nh) after it has already been initially allocated via Neyman allocation. This is applied to resolve the problem of oversampling in one or more of the strata.
realloc(h, x, nh, Nh, nume, my_env)realloc(h, x, nh, Nh, nume, my_env)
h |
A numeric: the no. of strata |
x |
A vector: the osb that has been calculated |
nh |
A vector: the stratum sample sizes that have been initially calculated |
Nh |
A vector: the stratum population sizes that have been initially calculated |
nume |
A numeric: the numerator total |
my_env |
The environment my_env has various constants and outputs stored from earlier opeartions through various other functions |
calculates and presents the new re-allocate stratum samples
Karuna Reddy <[email protected]>
MGM Khan <[email protected]>
This function takes in the univariate population data
(argument data) and a fixed sample size (n)
to compute the optimum stratum boundaries (OSB) for a
given number of strata (L), optimum sample sizes (nh),
etc. directly from the data. The main idea used is from
Khan et al (2008) whereby the problem of stratification
is formulated into a Mathematical Programming Problem (MPP)
using the best-fit frequency distribution and its parameters
estimated from the data. This MPP is then solved for the
OSB using a Dynamic Programming (DP) solution procedure.
strata.data(data, h, n, cost = FALSE, ch = NULL)strata.data(data, h, n, cost = FALSE, ch = NULL)
data |
A vector of values of the survey variable y for which the OSB are determined |
h |
A numeric: denotes the number of strata to be created. |
n |
A numeric: denotes a fixed total sample size. |
cost |
A logical: has default cost=FALSE. If it is a stratum-cost problem, cost=TRUE, with which, one must provide the Ch parameter. |
ch |
A numeric: denotes a vector of stratum costs. When cost=FALSE, it has a default of NULL. |
strata.data returns Optimum Strata Boundaries (OSB),
stratum weights (Wh), stratum variances (Vh), Optimum Sample Sizes
(nh), stratum population sizes (Nh) and sampling fraction (fh).
Karuna Reddy <[email protected]>
MGM Khan <[email protected]>
strata.distr
data <- rweibull(1000, shape=2, scale = 1.5) hist(data) obj <- strata.data(data, h = 2, n=300) summary(obj) #------------------------------------------------------------- data(anaemia) Iron <- anaemia$Iron res <- strata.data(Iron, h = 2, n=350) summary(res) #------------------------------------------------------------- data(sugarcane) Production <- sugarcane$Production hist(Production) res <- strata.data(Production, h = 2, n=1000) summary(res) #------------------------------------------------------------- #The function be dynamically used to visualize the the strata boundaries, #for 2 strata, over the density (or observations) of the "mag" variable #from the quakes data (with purrr and ggplot2 packages loaded). output <- quakes %>% pluck("mag") %>% strata.data(h = 2, n = 300) quakes %>% ggplot(aes(x = mag)) + geom_density(fill = "blue", colour = "black", alpha = 0.3) + geom_vline(xintercept = output$OSB, linetype = "dotted", color = "red") #-------------------------------------------------------------data <- rweibull(1000, shape=2, scale = 1.5) hist(data) obj <- strata.data(data, h = 2, n=300) summary(obj) #------------------------------------------------------------- data(anaemia) Iron <- anaemia$Iron res <- strata.data(Iron, h = 2, n=350) summary(res) #------------------------------------------------------------- data(sugarcane) Production <- sugarcane$Production hist(Production) res <- strata.data(Production, h = 2, n=1000) summary(res) #------------------------------------------------------------- #The function be dynamically used to visualize the the strata boundaries, #for 2 strata, over the density (or observations) of the "mag" variable #from the quakes data (with purrr and ggplot2 packages loaded). output <- quakes %>% pluck("mag") %>% strata.data(h = 2, n = 300) quakes %>% ggplot(aes(x = mag)) + geom_density(fill = "blue", colour = "black", alpha = 0.3) + geom_vline(xintercept = output$OSB, linetype = "dotted", color = "red") #-------------------------------------------------------------
This function takes in the underlying hypothetical distribution and its parameter(s) of the survey variable, the initial value and the range of the population, the fixed sample size (n) and the fixed population size (N) to compute the optimum stratum boundaries (OSB) for a given number of strata (L), optimum sample sizes (nh), etc. The main idea used is from Khan et al. (2008) whereby the problem of stratification is fromulated into a Mathematical Programming Problem (MPP) using the best-fit frequency distribution and its parameter estimates of the data. This MPP is then solved for the optimal solutions using the Dynamic Programming (DP) solution procedure.
strata.distr( h, initval, dist, distr = c("pareto", "triangle", "rtriangle", "weibull", "gamma", "exp", "unif", "norm", "lnorm", "cauchy"), params = c(shape = 0, scale = 0, rate = 0, gamma = 0, location = 0, mean = 0, sd = 0, meanlog = 0, sdlog = 0, min = 0, max = 0, mode = 0), n, N, cost = FALSE, ch = NULL )strata.distr( h, initval, dist, distr = c("pareto", "triangle", "rtriangle", "weibull", "gamma", "exp", "unif", "norm", "lnorm", "cauchy"), params = c(shape = 0, scale = 0, rate = 0, gamma = 0, location = 0, mean = 0, sd = 0, meanlog = 0, sdlog = 0, min = 0, max = 0, mode = 0), n, N, cost = FALSE, ch = NULL )
h |
A numeric: denotes the number of strata to be created. |
initval |
A numeric: denotes the initial value of the population |
dist |
A numeric: denotes distance (or range) of the population |
distr |
A character: denotes the name of the distribution that characterizes the population |
params |
A list: contains the values of all parameters of the distribution |
n |
A numeric: denotes the fixed total sample size. |
N |
A numeric: denotes the fixed total population size. |
cost |
A logical: has default cost=FALSE. If it is a stratum-cost problem, cost=TRUE, with which one must provide the Ch parameter. |
ch |
A numeric: denotes a vector of stratum costs. |
strata.distr returns Optimum Strata Boundaries (OSB),
stratum weights (Wh), stratum costs (Ch), stratum variances (Vh), Optimum Sample Sizes
(nh), stratum population sizes (Nh).
Karuna Reddy <[email protected]>
MGM Khan <[email protected]>
strata.data
#Assume data has initial value of 1.5, distance of 33 and follows #weibull distribution with estimated parameters as shape=2.15 and scale=13.5 #To compute the OSB, OSS, etc. with fixed sample n=500, we use: res <- strata.distr(h=2, initval=1.5, dist=33, distr = "weibull", params = c(shape=2.15, scale=13.5), n=500, N=2000, cost=FALSE) summary(res) #------------------------------------------------------------- #Assume data has initial value of 1, distance of 10415 and follows #lnorm distribution with estimated parameters as meanlog=5.5 and sdlog=1.5 #To compute the OSB, OSS, etc. with fixed sample n=500, we use: res <- strata.distr(h=2, initval=1, dist=10415, distr = "lnorm", params = c(meanlog=5.5, sdlog=1.5), n=500, N=12000) summary(res) #------------------------------------------------------------- #Assume data has initial value of 2, distance of 68 and follows #gamma distribution with estimated parameters as shape=3.8 and rate=0.55 #To compute the OSB, OSS, etc. with fixed sample n=500, we use: res <- strata.distr(h=2, initval=0.65, dist=68, distr = "gamma", params = c(shape=3.8, rate=0.55), n=500, N=10000) summary(res) #------------------------------------------------------------- #The function be dynamically used to visualize the the strata boundaries, #for 2 strata, over the density (or observations) of the "mag" variable #from the quakes data (with purrr and ggplot2 packages loaded). res <- strata.distr(h=2, initval=4, dist=2.4, distr = "lnorm", params = c(meanlog=1.52681032, sdlog=0.08503554), n=300, N=1000) quakes %>% ggplot(aes(x = mag)) + geom_density(fill = "blue", colour = "black", alpha = 0.3) + geom_vline(xintercept = res$OSB, linetype = "dotted", color = "red") #-------------------------------------------------------------#Assume data has initial value of 1.5, distance of 33 and follows #weibull distribution with estimated parameters as shape=2.15 and scale=13.5 #To compute the OSB, OSS, etc. with fixed sample n=500, we use: res <- strata.distr(h=2, initval=1.5, dist=33, distr = "weibull", params = c(shape=2.15, scale=13.5), n=500, N=2000, cost=FALSE) summary(res) #------------------------------------------------------------- #Assume data has initial value of 1, distance of 10415 and follows #lnorm distribution with estimated parameters as meanlog=5.5 and sdlog=1.5 #To compute the OSB, OSS, etc. with fixed sample n=500, we use: res <- strata.distr(h=2, initval=1, dist=10415, distr = "lnorm", params = c(meanlog=5.5, sdlog=1.5), n=500, N=12000) summary(res) #------------------------------------------------------------- #Assume data has initial value of 2, distance of 68 and follows #gamma distribution with estimated parameters as shape=3.8 and rate=0.55 #To compute the OSB, OSS, etc. with fixed sample n=500, we use: res <- strata.distr(h=2, initval=0.65, dist=68, distr = "gamma", params = c(shape=3.8, rate=0.55), n=500, N=10000) summary(res) #------------------------------------------------------------- #The function be dynamically used to visualize the the strata boundaries, #for 2 strata, over the density (or observations) of the "mag" variable #from the quakes data (with purrr and ggplot2 packages loaded). res <- strata.distr(h=2, initval=4, dist=2.4, distr = "lnorm", params = c(meanlog=1.52681032, sdlog=0.08503554), n=300, N=1000) quakes %>% ggplot(aes(x = mag)) + geom_density(fill = "blue", colour = "black", alpha = 0.3) + geom_vline(xintercept = res$OSB, linetype = "dotted", color = "red") #-------------------------------------------------------------
The sugarcane data shows the disposition area (land area under cane) for individual sugarcane farms and their cane productions with the incomes/earnings for the year 2010 in Fiji.
data(sugarcane)data(sugarcane)
A data frame with 13894 observations corresponding to individual farms. The following are the variables:
DispAreaDisposition area (or land area under cane) (hactares)
ProductionThe amount of sugarcane produced in the farm (tonnes)
IncomeNet income or money paid to farmers) (in FJD)
This data was obtained from the Fiji Sugar Corporation in Fiji.
data(sugarcane$Production) head(sugarcane$Production) Production <- sugarcane$Production min(Production); max(Production) hist(Production) boxplot(Production)data(sugarcane$Production) head(sugarcane$Production) Production <- sugarcane$Production min(Production); max(Production) hist(Production) boxplot(Production)
Nicely formatted (and colored) summary for "strata" objects.
Prints an aligned ASCII table with optional crayon colors to the console,
or returns a knitr kable table in knitted documents.
## S3 method for class 'strata' summary(object, ...)## S3 method for class 'strata' summary(object, ...)
object |
A strata object. |
... |
Additional arguments. Nicely formatted (and colored) summary for "strata" objects Console: aligned ASCII table with crayon colors (if supported). HTML: kable + kableExtra with yellow TOTAL row (text only, no background). |
Invisibly returns a data frame containing the per-stratum results and
totals (columns: Strata, OSB, Wh, Vh,
WhSh, nh, Nh, fh), with a "Total" row
appended. In a knitr HTML context a kableExtra table object
is returned instead.