Type Package
Title Optimal Stratification of Univariate Populations
Version 1.0-3
Date 2021-11-23
Author Karuna G. Reddy, M.G.M. Khan
Maintainer Karuna G. Reddy <[email protected]>
Description This implements the stratification of univariate populations under stratified sampling designs using the method of Khan et al. (2002) https://doi.org/10.1177/0008068320020518, Khan et al. (2008) https://www150.statcan.gc.ca/n1/pub/12-001-x/2008002/article/10761-eng.pdf and Khan et al. (2015) https://doi.org/10.1080/02664763.2015.1018674. It determines the Optimum Strata Boundaries (OSB) and Optimum Sample Sizes (OSS) for the study variable, y, using the best-fit frequency distribution of a survey variable (if data is available) or a hypothetical distribution (if data is not available). The method formulates the problem of determining the OSB as mathematical programming problem which is solved by using a dynamic programming technique. If a dataset of the population is available to the surveyor, the method estimates its best-fit distribution and determines the OSB and OSS under Neyman allocation directly. When the dataset is not available, stratification is made based on the assumption that the values of the study variable, y, are available as hypothetical realizations of proxy values of y from recent surveys. Thus, it requires certain distributional assumptions about the study variable. At present, it handles stratification for the populations where the study variable follows a continuous distribution, namely, Pareto, Triangular, Right-triangular, Weibull, Gamma, Exponential, Uniform, Normal, Log-normal and Cauchy distributions. In this vignette, the two major functionalities in the package are applied to a number of real and simulated populations and to some hypothetical populations.
License GPL (>= 2)
LazyData yes
NeedsCompilation yes
Depends R (>= 3.4.4), MASS, fitdistrplus, actuar, triangle, mc2d, zipfR
NeedsCompilation yes
Repository CRAN
Date/Publication 2021-11-23 10:00:00
The main aim of stratification is to produce estimators with a small variance when a population characteristic (y) is under study. A simple method that can be used to create strata for this population, if y itself is the stratification variable. The ideal situation is that the distribution of such a study variable is known and the OSB can be determined by placing boundaries on the range of this distribution at suitable cut-points. This problem of determining the OSB, when both the estimation and stratification variables are the same, was first discussed by Dalenius (1950). He provided equations for the determination of stratum boundaries that minimize the variance of population estimates under optimal allocation. Dalenius (1957) futher proposed a solution to the problem by taking equal intervals of the cumulative square root of frequency scale of the stratification variable.
One of the many kinds of stratification methods that has been proposed in the literature is due to Bühler and Deutler (1975). They formulated the problem of determining OSB as an optimization problem and developed a computational technique to solve the problem by using Dynamic Programming (DP). A good review of this method can be found in M. G. M. Khan, Nand, and Ahmad (2008). The procedure is also applied by E. A. Khan, Khan, and Ahsan (2002), M. G. M. Khan, Khan, and Ahsan (2003), M. G. M. Khan, Sehar, and Ahsan (2005), M. G. M. Khan, Nand, and Ahmad (2008), M. G. M. Khan, Ahmad, and Khan (2009), M. G. M. Khan et al. (2015), M. G. M. Khan, Reddy, and Rao (2015) for determining OSB for many different distributions. With the known frequency function of the study variable, they considered the problem of finding OSB as an equivalent problem of determining Optimum Strata Width (OSW), which is formulated as a Mathematical Programming Problem (MPP) and solved by using DP technique. They applied the technique to several univariate populations where the study variables followed different probability distributions. In this package, a univariate stratification technique is developed, which is based on the probability distribution assumed by the stratification variable as discussed by the authors above.
As implemented in this package, there are many advantages of using the method. The real advantage of the stratifyR package is that when the dataset of the study variable is not available, which may occur in practice, the package is still able to construct OSB and OSS based on the distributional assumption on the data. Moreover, the population survey can be a costly and time-consuming affair, hence, this approach also has the advantage of determining OSB for the study variable that is not available prior to conducting the survey. This, of course, requires certain distributional or parametric assumptions on the study variables, which can easily be obtained from recent or past surveys.
Other advantages of the method are that it leads to substantial gains in the precision of the estimates over other available methods. Results reveal that the variances get smaller with increasing number of strata (L), and they get much smaller at a much faster rate than other available methods. Once the OSB have been determined, the Optimum Sample Sizes (OSS) can be easily calculated for each stratum using Neyman allocation (Neyman (1934)). The algorithm may be a little slow, however, it does provide very efficient results which is probably the most important objective of our survey estimation efforts.
The stratifyR package implements the DP technique (from various literature by Khan et al) as a stratification procedure for univariate populations, when the stratification variable follows a continuous probability distribution, namely: uniform, triangular, right-triangular, pareto, exponential, normal, log-normal, cauchy, weibull and gamma. The package is able to determine the OSB and OSS from real data and as well as hypothetical situations when the dataset is not available. The latter requires some assumptions about the distribution and its initial value, range, parameter values, fixed sample size, etc.
There are two major functions which basically solve the two types of stratification problems: strata.data() which carries out univariate stratification for those univariate populations where dataset is available and strata.distr() which performs stratification when dataset is not available prior to conducting the survey.
In the former case, data on the study variable, number of strata (h), fixed sample size (n) and population size (N) are used as the input arguments to the strata.data() function in the package. In the latter case, strata.distr() function is called which requires the distribution to be assumed, its parameters, the inital value and the estimated range of the distribution, fixed sample (n) and population sizes (N). When executed, both the functions output the OSB and OSS, amongst other quantities such as stratum weight (Wh), stratum variance (Sh2), stratum objective function values (WhSh), stratum sample sizes (nh), stratum population sizes (Nh) and stratum sampling fraction (fh).
The following sections show the general formulation of the problem of stratification, the DP solution procedure, the concept of Neyman allocation as the method of determining sample size and the overview of package functionalities. The package is then applied to numerous survey populations with real and simulated data to illustrate its application.
Let the target population of the variable under study be stratified into L strata where the estimation of the mean of the study variable (y) is of interest. If a simple random sample of size nh is to be drawn from hth stratum with sample mean ȳh, then the stratified sample mean, ȳst, is given by where Wh (stratum weight) is the proportion of the population contained in the hth stratum.
When the finite population correction factors are ignored, under the Neyman (1934) allocation, the variance of ȳst is given by where Sh2 is the stratum variance for the study variable in the hth (where h = 1, 2, ..., L) stratum and n is the preassigned total sample size.
Let f(y); a ≤ y ≤ b be the frequency function of the study variable, y, on which OSB are to be constructed. If the population mean of this study variable is estimated under Neyman allocation, then the problem of determining OSB is to cut up the range, d = b − a, at (L − 1) intermediate points a = y0 ≤ y1 ≤ y2≤,..., ≤ yL − 1 ≤ yL = b such that (2) is minimum. The lower and upper bounds of the study variable are denoted by a and b respectively and the cut-points y0, y1, y2, ..., yL − 1 are the OSB.
For a fixed sample size n, minimizing the expression of the right hand side of equation (2) is equivalent to minimizing
If f(y) is known and integrable, the stratum weight (Wh), stratum variance (Sh2) and stratum mean (μh) can be obtained as a function of the boundary points yh and yh − 1 by using the following expressions:
where (yh − 1, yh) are the boundaries of hth stratum.
Thus, the objective function in (3) could be expressed as a function of boundary points yh and yh − 1 only. We further define
where lh ≥ 0 denotes the range or width of the hth stratum. Then, the range of the distribution, d = b − a, is expressed as a function of stratum width as: The hth stratification point yh; h = 1, 2, ..., L is then expressed as yh = yh − 1 + lh and from (8), the problem can be treated as an equivalent problem of determining Optimum Strata Widths (OSW), l1, l2, ..., lL. Due to the special nature of functions, the problem may be treated as a function of lh alone and can be expressed as:
Many multistage decision problems can be formulated as a Mathematical Programming Problem (MPP). The Dynamic Programming technique, developed by Bellman, R E (1957), is a computational method which is well suited for solving MPPs that may be treated as a multistage decision problem. The DP technique determines the optimum solution of a multistage problem by decomposing it into stages, each stage comprising of a single stage. The advantage of the decomposition is that the optimization process at each stage involves one variable only, which simplifies the computational task by dealing with all variables simultaneously.
The solution to an MPP is achieved in a sequential manner starting from one stage problem, moving onto a two stage problem, to a three stage problem and so on until finally all stages are included. The solution for n stages is obtained by adding the nth stage to the solution of n − 1 stages.
The basic concept of DP technique is contained in the principle of optimality proclaimed by Bellman, R E (1957), which implies that given the initial state of a system, an optimal policy for the subsequent stages does not depend upon the policy adopted at the preceding stages. It determines the optimum solution of a multi-variable problem by decomposing it into stages, each stage compromising a single variable sub-problem. A dynamic programming model is basically a recursive equation which links the different stages of the problem in a manner which guarantees that each stage’s optimal feasible solution is also optimum and feasible for the entire problem.
Consider the following sub-problem of (9) for first k( < L) strata:
where dk < d is the total width available for division into strata or the state value at stage k. Note that dk = d for k = L.
The transformation functions are given by:
Let Φk(dk) denote the minimum value of the objective function of MPP (10), that is,
With the above definition of Φk(dk), (10) is equivalent to finding ΦL(d) recursively by finding Φk(dk) for k = 1, 2, ..., L and 0 ≤ dk ≤ d.
We can write:
For a fixed value of lk; 0 ≤ lk ≤ dk,
Using the Bellman’s principle of optimality, a forward recursive equation can be written as:
For the first stage, that is, for k = 1:
where l1* = d1 is the optimum width of the first stratum. The relations (11) and (12) are solved recursively for each k = 1, 2, ..., L and 0 ≤ dk ≤ d, and ΦL(d) is obtained. From ΦL(d) the optimum width of Lth stratum, lL*, is obtained. From ΦL − 1(d − lL*) the optimum width of (L − 1)th stratum, lL − 1*, is obtained and so on until l1* is obtained.
The above algorithm of the Dynamic Programming solution procedure to solve the MPP given in (9) is summarized with the following steps:
When OSB (yh, yh − 1) have been determined, the Optimum Sample Sizes (OSS) nh; h = 1, 2, ..., L that minimizes the variance of the estimate can easily be computed. The sample size nh are obtained for a fixed total sample of size n under the Neyman allocation for h = 1, 2, ..., L and given as follows:
where Wh and Sh are derived in terms of the optimum boundary points (yh, yh − 1).
In Neyman allocation, the total sample size is proportional to the stratum size multiplied by the standard deviation of the stratum. If the variances are specified correctly, Neyman allocation will give an estimator with smaller variance compared to proportional allocation (Lohr (2009)).
In equation (13), it is also worth noting that the OSB (yh, yh − 1) are so obtained that nh must satisfy the restrictions:
where Nh = NWh. Thus, the restriction (14) indicates that the hth stratum must form with at least one unit and also avoid the problem of over-sampling.
For the numerical illustrations and application of the package, some of the real datasets such as ‘sugarcane’ of M. G. M. Khan, Reddy, and Rao (2015), ‘anaemia’ of K. G. Reddy, Khan, and Rao (2014); ‘hies’ and ‘math’ data are provided in the stratifyR. The ‘quakes’ and ‘Boston’ data provided in the datasets package in R are also used for illustration purposes. The stratifyR package is also tested on some published and commonly-used datasets such as ‘UScities’ and ‘UScolleges’ data from Cochran (1961), ‘Debtors’ data of Gunning and Horgan (2004), ‘REV84’ variable for ‘Swedish municipalities’ data from Särdnal, Swensson, and Wretman (1992) and ‘MRTS’ variable from ‘Statistics Canada Monthly Retail Trade Survey’ together with ‘SHS’ data collected in ‘Statistics Canada Survey of Household Spending’. For those distributions where real data is not found in literature, data is simulated to demonstrate the application of the package in this documentaion.
For a user, there are two different routes available in the package and these are basically dependent on the type of stratification problem to be solved. It could either be a data-based (i.e., when the dataset of the stratification variable is available) or a distribution-based (i.e., when dataset is not available) stratification problem. Whether stratification is based on data or not, the idea is that the problem is formulated as an MPP using the estimated (with available data) or assumed (with unavailable data) distribution of the data set. There are numerous functions created in the package, however, there are only a few major functions that are used by the two different types of problems being studied in univariate stratification.
If it is a data-based problem, the function used is strata.data() and the user has to provide as input arguments: the data, the number of strata (L) and the fixed sample size (n). For the distribution-based problem, the function used is strata.distr() and the user has to provide the name of the assumed distribution, number of strata (L), possible range of data (distance), fixed sample size (n) and the population size (N). The following two subsections delve a little deeper into the workings surrounding the two functions: strata.data() and strata.distr().
This function computes the OSB, OSS, and other important quantities from univariate survey populations by employing the methodology proposed by E. A. Khan, Khan, and Ahsan (2002) M. G. M. Khan, Khan, and Ahsan (2003), M. G. M. Khan, Sehar, and Ahsan (2005), M. G. M. Khan, Nand, and Ahmad (2008), M. G. M. Khan, Ahmad, and Khan (2009), Nand and Khan (2009), M. G. M. Khan et al. (2015), M. G. M. Khan, Reddy, and Rao (2015), Karuna Garan Reddy, Khan, and Khan (2018) and Karuna G. Reddy and Khan (2019). Their method uses the estimated distribution of the data and formulates the problem of determining OSB as a Mathematical Programming Problem which is an optimization problem that is solved by the DP technque. The OSB obtained are optimal solutions that are used to calculate the OSS under Neyman allocation. The usage and arguments are as follows:
strata.data(data, h, n, cost=FALSE, ch=NULL)
data - A vector: data containing every unit of the survey population
h - A numeric: number of strata to be sampled. The default is 2
n - A numeric: fixed total sample size
cost - A logical: stratum cost. Default cost=FALSE.
ch - A numeric: denotes a vector of stratum costs. Default ch=NULL.
To show the application of the strata.data() function, an example of the command used and its output from the package is given below. The problem uses the ‘mag’ variable from the ‘quakes’ data (with a population of N = 1000) available from the datasets package in R. To construct a 2-strata solution with a fixed sample size of n = 300, we use the following codes:
data(quakes)
head(quakes)
#> lat long depth mag stations
#> 1 -20.42 181.62 562 4.8 41
#> 2 -20.62 181.03 650 4.2 15
#> 3 -26.00 184.10 42 5.4 43
#> 4 -17.97 181.66 626 4.1 19
#> 5 -20.42 181.96 649 4.0 11
#> 6 -19.68 184.31 195 4.0 12
mag <- quakes$mag
length(mag)
#> [1] 1000
hist(mag) #to see the distribution
library(stratifyR)
#> Loading required package: fitdistrplus
#> Loading required package: MASS
#> Loading required package: survival
#> Loading required package: zipfR
#> Loading required package: actuar
#>
#> Attaching package: 'actuar'
#> The following objects are masked from 'package:stats':
#>
#> sd, var
#> The following object is masked from 'package:grDevices':
#>
#> cm
#> Loading required package: triangle
#> Loading required package: mc2d
#> Loading required package: mvtnorm
#>
#> Attaching package: 'mc2d'
#> The following objects are masked from 'package:base':
#>
#> pmax, pmin
res <- strata.data(mag, h = 2, n=300) # a 2-strata solution
#> The program is running, it'll take some time!
summary(res)
#> _____________________________________________
#> Optimum Strata Boundaries for h = 2
#> Data Range: [4, 6.4] with d = 2.4
#> Best-fit Frequency Distribution: lnorm
#> Parameter estimate(s):
#> meanlog sdlog
#> 1.52681032 0.08503554
#> ____________________________________________________
#> Strata OSB Wh Vh WhSh nh Nh fh
#> 1 4.68 0.58 0.03 0.109 140 585 0.24
#> 2 6.4 0.42 0.09 0.124 160 415 0.38
#> Total 1.00 0.12 0.233 300 1000 0.30
#> ____________________________________________________
All calculations have been rounded off to 2 decimal places, hence, the individual stratum solutions provided in the tables may not always add up to the totals.
This function is also used to compute the OSB, OSS, and other important quantities from univariate survey populations by employing the methodology proposed by Khan et al given earlier. Its algorithm is quite similar to that of the strata.data(), however, its functionality is applied to the case where the dataset of the population is not available and the distributonal assumptions of the study variable are strictly required. Another caveat for such distribution-based stratification is that the distr.alloc() function uses the probability density functions of the assumed distributions and integration rules presented by equations (4)-(6) to calculate the stratum sample sizes. It must be noted that this function works on ideal distributions that assumes the parameters chosen by the user. The usage and arguments are as follows:
strata.distr(h, initval = NULL, dist = NULL,
distr = c("pareto", "triangle", "rtriangle", "weibull", "gamma",
"exp", "unif","norm", "lnorm", "cauchy"), params = c(shape=0,
scale=0, rate=0, gamma=0, location=0, mean=0, sd=0, meanlog=0,
sdlog=0, min=0, max=0, mode=0), n, N, cost=FALSE, ch=NULL)
h - A numeric: number of strata to be sampled
initval - A numeric: initial value of the assumed distribution
dist - A numeric: distance or range of the assumed distribution
distr - A character: the assumed distribution of the hypothetical population
params - A list: parameters of the assumed distribution
n - A numeric: fixed total sample size
N - A numeric: fixed population size
cost - A logical: stratum cost. Default cost=FALSE.
ch - A numeric: denotes a vector of stratum costs. Default ch=NULL.
The algorithm for strata.distr() is quite similar to the strata.data() for the contruction of OSB. It is only at the sample allocation (OSS) stage that the two are different. strata.distr() is the main function and once the OSB have been computed, this function uses the distr.alloc() function which uses the numerical integration rules for different distibutions (i.e., the equations (4)-(6)) to compute the OSS.
To show the application of the strata.distr() function, let us construct a 2-strata solution assuming that the dataset of the stratification variable is not available but its distribution and estimated parameters are. Let us consider the ‘depth’ variable from the ‘quakes’ dataset from the datasets package, which has a Triangular distribution with parameters min = 39.99998, max = 680, mode = 39.99999. It starts at an initial value of 40 and has a distance (range) of 640 with a fixed sample size of n = 300 from a population of N = 1000 seismic events. Thus, we use the following commands:
min(depth); max(depth); d=max(depth)-min(depth);d #min, max and range of data
#> [1] 40
#> [1] 680
#> [1] 640
# the 2-strata solution is
res <- strata.distr(h=2, initval=40, dist=640, distr = "triangle",
params = c(min=39.99998, max=680, mode=39.99999), n=300, N=1000)
#> The program is running, it'll take some time!
summary(res)
#> _____________________________________________
#> Optimum Strata Boundaries for h = 2
#> Data Range: [40, 680] with d = 640
#> Best-fit Frequency Distribution: triangle
#> Parameter estimate(s):
#> min max mode
#> 39.99998 680.00000 39.99999
#> ____________________________________________________
#> Strata OSB Wh Vh WhSh nh Nh fh
#> 1 266.72 0.58 4217.46 37.862 145 583 0.25
#> 2 680 0.42 9488.76 40.619 155 417 0.37
#> Total 1.00 13706.22 78.481 300 1000 0.30
#> ____________________________________________________
As discussed earlier, the stratifyR package is able to handle ten continuous distributions that are quite commonly-used in real-life situations. This section presents a brief overview of these distributions and the application of the proposed method of stratification using real or simulated data which follows a particular distribution. Examples for hypothetical distributions are also presented. For the sake of brevity, the mathematical formulations of the problem of determining the OSB and the DP solution procedure are presented only for the Pareto Type II variable. For all other distributions, only the examples are presented to illustrate the application of the package.
Let the study variable follow the Pareto Type II distribution on the domain of [0, +∞], its two-parameter probability density function with a state space y ≥ 0 is given by:
where α > 0 is the shape parameter and s > 0 is the scale parameter of the distribution.
If the study variable y follows Pareto II (or Lomax) distribution (i.e., y ∼ P(a, s)) with density function given by (15). By using equations (4)-(6), the formulated MPP given in (10) could be expressed as:
where d = yL − y0, a and s are parameters of the Pareto Type II distribution.
To solve the MPP (16) using the DP technique as a solution procedure, we apply the algorithm, that is, the solution proceure using Dynamic Progrmming technique discussed earlier in Section 4. After substituting the quatity yh − 1 = y0 + dh − lh, the recurrence relations (11) and (12) are reduced to:
For the first stage, k = 1, at l1* = d1:
And for the stages k ≥ 2:
The recurrence relations (17) and (18) are solved using the DP technique to determine the OSB.
A dataset for a univariate population of size N = 5000 with the study variable that follows Pareto Type II distribution (pareto_data) was simulated using parameters shape = 5 and scale = 8 to demonstrate the application of the strata.data() function to determine the OSB and other quantites. The data exhibits a 2-parameter Pareto Type II distribution with the MLE estimates of the parameters as shape = 5.026907 and scale = 8.191676. The minimum and maximum values in the simulated data are [y0, yL] = [0.0002193, 38.56871], which implies that d = 38.56849.
To construct the OSB (a 2-strata solution, i.e., h = 2) for the pareto_data with a fixed total sample size of 500, we use the following codes:
set.seed(8235411)
pareto_data <- rpareto(5000, shape=5, scale=8)
head(pareto_data)
#> [1] 1.8464786 0.9226109 1.1079785 9.7699864 4.5600912 0.0441124
hist(pareto_data, breaks=100)
min(pareto_data); max(pareto_data); d=max(pareto_data)-min(pareto_data);d
#> [1] 0.0002192696
#> [1] 38.56871
#> [1] 38.56849
fit <- fitdist(pareto_data, "pareto", start = list(shape = 1, scale = 500))
fit
#> Fitting of the distribution ' pareto ' by maximum likelihood
#> Parameters:
#> estimate Std. Error
#> shape 5.026907 0.4337688
#> scale 8.191676 0.8358576
res <- strata.data(pareto_data, h = 2, n=500) # a 2-strata solution
#> The program is running, it'll take some time!
summary(res)
#> _____________________________________________
#> Optimum Strata Boundaries for h = 2
#> Data Range: [0.0002192696, 38.56871] with d = 38.56849
#> Best-fit Frequency Distribution: pareto
#> Parameter estimate(s):
#> shape scale
#> 5.016949 8.174580
#> ____________________________________________________
#> Strata OSB Wh Vh WhSh nh Nh fh
#> 1 3.03 0.79 0.64 0.632 232 3938 0.06
#> 2 38.57 0.21 11.87 0.732 268 1062 0.25
#> Total 1.00 12.51 1.363 500 5000 0.10
#> ____________________________________________________
Similarly, in order to find the OSB and other quantities, we can apply the strata.distr() function to a Pareto Type II population. Let us assume from past knowledge that the study variable in the population follows Pareto Type II distribution with the given attributes such as the shape = 5.05, scale = 8.20, initial value = 0.15 and distance = 38.55. Then, if a sample of size n = 500 is drawn from the population of size N = 5000, we can execute the following command to obtain the results:
res <- strata.distr(h=2, initval=0.15, dist=38.55, distr = "pareto",
params = c(shape=5.05, scale=8.20), n=500, N=5000)
#> The program is running, it'll take some time!
summary(res)
#> _____________________________________________
#> Optimum Strata Boundaries for h = 2
#> Data Range: [0.15, 38.7] with d = 38.55
#> Best-fit Frequency Distribution: pareto
#> Parameter estimate(s):
#> shape scale
#> 5.05 8.20
#> ____________________________________________________
#> Strata OSB Wh Vh WhSh nh Nh fh
#> 1 3.21 0.81 0.67 0.592 241 4057 0.06
#> 2 38.7 0.19 11.42 0.637 259 943 0.27
#> Total 1.00 12.09 1.229 500 5000 0.10
#> ____________________________________________________
Let the study variable follow Triangular distribution on the domain of [a, b], its three-parameter probability density function with a state space y ≥ 0 is given by:
where a is the location parameter, b is the scale parameter and c is the shape parameter of the distribution.
Then, formulating the problem as an MPP and solving the recurrence relations as discussed above for Pareto II variate, the OSB are obtained.
To solve the MPP formulated for Triangular distribution (19), we apply the algorithm using the DP technique discussed in Section 4. The recurrence relations used to determine the OSB are given by:
For the first stage, k = 1, at l1* = d1:
And for the stages k ≥ 2:
where ak = dk − lk + y0 − a and bk = b − dk + lk − y0 + a.
Substituting the values of a, b, c, y0 and d, the OSW (lh*) and the OSB (yh* = yh − 1* − lh*) are obtained by executing the strata.data() function.
Data on Mathematics marks of first year students in a University in Fiji, with a size of N = 354 called ‘math’ data is used to demonstrate the application of the stratifyR package on a Triangular population. In this example, the variable `final_marks’ is used - it exhibits 3-parameter Triangular distribution with the parameters: min = 6.205204, max = 98.47165 and mode = 53.999996. The minimum and maximum values in the ‘math’ data are [y0, yL] = [7, 97], which implies that d = 90.
To construct the OSB (h = 2) for the `final_marks’ data with a fixed total sample size of 150, we use the following code:
res <- strata.data(final_marks, h = 2, n=150) # a 2-strata solution
#> The program is running, it'll take some time!
summary(res)
#> _____________________________________________
#> Optimum Strata Boundaries for h = 2
#> Data Range: [7, 97] with d = 90
#> Best-fit Frequency Distribution: triangle
#> Parameter estimate(s):
#> min max mode
#> 6.205204 98.471650 53.999996
#> ____________________________________________________
#> Strata OSB Wh Vh WhSh nh Nh fh
#> 1 53.16 0.49 114.72 5.278 74 240 0.31
#> 2 97 0.51 116.01 5.463 76 247 0.31
#> Total 1.00 230.73 10.741 150 487 0.31
#> ____________________________________________________
Similarly, in order to find the OSB and other quantities, we can apply the strata.distr() function to a Triangular population. Based on the assumption from past knowledge that the population follows Triangular distribution with the given attributes such as the initial value, distance, parameters, etc., we can execute the following command to obtain the results:
data(math)
final_marks <- math$final_marks
eps = 1e-8; a <- min(final_marks); b <- max(final_marks); c=b-a# and with mode=54
a;b;c
#> [1] 7
#> [1] 97
#> [1] 90
#find out the estimated parameters
fit <- fitdist(final_marks, distr = "triang", method="mle", lower=c(0,0),
start = list(min = a-eps, max = b+eps, mode = 54))
#> Warning in checkparamlist(arg_startfix$start.arg, arg_startfix$fix.arg, : Some
#> parameter names have no starting/fixed value but have a default value: mean.
fit
#> Fitting of the distribution ' triang ' by maximum likelihood
#> Parameters:
#> estimate
#> min 6.205364
#> max 98.469797
#> mode 53.999994
# 2-strata solution
res <- strata.distr(h=2, initval=7, dist=90, distr = "triangle",
params = c(min=6.205364, max=98.469797, mode=53.999994), n=150, N=352)
#> The program is running, it'll take some time!
summary(res)
#> _____________________________________________
#> Optimum Strata Boundaries for h = 2
#> Data Range: [7, 97] with d = 90
#> Best-fit Frequency Distribution: triangle
#> Parameter estimate(s):
#> min max mode
#> 6.205364 98.469797 53.999994
#> ____________________________________________________
#> Strata OSB Wh Vh WhSh nh Nh fh
#> 1 53.16 0.5 122.23 5.525 76 176 0.43
#> 2 97 0.5 113.22 5.316 74 176 0.42
#> Total 1.0 235.45 10.841 150 352 0.43
#> ____________________________________________________
If a study variable follows the Right-Triangular distribution on the domain of [a, b], its two-parameter probability density function is given by:
where a is the location parameter and b is the scale parameter of the distribution.
Note that the Right-Triangular distribution is a special case of the Triangular distribution discussed in Section 7.2 where the parameters a = c, i.e., minimium value is equal to the mode. Thus, the density function (22) is a two-parameter distribution.
To solve the MPP formulated for Right-Triangular distribution (22), we apply the algorithm using the DP technique discussed in Section 4. The recurrence relations used to determine the OSB are given by:
For the first stage, k = 1, at l1* = d1:
And for the stages k ≥ 2:
where ak = b − dk + lk − y0.
Substituting the values of a, b, y0 and d, the OSW (lh*) and the OSB (yh* = yh − 1* − lh*) are obtained by executing the strata.data() function.
A data following Right-Triangular distribution, of size N = 5000, was simulated to demonstrate the application of the stratifyR package. The simulated data takes in three parameters where a = c to indicate that it’s a Right-Triangular distribution. Upon fitting the data, its best-fit distribution exhibits a three-parameter Triangular distribution with the parameters min = 1.987776, max = 7.935599 and mode = 2.026685. Note that because the min and mode parameters are very close to each and not exactly the same, it is treated as a Triangular distribution even thought the data simulated was for a Right-Triangular distribution. The minimum and maximum values in the simulated data are [y0, yL] = [2.000052, 7.83871] with d = 5.838658.
To construct the OSB (h = 2) for the Right-Triangular data with a fixed total sample size of 500, we use the following code:
#Generate RT data
set.seed(12546)
data <- rtriangle(n=1000, a=2, b=8, c=2) #right-triangular since a=c
hist(data)
res <- strata.data(data, h = 2, n=500) # a 2-strata solution
#> The program is running, it'll take some time!
summary(res)
#> _____________________________________________
#> Optimum Strata Boundaries for h = 2
#> Data Range: [2.000052, 7.83871] with d = 5.838658
#> Best-fit Frequency Distribution: triangle
#> Parameter estimate(s):
#> min max mode
#> 1.987776 7.935599 2.026685
#> ____________________________________________________
#> Strata OSB Wh Vh WhSh nh Nh fh
#> 1 4.1 0.57 0.36 0.339 233 568 0.41
#> 2 7.84 0.43 0.81 0.388 267 432 0.62
#> Total 1.00 1.16 0.727 500 1000 0.50
#> ____________________________________________________
We see in the above example that the simulated data is a Right-Triangular distribution, however, when data is fitted using MLE method in the package, it turns out that it best-fits Triangular distribution because the min is not exactly equal to mode.
Similarly, in order to find the OSB and other quantities, we can apply the strata.distr() function to a Right-Triangular population. Based on the assumption from past knowledge that the population follows Right-Triangular distribution with the given attributes such as the initial value, distance, parameters, etc., we can execute the following command to obtain the results:
res <- strata.distr(h=2, initval=1.007202, dist=0.992781, distr = "rtriangle",
params = c(min=2, max=10, mode=2), n=500, N=1000)
#> The program is running, it'll take some time!
summary(res)
#> _____________________________________________
#> Optimum Strata Boundaries for h = 2
#> Data Range: [1.007202, 1.999983] with d = 0.992781
#> Best-fit Frequency Distribution: rtriangle
#> Parameter estimate(s):
#> min max mode
#> 2 10 2
#> ____________________________________________________
#> Strata OSB Wh Vh WhSh nh Nh fh
#> 1 1.5 -0.13 0.02 0.019 -130 -130 1.00
#> 2 2 1.13 0.02 0.019 630 1130 0.56
#> Total 1.00 0.04 0.038 500 1000 0.50
#> ____________________________________________________
The results show that this fits a two-paramter Right-Triangular distribution. Do note that in the above command, one has to specify all three parameters min = 2, max = 10 and mode = 2, even for a Right-Triangular distribution, where min = mode.
If the study variable follows the Weibull distribution on the interval [y0, yL], its two-parameter probability density function with a state space y ≥ 0 is given by:
where r > 0 is the shape parameter and θ > 0 is the scale parameter of the distribution.
To solve the MPP formulated for Weibull distribution (25), we apply the algorithm using the DP technique discussed in Section 4. The recurrence relations used to determine the OSB are given by:
For the first stage, k = 1, at l1* = d1:
And for the stages k ≥ 2:
The recurrence relations (26) and (27) are solved using the DP technique to determine the OSB.
A health data of size N = 724, called ‘anaemia’ data (K. G. Reddy, Khan, and Rao (2014)), is used to demonstrate the application of the stratifyR package on Weibull population. The ‘anaemia’ data comes from the National Nutritional Survey on the ``Micronutrient Status of Women in Fiji” and has many variables such as level of Iron, Folate, Zinc, etc. In this example, the variable Iron is used since it exhibits a 2-parameter Weibull distribution with the shape and scale parameters as r = 2.144586 and θ = 13.790744 respectively. The minimum and maximum values are [y0, yL] = [1.5, 34.7], which implies that d = 33.2.
To construct the OSB (h = 2) for the Iron data with a fixed total sample size of 500, we use the following codes:
res <- strata.data(Iron, h = 2, n=500) # a 2-strata solution
#> The program is running, it'll take some time!
summary(res)
#> _____________________________________________
#> Optimum Strata Boundaries for h = 2
#> Data Range: [1.5, 34.7] with d = 33.2
#> Best-fit Frequency Distribution: weibull
#> Parameter estimate(s):
#> shape scale
#> 2.144879 13.792245
#> ____________________________________________________
#> Strata OSB Wh Vh WhSh nh Nh fh
#> 1 12.74 0.6 9.03 1.789 255 431 0.59
#> 2 34.7 0.4 18.11 1.722 245 293 0.84
#> Total 1.0 27.14 3.511 500 724 0.69
#> ____________________________________________________
Similarly, in order to find the OSB and other quantities, we can apply the strata.distr() function to a Weibull population. Based on the assumption from past knowledge that the population follows Weibull distribution with the given attributes such as the initial value, distance, parameters, etc., we can execute the following command to obtain the results:
res <- strata.distr(h=2, initval=2.9, dist=55.9, distr = "weibull",
params = c(shape=2.144586, scale=13.790744), n=500, N=5000)
#> The program is running, it'll take some time!
summary(res)
#> _____________________________________________
#> Optimum Strata Boundaries for h = 2
#> Data Range: [2.9, 58.8] with d = 55.9
#> Best-fit Frequency Distribution: weibull
#> Parameter estimate(s):
#> shape scale
#> 2.144586 13.790744
#> ____________________________________________________
#> Strata OSB Wh Vh WhSh nh Nh fh
#> 1 13.05 0.59 7.49 1.516 239 2943 0.08
#> 2 58.8 0.41 16.28 1.660 261 2057 0.13
#> Total 1.00 23.77 3.176 500 5000 0.10
#> ____________________________________________________
If the study variable follows the Gamma distribution (i.e., y ∼ Γ(r, θ)) on the interval [y0, yL], it has the following two-parameter probability density function:
where is a shape parameter and θ is the scale parameter and Γ(r) is a Gamma function defined by
The function in equation $(\ref{222})$ is also defined by an upper incomplete gamma function Γ(r, x) and a lower incomplete gamma function γ(r, x), respectively, as follows:
There also exist regularized/normalised incomplete Gamma functions which give a value restricted between 0 and 1 and can be stated as:
where Q(r, y) denotes the Upper Regularized Incomplete Gamma function while P(r, y) denotes regularized Lower Incomplete Gamma function (Abramowitz and Stegun (1972), Chapter 6). Note that Q(r, y) = 1 − P(r, y).
To solve the MPP formulated for Gamma distribution (28), we apply the algorithm using the DP technique discussed in Section 4. The recurrence relations used to determine the OSB are given by:
For the first stage, k = 1, at l1* = d1:
And for the stages k ≥ 2:
The recurrence relations (34) and (35) are solved using the DP technique to determine the OSB.
Again, the health data of size N = 724, derived from the ‘National Nutritional Survey’ on the ‘Micronutrient Status of Women in Fiji’ is used to demonstrate the application of the stratifyR package on Gamma population. In this example, the variable Folate is used since it exhibits a 2-parameter Gamma distribution with the shape and scale parameters as r = 6.9922 and θ = 2.5785 respectively. The minimum and maximum values are [y0, yL] = [4.9, 45.4], which implies that d = 40.5.
To construct the OSB (h = 2) for the Folate data with a fixed total sample size of 500, we use the following codes:
res <- strata.data(Folate, h = 2, n=500) # a 2-strata solution
#> The program is running, it'll take some time!
summary(res)
#> _____________________________________________
#> Optimum Strata Boundaries for h = 2
#> Data Range: [4.9, 45.4] with d = 40.5
#> Best-fit Frequency Distribution: gamma
#> Parameter estimate(s):
#> shape rate
#> 6.9910643 0.3877442
#> ____________________________________________________
#> Strata OSB Wh Vh WhSh nh Nh fh
#> 1 18.83 0.61 10.78 2.005 244 442 0.55
#> 2 45.4 0.39 29.05 2.099 256 282 0.91
#> Total 1.00 39.83 4.104 500 724 0.69
#> ____________________________________________________
Similarly, in order to find the OSB and other quantities, we can apply the strata.distr() function to a Gamma population. Based on the assumption from past knowledge that the population follows Gamma distribution with the given attributes such as the initial value, distance, parameters, etc., we can execute the following command to obtain the results:
res <- strata.distr(h=2, initval=0.5, dist=50, distr = "gamma",
params = c(shape=3.835768, rate=0.340328), n=500, N=12000)
#> The program is running, it'll take some time!
summary(res)
#> _____________________________________________
#> Optimum Strata Boundaries for h = 2
#> Data Range: [0.5, 50.5] with d = 50
#> Best-fit Frequency Distribution: gamma
#> Parameter estimate(s):
#> shape rate
#> 3.835768 0.340328
#> ____________________________________________________
#> Strata OSB Wh Vh WhSh nh Nh fh
#> 1 12.18 0.63 6.95 1.653 247 7522 0.03
#> 2 50.5 0.37 20.50 1.689 253 4478 0.06
#> Total 1.00 27.45 3.342 500 12000 0.04
#> ____________________________________________________
If the study variable follows the Exponential distribution on the interval [y0, yL], its one-parameter probability density function is given by:
where λ is the continuous rate parameter (or the inverse scale parameter).
To solve the MPP formulated for Exponential distribution (36), we apply the algorithm using the DP technique discussed in Section 4. The recurrence relations used to determine the OSB are given by:
For the first stage, k = 1, at l1* = d1:
And for the stages k ≥ 2:
Substituting the values of λ, y0 and d, the OSW (lh*) and the OSB (yh* = yh − 1* − lh*) are obtained by executing the strata.data() function.
A data following Exponential distribution (herein called Exp data), of size N = 10000 was simulated to demonstrate the application of the stratifyR package on an Exponential population. The data exhibits an Exponential distribution with the parameter rate = 1.359205. The minimum and maximum values in the simulated data are [y0, yL] = [5.747904e − 05, 8.016871], which implies that d = 8.016814.
To construct the OSB (h = 2) for the data that follows Exponential distribution with a fixed total sample size of 500, we use the following codes:
res <- strata.data(data, h = 2, n=500) # a 2-strata solution
#> The program is running, it'll take some time!
summary(res)
#> _____________________________________________
#> Optimum Strata Boundaries for h = 2
#> Data Range: [5.747904e-05, 8.016871] with d = 8.016814
#> Best-fit Frequency Distribution: exp
#> Parameter estimate(s):
#> rate
#> 1.359205
#> ____________________________________________________
#> Strata OSB Wh Vh WhSh nh Nh fh
#> 1 0.93 0.72 0.07 0.187 235 3596 0.07
#> 2 8.02 0.28 0.57 0.212 265 1404 0.19
#> Total 1.00 0.64 0.399 500 5000 0.10
#> ____________________________________________________
Similarly, in order to find the OSB and other quantities, we can apply the strata.distr() function to a Exponential population. Based on the assumption from past knowledge that the population follows Exponential distribution with the given attributes such as the initial value, distance, parameters, etc., we can execute the following command to obtain the results:
set.seed(28951)
data <- rexp(5000, rate = 1.36)
min(data); max(data); d=max(data)-min(data);d
#> [1] 5.747904e-05
#> [1] 8.016871
#> [1] 8.016814
fit <- fitdist(data, distr="exp", method="mle")
fit
#> Fitting of the distribution ' exp ' by maximum likelihood
#> Parameters:
#> estimate Std. Error
#> rate 1.359205 0.01922205
res <- strata.distr(h=2, initval=5.748e-05, dist=8.017, distr = "exp",
params = c(rate=1.36), n=500, N=5000) #a 2-strata solution
#> The program is running, it'll take some time!
summary(res)
#> _____________________________________________
#> Optimum Strata Boundaries for h = 2
#> Data Range: [5.748e-05, 8.017057] with d = 8.017
#> Best-fit Frequency Distribution: exp
#> Parameter estimate(s):
#> rate
#> 1.36
#> ____________________________________________________
#> Strata OSB Wh Vh WhSh nh Nh fh
#> 1 0.93 0.72 0.07 0.185 235 3584 0.07
#> 2 8.02 0.28 0.54 0.208 265 1416 0.19
#> Total 1.00 0.60 0.392 500 5000 0.10
#> ____________________________________________________
If the study variable follows the Uniform distribution on the interval [y0, yL], its two-parameter probability density function is given by:
where a and b are the continuous boundary parameters, i.e., a and b are the minimum and maximum values respectively.
To solve the MPP formulated for Uniform distribution (39), we apply the algorithm using the DP technique discussed in Section 4. The recurrence relations used to determine the OSB are given by:
For the first stage, k = 1, at l1* = d1:
And for the stages k ≥ 2:
Substituting the values of a, b, y0 and d, the OSW (lh*) and the OSB (yh* = yh − 1* − lh*) are obtained by executing the strata.data() function.
A data following Uniform distribution of size N = 5000 was simulated to demonstrate the application of the stratifyR package on a Uniform population. The data for Uniform distribution is simulated with the parameters min = 2 and max = 15. When fitted, the minimum and maximum values in the simulated data are [y0, yL] = [2.006522, 14.99764], which implies that d = 12.99112.
To construct the OSB (h = 2) for the Uniformly-distributed data with a fixed total sample size of 450, we use the following codes:
res <- strata.data(data, h = 2, n=450) # a 2-strata solution
#> The program is running, it'll take some time!
summary(res)
#> _____________________________________________
#> Optimum Strata Boundaries for h = 2
#> Data Range: [2.006522, 14.99764] with d = 12.99112
#> Best-fit Frequency Distribution: unif
#> Parameter estimate(s):
#> min max
#> 2.006522 14.997645
#> ____________________________________________________
#> Strata OSB Wh Vh WhSh nh Nh fh
#> 1 8.5 0.49 3.55 0.919 220 2437 0.09
#> 2 15 0.51 3.52 0.962 230 2563 0.09
#> Total 1.00 7.07 1.880 450 5000 0.09
#> ____________________________________________________
Note that the above results indicate that the best-fit distribution is Triangular and not Uniform. This is because when stratifyR package assesses the data to ascertain the best-fit distribution, it is not able to calculate its AIC value for Uniform distribution. Hence, the distribution that gives the lowest AIC is taken to be the best-fit distribution. Since AIC for Uniform distribution is not available, the next best-fit distribution (Triangular) is be chosen. The results, you would find, are still quite accurate!
Similarly, we can apply the strata.distr() function for the Uniform distribution where the arguments can be assumed from past knowledge. To construct the OSB for h = 2 for a hypothetical variable that follows Uniform distribution (with parameters min = 3 and max = 15) with a fixed total sample size of 450 from a population of 5000, the following command is used:
# For a hypothetical uniform distribution, it does give a result
res <- strata.distr(h=2, initval=3, dist=12, distr = "unif",
params = c(min=3, max=15), n=450, N=5000)
#> The program is running, it'll take some time!
summary(res)
#> _____________________________________________
#> Optimum Strata Boundaries for h = 2
#> Data Range: [3, 15] with d = 12
#> Best-fit Frequency Distribution: unif
#> Parameter estimate(s):
#> min max
#> 3 15
#> ____________________________________________________
#> Strata OSB Wh Vh WhSh nh Nh fh
#> 1 9 0 3 0.866 225 2500 0.09
#> 2 15 0 3 0.866 225 2500 0.09
#> Total 1 6 1.732 450 5000 0.09
#> ____________________________________________________
If the study variable follows the Normal distribution on the interval [y0, yL], it has the following two-parameter probability density function:
where σ > 0 is a scale parameter and μ is the location parameter.
The following definitions of error function are worth noting since they are needed to simplify the integrations used to derive the stratum weight, mean and variance due to normal distribution. It can also be written as
To solve the MPP formulated for Normal distribution (42), we apply the algorithm using the DP technique discussed in Section 4. The recurrence relations used to determine the OSB are given by:
For the first stage, k = 1, at l1* = d1:
And for the stages k ≥ 2:
Substituting the values of μ, σ, y0 and d, the OSW (lh*) and the OSB (yh* = yh − 1* − lh*) are obtained by executing the strata.data() function.
A data following Normal distribution (herein called Norm data), of size N = 5000 was simulated to demonstrate the application of the stratifyR package on a Normal population. The data exhibits an Normal distribution with the parameters mean = 16.010776 and sd = 1.662357. The minimum and maximum values in the simulated data are [y0, yL] = [9.923816, 22.51267], which implies that d = 10.62118.
To construct the OSB for h = 2 using the Norm data with a fixed total sample size of 580, the command below can be used:
res <- strata.data(data, h = 2, n=500) #construct a 2-strata solution
#> The program is running, it'll take some time!
summary(res)
#> _____________________________________________
#> Optimum Strata Boundaries for h = 2
#> Data Range: [9.923816, 22.51267] with d = 12.58885
#> Best-fit Frequency Distribution: norm
#> Parameter estimate(s):
#> mean sd
#> 16.010776 1.662357
#> ____________________________________________________
#> Strata OSB Wh Vh WhSh nh Nh fh
#> 1 16.01 0.5 0.98 0.493 244 2489 0.1
#> 2 22.51 0.5 1.05 0.515 256 2511 0.1
#> Total 1.0 2.03 1.009 500 5000 0.1
#> ____________________________________________________
Similarly, in order to find the OSB and other quantities, we can apply the strata.distr() function to a Normal population. Based on the assumption from past knowledge that the population follows Normal distribution with the given attributes such as the initial value, distance, parameters, etc., we can execute the following command to obtain the results:
set.seed(89821)
data <- rnorm(5000, mean = 16, sd = 1.65)
min(data); max(data); d=max(data)-min(data);d
#> [1] 9.923816
#> [1] 22.51267
#> [1] 12.58885
fit <- fitdist(data, distr="norm", method="mle")
fit
#> Fitting of the distribution ' norm ' by maximum likelihood
#> Parameters:
#> estimate Std. Error
#> mean 16.010776 0.02350928
#> sd 1.662357 0.01662355
res <- strata.distr(h=2, initval=9.923816, dist=12.58885, distr = "norm",
params = c(mean=16.010776, sd=1.662357), n=500, N=5000)
#> The program is running, it'll take some time!
summary(res)
#> _____________________________________________
#> Optimum Strata Boundaries for h = 2
#> Data Range: [9.923816, 22.51267] with d = 12.58885
#> Best-fit Frequency Distribution: norm
#> Parameter estimate(s):
#> mean sd
#> 16.010776 1.662357
#> ____________________________________________________
#> Strata OSB Wh Vh WhSh nh Nh fh
#> 1 16.01 0.5 1 0.5 250 2501 0.1
#> 2 22.51 0.5 1 0.5 250 2499 0.1
#> Total 1.0 2 1.0 500 5000 0.1
#> ____________________________________________________
If the study variable follows the Log-normal distribution on the interval [y0, yL], it has the following two-parameter probability density function:
where σ > 0 is a scale parameter and μ is the location parameter.
To solve the MPP formulated for Log-Normal distribution (47), we apply the algorithm using the DP technique discussed in Section 4. The recurrence relations used to determine the OSB are given by:
For the first stage, k = 1, at l1* = d1:
And for the stages k ≥ 2:
Substituting the values of μ, σ, y0 and d, the OSW (lh*) and the OSB (yh* = yh − 1* − lh*) are obtained by executing the strata.data() function.
The ‘hies’ data of size N = 3566 is used to demonstrate the application of the stratifyR package on Log-normal population. The ‘hies’ data comes from the HIES survey conducted in Fiji in the year 2010. The data contains only two aspects of the survey, namely Income and Expenditure. In this example, the variable Expenditure is used since it exhibits a 2-parameter Log-normal distribution with the shape and scale parameters as meanlog = 9.2804934 and sdlog = 0.6917842 respectively. The minimum and maximum values are [y0, yL] = [991.24, 136539.1], which implies that d = 135547.8.
To construct the OSB (h = 2) for the Iron data with a fixed total sample size of 500, we use the following codes:
data(hies)
Expenditure <- hies$Expenditure
head(Expenditure);length(Expenditure)
#> [1] 10880.58 3633.28 3882.84 4862.67 9237.19 9637.24
#> [1] 3566
hist(Expenditure)
min(Expenditure); max(Expenditure); d=max(Expenditure)-min(Expenditure);d
#> [1] 991.24
#> [1] 136539.1
#> [1] 135547.8
fit <- fitdist(Expenditure, distr="lnorm", method="mle")
fit
#> Fitting of the distribution ' lnorm ' by maximum likelihood
#> Parameters:
#> estimate Std. Error
#> meanlog 9.2804934 0.011584571
#> sdlog 0.6917842 0.008191452
res <- strata.data(Expenditure, h = 2, n=500)
#> The program is running, it'll take some time!
summary(res)
#> _____________________________________________
#> Optimum Strata Boundaries for h = 2
#> Data Range: [991.24, 136539.1] with d = 135547.8
#> Best-fit Frequency Distribution: lnorm
#> Parameter estimate(s):
#> meanlog sdlog
#> 9.2804934 0.6917842
#> ____________________________________________________
#> Strata OSB Wh Vh WhSh nh Nh fh
#> 1 17206.65 0.77 14022946 2886.776 214 2749 0.08
#> 2 136539.06 0.23 285033008 3868.016 286 817 0.35
#> Total 1.00 299055954 6754.792 500 3566 0.14
#> ____________________________________________________
Similarly, in order to find the OSB and other quantities, we can apply the strata.distr() function to a Log-normal population. Based on the assumption from past knowledge that the population follows Log-normal distribution with the given attributes such as the initial value, distance, parameters, etc., we can execute the following command to obtain the results:
res <- strata.distr(h=2, initval=10, dist=188, distr = "lnorm",
params = c(meanlog=3.23, sdlog=0.65), n=500, N=1588)
#> The program is running, it'll take some time!
summary(res)
#> _____________________________________________
#> Optimum Strata Boundaries for h = 2
#> Data Range: [10, 198] with d = 188
#> Best-fit Frequency Distribution: lnorm
#> Parameter estimate(s):
#> meanlog sdlog
#> 3.23 0.65
#> ____________________________________________________
#> Strata OSB Wh Vh WhSh nh Nh fh
#> 1 39.66 0.76 63.79 5.423 247 1200 0.21
#> 2 198 0.24 516.96 5.535 253 388 0.65
#> Total 1.00 580.75 10.958 500 1588 0.31
#> ____________________________________________________
If the study variable follows the Cauchy distribution on the interval [y0, yL], its two-parameter probability density function is given by:
where μ is the location parameter which specifies the location of the peak of the distribution and σ is the scale parameter, which specifies the half-width at half maximum of the distribution.
To solve the MPP formulated for Cauchy distribution (50), we apply the algorithm using the DP technique discussed in Section 4. The recurrence relations used to determine the OSB are given by:
For the first stage, k = 1, at l1* = d1:
And for the stages k ≥ 2:
Substituting the values of μ, σ, y0 and d, the OSW (lh*) and the OSB (yh* = yh − 1* − lh*) are obtained by executing the strata.data() function.
The Boston data from the MASS package is used to demonstrate the application for Cauchy population. In this example, the variable ‘black’ is used since it exhibits a 2-parameter Cauchy distribution with the location and scale parameters as location = 393.864307 and scale = 4.710457 respectively. The minimum and maximum values are [y0, yL] = [0.32, 396.9], which implies that d = 396.58.
min(black); max(black); d=max(black)-min(black);d
#> [1] 0.32
#> [1] 396.9
#> [1] 396.58
fit <- fitdist(black, distr="cauchy", method="mle")
fit
#> Fitting of the distribution ' cauchy ' by maximum likelihood
#> Parameters:
#> estimate Std. Error
#> location 393.863075 0.3060843
#> scale 4.709346 0.3395534
res <- strata.data(black, h = 2, n=500)
#> The program is running, it'll take some time!
summary(res)
#> _____________________________________________
#> Optimum Strata Boundaries for h = 2
#> Data Range: [0.32, 396.9] with d = 396.58
#> Best-fit Frequency Distribution: cauchy
#> Parameter estimate(s):
#> location scale
#> 393.864307 4.710457
#> ____________________________________________________
#> Strata OSB Wh Vh WhSh nh Nh fh
#> 1 364.29 0.2 18335.38 27.028 101 101 1.00
#> 2 396.9 0.8 58.14 6.103 399 405 0.99
#> Total 1.0 18393.52 33.131 500 506 0.99
#> ____________________________________________________
Please note that for Cauchy distributions, one might get the error messages “simpleError in optim()…”. You can ignore this error message which simply indicates that when the data is fitted to all ten distributions, it gives errors while computing the parameters of those distributions which are not defined in the negative region. This happens with Cauchy distribution because it is defined on both positive and negative regions.
Similarly, in order to find the OSB and other quantities, we can apply the strata.distr() function to a Cauchy population. Based on the assumption from past knowledge that the population follows Cauchy distribution with the given attributes such as the initial value, distance, parameters, etc., we can execute the following command to obtain the results:
#for a cauchy distribution with initial value of x0=-1, d=2 and
#location and scale parameters 0 and 1 respectively
res <- strata.distr(h=2, initval=-1, dist=2, distr = "cauchy",
params = c(location=0, scale=1), n=500, N=5000)
#> The program is running, it'll take some time!
summary(res)
#> _____________________________________________
#> Optimum Strata Boundaries for h = 2
#> Data Range: [-1, 1] with d = 2
#> Best-fit Frequency Distribution: cauchy
#> Parameter estimate(s):
#> location scale
#> 0 1
#> ____________________________________________________
#> Strata OSB Wh Vh WhSh nh Nh fh
#> 1 0 0.5 0.08 0.07 250 2500 0.1
#> 2 1 0.5 0.08 0.07 250 2500 0.1
#> Total 1.0 0.16 0.14 500 5000 0.1
#> ____________________________________________________