Title: | Methods for Analyzing Binned Income Data |
---|---|
Description: | Methods for model selection, model averaging, and calculating metrics, such as the Gini, Theil, Mean Log Deviation, etc, on binned income data where the topmost bin is right-censored. We provide both a non-parametric method, termed the bounded midpoint estimator (BME), which assigns cases to their bin midpoints; except for the censored bins, where cases are assigned to an income estimated by fitting a Pareto distribution. Because the usual Pareto estimate can be inaccurate or undefined, especially in small samples, we implement a bounded Pareto estimate that yields much better results. We also provide a parametric approach, which fits distributions from the generalized beta (GB) family. Because some GB distributions can have poor fit or undefined estimates, we fit 10 GB-family distributions and use multimodel inference to obtain definite estimates from the best-fitting distributions. We also provide binned income data from all United States of America school districts, counties, and states. |
Authors: | Samuel V. Scarpino, Paul von Hippel, and Igor Holas |
Maintainer: | Samuel V. Scarpino <[email protected]> |
License: | GPL (>= 3.0) |
Version: | 1.0.4 |
Built: | 2024-12-09 06:38:47 UTC |
Source: | CRAN |
Methods for model selection, model averaging, and calculating metrics, such as the Gini, Theil, Mean Log Deviation, etc, on binned income data where the topmost bin is right-censored. We provide both a non-parametric method, termed the bounded midpoint estimator (BME), which assigns cases to their bin midpoints; except for the censored bins, where cases are assigned to an income estimated by fitting a Pareto distribution. Because the usual Pareto estimate can be inaccurate or undefined, especially in small samples, we implement a bounded Pareto estimate that yields much better results. We also provide a parametric approach, which fits distributions from the generalized beta (GB) family. Because some GB distributions can have poor fit or undefined estimates, we fit 10 GB-family distributions and use multimodel inference to obtain definite estimates from the best-fitting distributions. We also provide binned income data from all United States of America school districts, counties, and states.
Package: | binequality |
Type: | Package |
Version: | 1.0.4 |
Date: | 2018-11-05 |
License: | GPL (>= 3.0) |
The datasets are: state_bins, county_bins, and school_district_bins.
The main functions are: fitFunc, run_GB_family, getMids, getQuantilesParams, giniCoef, LRT, makeFitComb, makeInt, makeIntWeight, makeWeightsAIC, mAvg, midStats, MLD, modelAvg, paramFilt, and theilInd.
Type ?<object> to learn more about these objects, e.g. ?state_bins
Type ?<function> to see examples of the function's use, e.g. ?getMids
Samuel V. Scarpino, Paul von Hippel, and Igor Holas
Maintainer: Samuel V. Scarpino <[email protected]>
Von Hippel, P. T., Scarpino, S. V., & Holas, I. (2016). Robust estimation of inequality from binned incomes. Sociological Methodology, 46(1), 212-251.
#FIXME, write the example run of states here
#FIXME, write the example run of states here
A data set of binned income by US counties. FIXME - more info.
data(state_bins)
data(state_bins)
#FIXME add in format
data(county_bins)
data(county_bins)
This function fits a parametric distribution binned data. The data are subdivided using ID.
fitFunc(ID, hb, bin_min, bin_max, obs_mean, ID_name, distribution = "LOGNO", distName = "LNO", links = c(muLink = "identity", sigmaLink = "log", nuLink = NULL, tauLink = NULL), qFunc = qLOGNO, quantiles = seq(0.006, 0.996, length.out = 1000), linksq = c(identity, exp, NULL, NULL), con = gamlss.control(c.crit=0.1,n.cyc=200, trace=FALSE), saveQuants = FALSE, muStart = NULL, sigmaStart = NULL, nuStart = NULL, tauStart = NULL, muFix = FALSE, sigmaFix = FALSE, nuFix = FALSE, tauFix = FALSE, freeParams = c(TRUE, TRUE, FALSE, FALSE), smartStart = FALSE, tstamp = as.numeric(Sys.time()))
fitFunc(ID, hb, bin_min, bin_max, obs_mean, ID_name, distribution = "LOGNO", distName = "LNO", links = c(muLink = "identity", sigmaLink = "log", nuLink = NULL, tauLink = NULL), qFunc = qLOGNO, quantiles = seq(0.006, 0.996, length.out = 1000), linksq = c(identity, exp, NULL, NULL), con = gamlss.control(c.crit=0.1,n.cyc=200, trace=FALSE), saveQuants = FALSE, muStart = NULL, sigmaStart = NULL, nuStart = NULL, tauStart = NULL, muFix = FALSE, sigmaFix = FALSE, nuFix = FALSE, tauFix = FALSE, freeParams = c(TRUE, TRUE, FALSE, FALSE), smartStart = FALSE, tstamp = as.numeric(Sys.time()))
ID |
a (non-empty) object containing the group ID for each row. Importantly, ID, bh, bin_min, bin_max, and obs_mean MUST be the same length and be in the SAME order. |
hb |
a (non-empty) object containing the number of observations in each bin. Importantly, ID, bh, bin_min, bin_max, and obs_mean MUST be the same length and be in the SAME order. |
bin_min |
a (non-empty) object containing the lower bound of each bin. Currently, this package cannot handle data with open lower bounds. Importantly, ID, bh, bin_min, bin_max, and obs_mean MUST be the same length and be in the SAME order. |
bin_max |
a (non-empty) object the upper bound of each bin. Currently, this package can only handle the upper-most bin being open ended. Importantly, ID, bh, bin_min, bin_max, and obs_mean MUST be the same length and be in the SAME order. |
obs_mean |
a (non-empty) object containing the mean for each group. Importantly, ID, bh, bin_min, bin_max, and obs_mean MUST be the same length and be in the SAME order. |
ID_name |
a (non-empty) object containing column name for the ID column. |
distribution |
a (non-empty) character naming a gamlss family. |
distName |
a (non-empty) character object with the name of the distribution. |
links |
a (non-empty) vector of link characters naming functions with the following items: muLink, sigmaLink, nuLink, and tauLink. |
qFunc |
a (non-empty)gamlss function for calculating quantiles, this should match the distribution in distribution. |
quantiles |
a (non-empty) numeric vectors of the desired quantiles, these are used in calculating metrics. |
linksq |
a (non-empty) vector of functions, which undue the link functions. For example, if muLink = log, then the first entry in linksq should be exp. If you are using an indentity link function in links, then the corresponding entry in linksq should be indentity. |
con |
an optional lists modifying gamlss.control. |
saveQuants |
an optional logical value indicating whether to save the quantiles. |
muStart |
an optional numerical value for the starting value of mu. |
sigmaStart |
an optional numerical value for the starting value of sigma. |
nuStart |
an optional numerical value for the starting value of nu. |
tauStart |
an optional numerical value for the starting value of tau. |
muFix |
an logical value indicating whether mu is fixed or is free to vary during the fitting process. |
sigmaFix |
an logical value indicating whether sigma is fixed or is free to vary during the fitting process. |
nuFix |
an logical value indicating whether nu is fixed or is free to vary during the fitting process. |
tauFix |
an logical value indicating whether tau is fixed or is free to vary during the fitting process. |
freeParams |
a vector of logical values indicating whether each of the four parameters is free == TRUE or fixed == FALSE. |
smartStart |
a logical indicating whether a smart starting place should be chosen, this applies only when fitting the GB2 distribution. |
tstamp |
a time stamp. |
Fits a GAMLSS and estimates a number of metrics, see value.
returns a list with 'datOut' a data.frame with the IDs, observer mean, distribution, estimated mean, variance, coefficient of variation, cv squared, gini, theil, MLD, aic, bic, the results of a convergence test, log likelihood, number of parameters, median, and std. deviation; 'timeStamp' a time stamp; 'parameters' the estiamted parameter; and 'quantiles' the quantile estimates if saveQuants == TRUE)
FIXME - references
data(state_bins) use_states <- which(state_bins[,'State'] == 'Texas' | state_bins[,'State'] == 'California') ID <- state_bins[use_states,'State'] hb <- state_bins[use_states,'hb'] bmin <- state_bins[use_states,'bin_min'] bmax <- state_bins[use_states,'bin_max'] omu <- rep(NA, length(use_states)) fitFunc(ID = ID, hb = hb, bin_min = bmin, bin_max = bmax, obs_mean = omu, ID_name = 'State')
data(state_bins) use_states <- which(state_bins[,'State'] == 'Texas' | state_bins[,'State'] == 'California') ID <- state_bins[use_states,'State'] hb <- state_bins[use_states,'hb'] bmin <- state_bins[use_states,'bin_min'] bmax <- state_bins[use_states,'bin_max'] omu <- rep(NA, length(use_states)) fitFunc(ID = ID, hb = hb, bin_min = bmin, bin_max = bmax, obs_mean = omu, ID_name = 'State')
This function returns the bin midpoints for use in calculating midpoint-based statistics. The code is not written to handle left censored bins or more than one right censored bins.
getMids(ID, hb, lb, ub, alpha_bound)
getMids(ID, hb, lb, ub, alpha_bound)
ID |
A vector of group IDs |
hb |
A vector of heights for each bin |
lb |
A vector of lower bounds, which must all be real numbers. |
ub |
A vector of upper bouns, which can have one censored bin per group. |
alpha_bound |
A numeric value indicating the bound on alpha for determining the midpoint of the upper-most, censored bin. To unbound the value of alpha, set alpha_bound = numeric(0). |
For all non-censored bins, ie real number upper and lower bounds, the midpoint is simply the arithemtic mean of the bounds. This assumes the number of observations are normally distributed in each bin. However, for the right-censored bin we relax this assumption when calcuting the bin midpoint. FIXME - say something about the method we use to calculate midpoint of the upper bin.
returns a list with the following elements: mids (a data frame with the midpoint information needed for the midStats function), the "c" values, and "alpha" values from the model fit.
FIXME - references
data(state_bins) bin_mids <- getMids(ID = state_bins[,'State'], hb = state_bins[,'hb'], lb = state_bins[,'bin_min'], ub = state_bins[,'bin_max'], alpha_bound = 10/9) bin_mids_unbound <- getMids(ID = state_bins[,'State'], hb = state_bins[,'hb'], lb = state_bins[,'bin_min'], ub = state_bins[,'bin_max'], alpha_bound = numeric(0))
data(state_bins) bin_mids <- getMids(ID = state_bins[,'State'], hb = state_bins[,'hb'], lb = state_bins[,'bin_min'], ub = state_bins[,'bin_max'], alpha_bound = 10/9) bin_mids_unbound <- getMids(ID = state_bins[,'State'], hb = state_bins[,'hb'], lb = state_bins[,'bin_min'], ub = state_bins[,'bin_max'], alpha_bound = numeric(0))
This extracts the quantiles and parameters.
getQuantilesParams(fit.i, qFunc = qLOGNO, quantiles = seq(0.006, 0.996, length.out = 1000), linksq = c(identity, exp, NULL, NULL), freeParams, fixedParams)
getQuantilesParams(fit.i, qFunc = qLOGNO, quantiles = seq(0.006, 0.996, length.out = 1000), linksq = c(identity, exp, NULL, NULL), freeParams, fixedParams)
fit.i |
a (non-empty) object of class gamlss, which is the fitted distribution. |
qFunc |
a (non-empty) quantile generating function from gamlss. |
quantiles |
an optional numeric vector of the desired quantiles. |
linksq |
a (non-empty) vector is link functions. |
freeParams |
a (non-empty) logical vector inidicating whether parameters are fixed == FALSE or free == TRUE. |
fixedParams |
a (non-empty) numeric vector of fixed parameter values. |
Extracts the quantile and parameter estimates.
Returns a list with: samps = the quantiles extracted at the locations specified in quantiles and params = the parameter values of the fitted model.
FIXME - references
#not run, this function is used internally
#not run, this function is used internally
This function estimates the Gini coefficient using the quantiles sampled from a fitted GAMLSS object.
giniCoef(seq.i, samps.i)
giniCoef(seq.i, samps.i)
seq.i |
a (non-empty) numeric vector of quantile locations. |
samps.i |
a (non-empty) numeric vector of quantile values. |
Calculates the Gini coefficient from quantiles
Returns the Gini coefficient estimate.
FIXME - references
#not run, internal function
#not run, internal function
This function performs likelihood ratio tests for use in model selection.
LRT(dat, fitComb, ID)
LRT(dat, fitComb, ID)
dat |
a (non-empty) data frame containing an ID column (see ID) and a column called "hb", which contains the bin heights used in model fitting. |
fitComb |
a (non-empty) data frame returned from the function makeFitComb. |
ID |
a (non-empty) string indicating the column dat and the column in fitComb containing the group ids, these column names must match. |
Performs a likelihood ratio test.
Returns an object with the same information as was passed into the fitComb argument, but with additional columns G2 = the test statistic, p = the p value, and df = the degrees of freedom.
FIXME - references
#not run, function is used internally
#not run, function is used internally
This funciton transforms a list of fitted models into a data frame.
makeFitComb(distFitsList)
makeFitComb(distFitsList)
distFitsList |
a (non-empty) list containing the fitted models from the function fitFunc |
Transforms a list into a data frame
Returns a data frame where each row is a distribution/unique ID pair, ie a county and its fitted GB2 distribution. It also contains summary stats.
#not run, internal function
#not run, internal function
This function creates a survival object from bin counts, these data cannot be right censored. There is a separate function for the right censored bins.
makeInt(ints, hb, trans = NULL)
makeInt(ints, hb, trans = NULL)
ints |
a (non-empty) numeric matrix containing the bin end points, there must be as many rows as length(hb) and the lower bin should be on the left. |
hb |
a (non-empty) numeric vector of bin heights. |
trans |
an optional function to transform the bins. |
Creates a survival object. FIXME - reference to survival package and Surv function.
Returns an object of class "Surv"
FIXME - reference to survival package and Surv function.
#not run, internal function
#not run, internal function
This function creates a survival object from bin counts, these data cannot be right censored. There is a separate function for the right censored bins. It also returns normalized bin weights.
makeIntWeight(ints, hb, trans = NULL)
makeIntWeight(ints, hb, trans = NULL)
ints |
a (non-empty) numeric matrix containing the bin end points, there must be as many rows as length(hb) and the lower bin should be on the left. |
hb |
a (non-empty) numeric vector of bin heights. |
trans |
an optional function to transform the bins. |
Creates a survival object. FIXME - reference to survival package and Surv function.
Returns an object of class "Surv" and normalizde bin weights.
FIXME - reference to survival package and Surv function.
#not run, internal function
#not run, internal function
This function calculates AIC weights for use in model selection and model averaging.
makeWeightsAIC(aics)
makeWeightsAIC(aics)
aics |
a (non-empty) numeric vector of AIC values. |
FIXME - citation
Returns a vector of AIC weights.
FIXME - refs
#not run, internal function
#not run, internal function
This function takes weights and parameters and returns model averaged parameters using the weights.
mAvg(params, ws)
mAvg(params, ws)
params |
a (non-empty) numeric vector of parameters |
ws |
a (non-empty) numeric vector of weights corresponding to params. |
Returns a numeric vector of model averaged parameters.
#not run, iternal function
#not run, iternal function
This function calcualtes a suite of pre-defined statistics on binned distributions using pre-calculated midpoints for each bin.
midStats(data)
midStats(data)
data |
a (non-empty) data frame with columns named ID, mids, and hb. |
Currently has the following stats: 'mean','median','gini','theil','cv','MLD'. FIXME - reference to functions in gamlss
Returns a data frame with the following columns: 'ID','mean','median','gini','theil','cv','MLD', 'SLD'. "cv" is the coefficient of variation, "MLD" is the mean log deviance, and "SLD" is the squar root of the MLD.
data(state_bins) bin_mids <- getMids(ID = state_bins[,'State'], hb = state_bins[,'hb'], lb = state_bins[,'bin_min'], ub = state_bins[,'bin_max'], alpha_bound = 10/9) state_bin_mid_stats <- midStats(data = bin_mids$mids)
data(state_bins) bin_mids <- getMids(ID = state_bins[,'State'], hb = state_bins[,'hb'], lb = state_bins[,'bin_min'], ub = state_bins[,'bin_max'], alpha_bound = 10/9) state_bin_mid_stats <- midStats(data = bin_mids$mids)
This fuction calculates MLD
MLD(samps)
MLD(samps)
samps |
a (non-empty) numeric vector of values to calculate MLD over, for example, bin mid points or samples take from a fitted distribution. |
FIXME - equations
returns a numeric value representing the MLD
FIXME - references
MLD(qnorm(seq(0.001,0.999,length.out = 10), mean = 100))
MLD(qnorm(seq(0.001,0.999,length.out = 10), mean = 100))
This function calculates model averaged statistics using AIC and BIC.
modelAvg(fitList, ID, nonCon = TRUE)
modelAvg(fitList, ID, nonCon = TRUE)
fitList |
a (non-empty) list of fitted distributions |
ID |
a (non-empty) string of the ID column name. |
nonCon |
an optional logical, where nonCon == TRUE excludes models failing to converged and nonCon == FALSE includes them. |
Calculates model averaged statistics using BIC and AIC as weights.
Returns a list with aic and bic values, aic and bic averages, and the aic and bic weights.
#not run, internal function
#not run, internal function
This functions filters models, excludes them from model averaging, and sets all parameter values/statistics to NA. The filter determines which models should have undefined moments based on the estimated parameters. These are filtered. FIXME - add in specific criteria for filtering.
paramFilt(paramsList, fitComb)
paramFilt(paramsList, fitComb)
paramsList |
a (non-empty) list containing the parameter values, it should be of length(fitComb). |
fitComb |
a (non-empty) list of fitted models |
Returns fitComb, filtered based on the parameters.
FIXME - references for specific filtering criteria
#not run, internal function
#not run, internal function
This function fits a series of parametric distributions from the GB family to binned data.
run_GB_family(ID,hb,bin_min,bin_max,obs_mean,ID_name,quantiles,modelsToFit,return_params)
run_GB_family(ID,hb,bin_min,bin_max,obs_mean,ID_name,quantiles,modelsToFit,return_params)
ID |
a (non-empty) object containing the group ID for each row. Importantly, ID, bh, bin_min, bin_max, and obs_mean MUST be the same length and be in the SAME order. |
hb |
a (non-empty) object containing the number of observations in each bin. Importantly, ID, bh, bin_min, bin_max, and obs_mean MUST be the same length and be in the SAME order. |
bin_min |
a (non-empty) object containing the lower bound of each bin. Currently, this package cannot handle data with open lower bounds. Importantly, ID, bh, bin_min, bin_max, and obs_mean MUST be the same length and be in the SAME order. |
bin_max |
a (non-empty) object the upper bound of each bin. Currently, this package can only handle the upper-most bin being open ended. Importantly, ID, bh, bin_min, bin_max, and obs_mean MUST be the same length and be in the SAME order. |
obs_mean |
a (non-empty) object containing the mean for each group. Importantly, ID, bh, bin_min, bin_max, and obs_mean MUST be the same length and be in the SAME order. |
ID_name |
a (non-empty) object containing column name for the ID column. |
quantiles |
a (non-empty) vector of quantiles for use in calculating metrics. The default is seq(0.006,0.996, length.out = 1000). |
modelsToFit |
a (non-empty) vector of model names as characters, currently limited to the following distributions GB2, GG, BETA2, DAGUM, SINGMAD, LOGNO, WEI, GA, LOGLOG, PARETO2. |
return_params |
a (non-empty) logical indicating whether to retunr the parameters. The default is TRUE. |
Fits a GAMLSS and estimates a number of metrics, see value.
returns a list with "fit" = fitted model and metrics, "fit.filter" = filtered for undefined moments fitted model and metrics, "best_model" = best model results, "best_model.filter" = filtered for undefined moments best model results.
data(state_bins) use_states <- which(state_bins[,'State'] == 'Texas') TX <- state_bins[use_states,] LNO_WEI_GA <- run_GB_family(ID = TX[,'State'], hb = TX[,'hb'], bin_min = TX[,'bin_min'], bin_max = TX[,'bin_max'], obs_mean = rep(NA, length(use_states)), ID_name = "State", quantiles = seq(0.006,0.996, length.out = 1000), modelsToFit = c('LOGNO','WEI','GA'))
data(state_bins) use_states <- which(state_bins[,'State'] == 'Texas') TX <- state_bins[use_states,] LNO_WEI_GA <- run_GB_family(ID = TX[,'State'], hb = TX[,'hb'], bin_min = TX[,'bin_min'], bin_max = TX[,'bin_max'], obs_mean = rep(NA, length(use_states)), ID_name = "State", quantiles = seq(0.006,0.996, length.out = 1000), modelsToFit = c('LOGNO','WEI','GA'))
This is a list where each entry is a different year.
data(state_bins)
data(state_bins)
#FIXME - add in format
data(school_district_bins) head(school_district_bins[[1]]) names(school_district_bins)
data(school_district_bins) head(school_district_bins[[1]]) names(school_district_bins)
This fuction calculates SDL
SDL(samps)
SDL(samps)
samps |
a (non-empty) numeric vector of values to calculate SDL over, for example, bin mid points or samples take from a fitted distribution. |
FIXME - equations
returns a numeric value representing the SLD
FIXME - references
SDL(qnorm(seq(0.001,0.999,length.out = 10), mean = 100))
SDL(qnorm(seq(0.001,0.999,length.out = 10), mean = 100))
A data set of binned income by US counties. FIXME - more info.
data(state_bins)
data(state_bins)
data(state_bins)
data(state_bins)
This fuction calculates the Theil Index.
theilInd(samps)
theilInd(samps)
samps |
a (non-empty) numeric vector of values to calculate MLD over, for example, bin mid points or samples take from a fitted distribution. |
FIXME - equations
returns a numeric value representing the Theil Index
FIXME - references
theilInd(qnorm(seq(0.001,0.999,length.out = 10), mean = 100))
theilInd(qnorm(seq(0.001,0.999,length.out = 10), mean = 100))