Title: | Fit Statistic for Binary Dependent Variable Models |
---|---|
Description: | Generates a fit plot for diagnosing misspecification in models of binary dependent variables, and calculates the related heatmap fit statistic described in Esarey and Pierce (2012) <DOI:10.1093/pan/mps026>. |
Authors: | Justin Esarey [aut, cre], Andrew Pierce [aut], Jericho Du [aut] |
Maintainer: | Justin Esarey <[email protected]> |
License: | GPL (>= 2) |
Version: | 2.0.4 |
Built: | 2024-12-13 06:40:25 UTC |
Source: | CRAN |
heatmap.fit
Reduces the size of large binary data sets by binning them according to their predicted probability [0, 1].
heatmap.compress(y, pred, init.grid)
heatmap.compress(y, pred, init.grid)
y |
A vector of observations of the dependent variable (in {0,1}). |
pred |
A vector of predicted Pr(y = 1) corresponding to each element of |
init.grid |
The number of bins on the interval [0, 1] to use for compression of |
A list with the elements:
y.out |
The value of |
pred.out |
The (binned) predicted Pr(y = 1) matching each observation. |
weight.out |
A weight parameter indicating the proportion of observations in the bin; sums to one. |
pred.total.out |
A vector of unique Pr(y = 1) bin values. |
n.out |
The number of observations (non-empty bins) after the data are collapsed. |
retained.obs |
A vector of indices for non-empty candidate bins (for internal use by |
Justin Esarey <[email protected]>
Generates a fit plot for diagnosing misspecification in models of binary dependent variables, and calculates the related heatmap fit statistic (Esarey and Pierce, 2012).
heatmap.fit(y, pred, calc.boot = TRUE, reps = 1000, span.l = "aicc", color = FALSE, compress.obs = TRUE, init.grid = 2000, ret.obs = FALSE, legend = TRUE)
heatmap.fit(y, pred, calc.boot = TRUE, reps = 1000, span.l = "aicc", color = FALSE, compress.obs = TRUE, init.grid = 2000, ret.obs = FALSE, legend = TRUE)
y |
A vector of observations of the dependent variable (in {0,1}). |
pred |
A vector of model-predicted Pr(y = 1) corresponding to each element of |
calc.boot |
Calculate bootstrap-based p-values (default = |
reps |
Number of bootstrap replicates to generate (default = 1000). |
span.l |
Bandwidth for the nonparametric fit between |
color |
Whether the plot should be in color ( |
compress.obs |
Whether large data sets should be compressed by pre-binning to save computing time (default |
init.grid |
If |
ret.obs |
Return the one-tailed bootstrap p-value for each observation in |
legend |
Print the legend on the heat map plot (the default, |
This function plots the degree to which a binary dependent variable (BDV) model generates predicted probabilities that are an accurate match for observed empirical probabilities of the BDV, in-sample or out-of-sample. For example, if a model predicts that Pr(y = 1) = k%, about k% of observations with this predicted probability should have y = 1. Loess smoothing (with an automatically-selected optimum bandwidth) is used to estimate empirical probabilities in the data set and to overcome sparseness of the data. Systematic deviations are distinguished from sampling variation via bootstrapping of the distribution under the null that the model is an accurate predictor, with p-values indicating the one-tailed proportion of bootstrap samples that are less-extreme than the observed deviation. The plot shows model predicted probabilities on the x-axis and smoothed empirical probabilities on the y-axis, with a histogram indicating the location and frequency of observations. The ideal fit is a 45-degree line. The shading of the plotted line indicates the degree to which fit deviations are larger than expected due to sampling variation.
A summary statistic for fit (the "heatmap statistic") is also reported. This statistic is the proportion of the sample in a region with one-tailed p-value less than or equal to 10%. Finding more than 20% of the dataset with this p-value in this region is diagnostic of misspecification in the model.
More details for the technique are given in Esarey and Pierce 2012, "Assessing Fit Quality and Testing for Misspecification in Binary Dependent Variable Models," Political Analysis 20(4): 480-500.
If ret.obs = T
, a list with the element:
heatmap.obs.p |
The one-tailed bootstrap p-value corresponding to each observation in |
Code to calculate AICc and GCV written by Michael Friendly (http://tolstoy.newcastle.edu.au/R/help/05/11/15899.html).
Justin Esarey <[email protected]>
Andrew Pierce <[email protected]>
Jericho Du <[email protected]>
Esarey, Justin and Andrew Pierce (2012). "Assessing Fit Quality and Testing for Misspecification in Binary Dependent Variable Models." Political Analysis 20(4): 480-500. DOI:10.1093/pan/mps026.
## Not run: ## a correctly specified model ############################### set.seed(123456) x <- runif(20000) y <- as.numeric( runif(20000) < pnorm(2*x - 1) ) mod <- glm( y ~ x, family=binomial(link="probit")) pred <- predict(mod, type="response") heatmap.fit(y, pred, reps=1000) ## out-of-sample prediction w/o bootstrap p-values set.seed(654321) x <- runif(1000) y <- as.numeric( runif(1000) < pnorm(2*x - 1) ) pred <- predict(mod, type="response", newdata=data.frame(x)) heatmap.fit(y, pred, calc.boot=FALSE) ## a misspecified model ######################## set.seed(13579) x <- runif(20000) y <- as.numeric( runif(20000) < pnorm(sin(10*x)) ) mod <- glm( y ~ x, family=binomial(link="probit")) pred <- predict(mod, type="response") heatmap.fit(y, pred, reps=1000) ## Comparison with and without data compression system.time(heatmap.fit(y, pred, reps=100)) system.time(heatmap.fit(y, pred, reps=100, compress.obs=FALSE)) ## End(Not run)
## Not run: ## a correctly specified model ############################### set.seed(123456) x <- runif(20000) y <- as.numeric( runif(20000) < pnorm(2*x - 1) ) mod <- glm( y ~ x, family=binomial(link="probit")) pred <- predict(mod, type="response") heatmap.fit(y, pred, reps=1000) ## out-of-sample prediction w/o bootstrap p-values set.seed(654321) x <- runif(1000) y <- as.numeric( runif(1000) < pnorm(2*x - 1) ) pred <- predict(mod, type="response", newdata=data.frame(x)) heatmap.fit(y, pred, calc.boot=FALSE) ## a misspecified model ######################## set.seed(13579) x <- runif(20000) y <- as.numeric( runif(20000) < pnorm(sin(10*x)) ) mod <- glm( y ~ x, family=binomial(link="probit")) pred <- predict(mod, type="response") heatmap.fit(y, pred, reps=1000) ## Comparison with and without data compression system.time(heatmap.fit(y, pred, reps=100)) system.time(heatmap.fit(y, pred, reps=100, compress.obs=FALSE)) ## End(Not run)