Package 'binsreg' reference manual

Title:	Binscatter Estimation and Inference
Description:	Provides tools for statistical analysis using the binscatter methods developed by Cattaneo, Crump, Farrell and Feng (2024a) <doi:10.48550/arXiv.1902.09608>, Cattaneo, Crump, Farrell and Feng (2024b) <https://nppackages.github.io/references/Cattaneo-Crump-Farrell-Feng_2024_NonlinearBinscatter.pdf> and Cattaneo, Crump, Farrell and Feng (2024c) <doi:10.48550/arXiv.1902.09615>. Binscatter provides a flexible way of describing the relationship between two variables based on partitioning/binning of the independent variable of interest. binsreg(), binsqreg() and binsglm() implement binscatter least squares regression, quantile regression and generalized linear regression respectively, with particular focus on constructing binned scatter plots. They also implement robust (pointwise and uniform) inference of regression functions and derivatives thereof. binstest() implements hypothesis testing procedures for parametric functional forms of and nonparametric shape restrictions on the regression function. binspwc() implements hypothesis testing procedures for pairwise group comparison of binscatter estimators. binsregselect() implements data-driven procedures for selecting the number of bins for binscatter estimation. All the commands allow for covariate adjustment, smoothness restrictions and clustering.
Authors:	Matias D. Cattaneo, Richard K. Crump, Max H. Farrell, Yingjie Feng
Maintainer:	Yingjie Feng <[email protected]>
License:	GPL-2
Version:	1.1
Built:	2025-01-20 06:47:28 UTC
Source:	CRAN

Binsreg Package Document

Description

Binscatter provides a flexible, yet parsimonious way of visualizing and summarizing large data sets and has been a popular methodology in applied microeconomics and other social sciences. The binsreg package provides tools for statistical analysis using the binscatter methods developed in Cattaneo, Crump, Farrell and Feng (2024a) and Cattaneo, Crump, Farrell and Feng (2024b). binsreg implements binscatter least squares regression with robust inference and plots, including curve estimation, pointwise confidence intervals and uniform confidence band. binsqreg implements binscatter quantile regression with robust inference and plots, including curve estimation, pointwise confidence intervals and uniform confidence band. binsglm implements binscatter generalized linear regression with robust inference and plots, including curve estimation, pointwise confidence intervals and uniform confidence band. binstest implements binscatter-based hypothesis testing procedures for parametric specifications of and shape restrictions on the unknown function of interest. binspwc implements hypothesis testing procedures for pairwise group comparison of binscatter estimators and plots confidence bands for the difference in binscatter parameters between each pair of groups. binsregselect implements data-driven number of bins selectors for binscatter implementation using either quantile-spaced or evenly-spaced binning/partitioning. All the commands allow for covariate adjustment, smoothness restrictions, and clustering, among other features.

The companion software article, Cattaneo, Crump, Farrell and Feng (2024c), provides further implementation details and empirical illustration. For related Stata, R and Python packages useful for nonparametric data analysis and statistical inference, visit https://nppackages.github.io/.

Author(s)

Matias D. Cattaneo, Princeton University, Princeton, NJ. [email protected].

Richard K. Crump, Federal Reserve Bank of New York, New York, NY. [email protected].

Max H. Farrell, UC Santa Barbara, Santa Barbara, CA. [email protected].

Yingjie Feng (maintainer), Tsinghua University, Beijing, China. [email protected].

References

Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024a: On Binscatter. American Economic Review 114(5): 1488-1514.

Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024b: Nonlinear Binscatter Methods. Working Paper.

Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024c: Binscatter Regressions. Working Paper.

Data-Driven Binscatter Generalized Linear Regression with Robust Inference Procedures and Plots

Description

binsglm implements binscatter generalized linear regression with robust inference procedures and plots, following the results in Cattaneo, Crump, Farrell and Feng (2024a) and Cattaneo, Crump, Farrell and Feng (2024b). Binscatter provides a flexible way to describe the relationship between two variables, after possibly adjusting for other covariates, based on partitioning/binning of the independent variable of interest. The main purpose of this function is to generate binned scatter plots with curve estimation with robust pointwise confidence intervals and uniform confidence band. If the binning scheme is not set by the user, the companion function binsregselect is used to implement binscatter in a data-driven way. Hypothesis testing about the function of interest can be conducted via the companion function binstest.

Usage

binsglm(y, x, w = NULL, data = NULL, at = NULL, family = gaussian(),
  deriv = 0, nolink = F, dots = NULL, dotsgrid = 0, dotsgridmean = T,
  line = NULL, linegrid = 20, ci = NULL, cigrid = 0, cigridmean = T,
  cb = NULL, cbgrid = 20, polyreg = NULL, polyreggrid = 20,
  polyregcigrid = 0, by = NULL, bycolors = NULL, bysymbols = NULL,
  bylpatterns = NULL, legendTitle = NULL, legendoff = F, nbins = NULL,
  binspos = "qs", binsmethod = "dpi", nbinsrot = NULL, pselect = NULL,
  sselect = NULL, samebinsby = F, randcut = NULL, nsims = 500,
  simsgrid = 20, simsseed = NULL, vce = "HC1", cluster = NULL,
  asyvar = F, level = 95, noplot = F, dfcheck = c(20, 30),
  masspoints = "on", weights = NULL, subset = NULL, plotxrange = NULL,
  plotyrange = NULL, ...)
binsglm(y, x, w = NULL, data = NULL, at = NULL, family = gaussian(),
  deriv = 0, nolink = F, dots = NULL, dotsgrid = 0, dotsgridmean = T,
  line = NULL, linegrid = 20, ci = NULL, cigrid = 0, cigridmean = T,
  cb = NULL, cbgrid = 20, polyreg = NULL, polyreggrid = 20,
  polyregcigrid = 0, by = NULL, bycolors = NULL, bysymbols = NULL,
  bylpatterns = NULL, legendTitle = NULL, legendoff = F, nbins = NULL,
  binspos = "qs", binsmethod = "dpi", nbinsrot = NULL, pselect = NULL,
  sselect = NULL, samebinsby = F, randcut = NULL, nsims = 500,
  simsgrid = 20, simsseed = NULL, vce = "HC1", cluster = NULL,
  asyvar = F, level = 95, noplot = F, dfcheck = c(20, 30),
  masspoints = "on", weights = NULL, subset = NULL, plotxrange = NULL,
  plotyrange = NULL, ...)

Arguments

`y`	outcome variable. A vector.
`x`	independent variable of interest. A vector.
`w`	control variables. A matrix, a vector or a `formula`.
`data`	an optional data frame containing variables in the model.
`at`	value of `w` at which the estimated function is evaluated. The default is `at="mean"`, which corresponds to the mean of `w`. Other options are: `at="median"` for the median of `w`, `at="zero"` for a vector of zeros. `at` can also be a vector of the same length as the number of columns of `w` (if `w` is a matrix) or a data frame containing the same variables as specified in `w` (when `data` is specified). Note that when `at="mean"` or `at="median"`, all factor variables (if specified) are excluded from the evaluation (set as zero).
`family`	a description of the error distribution and link function to be used in the generalized linear model. (See `family` for details of family functions.)
`deriv`	derivative order of the regression function for estimation, testing and plotting. The default is `deriv=0`, which corresponds to the function itself. If `nolink=FALSE`, `deriv` cannot be greater than 1.
`nolink`	if true, the function within the inverse link function is reported instead of the conditional mean function for the outcome.
`dots`	a vector or a logical value. If `dots=c(p,s)`, a piecewise polynomial of degree `p` with `s` smoothness constraints is used for point estimation and plotting as "dots". The default is `dots=c(0,0)`, which corresponds to piecewise constant (canonical binscatter). If `dots=T`, the default `dots=c(0,0)` is used unless the degree `p` or smoothness `s` selection is requested via the option `pselect` or `sselect` (see more details in the explanation of `pselect` and `sselect`). If `dots=F` is specified, the dots are not included in the plot.
`dotsgrid`	number of dots within each bin to be plotted. Given the choice, these dots are point estimates evaluated over an evenly-spaced grid within each bin. The default is `dotsgrid=0`, and only the point estimates at the mean of `x` within each bin are presented.
`dotsgridmean`	If true, the dots corresponding to the point estimates evaluated at the mean of `x` within each bin are presented. By default, they are presented, i.e., `dotsgridmean=T`.
`line`	a vector or a logical value. If `line=c(p,s)`, a piecewise polynomial of degree `p` with `s` smoothness constraints is used for plotting as a "line". If `line=T` is specified, `line=c(0,0)` is used unless the degree `p` or smoothness `s` selection is requested via the option `pselect` or `sselect` (see more details in the explanation of `pselect` and `sselect`). If `line=F` or `line=NULL` is specified, the line is not included in the plot. The default is `line=NULL`.
`linegrid`	number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the point estimate set by the `line=c(p,s)` option. The default is `linegrid=20`, which corresponds to 20 evenly-spaced evaluation points within each bin for fitting/plotting the line.
`ci`	a vector or a logical value. If `ci=c(p,s)` a piecewise polynomial of degree `p` with `s` smoothness constraints is used for constructing confidence intervals. If `ci=T` is specified, `ci=c(1,1)` is used unless the degree `p` or smoothness `s` selection is requested via the option `pselect` or `sselect` (see more details in the explanation of `pselect` and `sselect`). If `ci=F` or `ci=NULL` is specified, the confidence intervals are not included in the plot. The default is `ci=NULL`.
`cigrid`	number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the point estimate set by the `ci=c(p,s)` option. The default is `cigrid=1`, which corresponds to 1 evenly-spaced evaluation point within each bin for confidence interval construction.
`cigridmean`	If true, the confidence intervals corresponding to the point estimates evaluated at the mean of `x` within each bin are presented. The default is `cigridmean=T`.
`cb`	a vector or a logical value. If `cb=c(p,s)`, a the piecewise polynomial of degree `p` with `s` smoothness constraints is used for constructing the confidence band. If the option `cb=T` is specified, `cb=c(1,1)` is used unless the degree `p` or smoothness `s` selection is requested via the option `pselect` or `sselect` (see more details in the explanation of `pselect` and `sselect`). If `cb=F` or `cb=NULL` is specified, the confidence band is not included in the plot. The default is `cb=NULL`.
`cbgrid`	number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the point estimate set by the `cb=c(p,s)` option. The default is `cbgrid=20`, which corresponds to 20 evenly-spaced evaluation points within each bin for confidence interval construction.
`polyreg`	degree of a global polynomial regression model for plotting. By default, this fit is not included in the plot unless explicitly specified. Recommended specification is `polyreg=3`, which adds a cubic (global) polynomial fit of the regression function of interest to the binned scatter plot.
`polyreggrid`	number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the point estimate set by the `polyreg=p` option. The default is `polyreggrid=20`, which corresponds to 20 evenly-spaced evaluation points within each bin for confidence interval construction.
`polyregcigrid`	number of evaluation points of an evenly-spaced grid within each bin used for constructing confidence intervals based on polynomial regression set by the `polyreg=p` option. The default is `polyregcigrid=0`, which corresponds to not plotting confidence intervals for the global polynomial regression approximation.
`by`	a vector containing the group indicator for subgroup analysis; both numeric and string variables are supported. When `by` is specified, `binsreg` implements estimation and inference for each subgroup separately, but produces a common binned scatter plot. By default, the binning structure is selected for each subgroup separately, but see the option `samebinsby` below for imposing a common binning structure across subgroups.
`bycolors`	an ordered list of colors for plotting each subgroup series defined by the option `by`.
`bysymbols`	an ordered list of symbols for plotting each subgroup series defined by the option `by`.
`bylpatterns`	an ordered list of line patterns for plotting each subgroup series defined by the option `by`.
`legendTitle`	String, title of legend.
`legendoff`	If true, no legend is added.
`nbins`	number of bins for partitioning/binning of `x`. If `nbins=T` or `nbins=NULL` (default) is specified, the number of bins is selected via the companion command `binsregselect` in a data-driven, optimal way whenever possible. If a vector with more than one number is specified, the number of bins is selected within this vector via the companion command `binsregselect`.
`binspos`	position of binning knots. The default is `binspos="qs"`, which corresponds to quantile-spaced binning (canonical binscatter). The other options are `"es"` for evenly-spaced binning, or a vector for manual specification of the positions of inner knots (which must be within the range of `x`).
`binsmethod`	method for data-driven selection of the number of bins. The default is `binsmethod="dpi"`, which corresponds to the IMSE-optimal direct plug-in rule. The other option is: `"rot"` for rule of thumb implementation.
`nbinsrot`	initial number of bins value used to construct the DPI number of bins selector. If not specified, the data-driven ROT selector is used instead.
`pselect`	vector of numbers within which the degree of polynomial `p` for point estimation is selected. Piecewise polynomials of the selected optimal degree `p` are used to construct dots or line if `dots=T` or `line=T` is specified, whereas piecewise polynomials of degree `p+1` are used to construct confidence intervals or confidence band if `ci=T` or `cb=T` is specified. Note: To implement the degree or smoothness selection, in addition to `pselect` or `sselect`, `nbins=#` must be specified.
`sselect`	vector of numbers within which the number of smoothness constraints `s` for point estimation is selected. Piecewise polynomials with the selected optimal `s` smoothness constraints are used to construct dots or line if `dots=T` or `line=T` is specified, whereas piecewise polynomials with `s+1` constraints are used to construct confidence intervals or confidence band if `ci=T` or `cb=T` is specified. If not specified, for each value `p` supplied in the option `pselect`, only the piecewise polynomial with the maximum smoothness is considered, i.e., `s=p`.
`samebinsby`	if true, a common partitioning/binning structure across all subgroups specified by the option `by` is forced. The knots positions are selected according to the option `binspos` and using the full sample. If `nbins` is not specified, then the number of bins is selected via the companion command `binsregselect` and using the full sample.
`randcut`	upper bound on a uniformly distributed variable used to draw a subsample for bins/degree/smoothness selection. Observations for which `runif()<=#` are used. # must be between 0 and 1. By default, `max(5000, 0.01n)` observations are used if the samples size `n>5000`.
`nsims`	number of random draws for constructing confidence bands. The default is `nsims=500`, which corresponds to 500 draws from a standard Gaussian random vector of size `[(p+1)J - (J-1)s]`. Setting at least `nsims=2000` is recommended to obtain the final results.
`simsgrid`	number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the supremum operation needed to construct confidence bands. The default is `simsgrid=20`, which corresponds to 20 evenly-spaced evaluation points within each bin for approximating the supremum operator. Setting at least `simsgrid=50` is recommended to obtain the final results.
`simsseed`	seed for simulation.
`vce`	Procedure to compute the variance-covariance matrix estimator. Options are `"const"` homoskedastic variance estimator. `"HC0"` heteroskedasticity-robust plug-in residuals variance estimator without weights. `"HC1"` heteroskedasticity-robust plug-in residuals variance estimator with hc1 weights. Default. `"HC2"` heteroskedasticity-robust plug-in residuals variance estimator with hc2 weights. `"HC3"` heteroskedasticity-robust plug-in residuals variance estimator with hc3 weights.
`cluster`	cluster ID. Used for compute cluster-robust standard errors.
`asyvar`	if true, the standard error of the nonparametric component is computed and the uncertainty related to control variables is omitted. Default is `asyvar=FALSE`, that is, the uncertainty related to control variables is taken into account.
`level`	nominal confidence level for confidence interval and confidence band estimation. Default is `level=95`.
`noplot`	if true, no plot produced.
`dfcheck`	adjustments for minimum effective sample size checks, which take into account number of unique values of `x` (i.e., number of mass points), number of clusters, and degrees of freedom of the different stat models considered. The default is `dfcheck=c(20, 30)`. See Cattaneo, Crump, Farrell and Feng (2024c) for more details.
`masspoints`	how mass points in `x` are handled. Available options: `"on"` all mass point and degrees of freedom checks are implemented. Default. `"noadjust"` mass point checks and the corresponding effective sample size adjustments are omitted. `"nolocalcheck"` within-bin mass point and degrees of freedom checks are omitted. `"off"` "noadjust" and "nolocalcheck" are set simultaneously. `"veryfew"` forces the function to proceed as if `x` has only a few number of mass points (i.e., distinct values). In other words, forces the function to proceed as if the mass point and degrees of freedom checks were failed.
`weights`	an optional vector of weights to be used in the fitting process. Should be `NULL` or a numeric vector. For more details, see `lm`.
`subset`	optional rule specifying a subset of observations to be used.
`plotxrange`	a vector. `plotxrange=c(min, max)` specifies a range of the x-axis for binscatter plot. Observations outside the range are dropped in the plot.
`plotyrange`	a vector. `plotyrange=c(min, max)` specifies a range of the y-axis for binscatter plot. Observations outside the range are dropped in the plot.
`...`	optional arguments used by `glm`.

Value

`bins_plot`	A `ggplot` object for binscatter plot.
`data.plot`	A list containing data for plotting. Each item is a sublist of data frames for each group. Each sublist may contain the following data frames: `data.dots` Data for dots. It contains: `x`, evaluation points; `bin`, the indicator of bins; `isknot`, indicator of inner knots; `mid`, midpoint of each bin; and `fit`, fitted values. `data.line` Data for line. It contains: `x`, evaluation points; `bin`, the indicator of bins; `isknot`, indicator of inner knots; `mid`, midpoint of each bin; and `fit`, fitted values. `data.ci` Data for CI. It contains: `x`, evaluation points; `bin`, the indicator of bins; `isknot`, indicator of inner knots; `mid`, midpoint of each bin; `ci.l` and `ci.r`, left and right boundaries of each confidence intervals. `data.cb` Data for CB. It contains: `x`, evaluation points; `bin`, the indicator of bins; `isknot`, indicator of inner knots; `mid`, midpoint of each bin; `cb.l` and `cb.r`, left and right boundaries of the confidence band. `data.poly` Data for polynomial regression. It contains: `x`, evaluation points; `bin`, the indicator of bins; `isknot`, indicator of inner knots; `mid`, midpoint of each bin; and `fit`, fitted values. `data.polyci` Data for confidence intervals based on polynomial regression. It contains: `x`, evaluation points; `bin`, the indicator of bins; `isknot`, indicator of inner knots; `mid`, midpoint of each bin; `polyci.l` and `polyci.r`, left and right boundaries of each confidence intervals. `data.bin` Data for the binning structure. It contains: `bin.id`, ID for each bin; `left.endpoint` and `right.endpoint`, left and right endpoints of each bin.
`imse.var.rot`	Variance constant in IMSE, ROT selection.
`imse.bsq.rot`	Bias constant in IMSE, ROT selection.
`imse.var.dpi`	Variance constant in IMSE, DPI selection.
`imse.bsq.dpi`	Bias constant in IMSE, DPI selection.
`cval.by`	A vector of critical values for constructing confidence band for each group.
`opt`	A list containing options passed to the function, as well as `N.by` (total sample size for each group), `Ndist.by` (number of distinct values in `x` for each group), `Nclust.by` (number of clusters for each group), and `nbins.by` (number of bins for each group), and `byvals` (number of distinct values in `by`). The degree and smoothness of polynomials for dots, line, confidence intervals and confidence band for each group are saved in `dots`, `line`, `ci`, and `cb`.

Author(s)

Matias D. Cattaneo, Princeton University, Princeton, NJ. [email protected].

Richard K. Crump, Federal Reserve Bank of New York, New York, NY. [email protected].

Max H. Farrell, UC Santa Barbara, Santa Barbara, CA. [email protected].

Yingjie Feng (maintainer), Tsinghua University, Beijing, China. [email protected].

References

Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024a: On Binscatter. American Economic Review 114(5): 1488-1514.

Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024b: Nonlinear Binscatter Methods. Working Paper.

Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024c: Binscatter Regressions. Working Paper.

Examples

 x <- runif(500); d <- 1*(runif(500)<=x)
 ## Binned scatterplot
 binsglm(d, x, family=binomial())
x <- runif(500); d <- 1*(runif(500)<=x)
 ## Binned scatterplot
 binsglm(d, x, family=binomial())

Data-Driven Pairwise Group Comparison using Binscatter Methods

Description

binspwc implements hypothesis testing procedures for pairwise group comparison of binscatter estimators and plots confidence bands for the difference in binscatter parameters between each pair of groups, following the results in Cattaneo, Crump, Farrell and Feng (2024a) and Cattaneo, Crump, Farrell and Feng (2024b). If the binning scheme is not set by the user, the companion function binsregselect is used to implement binscatter in a data-driven way. Binned scatter plots based on different methods can be constructed using the companion functions binsreg, binsqreg or binsglm. Hypothesis testing for parametric functional forms of and shape restrictions on the regression function of interest can be conducted via the companion function binstest.

Usage

binspwc(y, x, w = NULL, data = NULL, estmethod = "reg",
  family = gaussian(), quantile = NULL, deriv = 0, at = NULL,
  nolink = F, by = NULL, pwc = NULL, testtype = "two-sided",
  lp = Inf, bins = NULL, bynbins = NULL, binspos = "qs",
  pselect = NULL, sselect = NULL, binsmethod = "dpi", nbinsrot = NULL,
  samebinsby = FALSE, randcut = NULL, nsims = 500, simsgrid = 20,
  simsseed = NULL, vce = NULL, cluster = NULL, asyvar = F,
  dfcheck = c(20, 30), masspoints = "on", weights = NULL,
  subset = NULL, numdist = NULL, numclust = NULL, estmethodopt = NULL,
  plot = FALSE, dotsngrid = 0, plotxrange = NULL, plotyrange = NULL,
  colors = NULL, symbols = NULL, level = 95, ...)
binspwc(y, x, w = NULL, data = NULL, estmethod = "reg",
  family = gaussian(), quantile = NULL, deriv = 0, at = NULL,
  nolink = F, by = NULL, pwc = NULL, testtype = "two-sided",
  lp = Inf, bins = NULL, bynbins = NULL, binspos = "qs",
  pselect = NULL, sselect = NULL, binsmethod = "dpi", nbinsrot = NULL,
  samebinsby = FALSE, randcut = NULL, nsims = 500, simsgrid = 20,
  simsseed = NULL, vce = NULL, cluster = NULL, asyvar = F,
  dfcheck = c(20, 30), masspoints = "on", weights = NULL,
  subset = NULL, numdist = NULL, numclust = NULL, estmethodopt = NULL,
  plot = FALSE, dotsngrid = 0, plotxrange = NULL, plotyrange = NULL,
  colors = NULL, symbols = NULL, level = 95, ...)

Arguments

`y`	outcome variable. A vector.
`x`	independent variable of interest. A vector.
`w`	control variables. A matrix, a vector or a `formula`.
`data`	an optional data frame containing variables used in the model.
`estmethod`	estimation method. The default is `estmethod="reg"` for tests based on binscatter least squares regression. Other options are `"qreg"` for quantile regression and `"glm"` for generalized linear regression. If `estmethod="glm"`, the option `family` must be specified.
`family`	a description of the error distribution and link function to be used in the generalized linear model when `estmethod="glm"`. (See `family` for details of family functions.)
`quantile`	the quantile to be estimated. A number strictly between 0 and 1.
`deriv`	derivative order of the regression function for estimation, testing and plotting. The default is `deriv=0`, which corresponds to the function itself.
`at`	value of `w` at which the estimated function is evaluated. The default is `at="mean"`, which corresponds to the mean of `w`. Other options are: `at="median"` for the median of `w`, `at="zero"` for a vector of zeros. `at` can also be a vector of the same length as the number of columns of `w` (if `w` is a matrix) or a data frame containing the same variables as specified in `w` (when `data` is specified). Note that when `at="mean"` or `at="median"`, all factor variables (if specified) are excluded from the evaluation (set as zero).
`nolink`	if true, the function within the inverse link function is reported instead of the conditional mean function for the outcome.
`by`	a vector containing the group indicator for subgroup analysis; both numeric and string variables are supported. When `by` is specified, `binsreg` implements estimation and inference for each subgroup separately, but produces a common binned scatter plot. By default, the binning structure is selected for each subgroup separately, but see the option `samebinsby` below for imposing a common binning structure across subgroups.
`pwc`	a vector or a logical value. If `pwc=c(p,s)`, a piecewise polynomial of degree `p` with `s` smoothness constraints is used for testing the difference between groups. If `pwc=T` or `pwc=NULL` (default) is specified, `pwc=c(1,1)` is used unless the degree `p` or smoothness `s` selection is requested via the option `pselect` or `sselect` (see more details in the explanation of `pselect` and `sselect`).
`testtype`	type of pairwise comparison test. The default is `testtype="two-sided"`, which corresponds to a two-sided test of the form `H0: mu_1(x)=mu_2(x)`. Other options are: `testtype="left"` for the one-sided test form `H0: mu_1(x)<=mu_2(x)` and `testtype="right"` for the one-sided test of the form `H0: mu_1(x)>=mu_2(x)`.
`lp`	an Lp metric used for pairwise comparison tests. The default is `lp=Inf`, which corresponds to the sup-norm of the t-statistic. Other options are `lp=q` for a positive number `q>=1`. Note that `lp=Inf` ("sup-norm") has to be used for one-sided tests (`testtype="left"` or `testtype="right"`).
`bins`	A vector. If `bins=c(p,s)`, it sets the piecewise polynomial of degree `p` with `s` smoothness constraints for data-driven (IMSE-optimal) selection of the partitioning/binning scheme. The default is `bins=c(0,0)`, which corresponds to the piecewise constant.
`bynbins`	a vector of the number of bins for partitioning/binning of `x`, which is applied to the binscatter estimation for each group. If a single number is specified, it is applied to the estimation for all groups. If `bynbins=T` or `bynbins=NULL` (default), the number of bins is selected via the companion function `binsregselect` in a data-driven way whenever possible. Note: If a vector with more than one number is supplied, it is understood as the number of bins applied to binscatter estimation for each subgroup rather than the range for selecting the number of bins.
`binspos`	position of binning knots. The default is `binspos="qs"`, which corresponds to quantile-spaced binning (canonical binscatter). The other options are `"es"` for evenly-spaced binning, or a vector for manual specification of the positions of inner knots (which must be within the range of `x`).
`pselect`	vector of numbers within which the degree of polynomial `p` for point estimation is selected. If the selected optimal degree is `p`, then piecewise polynomials of degree `p+1` are used to conduct pairwise group comparison. Note: To implement the degree or smoothness selection, in addition to `pselect` or `sselect`, `bynbins=#` must be specified.
`sselect`	vector of numbers within which the number of smoothness constraints `s` for point estimation is selected. If the selected optimal smoothness is `s`, then piecewise polynomials with `s+1` smoothness constraints are used to conduct pairwise group comparison. If not specified, for each value `p` supplied in the option `pselect`, only the piecewise polynomial with the maximum smoothness is considered, i.e., `s=p`.
`binsmethod`	method for data-driven selection of the number of bins. The default is `binsmethod="dpi"`, which corresponds to the IMSE-optimal direct plug-in rule. The other option is: `"rot"` for rule of thumb implementation.
`nbinsrot`	initial number of bins value used to construct the DPI number of bins selector. If not specified, the data-driven ROT selector is used instead.
`samebinsby`	if true, a common partitioning/binning structure across all subgroups specified by the option `by` is forced. The knots positions are selected according to the option `binspos` and using the full sample. If `nbins` is not specified, then the number of bins is selected via the companion command `binsregselect` and using the full sample.
`randcut`	upper bound on a uniformly distributed variable used to draw a subsample for bins/degree/smoothness selection. Observations for which `runif()<=#` are used. # must be between 0 and 1. By default, `max(5000, 0.01n)` observations are used if the samples size `n>5000`.
`nsims`	number of random draws for hypothesis testing. The default is `nsims=500`, which corresponds to 500 draws from a standard Gaussian random vector of size `[(p+1)J - (J-1)s]`. Setting at least `nsims=2000` is recommended to obtain the final results.
`simsgrid`	number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the supremum (infimum or Lp metric) operation needed to construct hypothesis testing procedures. The default is `simsgrid=20`, which corresponds to 20 evenly-spaced evaluation points within each bin for approximating the supremum (infimum or Lp metric) operator. Setting at least `simsgrid=50` is recommended to obtain the final results.
`simsseed`	seed for simulation.
`vce`	procedure to compute the variance-covariance matrix estimator. For least squares regression and generalized linear regression, the allowed options are the same as that for `binsreg` or `binsqreg`. For quantile regression, the allowed options are the same as that for `binsqreg`.
`cluster`	cluster ID. Used for compute cluster-robust standard errors.
`asyvar`	if true, the standard error of the nonparametric component is computed and the uncertainty related to control variables is omitted. Default is `asyvar=FALSE`, that is, the uncertainty related to control variables is taken into account.
`dfcheck`	adjustments for minimum effective sample size checks, which take into account number of unique values of `x` (i.e., number of mass points), number of clusters, and degrees of freedom of the different stat models considered. The default is `dfcheck=c(20, 30)`. See Cattaneo, Crump, Farrell and Feng (2024c) for more details.
`masspoints`	how mass points in `x` are handled. Available options: `"on"` all mass point and degrees of freedom checks are implemented. Default. `"noadjust"` mass point checks and the corresponding effective sample size adjustments are omitted. `"nolocalcheck"` within-bin mass point and degrees of freedom checks are omitted. `"off"` "noadjust" and "nolocalcheck" are set simultaneously. `"veryfew"` forces the function to proceed as if `x` has only a few number of mass points (i.e., distinct values). In other words, forces the function to proceed as if the mass point and degrees of freedom checks were failed.
`weights`	an optional vector of weights to be used in the fitting process. Should be `NULL` or a numeric vector. For more details, see `lm`.
`subset`	optional rule specifying a subset of observations to be used.
`numdist`	number of distinct values for selection. Used to speed up computation.
`numclust`	number of clusters for selection. Used to speed up computation.
`estmethodopt`	a list of optional arguments used by `rq` (for quantile regression) or `glm` (for fitting generalized linear models).
`plot`	if true, the confidence bands for all pairwise group comparisons (the difference between each pair of groups) are plotted. The degree and smoothness of polynomials used to construct the bands are the same as those specified for testing. The default is `plot=F`, i.e., no plot is generated.
`dotsngrid`	number of dots to be added to the plot for confidence bands. Given the choice, these dots are point estimates of the difference between groups evaluated over an evenly-spaced grid within the common support of all groups. The default is `dotsngrid=0`, i.e., no point estimates are added. Whenever possible, the degree and smoothness of the polynomial for these point estimates are the same as those for selecting the number of bins; otherwise, the degree and smoothness specified for testing are used.
`plotxrange`	a vector. `plotxrange=c(min, max)` specifies a range of the x-axis for plotting. Observations outside the range are dropped in the plot.
`plotyrange`	a vector. `plotyrange=c(min, max)` specifies a range of the y-axis for plotting. Observations outside the range are dropped in the plot.
`colors`	an ordered list of colors for plotting the difference between each pair of groups.
`symbols`	an ordered list of symbols for plotting the difference between each pair of groups.
`level`	nominal confidence level for confidence band estimation. Default is `level=95`.
`...`	optional arguments to control bootstrapping if `estmethod="qreg"` and `vce="boot"`. See `boot.rq`.

Value

`stat`	A matrix. Each row corresponds to the comparison between two groups. The first column is the test statistic. The second and third columns give the corresponding group numbers. The null hypothesis is `mu_i(x)<=mu_j(x)`, `mu_i(x)=mu_j(x)`, or `mu_i(x)>=mu_j(x)` for group i (given in the second column) and group j (given in the third column). The group number corresponds to the list of group names given by `opt$byvals`.
`pval`	A vector of p-values for all pairwise group comparisons.
`bins_plot`	A `ggplot` object for confidence bands plot.
`data.plot`	A list containing data for plotting. Each item is a sublist of data frames for comparison between each pair of groups. Each sublist may contain the following data frames: `data.dots` Data for dots. It contains: `pair`, the name for the pair of groups; `x`, evaluation points; `diff.fit`, point estimates of the group difference; `data.cb` Data for confidence bands. It contains: `pair`, the name for the pair of groups; `x`, evaluation points; `cb.fit`, point estimates of the group difference; `cb.se`, standard errors; `cb.l` and `cb.r`, left and right boundaries of the confidence band.
`cval.cb`	A vector of critical values for all pairwise group comparisons.
`imse.var.rot`	Variance constant in IMSE expansion, ROT selection.
`imse.bsq.rot`	Bias constant in IMSE expansion, ROT selection.
`imse.var.dpi`	Variance constant in IMSE expansion, DPI selection.
`imse.bsq.dpi`	Bias constant in IMSE expansion, DPI selection.
`opt`	A list containing options passed to the function, as well as `N.by` (total sample size for each group), `Ndist.by` (number of distinct values in `x` for each group), `Nclust.by` (number of clusters for each group), and `nbins.by` (number of bins for each group), and `byvals` (number of distinct values in `by`).

Author(s)

Matias D. Cattaneo, Princeton University, Princeton, NJ. [email protected].

Richard K. Crump, Federal Reserve Bank of New York, New York, NY. [email protected].

Max H. Farrell, UC Santa Barbara, Santa Barbara, CA. [email protected].

Yingjie Feng (maintainer), Tsinghua University, Beijing, China. [email protected].

References

Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024a: On Binscatter. American Economic Review 114(5): 1488-1514.

Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024b: Nonlinear Binscatter Methods. Working Paper.

Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024c: Binscatter Regressions. Working Paper.

Examples

 x <- runif(500); y <- sin(x)+rnorm(500); t <- 1*(runif(500)>0.5)
 ## Binned scatterplot
 binspwc(y,x, by=t)
x <- runif(500); y <- sin(x)+rnorm(500); t <- 1*(runif(500)>0.5)
 ## Binned scatterplot
 binspwc(y,x, by=t)

Data-Driven Binscatter Quantile Regression with Robust Inference Procedures and Plots

Description

binsqreg implements binscatter quantile regression with robust inference procedures and plots, following the results in Cattaneo, Crump, Farrell and Feng (2024a) and Cattaneo, Crump, Farrell and Feng (2024b). Binscatter provides a flexible way to describe the quantile relationship between two variables, after possibly adjusting for other covariates, based on partitioning/binning of the independent variable of interest. The main purpose of this function is to generate binned scatter plots with curve estimation with robust pointwise confidence intervals and uniform confidence band. If the binning scheme is not set by the user, the companion function binsregselect is used to implement binscatter in a data-driven way. Hypothesis testing about the function of interest can be conducted via the companion function binstest.

Usage

binsqreg(y, x, w = NULL, data = NULL, at = NULL, quantile = 0.5,
  deriv = 0, dots = NULL, dotsgrid = 0, dotsgridmean = T,
  line = NULL, linegrid = 20, ci = NULL, cigrid = 0, cigridmean = T,
  cb = NULL, cbgrid = 20, polyreg = NULL, polyreggrid = 20,
  polyregcigrid = 0, by = NULL, bycolors = NULL, bysymbols = NULL,
  bylpatterns = NULL, legendTitle = NULL, legendoff = F, nbins = NULL,
  binspos = "qs", binsmethod = "dpi", nbinsrot = NULL, pselect = NULL,
  sselect = NULL, samebinsby = F, randcut = NULL, nsims = 500,
  simsgrid = 20, simsseed = NULL, vce = "nid", cluster = NULL,
  asyvar = F, level = 95, noplot = F, dfcheck = c(20, 30),
  masspoints = "on", weights = NULL, subset = NULL, plotxrange = NULL,
  plotyrange = NULL, qregopt = NULL, ...)
binsqreg(y, x, w = NULL, data = NULL, at = NULL, quantile = 0.5,
  deriv = 0, dots = NULL, dotsgrid = 0, dotsgridmean = T,
  line = NULL, linegrid = 20, ci = NULL, cigrid = 0, cigridmean = T,
  cb = NULL, cbgrid = 20, polyreg = NULL, polyreggrid = 20,
  polyregcigrid = 0, by = NULL, bycolors = NULL, bysymbols = NULL,
  bylpatterns = NULL, legendTitle = NULL, legendoff = F, nbins = NULL,
  binspos = "qs", binsmethod = "dpi", nbinsrot = NULL, pselect = NULL,
  sselect = NULL, samebinsby = F, randcut = NULL, nsims = 500,
  simsgrid = 20, simsseed = NULL, vce = "nid", cluster = NULL,
  asyvar = F, level = 95, noplot = F, dfcheck = c(20, 30),
  masspoints = "on", weights = NULL, subset = NULL, plotxrange = NULL,
  plotyrange = NULL, qregopt = NULL, ...)

Arguments

`y`	outcome variable. A vector.
`x`	independent variable of interest. A vector.
`w`	control variables. A matrix, a vector or a `formula`.
`data`	an optional data frame containing variables in the model.
`at`	value of `w` at which the estimated function is evaluated. The default is `at="mean"`, which corresponds to the mean of `w`. Other options are: `at="median"` for the median of `w`, `at="zero"` for a vector of zeros. `at` can also be a vector of the same length as the number of columns of `w` (if `w` is a matrix) or a data frame containing the same variables as specified in `w` (when `data` is specified). Note that when `at="mean"` or `at="median"`, all factor variables (if specified) are excluded from the evaluation (set as zero).
`quantile`	the quantile to be estimated. A number strictly between 0 and 1.
`deriv`	derivative order of the regression function for estimation, testing and plotting. The default is `deriv=0`, which corresponds to the function itself.
`dots`	a vector or a logical value. If `dots=c(p,s)`, a piecewise polynomial of degree `p` with `s` smoothness constraints is used for point estimation and plotting as "dots". The default is `dots=c(0,0)`, which corresponds to piecewise constant (canonical binscatter). If `dots=T`, the default `dots=c(0,0)` is used unless the degree `p` or smoothness `s` selection is requested via the option `pselect` or `sselect` (see more details in the explanation of `pselect` and `sselect`). If `dots=F` is specified, the dots are not included in the plot.
`dotsgrid`	number of dots within each bin to be plotted. Given the choice, these dots are point estimates evaluated over an evenly-spaced grid within each bin. The default is `dotsgrid=0`, and only the point estimates at the mean of `x` within each bin are presented.
`dotsgridmean`	If true, the dots corresponding to the point estimates evaluated at the mean of `x` within each bin are presented. By default, they are presented, i.e., `dotsgridmean=T`.
`line`	a vector or a logical value. If `line=c(p,s)`, a piecewise polynomial of degree `p` with `s` smoothness constraints is used for plotting as a "line". If `line=T` is specified, `line=c(0,0)` is used unless the degree `p` or smoothness `s` selection is requested via the option `pselect` or `sselect` (see more details in the explanation of `pselect` and `sselect`). If `line=F` or `line=NULL` is specified, the line is not included in the plot. The default is `line=NULL`.
`linegrid`	number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the point estimate set by the `line=c(p,s)` option. The default is `linegrid=20`, which corresponds to 20 evenly-spaced evaluation points within each bin for fitting/plotting the line.
`ci`	a vector or a logical value. If `ci=c(p,s)` a piecewise polynomial of degree `p` with `s` smoothness constraints is used for constructing confidence intervals. If `ci=T` is specified, `ci=c(1,1)` is used unless the degree `p` or smoothness `s` selection is requested via the option `pselect` or `sselect` (see more details in the explanation of `pselect` and `sselect`). If `ci=F` or `ci=NULL` is specified, the confidence intervals are not included in the plot. The default is `ci=NULL`.
`cigrid`	number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the point estimate set by the `ci=c(p,s)` option. The default is `cigrid=1`, which corresponds to 1 evenly-spaced evaluation point within each bin for confidence interval construction.
`cigridmean`	If true, the confidence intervals corresponding to the point estimates evaluated at the mean of `x` within each bin are presented. The default is `cigridmean=T`.
`cb`	a vector or a logical value. If `cb=c(p,s)`, a the piecewise polynomial of degree `p` with `s` smoothness constraints is used for constructing the confidence band. If the option `cb=T` is specified, `cb=c(1,1)` is used unless the degree `p` or smoothness `s` selection is requested via the option `pselect` or `sselect` (see more details in the explanation of `pselect` and `sselect`). If `cb=F` or `cb=NULL` is specified, the confidence band is not included in the plot. The default is `cb=NULL`.
`cbgrid`	number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the point estimate set by the `cb=c(p,s)` option. The default is `cbgrid=20`, which corresponds to 20 evenly-spaced evaluation points within each bin for confidence interval construction.
`polyreg`	degree of a global polynomial regression model for plotting. By default, this fit is not included in the plot unless explicitly specified. Recommended specification is `polyreg=3`, which adds a cubic (global) polynomial fit of the regression function of interest to the binned scatter plot.
`polyreggrid`	number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the point estimate set by the `polyreg=p` option. The default is `polyreggrid=20`, which corresponds to 20 evenly-spaced evaluation points within each bin for confidence interval construction.
`polyregcigrid`	number of evaluation points of an evenly-spaced grid within each bin used for constructing confidence intervals based on polynomial regression set by the `polyreg=p` option. The default is `polyregcigrid=0`, which corresponds to not plotting confidence intervals for the global polynomial regression approximation.
`by`	a vector containing the group indicator for subgroup analysis; both numeric and string variables are supported. When `by` is specified, `binsreg` implements estimation and inference for each subgroup separately, but produces a common binned scatter plot. By default, the binning structure is selected for each subgroup separately, but see the option `samebinsby` below for imposing a common binning structure across subgroups.
`bycolors`	an ordered list of colors for plotting each subgroup series defined by the option `by`.
`bysymbols`	an ordered list of symbols for plotting each subgroup series defined by the option `by`.
`bylpatterns`	an ordered list of line patterns for plotting each subgroup series defined by the option `by`.
`legendTitle`	String, title of legend.
`legendoff`	If true, no legend is added.
`nbins`	number of bins for partitioning/binning of `x`. If `nbins=T` or `nbins=NULL` (default) is specified, the number of bins is selected via the companion command `binsregselect` in a data-driven, optimal way whenever possible. If a vector with more than one number is specified, the number of bins is selected within this vector via the companion command `binsregselect`.
`binspos`	position of binning knots. The default is `binspos="qs"`, which corresponds to quantile-spaced binning (canonical binscatter). The other options are `"es"` for evenly-spaced binning, or a vector for manual specification of the positions of inner knots (which must be within the range of `x`).
`binsmethod`	method for data-driven selection of the number of bins. The default is `binsmethod="dpi"`, which corresponds to the IMSE-optimal direct plug-in rule. The other option is: `"rot"` for rule of thumb implementation.
`nbinsrot`	initial number of bins value used to construct the DPI number of bins selector. If not specified, the data-driven ROT selector is used instead.
`pselect`	vector of numbers within which the degree of polynomial `p` for point estimation is selected. Piecewise polynomials of the selected optimal degree `p` are used to construct dots or line if `dots=T` or `line=T` is specified, whereas piecewise polynomials of degree `p+1` are used to construct confidence intervals or confidence band if `ci=T` or `cb=T` is specified. Note: To implement the degree or smoothness selection, in addition to `pselect` or `sselect`, `nbins=#` must be specified.
`sselect`	vector of numbers within which the number of smoothness constraints `s` for point estimation is selected. Piecewise polynomials with the selected optimal `s` smoothness constraints are used to construct dots or line if `dots=T` or `line=T` is specified, whereas piecewise polynomials with `s+1` constraints are used to construct confidence intervals or confidence band if `ci=T` or `cb=T` is specified. If not specified, for each value `p` supplied in the option `pselect`, only the piecewise polynomial with the maximum smoothness is considered, i.e., `s=p`.
`samebinsby`	if true, a common partitioning/binning structure across all subgroups specified by the option `by` is forced. The knots positions are selected according to the option `binspos` and using the full sample. If `nbins` is not specified, then the number of bins is selected via the companion command `binsregselect` and using the full sample.
`randcut`	upper bound on a uniformly distributed variable used to draw a subsample for bins/degree/smoothness selection. Observations for which `runif()<=#` are used. # must be between 0 and 1. By default, `max(5000, 0.01n)` observations are used if the samples size `n>5000`.
`nsims`	number of random draws for constructing confidence bands. The default is `nsims=500`, which corresponds to 500 draws from a standard Gaussian random vector of size `[(p+1)J - (J-1)s]`. Setting at least `nsims=2000` is recommended to obtain the final results.
`simsgrid`	number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the supremum operation needed to construct confidence bands. The default is `simsgrid=20`, which corresponds to 20 evenly-spaced evaluation points within each bin for approximating the supremum operator. Setting at least `simsgrid=50` is recommended to obtain the final results.
`simsseed`	seed for simulation.
`vce`	Procedure to compute the variance-covariance matrix estimator (see `summary.rq` for more details). Options are `"iid"` which presumes that the errors are iid and computes an estimate of the asymptotic covariance matrix as in KB(1978). `"nid"` which presumes local (in quantile) linearity of the the conditional quantile functions and computes a Huber sandwich estimate using a local estimate of the sparsity. `"ker"` which uses a kernel estimate of the sandwich as proposed by Powell (1991). `"boot"` which implements one of several possible bootstrapping alternatives for estimating standard errors including a variate of the wild bootstrap for clustered response. See `boot.rq` for further details.
`cluster`	cluster ID. Used for compute cluster-robust standard errors.
`asyvar`	if true, the standard error of the nonparametric component is computed and the uncertainty related to control variables is omitted. Default is `asyvar=FALSE`, that is, the uncertainty related to control variables is taken into account.
`level`	nominal confidence level for confidence interval and confidence band estimation. Default is `level=95`.
`noplot`	if true, no plot produced.
`dfcheck`	adjustments for minimum effective sample size checks, which take into account number of unique values of `x` (i.e., number of mass points), number of clusters, and degrees of freedom of the different statistical models considered. The default is `dfcheck=c(20, 30)`. See Cattaneo, Crump, Farrell and Feng (2024c) for more details.
`masspoints`	how mass points in `x` are handled. Available options: `"on"` all mass point and degrees of freedom checks are implemented. Default. `"noadjust"` mass point checks and the corresponding effective sample size adjustments are omitted. `"nolocalcheck"` within-bin mass point and degrees of freedom checks are omitted. `"off"` "noadjust" and "nolocalcheck" are set simultaneously. `"veryfew"` forces the function to proceed as if `x` has only a few number of mass points (i.e., distinct values). In other words, forces the function to proceed as if the mass point and degrees of freedom checks were failed.
`weights`	an optional vector of weights to be used in the fitting process. Should be `NULL` or a numeric vector. For more details, see `lm`.
`subset`	optional rule specifying a subset of observations to be used.
`plotxrange`	a vector. `plotxrange=c(min, max)` specifies a range of the x-axis for plotting. Observations outside the range are dropped in the plot.
`plotyrange`	a vector. `plotyrange=c(min, max)` specifies a range of the y-axis for plotting. Observations outside the range are dropped in the plot.
`qregopt`	a list of optional arguments used by `rq`.
`...`	optional arguments to control bootstrapping. See `boot.rq`.

Value

`bins_plot`	A `ggplot` object for binscatter plot.
`data.plot`	A list containing data for plotting. Each item is a sublist of data frames for each group. Each sublist may contain the following data frames: `data.dots` Data for dots. It contains: `x`, evaluation points; `bin`, the indicator of bins; `isknot`, indicator of inner knots; `mid`, midpoint of each bin; and `fit`, fitted values. `data.line` Data for line. It contains: `x`, evaluation points; `bin`, the indicator of bins; `isknot`, indicator of inner knots; `mid`, midpoint of each bin; and `fit`, fitted values. `data.ci` Data for CI. It contains: `x`, evaluation points; `bin`, the indicator of bins; `isknot`, indicator of inner knots; `mid`, midpoint of each bin; `ci.l` and `ci.r`, left and right boundaries of each confidence intervals. `data.cb` Data for CB. It contains: `x`, evaluation points; `bin`, the indicator of bins; `isknot`, indicator of inner knots; `mid`, midpoint of each bin; `cb.l` and `cb.r`, left and right boundaries of the confidence band. `data.poly` Data for polynomial regression. It contains: `x`, evaluation points; `bin`, the indicator of bins; `isknot`, indicator of inner knots; `mid`, midpoint of each bin; and `fit`, fitted values. `data.polyci` Data for confidence intervals based on polynomial regression. It contains: `x`, evaluation points; `bin`, the indicator of bins; `isknot`, indicator of inner knots; `mid`, midpoint of each bin; `polyci.l` and `polyci.r`, left and right boundaries of each confidence intervals. `data.bin` Data for the binning structure. It contains: `bin.id`, ID for each bin; `left.endpoint` and `right.endpoint`, left and right endpoints of each bin.
`imse.var.rot`	Variance constant in IMSE, ROT selection.
`imse.bsq.rot`	Bias constant in IMSE, ROT selection.
`imse.var.dpi`	Variance constant in IMSE, DPI selection.
`imse.bsq.dpi`	Bias constant in IMSE, DPI selection.
`cval.by`	A vector of critical values for constructing confidence band for each group.
`opt`	A list containing options passed to the function, as well as `N.by` (total sample size for each group), `Ndist.by` (number of distinct values in `x` for each group), `Nclust.by` (number of clusters for each group), and `nbins.by` (number of bins for each group), and `byvals` (number of distinct values in `by`). The degree and smoothness of polynomials for dots, line, confidence intervals and confidence band for each group are saved in `dots`, `line`, `ci`, and `cb`.

Author(s)

Matias D. Cattaneo, Princeton University, Princeton, NJ. [email protected].

Richard K. Crump, Federal Reserve Bank of New York, New York, NY. [email protected].

Max H. Farrell, UC Santa Barbara, Santa Barbara, CA. [email protected].

Yingjie Feng (maintainer), Tsinghua University, Beijing, China. [email protected].

References

Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024a: On Binscatter. American Economic Review 114(5): 1488-1514.

Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024b: Nonlinear Binscatter Methods. Working Paper.

Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024c: Binscatter Regressions. Working Paper.

Examples

 x <- runif(500); y <- sin(x)+rnorm(500)
 ## Binned scatterplot
 binsqreg(y,x)
x <- runif(500); y <- sin(x)+rnorm(500)
 ## Binned scatterplot
 binsqreg(y,x)

Data-Driven Binscatter Least Squares Regression with Robust Inference Procedures and Plots

Description

binsreg implements binscatter least squares regression with robust inference procedures and plots, following the results in Cattaneo, Crump, Farrell and Feng (2024a) and Cattaneo, Crump, Farrell and Feng (2024b). Binscatter provides a flexible way to describe the mean relationship between two variables, after possibly adjusting for other covariates, based on partitioning/binning of the independent variable of interest. The main purpose of this function is to generate binned scatter plots with curve estimation with robust pointwise confidence intervals and uniform confidence band. If the binning scheme is not set by the user, the companion function binsregselect is used to implement binscatter in a data-driven (optimal) way. Hypothesis testing about the regression function can be conducted via the companion function binstest.

Usage

binsreg(y, x, w = NULL, data = NULL, at = NULL, deriv = 0,
  dots = NULL, dotsgrid = 0, dotsgridmean = T, line = NULL,
  linegrid = 20, ci = NULL, cigrid = 0, cigridmean = T, cb = NULL,
  cbgrid = 20, polyreg = NULL, polyreggrid = 20, polyregcigrid = 0,
  by = NULL, bycolors = NULL, bysymbols = NULL, bylpatterns = NULL,
  legendTitle = NULL, legendoff = F, nbins = NULL, binspos = "qs",
  binsmethod = "dpi", nbinsrot = NULL, pselect = NULL, sselect = NULL,
  samebinsby = F, randcut = NULL, nsims = 500, simsgrid = 20,
  simsseed = NULL, vce = "HC1", cluster = NULL, asyvar = F,
  level = 95, noplot = F, dfcheck = c(20, 30), masspoints = "on",
  weights = NULL, subset = NULL, plotxrange = NULL, plotyrange = NULL)
binsreg(y, x, w = NULL, data = NULL, at = NULL, deriv = 0,
  dots = NULL, dotsgrid = 0, dotsgridmean = T, line = NULL,
  linegrid = 20, ci = NULL, cigrid = 0, cigridmean = T, cb = NULL,
  cbgrid = 20, polyreg = NULL, polyreggrid = 20, polyregcigrid = 0,
  by = NULL, bycolors = NULL, bysymbols = NULL, bylpatterns = NULL,
  legendTitle = NULL, legendoff = F, nbins = NULL, binspos = "qs",
  binsmethod = "dpi", nbinsrot = NULL, pselect = NULL, sselect = NULL,
  samebinsby = F, randcut = NULL, nsims = 500, simsgrid = 20,
  simsseed = NULL, vce = "HC1", cluster = NULL, asyvar = F,
  level = 95, noplot = F, dfcheck = c(20, 30), masspoints = "on",
  weights = NULL, subset = NULL, plotxrange = NULL, plotyrange = NULL)

Arguments

`y`	outcome variable. A vector.
`x`	independent variable of interest. A vector.
`w`	control variables. A matrix, a vector or a `formula`.
`data`	an optional data frame containing variables used in the model.
`at`	value of `w` at which the estimated function is evaluated. The default is `at="mean"`, which corresponds to the mean of `w`. Other options are: `at="median"` for the median of `w`, `at="zero"` for a vector of zeros. `at` can also be a vector of the same length as the number of columns of `w` (if `w` is a matrix) or a data frame containing the same variables as specified in `w` (when `data` is specified). Note that when `at="mean"` or `at="median"`, all factor variables (if specified) are excluded from the evaluation (set as zero).
`deriv`	derivative order of the regression function for estimation, testing and plotting. The default is `deriv=0`, which corresponds to the function itself.
`dots`	a vector or a logical value. If `dots=c(p,s)`, a piecewise polynomial of degree `p` with `s` smoothness constraints is used for point estimation and plotting as "dots". The default is `dots=c(0,0)`, which corresponds to piecewise constant (canonical binscatter). If `dots=T`, the default `dots=c(0,0)` is used unless the degree `p` or smoothness `s` selection is requested via the option `pselect` or `sselect` (see more details in the explanation of `pselect` and `sselect`). If `dots=F` is specified, the dots are not included in the plot.
`dotsgrid`	number of dots within each bin to be plotted. Given the choice, these dots are point estimates evaluated over an evenly-spaced grid within each bin. The default is `dotsgrid=0`, and only the point estimates at the mean of `x` within each bin are presented.
`dotsgridmean`	If true, the dots corresponding to the point estimates evaluated at the mean of `x` within each bin are presented. By default, they are presented, i.e., `dotsgridmean=T`.
`line`	a vector or a logical value. If `line=c(p,s)`, a piecewise polynomial of degree `p` with `s` smoothness constraints is used for plotting as a "line". If `line=T` is specified, `line=c(0,0)` is used unless the degree `p` or smoothness `s` selection is requested via the option `pselect` or `sselect` (see more details in the explanation of `pselect` and `sselect`). If `line=F` or `line=NULL` is specified, the line is not included in the plot. The default is `line=NULL`.
`linegrid`	number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the point estimate set by the `line=c(p,s)` option. The default is `linegrid=20`, which corresponds to 20 evenly-spaced evaluation points within each bin for fitting/plotting the line.
`ci`	a vector or a logical value. If `ci=c(p,s)` a piecewise polynomial of degree `p` with `s` smoothness constraints is used for constructing confidence intervals. If `ci=T` is specified, `ci=c(1,1)` is used unless the degree `p` or smoothness `s` selection is requested via the option `pselect` or `sselect` (see more details in the explanation of `pselect` and `sselect`). If `ci=F` or `ci=NULL` is specified, the confidence intervals are not included in the plot. The default is `ci=NULL`.
`cigrid`	number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the point estimate set by the `ci=c(p,s)` option. The default is `cigrid=1`, which corresponds to 1 evenly-spaced evaluation point within each bin for confidence interval construction.
`cigridmean`	If true, the confidence intervals corresponding to the point estimates evaluated at the mean of `x` within each bin are presented. The default is `cigridmean=T`.
`cb`	a vector or a logical value. If `cb=c(p,s)`, a the piecewise polynomial of degree `p` with `s` smoothness constraints is used for constructing the confidence band. If the option `cb=T` is specified, `cb=c(1,1)` is used unless the degree `p` or smoothness `s` selection is requested via the option `pselect` or `sselect` (see more details in the explanation of `pselect` and `sselect`). If `cb=F` or `cb=NULL` is specified, the confidence band is not included in the plot. The default is `cb=NULL`.
`cbgrid`	number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the point estimate set by the `cb=c(p,s)` option. The default is `cbgrid=20`, which corresponds to 20 evenly-spaced evaluation points within each bin for confidence interval construction.
`polyreg`	degree of a global polynomial regression model for plotting. By default, this fit is not included in the plot unless explicitly specified. Recommended specification is `polyreg=3`, which adds a cubic (global) polynomial fit of the regression function of interest to the binned scatter plot.
`polyreggrid`	number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the point estimate set by the `polyreg=p` option. The default is `polyreggrid=20`, which corresponds to 20 evenly-spaced evaluation points within each bin for confidence interval construction.
`polyregcigrid`	number of evaluation points of an evenly-spaced grid within each bin used for constructing confidence intervals based on polynomial regression set by the `polyreg=p` option. The default is `polyregcigrid=0`, which corresponds to not plotting confidence intervals for the global polynomial regression approximation.
`by`	a vector containing the group indicator for subgroup analysis; both numeric and string variables are supported. When `by` is specified, `binsreg` implements estimation and inference for each subgroup separately, but produces a common binned scatter plot. By default, the binning structure is selected for each subgroup separately, but see the option `samebinsby` below for imposing a common binning structure across subgroups.
`bycolors`	an ordered list of colors for plotting each subgroup series defined by the option `by`.
`bysymbols`	an ordered list of symbols for plotting each subgroup series defined by the option `by`.
`bylpatterns`	an ordered list of line patterns for plotting each subgroup series defined by the option `by`.
`legendTitle`	String, title of legend.
`legendoff`	If true, no legend is added.
`nbins`	number of bins for partitioning/binning of `x`. If `nbins=T` or `nbins=NULL` (default) is specified, the number of bins is selected via the companion command `binsregselect` in a data-driven, optimal way whenever possible. If a vector with more than one number is specified, the number of bins is selected within this vector via the companion command `binsregselect`.
`binspos`	position of binning knots. The default is `binspos="qs"`, which corresponds to quantile-spaced binning (canonical binscatter). The other options are `"es"` for evenly-spaced binning, or a vector for manual specification of the positions of inner knots (which must be within the range of `x`).
`binsmethod`	method for data-driven selection of the number of bins. The default is `binsmethod="dpi"`, which corresponds to the IMSE-optimal direct plug-in rule. The other option is: `"rot"` for rule of thumb implementation.
`nbinsrot`	initial number of bins value used to construct the DPI number of bins selector. If not specified, the data-driven ROT selector is used instead.
`pselect`	vector of numbers within which the degree of polynomial `p` for point estimation is selected. Piecewise polynomials of the selected optimal degree `p` are used to construct dots or line if `dots=T` or `line=T` is specified, whereas piecewise polynomials of degree `p+1` are used to construct confidence intervals or confidence band if `ci=T` or `cb=T` is specified. Note: To implement the degree or smoothness selection, in addition to `pselect` or `sselect`, `nbins=#` must be specified.
`sselect`	vector of numbers within which the number of smoothness constraints `s` for point estimation is selected. Piecewise polynomials with the selected optimal `s` smoothness constraints are used to construct dots or line if `dots=T` or `line=T` is specified, whereas piecewise polynomials with `s+1` constraints are used to construct confidence intervals or confidence band if `ci=T` or `cb=T` is specified. If not specified, for each value `p` supplied in the option `pselect`, only the piecewise polynomial with the maximum smoothness is considered, i.e., `s=p`.
`samebinsby`	if true, a common partitioning/binning structure across all subgroups specified by the option `by` is forced. The knots positions are selected according to the option `binspos` and using the full sample. If `nbins` is not specified, then the number of bins is selected via the companion command `binsregselect` and using the full sample.
`randcut`	upper bound on a uniformly distributed variable used to draw a subsample for bins/degree/smoothness selection. Observations for which `runif()<=#` are used. # must be between 0 and 1. By default, `max(5000, 0.01n)` observations are used if the samples size `n>5000`.
`nsims`	number of random draws for constructing confidence bands. The default is `nsims=500`, which corresponds to 500 draws from a standard Gaussian random vector of size `[(p+1)J - (J-1)s]`. Setting at least `nsims=2000` is recommended to obtain the final results.
`simsgrid`	number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the supremum operation needed to construct confidence bands. The default is `simsgrid=20`, which corresponds to 20 evenly-spaced evaluation points within each bin for approximating the supremum operator. Setting at least `simsgrid=50` is recommended to obtain the final results.
`simsseed`	seed for simulation.
`vce`	Procedure to compute the variance-covariance matrix estimator. Options are `"const"` homoskedastic variance estimator. `"HC0"` heteroskedasticity-robust plug-in residuals variance estimator without weights. `"HC1"` heteroskedasticity-robust plug-in residuals variance estimator with hc1 weights. Default. `"HC2"` heteroskedasticity-robust plug-in residuals variance estimator with hc2 weights. `"HC3"` heteroskedasticity-robust plug-in residuals variance estimator with hc3 weights.
`cluster`	cluster ID. Used for compute cluster-robust standard errors.
`asyvar`	If true, the standard error of the nonparametric component is computed and the uncertainty related to control variables is omitted. Default is `asyvar=FALSE`, that is, the uncertainty related to control variables is taken into account.
`level`	nominal confidence level for confidence interval and confidence band estimation. Default is `level=95`.
`noplot`	if true, no plot produced.
`dfcheck`	adjustments for minimum effective sample size checks, which take into account number of unique values of `x` (i.e., number of mass points), number of clusters, and degrees of freedom of the different statistical models considered. The default is `dfcheck=c(20, 30)`. See Cattaneo, Crump, Farrell and Feng (2024c) for more details.
`masspoints`	how mass points in `x` are handled. Available options: `"on"` all mass point and degrees of freedom checks are implemented. Default. `"noadjust"` mass point checks and the corresponding effective sample size adjustments are omitted. `"nolocalcheck"` within-bin mass point and degrees of freedom checks are omitted. `"off"` "noadjust" and "nolocalcheck" are set simultaneously. `"veryfew"` forces the function to proceed as if `x` has only a few number of mass points (i.e., distinct values). In other words, forces the function to proceed as if the mass point and degrees of freedom checks were failed.
`weights`	an optional vector of weights to be used in the fitting process. Should be `NULL` or a numeric vector. For more details, see `lm`.
`subset`	Optional rule specifying a subset of observations to be used.
`plotxrange`	a vector. `plotxrange=c(min, max)` specifies a range of the x-axis for plotting. Observations outside the range are dropped in the plot.
`plotyrange`	a vector. `plotyrange=c(min, max)` specifies a range of the y-axis for plotting. Observations outside the range are dropped in the plot.

Value

`bins_plot`	A `ggplot` object for binscatter plot.
`data.plot`	A list containing data for plotting. Each item is a sublist of data frames for each group. Each sublist may contain the following data frames: `data.dots` Data for dots. It contains: `x`, evaluation points; `bin`, the indicator of bins; `isknot`, indicator of inner knots; `mid`, midpoint of each bin; and `fit`, fitted values. `data.line` Data for line. It contains: `x`, evaluation points; `bin`, the indicator of bins; `isknot`, indicator of inner knots; `mid`, midpoint of each bin; and `fit`, fitted values. `data.ci` Data for CI. It contains: `x`, evaluation points; `bin`, the indicator of bins; `isknot`, indicator of inner knots; `mid`, midpoint of each bin; `ci.l` and `ci.r`, left and right boundaries of each confidence intervals. `data.cb` Data for CB. It contains: `x`, evaluation points; `bin`, the indicator of bins; `isknot`, indicator of inner knots; `mid`, midpoint of each bin; `cb.l` and `cb.r`, left and right boundaries of the confidence band. `data.poly` Data for polynomial regression. It contains: `x`, evaluation points; `bin`, the indicator of bins; `isknot`, indicator of inner knots; `mid`, midpoint of each bin; and `fit`, fitted values. `data.polyci` Data for confidence intervals based on polynomial regression. It contains: `x`, evaluation points; `bin`, the indicator of bins; `isknot`, indicator of inner knots; `mid`, midpoint of each bin; `polyci.l` and `polyci.r`, left and right boundaries of each confidence intervals. `data.bin` Data for the binning structure. It contains: `bin.id`, ID for each bin; `left.endpoint` and `right.endpoint`, left and right endpoints of each bin.
`imse.var.rot`	Variance constant in IMSE, ROT selection.
`imse.bsq.rot`	Bias constant in IMSE, ROT selection.
`imse.var.dpi`	Variance constant in IMSE, DPI selection.
`imse.bsq.dpi`	Bias constant in IMSE, DPI selection.
`cval.by`	A vector of critical values for constructing confidence band for each group.
`opt`	A list containing options passed to the function, as well as `N.by` (total sample size for each group), `Ndist.by` (number of distinct values in `x` for each group), `Nclust.by` (number of clusters for each group), and `nbins.by` (number of bins for each group), and `byvals` (number of distinct values in `by`). The degree and smoothness of polynomials for dots, line, confidence intervals and confidence band for each group are saved in `dots`, `line`, `ci`, and `cb`.

Author(s)

Matias D. Cattaneo, Princeton University, Princeton, NJ. [email protected].

Richard K. Crump, Federal Reserve Bank of New York, New York, NY. [email protected].

Max H. Farrell, UC Santa Barbara, Santa Barbara, CA. [email protected].

Yingjie Feng (maintainer), Tsinghua University, Beijing, China. [email protected].

References

Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024a: On Binscatter. American Economic Review 114(5): 1488-1514.

Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024b: Nonlinear Binscatter Methods. Working Paper.

Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024c: Binscatter Regressions. Working Paper.

Examples

 x <- runif(500); y <- sin(x)+rnorm(500)
 ## Binned scatterplot
 binsreg(y,x)
x <- runif(500); y <- sin(x)+rnorm(500)
 ## Binned scatterplot
 binsreg(y,x)

Data-Driven IMSE-Optimal Partitioning/Binning Selection for Binscatter

Description

binsregselect implements data-driven procedures for selecting the number of bins for binscatter estimation. The selected number is optimal in minimizing integrated mean squared error (IMSE).

Usage

binsregselect(y, x, w = NULL, data = NULL, deriv = 0, bins = NULL,
  pselect = NULL, sselect = NULL, binspos = "qs", nbins = NULL,
  binsmethod = "dpi", nbinsrot = NULL, simsgrid = 20, savegrid = F,
  vce = "HC1", useeffn = NULL, randcut = NULL, cluster = NULL,
  dfcheck = c(20, 30), masspoints = "on", weights = NULL,
  subset = NULL, norotnorm = F, numdist = NULL, numclust = NULL)
binsregselect(y, x, w = NULL, data = NULL, deriv = 0, bins = NULL,
  pselect = NULL, sselect = NULL, binspos = "qs", nbins = NULL,
  binsmethod = "dpi", nbinsrot = NULL, simsgrid = 20, savegrid = F,
  vce = "HC1", useeffn = NULL, randcut = NULL, cluster = NULL,
  dfcheck = c(20, 30), masspoints = "on", weights = NULL,
  subset = NULL, norotnorm = F, numdist = NULL, numclust = NULL)

Arguments

`y`	outcome variable. A vector.
`x`	independent variable of interest. A vector.
`w`	control variables. A matrix, a vector or a `formula`.
`data`	an optional data frame containing variables used in the model.
`deriv`	derivative order of the regression function for estimation, testing and plotting. The default is `deriv=0`, which corresponds to the function itself.
`bins`	a vector. `bins=c(p,s)` set a piecewise polynomial of degree `p` with `s` smoothness constraints for data-driven (IMSE-optimal) selection of the partitioning/binning scheme. By default, the function sets `bins=c(0,0)`, which corresponds to piecewise constant (canonical binscatter).
`pselect`	vector of numbers within which the degree of polynomial `p` for point estimation is selected. Note: To implement the degree or smoothness selection, in addition to `pselect` or `sselect`, `nbins=#` must be specified.
`sselect`	vector of numbers within which the number of smoothness constraints `s` for point estimation is selected. If not specified, for each value `p` supplied in the option `pselect`, only the piecewise polynomial with the maximum smoothness is considered, i.e., `s=p`.
`binspos`	position of binning knots. The default is `binspos="qs"`, which corresponds to quantile-spaced binning (canonical binscatter). The other option is `binspos="es"` for evenly-spaced binning.
`nbins`	number of bins for degree/smoothness selection. If `nbins=T` or `nbins=NULL` (default) is specified, the function selects the number of bins instead, given the specified degree and smoothness. If a vector with more than one number is specified, the command selects the number of bins within this vector.
`binsmethod`	method for data-driven selection of the number of bins. The default is `binsmethod="dpi"`, which corresponds to the IMSE-optimal direct plug-in rule. The other option is: `"rot"` for rule of thumb implementation.
`nbinsrot`	initial number of bins value used to construct the DPI number of bins selector. If not specified, the data-driven ROT selector is used instead.
`simsgrid`	number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the supremum (infimum or Lp metric) operation needed to construct confidence bands and hypothesis testing procedures. The default is `simsgrid=20`, which corresponds to 20 evenly-spaced evaluation points within each bin for approximating the supremum (infimum or Lp metric) operator.
`savegrid`	if true, a data frame produced containing grid.
`vce`	procedure to compute the variance-covariance matrix estimator. Options are `"const"` homoskedastic variance estimator. `"HC0"` heteroskedasticity-robust plug-in residuals variance estimator without weights. `"HC1"` heteroskedasticity-robust plug-in residuals variance estimator with hc1 weights. Default. `"HC2"` heteroskedasticity-robust plug-in residuals variance estimator with hc2 weights. `"HC3"` heteroskedasticity-robust plug-in residuals variance estimator with hc3 weights.
`useeffn`	effective sample size to be used when computing the (IMSE-optimal) number of bins. This option is useful for extrapolating the optimal number of bins to larger (or smaller) datasets than the one used to compute it.
`randcut`	upper bound on a uniformly distributed variable used to draw a subsample for bins/degree/smoothness selection. Observations for which `runif()<=#` are used. # must be between 0 and 1.
`cluster`	cluster ID. Used for compute cluster-robust standard errors.
`dfcheck`	adjustments for minimum effective sample size checks, which take into account number of unique values of `x` (i.e., number of mass points), number of clusters, and degrees of freedom of the different statistical models considered. The default is `dfcheck=c(20, 30)`. See Cattaneo, Crump, Farrell and Feng (2024c) for more details.
`masspoints`	how mass points in `x` are handled. Available options: `"on"` all mass point and degrees of freedom checks are implemented. Default. `"noadjust"` mass point checks and the corresponding effective sample size adjustments are omitted. `"nolocalcheck"` within-bin mass point and degrees of freedom checks are omitted. `"off"` "noadjust" and "nolocalcheck" are set simultaneously. `"veryfew"` forces the function to proceed as if `x` has only a few number of mass points (i.e., distinct values). In other words, forces the function to proceed as if the mass point and degrees of freedom checks were failed.
`weights`	an optional vector of weights to be used in the fitting process. Should be `NULL` or a numeric vector. For more details, see `lm`.
`subset`	optional rule specifying a subset of observations to be used.
`norotnorm`	if true, a uniform density rather than normal density used for ROT selection.
`numdist`	number of distinct values for selection. Used to speed up computation.
`numclust`	number of clusters for selection. Used to speed up computation.

Value

`nbinsrot.poly`	ROT number of bins, unregularized.
`nbinsrot.regul`	ROT number of bins, regularized.
`nbinsrot.uknot`	ROT number of bins, unique knots.
`nbinsdpi`	DPI number of bins.
`nbinsdpi.uknot`	DPI number of bins, unique knots.
`prot.poly`	ROT degree of polynomials, unregularized.
`prot.regul`	ROT degree of polynomials, regularized.
`prot.uknot`	ROT degree of polynomials, unique knots.
`pdpi`	DPI degree of polynomials.
`pdpi.uknot`	DPI degree of polynomials, unique knots.
`srot.poly`	ROT number of smoothness constraints, unregularized.
`srot.regul`	ROT number of smoothness constraints, regularized.
`srot.uknot`	ROT number of smoothness constraints, unique knots.
`sdpi`	DPI number of smoothness constraints.
`sdpi.uknot`	DPI number of smoothness constraints, unique knots.
`imse.var.rot`	Variance constant in IMSE expansion, ROT selection.
`imse.bsq.rot`	Bias constant in IMSE expansion, ROT selection.
`imse.var.dpi`	Variance constant in IMSE expansion, DPI selection.
`imse.bsq.dpi`	Bias constant in IMSE expansion, DPI selection.
`int.result`	Intermediate results, including a matrix of degree and smoothness (`deg_mat`), the selected numbers of bins (`vec.nbinsrot.poly`,`vec.nbinsrot.regul`, `vec.nbinsrot.uknot`, `vec.nbinsdpi`, `vec.nbinsdpi.uknot`), and the bias and variance constants in IMSE (`vec.imse.b.rot`, `vec.imse.v.rot`, `vec.imse.b.dpi`, `vec.imse.v.dpi`) under each rule (ROT or DPI), corresponding to each pair of degree and smoothness (each row in `deg_mat`).
`opt`	A list containing options passed to the function, as well as total sample size `n`, number of distinct values `Ndist` in `x`, and number of clusters `Nclust`.
`data.grid`	A data frame containing grid.

Author(s)

Matias D. Cattaneo, Princeton University, Princeton, NJ. [email protected].

Richard K. Crump, Federal Reserve Bank of New York, New York, NY. [email protected].

Max H. Farrell, UC Santa Barbara, Santa Barbara, CA. [email protected].

Yingjie Feng (maintainer), Tsinghua University, Beijing, China. [email protected].

References

Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024a: On Binscatter. American Economic Review 114(5): 1488-1514.

Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024b: Nonlinear Binscatter Methods. Working Paper.

Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024c: Binscatter Regressions. Working Paper.

Examples

 x <- runif(500); y <- sin(x)+rnorm(500)
 est <- binsregselect(y,x)
 summary(est)
x <- runif(500); y <- sin(x)+rnorm(500)
 est <- binsregselect(y,x)
 summary(est)

Data-Driven Nonparametric Shape Restriction and Parametric Model Specification Testing using Binscatter

Description

binstest implements binscatter-based hypothesis testing procedures for parametric functional forms of and nonparametric shape restrictions on the regression function of interest, following the results in Cattaneo, Crump, Farrell and Feng (2024a) and Cattaneo, Crump, Farrell and Feng (2024b). If the binning scheme is not set by the user, the companion function binsregselect is used to implement binscatter in a data-driven way and inference procedures are based on robust bias correction. Binned scatter plots based on different methods can be constructed using the companion functions binsreg, binsqreg or binsglm.

Usage

binstest(y, x, w = NULL, data = NULL, estmethod = "reg",
  family = gaussian(), quantile = NULL, deriv = 0, at = NULL,
  nolink = F, testmodel = NULL, testmodelparfit = NULL,
  testmodelpoly = NULL, testshape = NULL, testshapel = NULL,
  testshaper = NULL, testshape2 = NULL, lp = Inf, bins = NULL,
  nbins = NULL, pselect = NULL, sselect = NULL, binspos = "qs",
  binsmethod = "dpi", nbinsrot = NULL, randcut = NULL, nsims = 500,
  simsgrid = 20, simsseed = NULL, vce = NULL, cluster = NULL,
  asyvar = F, dfcheck = c(20, 30), masspoints = "on", weights = NULL,
  subset = NULL, numdist = NULL, numclust = NULL, estmethodopt = NULL,
  ...)
binstest(y, x, w = NULL, data = NULL, estmethod = "reg",
  family = gaussian(), quantile = NULL, deriv = 0, at = NULL,
  nolink = F, testmodel = NULL, testmodelparfit = NULL,
  testmodelpoly = NULL, testshape = NULL, testshapel = NULL,
  testshaper = NULL, testshape2 = NULL, lp = Inf, bins = NULL,
  nbins = NULL, pselect = NULL, sselect = NULL, binspos = "qs",
  binsmethod = "dpi", nbinsrot = NULL, randcut = NULL, nsims = 500,
  simsgrid = 20, simsseed = NULL, vce = NULL, cluster = NULL,
  asyvar = F, dfcheck = c(20, 30), masspoints = "on", weights = NULL,
  subset = NULL, numdist = NULL, numclust = NULL, estmethodopt = NULL,
  ...)

Arguments

`y`	outcome variable. A vector.
`x`	independent variable of interest. A vector.
`w`	control variables. A matrix, a vector or a `formula`.
`data`	an optional data frame containing variables used in the model.
`estmethod`	estimation method. The default is `estmethod="reg"` for tests based on binscatter least squares regression. Other options are `"qreg"` for quantile regression and `"glm"` for generalized linear regression. If `estmethod="glm"`, the option `family` must be specified.
`family`	a description of the error distribution and link function to be used in the generalized linear model when `estmethod="glm"`. (See `family` for details of family functions.)
`quantile`	the quantile to be estimated. A number strictly between 0 and 1.
`deriv`	derivative order of the regression function for estimation, testing and plotting. The default is `deriv=0`, which corresponds to the function itself.
`at`	value of `w` at which the estimated function is evaluated. The default is `at="mean"`, which corresponds to the mean of `w`. Other options are: `at="median"` for the median of `w`, `at="zero"` for a vector of zeros. `at` can also be a vector of the same length as the number of columns of `w` (if `w` is a matrix) or a data frame containing the same variables as specified in `w` (when `data` is specified). Note that when `at="mean"` or `at="median"`, all factor variables (if specified) are excluded from the evaluation (set as zero).
`nolink`	if true, the function within the inverse link function is reported instead of the conditional mean function for the outcome.
`testmodel`	a vector or a logical value. It sets the degree of polynomial and the number of smoothness constraints for parametric model specification testing. If `testmodel=c(p,s)` is specified, a piecewise polynomial of degree `p` with `s` smoothness constraints is used. If `testmodel=T` or `testmodel=NULL` (default) is specified, `testmodel=c(1,1)` is used unless the degree `p` or the smoothness `s` selection is requested via the option `pselect` or `sselect` (see more details in the explanation of `pselect` and `sselect`).
`testmodelparfit`	a data frame or matrix which contains the evaluation grid and fitted values of the model(s) to be tested against. The column contains a series of evaluation points at which the binscatter model and the parametric model of interest are compared with each other. Each parametric model is represented by other columns, which must contain the fitted values at the corresponding evaluation points.
`testmodelpoly`	degree of a global polynomial model to be tested against.
`testshape`	a vector or a logical value. It sets the degree of polynomial and the number of smoothness constraints for nonparametric shape restriction testing. If `testshape=c(p,s)` is specified, a piecewise polynomial of degree `p` with `s` smoothness constraints is used. If `testshape=T` or `testshape=NULL` (default) is specified, `testshape=c(1,1)` is used unless the degree `p` or smoothness `s` selection is requested via the option `pselect` or `sselect` (see more details in the explanation of `pselect` and `sselect`).
`testshapel`	a vector of null boundary values for hypothesis testing. Each number `a` in the vector corresponds to one boundary of a one-sided hypothesis test to the left of the form `H0: sup_x mu(x)<=a`.
`testshaper`	a vector of null boundary values for hypothesis testing. Each number `a` in the vector corresponds to one boundary of a one-sided hypothesis test to the right of the form `H0: inf_x mu(x)>=a`.
`testshape2`	a vector of null boundary values for hypothesis testing. Each number `a` in the vector corresponds to one boundary of a two-sided hypothesis test of the form `H0: sup_x \|mu(x)-a\|=0`.
`lp`	an Lp metric used for parametric model specification testing and/or shape restriction testing. The default is `lp=Inf`, which corresponds to the sup-norm of the t-statistic. Other options are `lp=q` for a positive number `q>=1`. Note that `lp=Inf` ("sup-norm") has to be used for testing one-sided shape restrictions.
`bins`	a vector. If `bins=c(p,s)`, it sets the piecewise polynomial of degree `p` with `s` smoothness constraints for data-driven (IMSE-optimal) selection of the partitioning/binning scheme. The default is `bins=c(0,0)`, which corresponds to the piecewise constant.
`nbins`	number of bins for partitioning/binning of `x`. If `nbins=T` or `nbins=NULL` (default) is specified, the number of bins is selected via the companion command `binsregselect` in a data-driven, optimal way whenever possible. If a vector with more than one number is specified, the number of bins is selected within this vector via the companion command `binsregselect`.
`pselect`	vector of numbers within which the degree of polynomial `p` for point estimation is selected. If the selected optimal degree is `p`, then piecewise polynomials of degree `p+1` are used to conduct testing for nonparametric shape restrictions or parametric model specifications. Note: To implement the degree or smoothness selection, in addition to `pselect` or `sselect`, `nbins=#` must be specified.
`sselect`	vector of numbers within which the number of smoothness constraints `s` for point estimation is selected. If the selected optimal smoothness is `s`, then piecewise polynomials of `s+1` smoothness constraints are used to conduct testing for nonparametric shape restrictions or parametric model specifications. If not specified, for each value `p` supplied in the option `pselect`, only the piecewise polynomial with the maximum smoothness is considered, i.e., `s=p`.
`binspos`	position of binning knots. The default is `binspos="qs"`, which corresponds to quantile-spaced binning (canonical binscatter). The other options are `"es"` for evenly-spaced binning, or a vector for manual specification of the positions of inner knots (which must be within the range of `x`).
`binsmethod`	method for data-driven selection of the number of bins. The default is `binsmethod="dpi"`, which corresponds to the IMSE-optimal direct plug-in rule. The other option is: `"rot"` for rule of thumb implementation.
`nbinsrot`	initial number of bins value used to construct the DPI number of bins selector. If not specified, the data-driven ROT selector is used instead.
`randcut`	upper bound on a uniformly distributed variable used to draw a subsample for bins/degree/smoothness selection. Observations for which `runif()<=#` are used. # must be between 0 and 1. By default, `max(5000, 0.01n)` observations are used if the samples size `n>5000`.
`nsims`	number of random draws for hypothesis testing. The default is `nsims=500`, which corresponds to 500 draws from a standard Gaussian random vector of size `[(p+1)J - (J-1)s]`. Setting at least `nsims=2000` is recommended to obtain the final results.
`simsgrid`	number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the supremum (infimum or Lp metric) operation needed to construct hypothesis testing procedures. The default is `simsgrid=20`, which corresponds to 20 evenly-spaced evaluation points within each bin for approximating the supremum (infimum or Lp metric) operator. Setting at least `simsgrid=50` is recommended to obtain the final results.
`simsseed`	seed for simulation.
`vce`	procedure to compute the variance-covariance matrix estimator. For least squares regression and generalized linear regression, the allowed options are the same as that for `binsreg` or `binsqreg`. For quantile regression, the allowed options are the same as that for `binsqreg`.
`cluster`	cluster ID. Used for compute cluster-robust standard errors.
`asyvar`	if true, the standard error of the nonparametric component is computed and the uncertainty related to control variables is omitted. Default is `asyvar=FALSE`, that is, the uncertainty related to control variables is taken into account.
`dfcheck`	adjustments for minimum effective sample size checks, which take into account number of unique values of `x` (i.e., number of mass points), number of clusters, and degrees of freedom of the different stat models considered. The default is `dfcheck=c(20, 30)`. See Cattaneo, Crump, Farrell and Feng (2024c) for more details.
`masspoints`	how mass points in `x` are handled. Available options: `"on"` all mass point and degrees of freedom checks are implemented. Default. `"noadjust"` mass point checks and the corresponding effective sample size adjustments are omitted. `"nolocalcheck"` within-bin mass point and degrees of freedom checks are omitted. `"off"` "noadjust" and "nolocalcheck" are set simultaneously. `"veryfew"` forces the function to proceed as if `x` has only a few number of mass points (i.e., distinct values). In other words, forces the function to proceed as if the mass point and degrees of freedom checks were failed.
`weights`	an optional vector of weights to be used in the fitting process. Should be `NULL` or a numeric vector. For more details, see `lm`.
`subset`	optional rule specifying a subset of observations to be used.
`numdist`	number of distinct values for selection. Used to speed up computation.
`numclust`	number of clusters for selection. Used to speed up computation.
`estmethodopt`	a list of optional arguments used by `rq` (for quantile regression) or `glm` (for fitting generalized linear models).
`...`	optional arguments to control bootstrapping if `estmethod="qreg"` and `vce="boot"`. See `boot.rq`.

Value

`testshapeL`	Results for `testshapel`, including: `testvalL`, null boundary values; `stat.shapeL`, test statistics; and `pval.shapeL`, p-value.
`testshapeR`	Results for `testshaper`, including: `testvalR`, null boundary values; `stat.shapeR`, test statistics; and `pval.shapeR`, p-value.
`testshape2`	Results for `testshape2`, including: `testval2`, null boundary values; `stat.shape2`, test statistics; and `pval.shape2`, p-value.
`testpoly`	Results for `testmodelpoly`, including: `testpoly`, the degree of global polynomial; `stat.poly`, test statistic; `pval.poly`, p-value.
`testmodel`	Results for `testmodelparfit`, including: `stat.model`, test statistics; `pval.model`, p-values.
`imse.var.rot`	Variance constant in IMSE, ROT selection.
`imse.bsq.rot`	Bias constant in IMSE, ROT selection.
`imse.var.dpi`	Variance constant in IMSE, DPI selection.
`imse.bsq.dpi`	Bias constant in IMSE, DPI selection.
`opt`	A list containing options passed to the function, as well as total sample size `n`, number of distinct values `Ndist` in `x`, number of clusters `Nclust`, and number of bins `nbins`.

Author(s)

Matias D. Cattaneo, Princeton University, Princeton, NJ. [email protected].

Richard K. Crump, Federal Reserve Bank of New York, New York, NY. [email protected].

Max H. Farrell, UC Santa Barbara, Santa Barbara, CA. [email protected].

Yingjie Feng (maintainer), Tsinghua University, Beijing, China. [email protected].

References

Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024a: On Binscatter. American Economic Review 114(5): 1488-1514.

Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024b: Nonlinear Binscatter Methods. Working Paper.

Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024c: Binscatter Regressions. Working Paper.

Examples

 x <- runif(500); y <- sin(x)+rnorm(500)
 est <- binstest(y,x, testmodelpoly=1)
 summary(est)
x <- runif(500); y <- sin(x)+rnorm(500)
 est <- binstest(y,x, testmodelpoly=1)
 summary(est)

Package 'binsreg'

Help Index

Binsreg Package Document

Description

Author(s)

References

Data-Driven Binscatter Generalized Linear Regression with Robust Inference Procedures and Plots

Description

Usage

Arguments

Value

Author(s)

References

See Also

Examples

Data-Driven Pairwise Group Comparison using Binscatter Methods

Description

Usage

Arguments

Value

Author(s)

References

See Also

Examples

Data-Driven Binscatter Quantile Regression with Robust Inference Procedures and Plots

Description

Usage

Arguments

Value

Author(s)

References

See Also

Examples

Data-Driven Binscatter Least Squares Regression with Robust Inference Procedures and Plots

Description

Usage

Arguments

Value

Author(s)

References

See Also

Examples

Data-Driven IMSE-Optimal Partitioning/Binning Selection for Binscatter

Description

Usage

Arguments

Value

Author(s)

References

See Also

Examples

Data-Driven Nonparametric Shape Restriction and Parametric Model Specification Testing using Binscatter

Description

Usage

Arguments

Value

Author(s)

References

See Also

Examples