Title: | Sample Selection Models |
---|---|
Description: | Two-step and maximum likelihood estimation of Heckman-type sample selection models: standard sample selection models (Tobit-2), endogenous switching regression models (Tobit-5), sample selection models with binary dependent outcome variable, interval regression with sample selection (only ML estimation), and endogenous treatment effects models. These methods are described in the three vignettes that are included in this package and in econometric textbooks such as Greene (2011, Econometric Analysis, 7th edition, Pearson). |
Authors: | Arne Henningsen [aut, cre], Ott Toomet [aut], Sebastian Petersen [ctb] |
Maintainer: | Arne Henningsen <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.2-12 |
Built: | 2025-01-17 07:58:58 UTC |
Source: | CRAN |
This function extracts coefficients from sample selection models
## S3 method for class 'selection' coef(object, part = "full", ...) ## S3 method for class 'summary.selection' coef(object, part = "full", ...) ## S3 method for class 'coef.selection' print( x, prefix = TRUE, digits = max(3, getOption("digits") - 3), ... )
## S3 method for class 'selection' coef(object, part = "full", ...) ## S3 method for class 'summary.selection' coef(object, part = "full", ...) ## S3 method for class 'coef.selection' print( x, prefix = TRUE, digits = max(3, getOption("digits") - 3), ... )
object |
object of class |
part |
character string indicating
which parts of the coefficients to extract:
|
x |
object returned by |
prefix |
logical. Add a prefix to the names of the coefficients that indicates to which equation the coefficient belongs. |
digits |
numeric, (suggested) number of significant digits. |
... |
currently not used. |
coef.selection
returns a vector of the estimated coefficients.
coef.summary.selection
returns a matrix of the estimated coefficients,
their standard errors, t-values, and p-values.
Arne Henningsen, Ott Toomet ([email protected])
coef
, selection
, vcov.selection
,
and selection-methods
.
## Estimate a simple female wage model taking into account the labour ## force participation data(Mroz87) a <- heckit(lfp ~ huswage + kids5 + mtr + fatheduc + educ + city, log(wage) ~ educ + city, data=Mroz87) ## extract all coefficients of the model: coef( a ) ## now extract the coefficients of the outcome model only: coef( a, part="outcome") ## extract all coefficients, standard errors, t-values ## and p-values of the model: coef( summary( a ) ) ## now extract the coefficients, standard errors, t-values ## and p-values of the outcome model only: coef( summary( a ), part="outcome")
## Estimate a simple female wage model taking into account the labour ## force participation data(Mroz87) a <- heckit(lfp ~ huswage + kids5 + mtr + fatheduc + educ + city, log(wage) ~ educ + city, data=Mroz87) ## extract all coefficients of the model: coef( a ) ## now extract the coefficients of the outcome model only: coef( a, part="outcome") ## extract all coefficients, standard errors, t-values ## and p-values of the model: coef( summary( a ) ) ## now extract the coefficients, standard errors, t-values ## and p-values of the outcome model only: coef( summary( a ), part="outcome")
Calculate fitted values of sample selection models
## S3 method for class 'selection' fitted(object, part = "outcome", ... )
## S3 method for class 'selection' fitted(object, part = "outcome", ... )
object |
object of class |
part |
character string indication which fitted values to extract: "outcome" for the fitted values of the outcome equation(s) or "selection" for the fitted values of the selection equation. |
... |
further arguments passed to other methods
(e.g. |
If the model was estimated by the 2-step method,
the fitted values of the outcome equation are calculated
using all regressors of this equation
including the inverse Mill's ratios,
i.e. the fitted values correspond to the conditional expectations
(see predict.selection
).
If the model was estimated by the maximum likelihood method,
the fitted values of the outcome equation are calculated
by disregarding the selection equation,
i.e. the fitted values correspond to the unconditional expectations
(see predict.selection
).
The fitted values of the selection equation are probabilities. If the dependent variable of the outcome equation is binary, also the fitted values of the outcome equation are probabilities.
A numeric vector of the fitted values.
Arne Henningsen
fitted
, selection
,
residuals.selection
, and selection-methods
.
Calculate the asymptotic covariance matrix for the coefficients of a Heckit estimation
heckitVcov( xMat, wMat, vcovProbit, rho, delta, sigma, saveMemory = TRUE )
heckitVcov( xMat, wMat, vcovProbit, rho, delta, sigma, saveMemory = TRUE )
xMat |
model matrix of the 2nd step estimation. |
wMat |
model matrix of the 1st step probit estimation. |
vcovProbit |
variance covariance matrix of the 1st step probit estimation. |
rho |
the estimated |
delta |
the estimated |
sigma |
the estimated |
saveMemory |
logical. Save memory by using a different implementation of the formula? (this should not influence the results). |
The formula implemented in heckitVcov
is available,
e.g., in Greene (2003), last formula on page 785.
the variance covariance matrix of the coefficients.
Arne Henningsen
Greene, W. H. (2003) Econometric Analysis, Fifth Edition, Prentice Hall.
Lee, L., G. Maddala and R. Trost (1980) Asymetric covariance matrices of two-stage probit and two-stage tobit methods for simultaneous equations models with selectivity. Econometrica, 48, p. 491-503.
Calculates the 'Inverse Mill's Ratios' of univariate and bivariate probit models.
invMillsRatio( x, all = FALSE )
invMillsRatio( x, all = FALSE )
x |
|
all |
a logical value indicating whether the inverse Mill's Ratios should be calculated for all observations. |
The formula to calculate the inverse Mill's ratios for univariate probit models is taken from Greene (2003, p. 785), whereas the formulas for bivariate probit models are derived in Henning and Henningsen (2005).
A data frame that contains the Inverse Mill's Ratios (IMR) and the delta values (see Greene, 2003, p. 784).
If a univariate probit estimation is provided, the variables
IMR1
and IMR0
are the Inverse Mill's Ratios to correct
for a sample selection bias of y = 1 and y = 0, respectively.
Accordingly, 'delta1' and 'delta0' are the corresponding delta values.
If a bivariate probit estimation is provided, the variables
IMRa1
, IMRa0
, IMRb1
, and IMRb0
are the
Inverse Mills Ratios to correct for a sample selection bias
of y = 1 and y = 0 in equations 'a' and 'b', respectively.
Accordingly, 'deltaa1', 'deltaa0', 'deltab1' and 'deltab0' are the
corresponding delta values.
Arne Henningsen
Greene, W. H. (2003) Econometric Analysis, Fifth Edition, Prentice Hall.
Henning, C.H.C.A and A. Henningsen (2005) Modeling Price Response of Farm Households in Imperfect Labor Markets in Poland: Incorporating Transaction Costs and Heterogeneity into a Farm Household Approach. Unpublished, University of Kiel, Germany.
## Wooldridge( 2003 ): example 17.5, page 590 data(Mroz87) myProbit <- glm( lfp ~ nwifeinc + educ + exper + I( exper^2 ) + age + kids5 + kids618, family = binomial( link = "probit" ), data=Mroz87 ) Mroz87$IMR <- invMillsRatio( myProbit )$IMR1 myHeckit <- lm( log( wage ) ~ educ + exper + I( exper^2 ) + IMR, data = Mroz87[ Mroz87$lfp == 1, ] ) # using NO labor force participation as endogenous variable Mroz87$nolfp <- 1 - Mroz87$lfp myProbit2 <- glm( nolfp ~ nwifeinc + educ + exper + I( exper^2 ) + age + kids5 + kids618, family = binomial( link = "probit" ), data=Mroz87 ) all.equal( invMillsRatio( myProbit )$IMR1, invMillsRatio( myProbit2 )$IMR0 ) # should be true # example for bivariate probit library( "mvtnorm" ) library( "VGAM" ) nObs <- 1000 # error terms (trivariate normal) sigma <- symMatrix( c( 2, 0.7, 1.2, 1, 0.5, 1 ) ) myData <- as.data.frame( rmvnorm( nObs, c( 0, 0, 0 ), sigma ) ) names( myData ) <- c( "e0", "e1", "e2" ) # exogenous variables (indepently normal) myData$x0 <- rnorm( nObs ) myData$x1 <- rnorm( nObs ) myData$x2 <- rnorm( nObs ) # endogenous variables myData$y0 <- -1.5 + 0.8 * myData$x1 + myData$e0 myData$y1 <- ( 0.3 + 0.4 * myData$x1 + 0.3 * myData$x2 + myData$e1 ) > 0 myData$y2 <- ( -0.1 + 0.6 * myData$x1 + 0.7 * myData$x2 + myData$e2 ) > 0 # bivariate probit (using rhobit transformation) bProbit <- vglm( cbind( y1, y2 ) ~ x1 + x2, family = binom2.rho, data = myData ) summary( bProbit ) # bivariate probit (NOT using rhobit transformation) bProbit2 <- vglm( cbind( y1, y2 ) ~ x1 + x2, family = binom2.rho( lrho = "identitylink" ), data = myData ) summary( bProbit2 ) # inverse Mills Ratios imr <- invMillsRatio( bProbit ) imr2 <- invMillsRatio( bProbit2 ) all.equal( imr, imr2, tolerance = .Machine$double.eps ^ 0.25) # tests # E[ e0 | y1* > 0 & y2* > 0 ] mean( myData$e0[ myData$y1 & myData$y2 ] ) mean( sigma[1,2] * imr$IMR11a + sigma[1,3] * imr$IMR11b, na.rm = TRUE ) # E[ e0 | y1* > 0 & y2* <= 0 ] mean( myData$e0[ myData$y1 & !myData$y2 ] ) mean( sigma[1,2] * imr$IMR10a + sigma[1,3] * imr$IMR10b, na.rm = TRUE ) # E[ e0 | y1* <= 0 & y2* > 0 ] mean( myData$e0[ !myData$y1 & myData$y2 ] ) mean( sigma[1,2] * imr$IMR01a + sigma[1,3] * imr$IMR01b, na.rm = TRUE ) # E[ e0 | y1* <= 0 & y2* <= 0 ] mean( myData$e0[ !myData$y1 & !myData$y2 ] ) mean( sigma[1,2] * imr$IMR00a + sigma[1,3] * imr$IMR00b, na.rm = TRUE ) # E[ e0 | y1* > 0 ] mean( myData$e0[ myData$y1 ] ) mean( sigma[1,2] * imr$IMR1X, na.rm = TRUE ) # E[ e0 | y1* <= 0 ] mean( myData$e0[ !myData$y1 ] ) mean( sigma[1,2] * imr$IMR0X, na.rm = TRUE ) # E[ e0 | y2* > 0 ] mean( myData$e0[ myData$y2 ] ) mean( sigma[1,3] * imr$IMRX1, na.rm = TRUE ) # E[ e0 | y2* <= 0 ] mean( myData$e0[ !myData$y2 ] ) mean( sigma[1,3] * imr$IMRX0, na.rm = TRUE ) # estimation for y1* > 0 and y2* > 0 selection <- myData$y1 & myData$y2 # OLS estimation ols11 <- lm( y0 ~ x1, data = myData, subset = selection ) summary( ols11 ) # heckman type estimation heckit11 <- lm( y0 ~ x1 + IMR11a + IMR11b, data = cbind( myData, imr ), subset = selection ) summary( heckit11 ) # estimation for y1* > 0 and y2* <= 0 selection <- myData$y1 & !myData$y2 # OLS estimation ols10 <- lm( y0 ~ x1, data = myData, subset = selection ) summary( ols10 ) # heckman type estimation heckit10 <- lm( y0 ~ x1 + IMR10a + IMR10b, data = cbind( myData, imr ), subset = selection ) summary( heckit10 ) # estimation for y1* <= 0 and y2* > 0 selection <- !myData$y1 & myData$y2 # OLS estimation ols01 <- lm( y0 ~ x1, data = myData, subset = selection ) summary( ols01 ) # heckman type estimation heckit01 <- lm( y0 ~ x1 + IMR01a + IMR01b, data = cbind( myData, imr ), subset = selection ) summary( heckit01 ) # estimation for y1* <= 0 and y2* <= 0 selection <- !myData$y1 & !myData$y2 # OLS estimation ols00 <- lm( y0 ~ x1, data = myData, subset = selection ) summary( ols00 ) # heckman type estimation heckit00 <- lm( y0 ~ x1 + IMR00a + IMR00b, data = cbind( myData, imr ), subset = selection ) summary( heckit00 ) # estimation for y1* > 0 selection <- myData$y1 # OLS estimation ols1X <- lm( y0 ~ x1, data = myData, subset = selection ) summary( ols1X ) # heckman type estimation heckit1X <- lm( y0 ~ x1 + IMR1X, data = cbind( myData, imr ), subset = selection ) summary( heckit1X ) # estimation for y1* <= 0 selection <- !myData$y1 # OLS estimation ols0X <- lm( y0 ~ x1, data = myData, subset = selection ) summary( ols0X ) # heckman type estimation heckit0X <- lm( y0 ~ x1 + IMR0X, data = cbind( myData, imr ), subset = selection ) summary( heckit0X ) # estimation for y2* > 0 selection <- myData$y2 # OLS estimation olsX1 <- lm( y0 ~ x1, data = myData, subset = selection ) summary( olsX1 ) # heckman type estimation heckitX1 <- lm( y0 ~ x1 + IMRX1, data = cbind( myData, imr ), subset = selection ) summary( heckitX1 ) # estimation for y2* <= 0 selection <- !myData$y2 # OLS estimation olsX0 <- lm( y0 ~ x1, data = myData, subset = selection ) summary( olsX0 ) # heckman type estimation heckitX0 <- lm( y0 ~ x1 + IMRX0, data = cbind( myData, imr ), subset = selection ) summary( heckitX0 )
## Wooldridge( 2003 ): example 17.5, page 590 data(Mroz87) myProbit <- glm( lfp ~ nwifeinc + educ + exper + I( exper^2 ) + age + kids5 + kids618, family = binomial( link = "probit" ), data=Mroz87 ) Mroz87$IMR <- invMillsRatio( myProbit )$IMR1 myHeckit <- lm( log( wage ) ~ educ + exper + I( exper^2 ) + IMR, data = Mroz87[ Mroz87$lfp == 1, ] ) # using NO labor force participation as endogenous variable Mroz87$nolfp <- 1 - Mroz87$lfp myProbit2 <- glm( nolfp ~ nwifeinc + educ + exper + I( exper^2 ) + age + kids5 + kids618, family = binomial( link = "probit" ), data=Mroz87 ) all.equal( invMillsRatio( myProbit )$IMR1, invMillsRatio( myProbit2 )$IMR0 ) # should be true # example for bivariate probit library( "mvtnorm" ) library( "VGAM" ) nObs <- 1000 # error terms (trivariate normal) sigma <- symMatrix( c( 2, 0.7, 1.2, 1, 0.5, 1 ) ) myData <- as.data.frame( rmvnorm( nObs, c( 0, 0, 0 ), sigma ) ) names( myData ) <- c( "e0", "e1", "e2" ) # exogenous variables (indepently normal) myData$x0 <- rnorm( nObs ) myData$x1 <- rnorm( nObs ) myData$x2 <- rnorm( nObs ) # endogenous variables myData$y0 <- -1.5 + 0.8 * myData$x1 + myData$e0 myData$y1 <- ( 0.3 + 0.4 * myData$x1 + 0.3 * myData$x2 + myData$e1 ) > 0 myData$y2 <- ( -0.1 + 0.6 * myData$x1 + 0.7 * myData$x2 + myData$e2 ) > 0 # bivariate probit (using rhobit transformation) bProbit <- vglm( cbind( y1, y2 ) ~ x1 + x2, family = binom2.rho, data = myData ) summary( bProbit ) # bivariate probit (NOT using rhobit transformation) bProbit2 <- vglm( cbind( y1, y2 ) ~ x1 + x2, family = binom2.rho( lrho = "identitylink" ), data = myData ) summary( bProbit2 ) # inverse Mills Ratios imr <- invMillsRatio( bProbit ) imr2 <- invMillsRatio( bProbit2 ) all.equal( imr, imr2, tolerance = .Machine$double.eps ^ 0.25) # tests # E[ e0 | y1* > 0 & y2* > 0 ] mean( myData$e0[ myData$y1 & myData$y2 ] ) mean( sigma[1,2] * imr$IMR11a + sigma[1,3] * imr$IMR11b, na.rm = TRUE ) # E[ e0 | y1* > 0 & y2* <= 0 ] mean( myData$e0[ myData$y1 & !myData$y2 ] ) mean( sigma[1,2] * imr$IMR10a + sigma[1,3] * imr$IMR10b, na.rm = TRUE ) # E[ e0 | y1* <= 0 & y2* > 0 ] mean( myData$e0[ !myData$y1 & myData$y2 ] ) mean( sigma[1,2] * imr$IMR01a + sigma[1,3] * imr$IMR01b, na.rm = TRUE ) # E[ e0 | y1* <= 0 & y2* <= 0 ] mean( myData$e0[ !myData$y1 & !myData$y2 ] ) mean( sigma[1,2] * imr$IMR00a + sigma[1,3] * imr$IMR00b, na.rm = TRUE ) # E[ e0 | y1* > 0 ] mean( myData$e0[ myData$y1 ] ) mean( sigma[1,2] * imr$IMR1X, na.rm = TRUE ) # E[ e0 | y1* <= 0 ] mean( myData$e0[ !myData$y1 ] ) mean( sigma[1,2] * imr$IMR0X, na.rm = TRUE ) # E[ e0 | y2* > 0 ] mean( myData$e0[ myData$y2 ] ) mean( sigma[1,3] * imr$IMRX1, na.rm = TRUE ) # E[ e0 | y2* <= 0 ] mean( myData$e0[ !myData$y2 ] ) mean( sigma[1,3] * imr$IMRX0, na.rm = TRUE ) # estimation for y1* > 0 and y2* > 0 selection <- myData$y1 & myData$y2 # OLS estimation ols11 <- lm( y0 ~ x1, data = myData, subset = selection ) summary( ols11 ) # heckman type estimation heckit11 <- lm( y0 ~ x1 + IMR11a + IMR11b, data = cbind( myData, imr ), subset = selection ) summary( heckit11 ) # estimation for y1* > 0 and y2* <= 0 selection <- myData$y1 & !myData$y2 # OLS estimation ols10 <- lm( y0 ~ x1, data = myData, subset = selection ) summary( ols10 ) # heckman type estimation heckit10 <- lm( y0 ~ x1 + IMR10a + IMR10b, data = cbind( myData, imr ), subset = selection ) summary( heckit10 ) # estimation for y1* <= 0 and y2* > 0 selection <- !myData$y1 & myData$y2 # OLS estimation ols01 <- lm( y0 ~ x1, data = myData, subset = selection ) summary( ols01 ) # heckman type estimation heckit01 <- lm( y0 ~ x1 + IMR01a + IMR01b, data = cbind( myData, imr ), subset = selection ) summary( heckit01 ) # estimation for y1* <= 0 and y2* <= 0 selection <- !myData$y1 & !myData$y2 # OLS estimation ols00 <- lm( y0 ~ x1, data = myData, subset = selection ) summary( ols00 ) # heckman type estimation heckit00 <- lm( y0 ~ x1 + IMR00a + IMR00b, data = cbind( myData, imr ), subset = selection ) summary( heckit00 ) # estimation for y1* > 0 selection <- myData$y1 # OLS estimation ols1X <- lm( y0 ~ x1, data = myData, subset = selection ) summary( ols1X ) # heckman type estimation heckit1X <- lm( y0 ~ x1 + IMR1X, data = cbind( myData, imr ), subset = selection ) summary( heckit1X ) # estimation for y1* <= 0 selection <- !myData$y1 # OLS estimation ols0X <- lm( y0 ~ x1, data = myData, subset = selection ) summary( ols0X ) # heckman type estimation heckit0X <- lm( y0 ~ x1 + IMR0X, data = cbind( myData, imr ), subset = selection ) summary( heckit0X ) # estimation for y2* > 0 selection <- myData$y2 # OLS estimation olsX1 <- lm( y0 ~ x1, data = myData, subset = selection ) summary( olsX1 ) # heckman type estimation heckitX1 <- lm( y0 ~ x1 + IMRX1, data = cbind( myData, imr ), subset = selection ) summary( heckitX1 ) # estimation for y2* <= 0 selection <- !myData$y2 # OLS estimation olsX0 <- lm( y0 ~ x1, data = myData, subset = selection ) summary( olsX0 ) # heckman type estimation heckitX0 <- lm( y0 ~ x1 + IMRX0, data = cbind( myData, imr ), subset = selection ) summary( heckitX0 )
Calculates the (unobservable) linear predictors for probability models.
linearPredictors(x, ...) ## S3 method for class 'probit' linearPredictors( x, ... )
linearPredictors(x, ...) ## S3 method for class 'probit' linearPredictors( x, ... )
x |
model of an appropriate class |
... |
other arguments depending on the method |
It is a generic function with a method for 'probit'.
A matrix with nrow equal to the number of observations and one column: the linear predictors for observations
Ott Toomet [email protected], Arne Henningsen
probit
and probit-methods
.
data(Mroz87) Mroz87$kids <- ( Mroz87$kids5 + Mroz87$kids618 > 0 ) a <- probit(lfp ~ kids + educ + hushrs + huseduc + huswage + mtr + motheduc, data=Mroz87) b <- linearPredictors(a) cor(Mroz87$lfp, b) # should be positive and highly significant
data(Mroz87) Mroz87$kids <- ( Mroz87$kids5 + Mroz87$kids618 > 0 ) a <- probit(lfp ~ kids + educ + hushrs + huseduc + huswage + mtr + motheduc, data=Mroz87) b <- linearPredictors(a) cor(Mroz87$lfp, b) # should be positive and highly significant
Return the variables used for estimating a binary choice model
## S3 method for class 'binaryChoice' model.frame( formula, ... )
## S3 method for class 'binaryChoice' model.frame( formula, ... )
formula |
object of class |
... |
further arguments passed to other methods
(e.g. |
A data.frame containing all variables used for the estimation.
Arne Henningsen, Ott Toomet [email protected]
binaryChoice
, probit
,
model.frame
, model.matrix.binaryChoice
,
and probit-methods
.
Return the variables used for estimating a sample selection model
## S3 method for class 'selection' model.frame(formula, ... )
## S3 method for class 'selection' model.frame(formula, ... )
formula |
object of class |
... |
further arguments passed to other methods
(e.g. |
A data.frame containing all variables used for the estimation. The “terms” attribute contains the terms for the selection equation.
Arne Henningsen
model.frame
, selection
,
model.matrix.selection
, and selection-methods
.
Create design matrix of binary choice models
## S3 method for class 'binaryChoice' model.matrix( object, ... )
## S3 method for class 'binaryChoice' model.matrix( object, ... )
object |
object of class |
... |
currently not used. |
The design matrix of binary choice models.
Arne Henningsen, Ott Toomet [email protected]
binaryChoice
, probit
,
model.matrix
, model.frame.binaryChoice
,
and probit-methods
.
Create design matrix of sample selection models
## S3 method for class 'selection' model.matrix(object, part = "outcome", ... )
## S3 method for class 'selection' model.matrix(object, part = "outcome", ... )
object |
object of class |
part |
character string indication which design matrix/matrices to extract: "outcome" for the design matrix/matrices of the outcome equation(s) or "selection" for the design matrix of the selection equation. |
... |
further arguments passed to other methods
(e.g. |
If argument part
is "selection"
,
the design matrix of the selection equation is returned.
If argument part
is "outcome"
,
the design matrix of the outcome equation (tobit-2 or treatment
model)
or a list of two outcome matrices (tobit-5 model) is returned.
All unobserved outcomes, including the corresponding explanatory
variables
are set to NA
, even in case where
valid values were supplied
for estimation.
Arne Henningsen
model.matrix
, selection
,
model.frame.selection
, and selection-methods
.
The Mroz87
data frame contains data about 753 married women.
These data are collected within the "Panel Study of Income Dynamics" (PSID).
Of the 753 observations, the first 428 are for women with positive hours
worked in 1975, while the remaining 325 observations are for women who
did not work for pay in 1975. A more complete discussion of the data is
found in Mroz (1987), Appendix 1.
data(Mroz87)
data(Mroz87)
This data frame contains the following columns:
Dummy variable for labor-force participation.
Wife's hours of work in 1975.
Number of children 5 years old or younger.
Number of children 6 to 18 years old.
Wife's age.
Wife's educational attainment, in years.
Wife's average hourly earnings, in 1975 dollars.
Wife's wage reported at the time of the 1976 interview.
Husband's hours worked in 1975.
Husband's age.
Husband's educational attainment, in years.
Husband's wage, in 1975 dollars.
Family income, in 1975 dollars.
Marginal tax rate facing the wife.
Wife's mother's educational attainment, in years.
Wife's father's educational attainment, in years.
Unemployment rate in county of residence, in percentage points.
Dummy variable = 1 if live in large city, else 0.
Actual years of wife's previous labor market experience.
Non-wife income.
Dummy variable for wife's college attendance.
Dummy variable for husband's college attendance.
Mroz, T. A. (1987) The sensitivity of an empirical model of married women's hours of work to economic and statistical assumptions. Econometrica 55, 765–799.
PSID Staff, The Panel Study of Income Dynamics, Institute for Social ResearchPanel Study of Income Dynamics, University of Michigan, https://psidonline.isr.umich.edu.
## Wooldridge( 2003 ): example 17.5, page 590 data( Mroz87 ) # Two-step estimation summary( heckit( lfp ~ nwifeinc + educ + exper + I( exper^2 ) + age + kids5 + kids618, log( wage ) ~ educ + exper + I( exper^2 ), Mroz87, method = "2step" ) )
## Wooldridge( 2003 ): example 17.5, page 590 data( Mroz87 ) # Two-step estimation summary( heckit( lfp ~ nwifeinc + educ + exper + I( exper^2 ) + age + kids5 + kids618, log( wage ) ~ educ + exper + I( exper^2 ), Mroz87, method = "2step" ) )
The nlswork
data frame contains data about 4711 young working women
who had an age of 14–26 years in 1968.
These data are collected within the "National Longitudinal Survey"
over the years 1968-1988 (with gaps).
There are 28534 observations in total.
data(nlswork)
data(nlswork)
This data frame contains the following columns:
NLS ID.
interview year.
birth year.
age in current year.
1=white, 2=black, 3=other.
1 if married, spouse present.
1 if never married.
current grade completed.
1 if college graduate.
1 if not SMSA.
1 if central city.
1 if south.
industry of employment.
occupation.
1 if union.
weeks unemployed last year.
total work experience.
job tenure, in years.
usual hours worked.
weeks worked last year.
ln(wage/GNP deflator).
Two different versions of this data set are available on the internet.
They are slighly different:
The variable wks_work
(weeks worked last year)
is 101
in this version (from Stata),
but NA
in the version provided by the Boston College
for the observation with idcode = 1
and year = 83
.
Moreover, this variable
is NA
in this version (from Stata),
but 104
in the version provided by the Boston College
for the observation with idcode = 2
and year = 87
.
Datasets for Stata Longitudinal/Panel-Data Reference Manual, Release 10: National Longitudinal Survey. Young Women 14-26 years of age in 1968, https://www.stata-press.com/data/r10/nlswork.dta.
Boston College, National Longitudinal Survey. Young Women 14-26 years of age in 1968, https://fmwww.bc.edu/ec-p/data/stata/nlswork.dta.
data( "nlswork" ) summary( nlswork ) ## Not run: library( "plm" ) nlswork <- plm.data( nlswork, c( "idcode", "year" ) ) plmResult <- plm( ln_wage ~ union + age + grade + not_smsa + south + occ_code, data = nlswork, model = "random" ) summary( plmResult ) ## End(Not run)
data( "nlswork" ) summary( nlswork ) ## Not run: library( "plm" ) nlswork <- plm.data( nlswork, c( "idcode", "year" ) ) plmResult <- plm( ln_wage ~ union + age + grade + not_smsa + south + occ_code, data = nlswork, model = "random" ) summary( plmResult ) ## End(Not run)
Calculate predicted values for fitted probit
models.
## S3 method for class 'probit' predict( object, newdata = NULL, type = "link", ... )
## S3 method for class 'probit' predict( object, newdata = NULL, type = "link", ... )
object |
a fitted object of class |
newdata |
optionally, a data frame in which to look for variables with which to predict. If omitted, the fitted linear predictors or the fitted response values are returned. |
type |
the type of prediction.
If this argument is |
... |
further arguments (currently ignored). |
A numeric vector of the predicted values.
Arne Henningsen and the R Core Team (the code of predict.probit
is partly based on the code of predict.lm
and predict.glm
).
probit
, predict
,
predict.glm
, residuals.probit
,
and probit-methods
.
## female labour force participation probability data( "Mroz87" ) m <- probit( lfp ~ kids5 + kids618 + educ + hushrs + huseduc + huswage + mtr + motheduc, data=Mroz87 ) predict( m ) # equal to linearPredictors(m) predict( m, type = "response" ) # equal to fitted(m) predict( m, newdata = Mroz87[ 3:9, ] ) # equal to linearPredictors(m)[3:9] predict( m, newdata = Mroz87[ 3:9, ], type = "response" ) # equal to fitted(m)[3:9]
## female labour force participation probability data( "Mroz87" ) m <- probit( lfp ~ kids5 + kids618 + educ + hushrs + huseduc + huswage + mtr + motheduc, data=Mroz87 ) predict( m ) # equal to linearPredictors(m) predict( m, type = "response" ) # equal to fitted(m) predict( m, newdata = Mroz87[ 3:9, ] ) # equal to linearPredictors(m)[3:9] predict( m, newdata = Mroz87[ 3:9, ], type = "response" ) # equal to fitted(m)[3:9]
Calculate predicted values for sample selection models
fitted with function selection
.
## S3 method for class 'selection' predict( object, newdata = NULL, part = ifelse( type %in% c( "unconditional", "conditional" ), "outcome", "selection" ), type = "unconditional", ... )
## S3 method for class 'selection' predict( object, newdata = NULL, part = ifelse( type %in% c( "unconditional", "conditional" ), "outcome", "selection" ), type = "unconditional", ... )
object |
a fitted object of class |
newdata |
optionally, a data frame in which to look for variables with which to predict. If omitted, the fitted linear predictors or the fitted response values are returned. |
part |
character string indicating for which equation
the predicted variables should be calculated:
|
type |
if argument |
... |
further arguments (currently ignored). |
In most cases, a numeric vector of the predicted values is returned. However, there are three exceptions: (i) when predicting the unconditional expectations of a Tobit-5 model, a matrix with two columns is returned, where the two columns correspond to the two outcome equations (E[yo1] and E[yo2]); (ii) when predicting the conditional expectations of a Tobit-2 model, a matrix with two columns is returned, where the first column returns the expectations conditional on the observation being not selected (E[yo|ys=0]), while the second column returns the expectations conditional on the observation being selected (E[yo|ys=1]); (iii) when predicting the conditional expectations of a Tobit-5 model, a matrix with four columns is returned, where the first two columns return the conditional expectations of the first outcome equation (E[yo1|ys=0] and E[yo1|ys=1]) and the last two columns return the conditional expectations of the second outcome equation (E[yo2|ys=0] and E[yo2|ys=1]).
Arne Henningsen and ‘fg nu’ (the code is partly based on the code posted by ‘fg nu’ at https://stackoverflow.com/questions/14005788/predict-function-for-heckman-model)
selection
, predict
,
predict.probit
, residuals.selection
,
and selection-methods
.
## Greene( 2003 ): example 22.8, page 786 data( Mroz87 ) Mroz87$kids <- ( Mroz87$kids5 + Mroz87$kids618 > 0 ) # ML estimation m <- selection( lfp ~ age + I( age^2 ) + faminc + kids + educ, wage ~ exper + I( exper^2 ) + educ + city, Mroz87 ) predict( m ) predict( m, type = "conditional" ) predict( m, type = "link" ) predict( m, type = "response" ) predict( m, newdata = Mroz87[ 3:9, ] )
## Greene( 2003 ): example 22.8, page 786 data( Mroz87 ) Mroz87$kids <- ( Mroz87$kids5 + Mroz87$kids618 > 0 ) # ML estimation m <- selection( lfp ~ age + I( age^2 ) + faminc + kids + educ, wage ~ exper + I( exper^2 ) + educ + city, Mroz87 ) predict( m ) predict( m, type = "conditional" ) predict( m, type = "link" ) predict( m, type = "response" ) predict( m, newdata = Mroz87[ 3:9, ] )
Binary Choice models. These models are estimated by
binaryChoice
, intended to be called by wrappers like
probit
.
probit(formula, weights = NULL, ...) binaryChoice(formula, subset, na.action, start = NULL, data = sys.frame(sys.parent()), x=FALSE, y = FALSE, model = FALSE, method="ML", userLogLik=NULL, cdfLower, cdfUpper=function(x) 1 - cdfLower(x), logCdfLower=NULL, logCdfUpper=NULL, pdf, logPdf=NULL, gradPdf, maxMethod="Newton-Raphson", ... )
probit(formula, weights = NULL, ...) binaryChoice(formula, subset, na.action, start = NULL, data = sys.frame(sys.parent()), x=FALSE, y = FALSE, model = FALSE, method="ML", userLogLik=NULL, cdfLower, cdfUpper=function(x) 1 - cdfLower(x), logCdfLower=NULL, logCdfUpper=NULL, pdf, logPdf=NULL, gradPdf, maxMethod="Newton-Raphson", ... )
formula |
a symbolic description of the model to be fit, in the
form |
weights |
an optional vector of ‘prior weights’ to be used in the fitting process. Should be NULL or a numeric vector. |
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
na.action |
a function which indicates what should happen when the data contain 'NA's. The default is set by the 'na.action' setting of 'options', and is 'na.fail' if that is unset. The 'factory-fresh' default is 'na.omit'. Another possible value is 'NULL', no action. Value 'na.exclude' can be useful. |
start |
inital value of parameters. |
data |
an optional data frame containing the variables in the
model. If not found in data, the
variables are taken from environment(formula), typically the
environment from which |
x , y , model
|
logicals. If TRUE the corresponding components of the fit (the model matrix, the response, the model frame) are returned. |
method |
the method to use; for fitting, currently only method = "ML" (Maximum Likelihood) is supported; method = "model.frame" returns the model frame (the same as with model = TRUE, see below). |
userLogLik |
log-likelihood function. A function of the
parameter to be estimated, which computes
the log likelihood. If supplied, it will be used instead of
|
cdfLower , cdfUpper , pdf , gradPdf
|
function, lower and upper tail of the cumulative distribution function of the disturbance term, corresponding probability density function, and gradient of the density function. These functions must take a numeric vector as the argument, and return numeric vector of the probability/gradient values. |
logCdfLower , logCdfUpper , logPdf
|
logs of the corresponding functions. Providing these may improve precision in extreme tail. If not provided, simply logs are takes of the corresponding non-log values. |
maxMethod |
character, a maximisation method supported by
|
... |
further arguments for |
The dependent variable for the binary choice models must have exactly two levels (e.g. '0' and '1', 'FALSE' and 'TRUE', or 'no' and 'yes'). Internally, the first level is always coded '0' ('failure') and the second level as '1' ('success'), no matter of the actual value. However, by default the levels are ordered alphabetically and this makes puts '1' after '0', 'TRUE' after 'FALSE' nad 'yes' after 'no'.
Via the distribution function parameters, binaryChoice
supports generic latent linear index binary choice models with
additive disturbance terms. It is intended to be called by wrappers
like probit
. However, it is also visible in the namespace as
the user may want to implement her own models using another
distribution of the disturbance term.
The model is estimated using Maximum Likelihood and Newton-Raphson optimizer.
probit
implements an outlier-robust log-likelihood (Demidenko,
2001). In case of large outliers the analytic Hessian is
singular while Fisher scoring approximation (used, for instance, by
glm
) is invertible. Those values are not
reliable in case of outliers.
No attempt is made to establish the existence of the estimator.
An object of class "binaryChoice". It is a list with following components:
LRT |
Likelihood ration test. The full model is tested against H0: the parameters (besides constant) have no effect on the result. This is a list with components
LRT is distributed by chi2(df) under H0. |
param |
A list with following background information:
|
df.residual |
degrees of freedom of the residuals. |
x |
if requested, the model matrix used. |
y |
if requested, the model response used. The response is represented internally as 0/1 integer vector. |
model |
the model frame, only if |
na.action |
information returned by |
Other components are inherited from maxLik
.
probit
adds class "probit" and following components to
the "binaryChoice" object:
family |
the family object used ( |
Ott Toomet [email protected], Arne Henningsen
Demidenko, Eugene (2001) “Computational aspects of probit model”, Mathematical Communications 6, 233-247
maxLik
for ready-packaged likelihood maximisation
routines and methods, glm
for generalised linear models,
including probit, binomial
, and probit-methods
.
## A simple MC trial: note probit assumes normal errors x <- runif(100) e <- 0.5*rnorm(100) y <- x + e summary(probit((y > 0) ~ x)) ## female labour force participation probability data(Mroz87) Mroz87$kids <- Mroz87$kids5 > 0 | Mroz87$kids618 > 0 Mroz87$age30.39 <- Mroz87$age < 40 Mroz87$age50.60 <- Mroz87$age >= 50 summary(probit(lfp ~ kids + age30.39 + age50.60 + educ + hushrs + huseduc + huswage + mtr + motheduc, data=Mroz87))
## A simple MC trial: note probit assumes normal errors x <- runif(100) e <- 0.5*rnorm(100) y <- x + e summary(probit((y > 0) ~ x)) ## female labour force participation probability data(Mroz87) Mroz87$kids <- Mroz87$kids5 > 0 | Mroz87$kids618 > 0 Mroz87$age30.39 <- Mroz87$age < 40 Mroz87$age50.60 <- Mroz87$age >= 50 summary(probit(lfp ~ kids + age30.39 + age50.60 + educ + hushrs + huseduc + huswage + mtr + motheduc, data=Mroz87))
Methods for probit models
## S3 method for class 'probit' fitted(object, ... ) ## S3 method for class 'probit' logLik(object, ... ) ## S3 method for class 'probit' nobs(object, ... ) ## S3 method for class 'probit' nObs(x, ... ) ## S3 method for class 'probit' print( x, digits = max(3, getOption("digits") - 3), ... )
## S3 method for class 'probit' fitted(object, ... ) ## S3 method for class 'probit' logLik(object, ... ) ## S3 method for class 'probit' nobs(object, ... ) ## S3 method for class 'probit' nObs(x, ... ) ## S3 method for class 'probit' print( x, digits = max(3, getOption("digits") - 3), ... )
object , x
|
object of class |
digits |
the minimum number of significant digits of the coefficients to be printed. |
... |
further arguments (currently ignored). |
The fitted
method returns a vector of fitted values (probabilities).
The logLik
method returns the log likelihood value of the model.
The nobs
and nObs
methods return the number of observations.
The print
method prints the call and the estimated coefficients.
Furthermore, some standard methods can be applied to probit models:
the coef
method returns the vector
of the estimated parameters.
The df.residual
method returns the degrees of freedom
of the residuals.
The lrtest
method can be used to perform
likelihood-ratio tests.
The stdEr
method returns the vector
of the standard errors of the estimated parameters.
The vcov
method returns the variance covariance matrix
of the estimated coefficients.
The methods linearPredictors.probit
,
model.frame.binaryChoice
,
model.matrix.binaryChoice
,
residuals.probit
, and summary.probit
are described at seperate help pages.
Arne Henningsen
probit
, summary.probit
,
and selection-methods
.
'The RAND Health Insurance Experiment (RAND HIE) was a comprehensive study of health care cost, utilization and outcome in the United States. It is the only randomized study of health insurance, and the only study which can give definitive evidence as to the causal effects of different health insurance plans. [...] Although the fieldwork of the study was conducted between 1974 and 1982, the results are still highly relevant, since RAND HIE is the only study which can make causal statements.' (Wikipedia, RAND Health Insurance Experiment, https://en.wikipedia.org/w/index.php?title=RAND_Health_Insurance_Experiment&oldid=110166949, accessed April 8, 2007).
data(RandHIE)
data(RandHIE)
This data frame contains the following columns:
HIE plan number.
Participant's place of residence when the participant was initially enrolled.
Coinsurance rate.
Took baseline physical.
Study year.
Person identifier.
1 if race of household head is black.
Family income.
Age in years.
1 if person is female.
Education of household head in years.
Time eligible during the year.
Outpatient expenses: all covered outpatient medical services excluding dental care, outpatient psychotherapy, outpatient drugs or supplies.
Drug expenses: all covered outpatient and dental drugs.
Supply expenses: all covered outpatient supplies including dental.
Psychotherapy expenses: all covered outpatient psychotherapy services including injections excluding charges for visits in excess of 52 per year, prescription drugs, and inpatient care.
Inpatient expenses: all covered inpatient expenses in a hospital, mental hospital, or nursing home, excluding outpatient care and renal dialysis.
Medical expenses: all covered inpatient and outpatient services, including drugs, supplies, and inpatient costs of newborns excluding dental care and outpatient psychotherapy.
Hospital admissions: annual number of covered hospitalizations.
Incomplete Hospital Records: missing inpatient records.
Psychotherapy visits: indicates the annual number of outpatient visits for psychotherapy. It includes billed visits only. The limit was 52 covered visits per person per year. The count includes an initial visit to a psychiatrist or psychologist.
Face-to-Face visits to physicians: annual covered outpatient visits with physician providers (excludes dental, psychotherapy, and radiology/anesthesiology/pathology-only visits).
Face-to-Face visits to nonphysicians: annual covered outpatient visits with nonphysician providers such as speech and physical therapists, chiropractors, podiatrists, acupuncturists, Christian Science etc. (excludes dental, healers, psychotherapy, and radiology/anesthesiology/pathology-only visits).
Family size.
Mental health index.
Number of chronic diseases.
Physical limitations.
General health index.
Maximum expenditure offer.
Participation incentive payment.
1 if age is less than 18 years.
female * child
.
log of num
(family size).
log of pioff
(participation incentive payment).
1 if individual deductible plan.
log(coins+1)
.
0 if idp=1
,
ln(max(1,mdeoff/(0.01*coins)))
otherwise.
1 if self-rated health is good – baseline is excellent self-rated health.
1 if self-rated health is fair – baseline is excellent self-rated health.
1 if self-rated health is poor – baseline is excellent self-rated health.
ghindx
(general healt index)
with imputations of missing values.
log of income
(family income).
log of num
(family size).
log of meddol
(medical expenses).
1 if meddol
> 0.
Data sets of Cameron and Trivedi (2005), http://cameron.econ.ucdavis.edu/mmabook/mmadata.html.
Additional information of variables from Table 20.4 of Cameron and Trivedi (2005) and from Newhouse (1999).
Cameron, A. C. and Trivedi, P. K. (2005) Microeconometrics: Methods and Applications, Cambridge University Press.
Newhouse, J. P. (1999) RAND Health Insurance Experiment [in Metropolitan and Non-Metropolitan Areas of the United States], 1974–1982, ICPSR Inter-university Consortium for Political and Social Research, Aggregated Claims Series, Volume 1: Codebook for Fee-for-Service Annual Expenditures and Visit Counts, ICPSR 6439.
Wikipedia, RAND Health Insurance Experiment, https://en.wikipedia.org/wiki/RAND_Health_Insurance_Experiment.
## Cameron and Trivedi (2005): Section 16.6, page 553ff data( RandHIE ) subsample <- RandHIE$year == 2 & !is.na( RandHIE$educdec ) selectEq <- binexp ~ logc + idp + lpi + fmde + physlm + disea + hlthg + hlthf + hlthp + linc + lfam + educdec + xage + female + child + fchild + black outcomeEq <- lnmeddol ~ logc + idp + lpi + fmde + physlm + disea + hlthg + hlthf + hlthp + linc + lfam + educdec + xage + female + child + fchild + black # ML estimation cameron <- selection( selectEq, outcomeEq, data = RandHIE[ subsample, ] ) summary( cameron )
## Cameron and Trivedi (2005): Section 16.6, page 553ff data( RandHIE ) subsample <- RandHIE$year == 2 & !is.na( RandHIE$educdec ) selectEq <- binexp ~ logc + idp + lpi + fmde + physlm + disea + hlthg + hlthf + hlthp + linc + lfam + educdec + xage + female + child + fchild + black outcomeEq <- lnmeddol ~ logc + idp + lpi + fmde + physlm + disea + hlthg + hlthf + hlthp + linc + lfam + educdec + xage + female + child + fchild + black # ML estimation cameron <- selection( selectEq, outcomeEq, data = RandHIE[ subsample, ] ) summary( cameron )
Calculate residuals of probit
models.
## S3 method for class 'probit' residuals( object, type = "deviance", ... )
## S3 method for class 'probit' residuals( object, type = "deviance", ... )
object |
an object of class |
type |
the type of residuals which should be returned. The alternatives are: "deviance" (default), "pearson", and "response" (see details). |
... |
further arguments (currently ignored). |
The residuals are calculated with following formulas:
Response residuals:
Pearson residuals:
Deviance residuals:
if
,
if
Here, is the
th residual,
is the
th response,
is the estimated probability
that
is one,
is the cumulative distribution function of the standard normal
distribution,
is the vector of regressors of the
th observation, and
is the vector of estimated coefficients.
More details are available in Davison & Snell (1991).
A numeric vector of the residuals.
Arne Henningsen
Davison, A. C. and Snell, E. J. (1991) Residuals and diagnostics. In: Statistical Theory and Modelling. In Honour of Sir David Cox, edited by Hinkley, D. V., Reid, N. and Snell, E. J., Chapman & Hall, London.
probit
, residuals
,
residuals.glm
, and probit-methods
.
Calculate residuals of sample selection models
## S3 method for class 'selection' residuals(object, part = "outcome", type = "deviance", ... )
## S3 method for class 'selection' residuals(object, part = "outcome", type = "deviance", ... )
object |
object of class |
part |
character string indication which residuals to extract: "outcome" for the fitted values of the outcome equation(s) or "selection" for the fitted values of the selection equation. |
type |
the type of residuals (see section ‘Details’).
The alternatives are: "deviance" (default), "pearson", and "response"
(see |
... |
further arguments passed to other methods
(e.g. |
The calculation of the fitted values
that are used to calculate the residuals
is described in the details section of the documentation
of fitted.selection
.
Argument type
is only used for binary dependent variables,
i.e. if argument part
is equal "selection"
or the dependent variable of the outcome model is binary.
Hence, argument type
is ignored
if argument part
is equal "outcome"
and the dependent variable of the outcome model is numeric.
A numeric vector of the residuals.
Arne Henningsen
residuals
, selection
,
fitted.selection
, residuals.probit
,
and selection-methods
.
This is the frontend for estimating Heckman-style selection models either with one or two outcomes (also known as generalized tobit models). It supports binary outcomes and interval outcomes in the single-outcome case. It also supports normal-distribution based treatment effect models.
For model specification and more details, see Toomet and Henningsen (2008) and the included vignettes “Sample Selection Models”, “Interval Regression with Sample Selection”, and “All-Normal Treatment Effects”.
selection(selection, outcome, data = sys.frame(sys.parent()), weights = NULL, subset, method = "ml", type = NULL, start = NULL, boundaries = NULL, ys = FALSE, xs = FALSE, yo = FALSE, xo = FALSE, mfs = FALSE, mfo = FALSE, printLevel = print.level, print.level=0, ...) heckit( selection, outcome, data = sys.frame(sys.parent()), method = "2step", ... ) treatReg(selection, outcome, data=sys.frame(sys.parent()), mfs=TRUE, mfo=TRUE, ...)
selection(selection, outcome, data = sys.frame(sys.parent()), weights = NULL, subset, method = "ml", type = NULL, start = NULL, boundaries = NULL, ys = FALSE, xs = FALSE, yo = FALSE, xo = FALSE, mfs = FALSE, mfo = FALSE, printLevel = print.level, print.level=0, ...) heckit( selection, outcome, data = sys.frame(sys.parent()), method = "2step", ... ) treatReg(selection, outcome, data=sys.frame(sys.parent()), mfs=TRUE, mfo=TRUE, ...)
selection |
formula, the selection equation. |
outcome |
the outcome equation(s). Either a single equation (for tobit 2 models), or a list of two equations (tobit 5 models). |
data |
an optional data frame, list or environment (or object
coercible by |
weights |
an optional vector of ‘prior weights’ to be used in the fitting process. Should be NULL or a numeric vector. Weights are currently only supported in type-2 models. |
subset |
an optional index vector specifying a subset of observations to be used in the fitting process. |
method |
how to estimate the model. Either |
type |
model type. |
start |
vector, initial values for the ML estimation. If
|
boundaries |
an optional vector of boundaries of the intervals of the dependent variable of the outcome equation for sample selection models with interval regression of the outcome equation. |
ys , yo , xs , xo , mfs , mfo
|
logicals. If true, the response ( |
printLevel , print.level
|
integer. Various debugging information, higher value gives more information. The preferred option is ‘printLevel’. |
... |
additional parameters for the corresponding fitting
functions |
The dependent variable of of the selection equation
(specified by argument selection
) must have exactly
two levels (e.g., 'FALSE' and 'TRUE', or '0' and '1').
By default the levels are sorted in increasing order
('FALSE' is before 'TRUE', and '0' is before '1').
If the dependent variable of the outcome equation
(specified by argument outcome
) has exactly two levels,
this variable is modelled as a binary variable.
If argument boundaries
is specified,
the outcome equation is estimated as interval regression model
and the dependent variable of the outcome equation
must be a categorical (factor) variable
or a variable of strictly positive integer values,
whereas the vector specified by argument boundaries
must have one more element than the number of levels
or the largest integer, respectively.
In all other cases, it is assumed
that the dependent variable of the outcome equation is continuous
and an ordinary sample selection model is estimated.
For tobit-2 (sample selection) models, only those observations are included in the second step estimation (argument 'outcome'), where the dependent variable variable of the selection equation equals the second element of its levels (e.g., 'TRUE' or '1').
For tobit-5 (switching regression) models, in the second step the first outcome equation (first element of argument 'outcome') is estimated only for those observations, where the dependent variable of the selection equation equals the first element of its levels (e.g., 'FALSE' or '0'). The second outcome equation is estimated only for those observations, where this variable equals the second element of its levels (e.g., 'TRUE' or '1').
Treatment effect models are a version of tobit-5 models where the two
outcomes are “participation” and “non-participation”.
treatReg
takes an equal set of explanatory variables for both of these choices
and assumes that the corresponding parameters are equal. In typical
treatment effect model the selection outcome variable (participation
decision) is included as an explanatory variable for the outcome.
If this is not done, treatReg
amounts to estimating two
equations with correlated error structure.
NA
-s are allowed in the data. These are ignored if the
corresponding outcome is unobserved, otherwise observations which
contain NA
(either in selection or outcome) are
removed.
These selection models assume a known (multivariate normal) distribution of error terms. Because of this, the instruments (exclusion restrictions) are not necessary. However, if no instruments are supplied, the results are based solely on the assumption on multivariate normality. This may or may not be an appropriate assumption for a particular problem. Note also that standard errors tend to be large without a strong exclusion restriction.
If argument method
is equal to "ml"
(the default),
the estimation is done by the maximum likelihood method,
where the Newton-Raphson algorithm is used by default.
Argument maxMethod
(see tobit2fit
)
can be used to chose other algorithms for the maximisation
of the (log) likelihood function.
If argument method
is equal to "ml"
(the default)
and argument start
is NULL
(the default),
the starting values for the maximum-likelihood (ML) estimation
of a tobit-2 or tobit-5 model
are obtained by an initial two-step estimation of this model.
The two-step estimation of interval-regression models with sample-selection
has not yet been implemented.
If no starting values for a maximum-likelihood (ML) estimation
of an interval-regression model with sample-selection
are specified (i.e., argument start
is NULL
, the default),
starting values are obtained by an initial estimation of a tobit-2 model,
where the dependent variable of the outcome equation
is set to the mid points of the boudaries of the intervals.
By default, the starting values are obtained
by a maximum-likelihood (ML) estimation of the tobit-2 model,
whereas the starting values
for the maximum-likelihood (ML) estimation of the tobit-2 model
are obtained by a 2-step estimation of the tobit-2 model.
If argument start
is set to "2step"
,
the starting values for the maximum-likelihood (ML) estimation
of an interval-regression model with sample-selection
are directly obtained by a 2-step estimation of the tobit-2 model
(i.e., without a subsequent ML estimation of the tobit-2 model).
Methods that can be applied to objects returned by selection()
are described on the help page selection-methods
.
'selection' returns an object of class "selection". If the model estimated by Maximum Likelihood (argument method = "ml"), this object is a list, which has all the components of a 'maxLik' object, and in addition the elements 'twoStep', 'start, 'param', 'termS', 'termO', 'outcomeVar', and if requested 'ys', 'xs', 'yo', 'xo', 'mfs', and 'mfo'. If a tobit-2 (sample selection) model is estimated by the two-step method (argument method = "2step"), the returned object is a list with components 'probit', 'coefficients', 'param', 'vcov', 'lm', 'sigma', 'rho', 'invMillsRatio', and 'imrDelta'. If a tobit-5 (switching regression) model is estimated by the two-step method (argument method = "2step"), the returned object is a list with components 'coefficients', 'vcov', 'probit', 'lm1', 'lm2', 'rho1', 'rho2', 'sigma1', 'sigma2', 'termsS', 'termsO', 'param', and if requested 'ys', 'xs', 'yo', 'xo', 'mfs', and 'mfo'.
probit |
object of class 'probit' that contains the results of the 1st step (probit estimation) (only for two-step estimations). |
twoStep |
(only if initial values not given) results of the 2-step estimation, used for initial values |
start |
initial values for ML estimation |
termsS , termsO
|
terms for the selection and outcome equation |
ys , xs , yo , xo , mfs , mfo
|
response, matrix and frame of the selection- and outcome equations (as a list of two for the latter). NULL, if not requested. The response is represented internally as 0/1 integer vector with 0 denoting either the unobservable outcome (tobit 2) or the first selection (tobit 5). |
coefficients |
estimated coefficients, the complete model.
coefficient for the Inverse Mills ratio is treated as a parameter
( |
vcov |
variance covariance matrix of the estimated coefficients. |
param |
a list with following components: |
lm , lm1 , lm2
|
objects of class 'lm' that contain the results
of the 2nd step estimation(s) of the outcome equation(s).
Note: the standard errors of this
estimation are biased, because they do not account for the
estimation of |
sigma , sigma1 , sigma2
|
the standard error(s) of the error terms of the outcome equation(s). |
rho , rho1 , rho2
|
the estimated correlation coefficient(s) between the error term of the selection equation and the outcome equation(s). |
invMillsRatio |
the inverse Mills Ratios calculated from the results of the 1st step probit estimation. |
imrDelta |
the |
outcomeVar |
character string indicating whether the dependent variable
of the outcome equation is |
The 2-step estimate of 'rho' may be outside of the
interval. In that case the standard errors of
invMillsRatio may be meaningless.
Arne Henningsen, Ott Toomet [email protected]
Cameron, A. C. and Trivedi, P. K. (2005) Microeconometrics: Methods and Applications, Cambridge University Press.
Greene, W. H. (2003) Econometric Analysis, Fifth Edition, Prentice Hall.
Heckman, J. (1976) The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models, Annals of Economic and Social Measurement, 5(4), p. 475-492.
Johnston, J. and J. DiNardo (1997) Econometric Methods, Fourth Edition, McGraw-Hill.
Lee, L., G. Maddala and R. Trost (1980) Asymetric covariance matrices of two-stage probit and two-stage tobit methods for simultaneous equations models with selectivity. Econometrica, 48, p. 491-503.
Petersen, S., G. Henningsen and A. Henningsen (2017) Which Households Invest in Energy-Saving Home Improvements? Evidence From a Danish Policy Intervention. Unpublished Manuscript. Department of Management Engineering, Technical University of Denmark.
Toomet, O. and A. Henningsen, (2008) Sample Selection Models in R: Package sampleSelection. Journal of Statistical Software 27(7), https://www.jstatsoft.org/v27/i07/
Wooldridge, J. M. (2003) Introductory Econometrics: A Modern Approach, 2e, Thomson South-Western.
summary.selection
, selection-methods
,
probit
, lm
,
and Mroz87
and RandHIE
for further examples.
## Greene( 2003 ): example 22.8, page 786 data( Mroz87 ) Mroz87$kids <- ( Mroz87$kids5 + Mroz87$kids618 > 0 ) # Two-step estimation summary( heckit( lfp ~ age + I( age^2 ) + faminc + kids + educ, wage ~ exper + I( exper^2 ) + educ + city, Mroz87 ) ) # ML estimation summary( selection( lfp ~ age + I( age^2 ) + faminc + kids + educ, wage ~ exper + I( exper^2 ) + educ + city, Mroz87 ) ) ## Example using binary outcome for selection model. ## We estimate the probability of womens' education on their ## chances to get high wage (> $5/hr in 1975 USD), using PSID data ## We use education as explanatory variable ## and add age, kids, and non-work income as exclusion restrictions. data(Mroz87) m <- selection(lfp ~ educ + age + kids5 + kids618 + nwifeinc, wage >= 5 ~ educ, data = Mroz87 ) summary(m) ## example using random numbers library( "mvtnorm" ) nObs <- 1000 sigma <- matrix( c( 1, -0.7, -0.7, 1 ), ncol = 2 ) errorTerms <- rmvnorm( nObs, c( 0, 0 ), sigma ) myData <- data.frame( no = c( 1:nObs ), x1 = rnorm( nObs ), x2 = rnorm( nObs ), u1 = errorTerms[ , 1 ], u2 = errorTerms[ , 2 ] ) myData$y <- 2 + myData$x1 + myData$u1 myData$s <- ( 2 * myData$x1 + myData$x2 + myData$u2 - 0.2 ) > 0 myData$y[ !myData$s ] <- NA myOls <- lm( y ~ x1, data = myData) summary( myOls ) myHeckit <- heckit( s ~ x1 + x2, y ~ x1, myData, print.level = 1 ) summary( myHeckit ) ## example using random numbers with IV/2SLS estimation library( "mvtnorm" ) nObs <- 1000 sigma <- matrix( c( 1, 0.5, 0.1, 0.5, 1, -0.3, 0.1, -0.3, 1 ), ncol = 3 ) errorTerms <- rmvnorm( nObs, c( 0, 0, 0 ), sigma ) myData <- data.frame( no = c( 1:nObs ), x1 = rnorm( nObs ), x2 = rnorm( nObs ), u1 = errorTerms[ , 1 ], u2 = errorTerms[ , 2 ], u3 = errorTerms[ , 3 ] ) myData$w <- 1 + myData$x1 + myData$u1 myData$y <- 2 + myData$w + myData$u2 myData$s <- ( 2 * myData$x1 + myData$x2 + myData$u3 - 0.2 ) > 0 myData$y[ !myData$s ] <- NA myHeckit <- heckit( s ~ x1 + x2, y ~ w, data = myData ) summary( myHeckit ) # biased! myHeckitIv <- heckit( s ~ x1 + x2, y ~ w, data = myData, inst = ~ x1 ) summary( myHeckitIv ) # unbiased ## tobit-5 example N <- 500 library(mvtnorm) vc <- diag(3) vc[lower.tri(vc)] <- c(0.9, 0.5, 0.6) vc[upper.tri(vc)] <- vc[lower.tri(vc)] eps <- rmvnorm(N, rep(0, 3), vc) xs <- runif(N) ys <- xs + eps[,1] > 0 xo1 <- runif(N) yo1 <- xo1 + eps[,2] xo2 <- runif(N) yo2 <- xo2 + eps[,3] a <- selection(ys~xs, list(yo1 ~ xo1, yo2 ~ xo2)) summary(a) ## tobit2 example vc <- diag(2) vc[2,1] <- vc[1,2] <- -0.7 eps <- rmvnorm(N, rep(0, 2), vc) xs <- runif(N) ys <- xs + eps[,1] > 0 xo <- runif(N) yo <- (xo + eps[,2])*(ys > 0) a <- selection(ys~xs, yo ~xo) summary(a) ## Example for treatment regressions ## Estimate the effect of treatment on income ## selection outcome: treatment participation, logical (treatment) ## selection explanatory variables: age, education (years) ## unemployment in 1974, 1975, race ## outcome: log real income 1978 ## outcome explanatory variables: treatment, age, education, race. ## unemployment variables are treated as exclusion restriction data(Treatment, package="Ecdat") a <- treatReg(treat~poly(age,2) + educ + u74 + u75 + ethn, log(re78)~treat + poly(age,2) + educ + ethn, data=Treatment) print(summary(a))
## Greene( 2003 ): example 22.8, page 786 data( Mroz87 ) Mroz87$kids <- ( Mroz87$kids5 + Mroz87$kids618 > 0 ) # Two-step estimation summary( heckit( lfp ~ age + I( age^2 ) + faminc + kids + educ, wage ~ exper + I( exper^2 ) + educ + city, Mroz87 ) ) # ML estimation summary( selection( lfp ~ age + I( age^2 ) + faminc + kids + educ, wage ~ exper + I( exper^2 ) + educ + city, Mroz87 ) ) ## Example using binary outcome for selection model. ## We estimate the probability of womens' education on their ## chances to get high wage (> $5/hr in 1975 USD), using PSID data ## We use education as explanatory variable ## and add age, kids, and non-work income as exclusion restrictions. data(Mroz87) m <- selection(lfp ~ educ + age + kids5 + kids618 + nwifeinc, wage >= 5 ~ educ, data = Mroz87 ) summary(m) ## example using random numbers library( "mvtnorm" ) nObs <- 1000 sigma <- matrix( c( 1, -0.7, -0.7, 1 ), ncol = 2 ) errorTerms <- rmvnorm( nObs, c( 0, 0 ), sigma ) myData <- data.frame( no = c( 1:nObs ), x1 = rnorm( nObs ), x2 = rnorm( nObs ), u1 = errorTerms[ , 1 ], u2 = errorTerms[ , 2 ] ) myData$y <- 2 + myData$x1 + myData$u1 myData$s <- ( 2 * myData$x1 + myData$x2 + myData$u2 - 0.2 ) > 0 myData$y[ !myData$s ] <- NA myOls <- lm( y ~ x1, data = myData) summary( myOls ) myHeckit <- heckit( s ~ x1 + x2, y ~ x1, myData, print.level = 1 ) summary( myHeckit ) ## example using random numbers with IV/2SLS estimation library( "mvtnorm" ) nObs <- 1000 sigma <- matrix( c( 1, 0.5, 0.1, 0.5, 1, -0.3, 0.1, -0.3, 1 ), ncol = 3 ) errorTerms <- rmvnorm( nObs, c( 0, 0, 0 ), sigma ) myData <- data.frame( no = c( 1:nObs ), x1 = rnorm( nObs ), x2 = rnorm( nObs ), u1 = errorTerms[ , 1 ], u2 = errorTerms[ , 2 ], u3 = errorTerms[ , 3 ] ) myData$w <- 1 + myData$x1 + myData$u1 myData$y <- 2 + myData$w + myData$u2 myData$s <- ( 2 * myData$x1 + myData$x2 + myData$u3 - 0.2 ) > 0 myData$y[ !myData$s ] <- NA myHeckit <- heckit( s ~ x1 + x2, y ~ w, data = myData ) summary( myHeckit ) # biased! myHeckitIv <- heckit( s ~ x1 + x2, y ~ w, data = myData, inst = ~ x1 ) summary( myHeckitIv ) # unbiased ## tobit-5 example N <- 500 library(mvtnorm) vc <- diag(3) vc[lower.tri(vc)] <- c(0.9, 0.5, 0.6) vc[upper.tri(vc)] <- vc[lower.tri(vc)] eps <- rmvnorm(N, rep(0, 3), vc) xs <- runif(N) ys <- xs + eps[,1] > 0 xo1 <- runif(N) yo1 <- xo1 + eps[,2] xo2 <- runif(N) yo2 <- xo2 + eps[,3] a <- selection(ys~xs, list(yo1 ~ xo1, yo2 ~ xo2)) summary(a) ## tobit2 example vc <- diag(2) vc[2,1] <- vc[1,2] <- -0.7 eps <- rmvnorm(N, rep(0, 2), vc) xs <- runif(N) ys <- xs + eps[,1] > 0 xo <- runif(N) yo <- (xo + eps[,2])*(ys > 0) a <- selection(ys~xs, yo ~xo) summary(a) ## Example for treatment regressions ## Estimate the effect of treatment on income ## selection outcome: treatment participation, logical (treatment) ## selection explanatory variables: age, education (years) ## unemployment in 1974, 1975, race ## outcome: log real income 1978 ## outcome explanatory variables: treatment, age, education, race. ## unemployment variables are treated as exclusion restriction data(Treatment, package="Ecdat") a <- treatReg(treat~poly(age,2) + educ + u74 + u75 + ethn, log(re78)~treat + poly(age,2) + educ + ethn, data=Treatment) print(summary(a))
Methods for selection models
## S3 method for class 'selection' logLik(object, ... ) ## S3 method for class 'selection' nobs(object, ... ) ## S3 method for class 'selection' nObs(x, ... ) ## S3 method for class 'selection' print( x, digits = max(3, getOption("digits") - 3), ... )
## S3 method for class 'selection' logLik(object, ... ) ## S3 method for class 'selection' nobs(object, ... ) ## S3 method for class 'selection' nObs(x, ... ) ## S3 method for class 'selection' print( x, digits = max(3, getOption("digits") - 3), ... )
object , x
|
object of class |
digits |
the minimum number of significant digits of the coefficients to be printed. |
... |
further arguments (currently ignored). |
The logLik
method returns the log likelihood value of the model.
The nobs
and nObs
methods return the number of observations.
The print
method prints the call and the estimated coefficients.
Furthermore, some standard methods can be applied to selection models:
The lrtest
method can be used to perform
likelihood-ratio tests.
The stdEr
method returns the vector
of the standard errors of the estimated parameters.
The methods coef.selection
,
fitted.selection
model.frame.selection
,
model.matrix.selection
,
residuals.selection
,
summary.selection
,
and vcov.selection
are described at seperate help pages.
Arne Henningsen
selection
, summary.selection
,
and probit-methods
.
'Instructional dataset, N=807, cross-sectional individual data on smoking accompanying Introductory Econometrics: A Modern Approach, Jeffrey M. Wooldridge, South-Western College Publishing, (c) 2000 and Jeffrey M. Wooldridge, Econometric Analysis of Cross Section and Panel Data, MIT Press, (c) 2001.' (https://ideas.repec.org/p/boc/bocins/smoke.html#biblio, accessed February 27, 2017). This dataset is a subset of data used in Mullahy (1997). Data was collected in 1979 and 1980 through the Smoking Supplement to the US National Health Interview Survey.
data(Smoke)
data(Smoke)
This data frame contains the following columns:
Years of schooling.
Respondents age in years.
State cigarette price, cents per pack.
Annual income in USD.
Dummy variable indicating if state restaurant smoking restrictions are in place.
Dummy variable indicating if person has smoked at least one cigarette.
Number of cigarettes smoked per day, coded in intervals with intervals boundaries: (0,5,10,20,50)
Number of cigarettes smoked per day.
Wooldridge(2009)'s dataset also available in other formats at https://ideas.repec.org/p/boc/bocins/smoke.html#biblio.
Original data used in Mullahy (1985) and Mullahy (1997).
Jeffrey, M. Wooldridge (2009), Introductory Econometrics: A modern approach, Canada: South-Western Cengage Learning.
Mullahy, John (1997), Instrumental-Variable Estimation of Count Data Models: Applications to Models of Cigarette Smoking Behavior, Review of Economics and Statistics 79, 596-593.
Mullahy, John (1985) Cigarette Smoking: Habits, Health Concerns, and Heterogeneous Unobservables in a Microeconometric Analysis of Consumer Demand, Ph.D. dissertation, University of Virginia.
data( Smoke ) # boundaries of the intervals bounds <- c(0,5,10,20,50,Inf) ## Not run: # estimation with starting values obtained by a ML estimation # of a standard tobit-2 model with the dependent variable # of the outcome equation equal to the mid-points of the intervals res <- selection( smoker ~ educ + age, cigs_intervals ~ educ, data = Smoke, boundaries = bounds ) summary( res ) # estimation with starting values obtained by a two-step estimation # of a standard tobit-2 model with the dependent variable # of the outcome equation equal to the mid-points of the intervals res2 <- selection( smoker ~ educ + age, cigs_intervals ~ educ, data = Smoke, boundaries = bounds, start = "2step" ) summary( res2 ) ## End(Not run) # estimation with starting values that are very close to the estimates # (in order to reduce the execution time of running this example) resS <- selection( smoker ~ educ + age, cigs_intervals ~ educ, data = Smoke, boundaries = bounds, start = c( 0.527, -0.0482, -0.0057, 4.23, -0.319, 2.97, 2.245 ) ) summary( resS )
data( Smoke ) # boundaries of the intervals bounds <- c(0,5,10,20,50,Inf) ## Not run: # estimation with starting values obtained by a ML estimation # of a standard tobit-2 model with the dependent variable # of the outcome equation equal to the mid-points of the intervals res <- selection( smoker ~ educ + age, cigs_intervals ~ educ, data = Smoke, boundaries = bounds ) summary( res ) # estimation with starting values obtained by a two-step estimation # of a standard tobit-2 model with the dependent variable # of the outcome equation equal to the mid-points of the intervals res2 <- selection( smoker ~ educ + age, cigs_intervals ~ educ, data = Smoke, boundaries = bounds, start = "2step" ) summary( res2 ) ## End(Not run) # estimation with starting values that are very close to the estimates # (in order to reduce the execution time of running this example) resS <- selection( smoker ~ educ + age, cigs_intervals ~ educ, data = Smoke, boundaries = bounds, start = c( 0.527, -0.0482, -0.0057, 4.23, -0.319, 2.97, 2.245 ) ) summary( resS )
Print or return a summary of a probit estimation.
## S3 method for class 'probit' summary( object, ... ) ## S3 method for class 'summary.probit' print( x, ... )
## S3 method for class 'probit' summary( object, ... ) ## S3 method for class 'summary.probit' print( x, ... )
object |
an object of class |
x |
an object of class |
... |
currently not used. |
The summary
method returns an object of class summary.probit
;
the print
method prints summary results and returns
the argument invisibly.
Arne Henningsen
probit
and probit-methods
.
Print or return a summary of a selection estimation.
## S3 method for class 'selection' summary(object, ...) ## S3 method for class 'summary.selection' print(x, digits = max(3, getOption("digits") - 3), part = "full", ...)
## S3 method for class 'selection' summary(object, ...) ## S3 method for class 'summary.selection' print(x, digits = max(3, getOption("digits") - 3), part = "full", ...)
object |
an object of class ' |
x |
an object of class ' |
part |
character string: which parts of the summary to print: "full" for all the estimated parameters (probit selection, outcome estimates, correlation and residual variance), or "outcome" for the outcome results only. |
digits |
numeric, (suggested) number of significant digits. |
... |
currently not used. |
The variance-covariance matrix of the two-step estimator is currently implemented only for tobit-2 (sample selection) models, but not for the tobit-5 (switching regression) model.
Summary methods return an object of class summary.selection
.
Print methods return the argument invisibly.
Arne Henningsen, Ott Toomet [email protected]
summary
, selection
,
and selection-methods
.
## Wooldridge( 2003 ): example 17.5, page 590 data( Mroz87 ) wooldridge <- selection( lfp ~ nwifeinc + educ + exper + I( exper^2 ) + age + kids5 + kids618, log( wage ) ~ educ + exper + I( exper^2 ), data = Mroz87, method = "2step" ) # summary of the 1st step probit estimation (Example 17.1, p. 562f) # and the 2nd step OLS regression (example 17.5, page 590) summary( wooldridge ) # summary of the outcome equation only print(summary(wooldridge), part="outcome")
## Wooldridge( 2003 ): example 17.5, page 590 data( Mroz87 ) wooldridge <- selection( lfp ~ nwifeinc + educ + exper + I( exper^2 ) + age + kids5 + kids618, log( wage ) ~ educ + exper + I( exper^2 ), data = Mroz87, method = "2step" ) # summary of the 1st step probit estimation (Example 17.1, p. 562f) # and the 2nd step OLS regression (example 17.5, page 590) summary( wooldridge ) # summary of the outcome equation only print(summary(wooldridge), part="outcome")
This function extracts the coefficient variance-covariance matrix from sample selection models.
## S3 method for class 'selection' vcov(object, part = "full", ...)
## S3 method for class 'selection' vcov(object, part = "full", ...)
object |
object of class "selection". |
part |
character string indicating which parts
of the variance-covariance matrix to extract:
|
... |
currently not used. |
The variance-covariance matrix of a two-step estimate is currently only partly implemented. The unimplemented part of the matrix is filled with NAs.
the estimated variance covariance matrix of the coefficients.
Arne Henningsen, Ott Toomet [email protected]
vcov
, selection
,
coef.selection
, and selection-methods
.
## Estimate a simple female wage model taking into account the labour ## force participation data(Mroz87) a <- heckit(lfp ~ huswage + kids5 + mtr + fatheduc + educ + city, log(wage) ~ educ + city, data=Mroz87) ## extract the full variance-covariance matrix: vcov( a ) ## now extract the variance-covariance matrix of the outcome model only: vcov( a, part = "outcome" )
## Estimate a simple female wage model taking into account the labour ## force participation data(Mroz87) a <- heckit(lfp ~ huswage + kids5 + mtr + fatheduc + educ + city, log(wage) ~ educ + city, data=Mroz87) ## extract the full variance-covariance matrix: vcov( a ) ## now extract the variance-covariance matrix of the outcome model only: vcov( a, part = "outcome" )