Title: | Recursive Partitioning for Modeling Survey Data |
---|---|
Description: | Functions to allow users to build and analyze design consistent tree and random forest models using survey data from a complex sample design. The tree model algorithm can fit a linear model to survey data in each node obtained by recursively partitioning the data. The splitting variables and selected splits are obtained using a randomized permutation test procedure which adjusted for complex sample design features used to obtain the data. Likewise the model fitting algorithm produces design-consistent coefficients to any specified least squares linear model between the dependent and independent variables used in the end nodes. The main functions return the resulting binary tree or random forest as an object of "rpms" or "rpms_forest" type. The package also provides methods modeling a "boosted" tree or forest model and a tree model for zero-inflated data as well as a number of functions and methods available for use with these object types. |
Authors: | Daniell Toth [aut, cre] |
Maintainer: | Daniell Toth <[email protected]> |
License: | CC0 |
Version: | 0.5.1 |
Built: | 2024-12-22 06:32:48 UTC |
Source: | CRAN |
This package provides a function rpms
to produce an rpms
object
and method functions that operate on them.
The rpms
object is a representation of a regression tree achieved
by recursively partitioning the dataset, fitting the specified linear model
on each node separately.
The recursive partitioning algorithm has an unbiased variable selection
and accounts for the sample design.
The algorithm accounts for one-stage of stratification and clustering as
well as unequal probability of selection.
There are also functions for producing random forest estimator
(a list of rpms
objects), a boosted regression tree and tree
based zero-inflated model.
For each row of data, returns a vector indicators whether observation is in that box or not
box_ind(x, newdata)
box_ind(x, newdata)
x |
|
newdata |
dataframe containing the variables used for the recursive partitioning. |
Matrix where each row is a vector of indicators whether observation is in box or not.
returns end boxes that partition the data
boxes(x)
boxes(x)
x |
|
data.frame including end_node, sample size, splits, and values for each end node
A dataset containing consumer unit characteristics, assets and expenditure data from the Bureau of Labor Statistics' Consumer Expenditure Survey public use interview data file.
CE
CE
A data frame with 68,415 observations on 47 variables:
Consumer unit identifying variable, constructed using the first seven digits of NEWID BLS derived
Primary Sampling Unit code for the 21 biggest clusters
Cluster Identifier for all clusters, (created using PSU, REGION, STATE, and POPSIZE) not part of CE data
Month for which data was collected
Final sample weight to make inference to total population
State FIPS code
Region code: 1 Northeast; 2 Midwest; 3 South; 4 West
Urban = 1, Rural = 2
Population size class of PSU: 1-biggest 5-smallest
Housing tenure: 1 Owned with mortgage; 2 Owned without mortgage 3 Owned mortgage not reported; 4 Rented; 5 Occupied without payment of cash rent; 6 Student housing
Number of rooms, including finished living areas and excluding all baths
Number of bathrooms
Number of bedrooms
Number of owned vehicles
Number of leased vehicles
CU code based on relationship of members to reference person (children incldue blood-related, step and adopted): 1 Married Couple only; 2 Married Couple, children (oldest < 6 years old); 3 Married Couple, children (oldest 6 to 17 years old); 4 Married Couple, children (oldest > 17 years old); 5 All other Married Couple CUs 6 One parent (male), children (at least one child < 18 years old); 7 One parent (female), children (at least one child < 18 years old); 8 Single consumers; 9 Other CUs
Number of members in CU
Number of people <18 yrs old
Number of people >64 yrs old
Number of earners
Age of primary earner
Education level coded: 1 None; 2 1st-8th Grade; 3 some HS; 4 HS; 5 Some college; 6 AA degree; 7 Bachelors degree; 8 Advanced degree
Gender Code: F (Female); M (Male)
Marital Status Coded: 1 Married; 2 Widowed; 3 Divorced; 4 Separated; 5 Never Married
Race code: 1 White; 2 Black; 3 Native American; 4 Asian; 5 Pacific Islander; 6 Multi-race
Hispanic, Latino, or Spanish origin? Y (Yes); N (No)
Member of armed forces? Y (Yes); N (No)
Currently enrolled in college? Full (full time); Part (part time); No
Earn income: Y (Yes); N (No)
1 Full time all year; 2 Part time all year; 3 Full time part of the year; 2 Part time part of the year;
The job in which the member received the most earnings during the past 12 months fits best in the following category: 01 Administrator, manager; 02 Teacher; 03 Professional Administrative support, technical, sales; 04 Administrative support, including clerical; 05 Sales, retail; 06 Sales, business goods and services; 07 Technician; 08 Protective service; 09 Private household service; 10 Other service; 11 Machine operator, assembler, inspector; 12 Transportation operator; 13 Handler, helper, laborer; 14 Mechanic, repairer, precision production; 15 Construction, mining; 16 Farming; 17 Forestry, fishing, grounds-keeping; 18 Armed forces
Type of employment: 1 An employee of a PRIVATE company, business, or individual 2 A Federal government employee 3 A State government employee 4 A local government employee 5 Self-employed in OWN business, professional practice or farm 6 Working WITHOUT PAY in family business or farm
Reason did not work during the past 12 months: 1 Retired; 2 Home maker; 3 School; 4 health; 5 Unable to find work; 6 Doing something else
Amount of CU income before taxes in past 12 months
Amount of wage or salary income received in past 12 months, before any deductions
Amount income received from Social Security and Railroad Retirement in past 12 months
Total value of all retirement accounts
Value of liquid assets
Total value of all directly-held stocks, bonds
Amount owed on all student loans
Total expenditures for current quarter
Total taxes paid (estimated)
Total expenditures for housing paid this quarter
Expenditures on health care quarter
Expenditure on food this quarter
Tobacco and smoking supplies this quarter
Expenditure on footware1 this quarter
end describe
https://www.bls.gov/cex/pumd_data.htm
Either a vector of end-node labels for each opbservation in newdata or a vector of the endnodes in the tree model if newdata is not provided.
end_nodes(object, newdata = NULL)
end_nodes(object, newdata = NULL)
object |
|
newdata |
data.frame |
vector of end_node labels
{ # model mean of retirement account value for households with reported # retirment account values > 0 using a binary tree while accounting for # clusterd data and sample weights. s1<- which(CE$IRAX > 0) r1 <-rpms(IRAX~EDUCA+AGE+BLS_URBN, data = CE[s1,], weights=~FINLWT21, clusters=~CID) end_nodes(r1) }
{ # model mean of retirement account value for households with reported # retirment account values > 0 using a binary tree while accounting for # clusterd data and sample weights. s1<- which(CE$IRAX > 0) r1 <-rpms(IRAX~EDUCA+AGE+BLS_URBN, data = CE[s1,], weights=~FINLWT21, clusters=~CID) end_nodes(r1) }
grow an rpms tree from a given node
grow_rpms( x, node, data, weights = ~1, strata = ~1, clusters = ~1, pval = NA, bin_size = NA )
grow_rpms( x, node, data, weights = ~1, strata = ~1, clusters = ~1, pval = NA, bin_size = NA )
x |
|
node |
node from which to grow tree further |
data |
data.frame that includes variables used in rp_equ, e_equ, and design information |
weights |
formula or vector of sample weights for each observation |
strata |
formula or vector of strata labels |
clusters |
formula or vector of cluster labels |
pval |
numeric p-value used to reject null hypothesis in permutation test |
bin_size |
numeric minimum number of observations in each node |
rpms tree expanded from node.
Get index of elements in dataframe that are in the specified
end-node of an rpms
object. A "which" function for end-nodes.
in_node(x, node, data)
in_node(x, node, data)
x |
|
node |
integer label of the desired end-node. |
data |
dataframe containing the variables used for the recursive partitioning. |
vector of indexes for observations in the end-node.
{ # model mean of retirement account value for households with reported # retirment account values > 0 using a binary tree while accounting for # clusterd data and sample weights. s1<- which(CE$IRAX > 0) r1 <-rpms(IRAX~EDUCA+AGE+BLS_URBN, data = CE[s1,], weights=~FINLWT21, clusters=~CID) # Get summary statistics of CUTENURE for households in end-nodes 7 and 8 of the tree if(7 %in% end_nodes(r1)) summary(CE$CUTENURE[in_node(node=7, r1, data=CE[s1,])]) if(8 %in% end_nodes(r1)) summary(CE$CUTENURE[in_node(node=8, r1, data=CE[s1,])]) }
{ # model mean of retirement account value for households with reported # retirment account values > 0 using a binary tree while accounting for # clusterd data and sample weights. s1<- which(CE$IRAX > 0) r1 <-rpms(IRAX~EDUCA+AGE+BLS_URBN, data = CE[s1,], weights=~FINLWT21, clusters=~CID) # Get summary statistics of CUTENURE for households in end-nodes 7 and 8 of the tree if(7 %in% end_nodes(r1)) summary(CE$CUTENURE[in_node(node=7, r1, data=CE[s1,])]) if(8 %in% end_nodes(r1)) summary(CE$CUTENURE[in_node(node=8, r1, data=CE[s1,])]) }
returns a linerized version of the splits. The coefficients represent the effect that each split has on the mean
linearize(x, data, weights = ~1, strata = ~1, clusters = ~1, type = "part")
linearize(x, data, weights = ~1, strata = ~1, clusters = ~1, type = "part")
x |
|
data |
data.frame |
weights |
formula or vector of sample weights for each observation |
strata |
formula or vector of strata labels |
clusters |
formula or vector of cluster labels |
type |
is on of "part" or "lin" |
data.frame including splits and estimates for the coefficient and their standard errors
plots end-node of object of class rpms
node_plot(object, node, data, variable = NA, ...)
node_plot(object, node, data, variable = NA, ...)
object |
|
node |
integer label of the desired end-node. |
data |
data.frame that includes variables used in rp_equ, e_equ, and design information |
variable |
string name of variable in data to use as x-axis in plot |
... |
further arguments passed to plot function. |
{ # model mean of retirement account value for households with reported # retirment account values > 0 using a binary tree while accounting for # clusterd data and sample weights. s1<- which(CE$IRAX > 0) r1 <-rpms(IRAX~EDUCA+AGE+BLS_URBN, data = CE[s1,], weights=~FINLWT21, clusters=~CID) # plot node 6 if it is an end-node of the tree if(6 %in% end_nodes(r1)) node_plot(object=r1, node=6, data=CE[s1,]) # plot node 6 if it is an end-node of the tree if(8 %in% end_nodes(r1)) node_plot(object=r1, node=8, data=CE[s1,]) }
{ # model mean of retirement account value for households with reported # retirment account values > 0 using a binary tree while accounting for # clusterd data and sample weights. s1<- which(CE$IRAX > 0) r1 <-rpms(IRAX~EDUCA+AGE+BLS_URBN, data = CE[s1,], weights=~FINLWT21, clusters=~CID) # plot node 6 if it is an end-node of the tree if(6 %in% end_nodes(r1)) node_plot(object=r1, node=6, data=CE[s1,]) # plot node 6 if it is an end-node of the tree if(8 %in% end_nodes(r1)) node_plot(object=r1, node=8, data=CE[s1,]) }
Predicted values based on rpms
object
## S3 method for class 'rpms' predict(object, newdata, ...)
## S3 method for class 'rpms' predict(object, newdata, ...)
object |
Object inheriting from |
newdata |
data frame with variables to use for predicting new values. |
... |
further arguments passed to or from other methods. |
vector of predicticed values for each row of newdata
{ # get rpms model of mean Soc Security income for families headed by a # retired person by several factors r1 <-rpms(SOCRRX~EDUCA+AGE+BLS_URBN+REGION, data=CE[which(CE$INCNONWK==1),], clusters=~CID) r1 # first 10 predicted means predict(r1, CE[10:20, ]) }
{ # get rpms model of mean Soc Security income for families headed by a # retired person by several factors r1 <-rpms(SOCRRX~EDUCA+AGE+BLS_URBN+REGION, data=CE[which(CE$INCNONWK==1),], clusters=~CID) r1 # first 10 predicted means predict(r1, CE[10:20, ]) }
Predicted values based on rpms_boost
object
## S3 method for class 'rpms_boost' predict(object, newdata, ...)
## S3 method for class 'rpms_boost' predict(object, newdata, ...)
object |
Object inheriting from |
newdata |
data frame with variables to use for predicting new values. |
... |
further arguments passed to or from other methods. |
vector of predicticed values for each row of newdata
Gets predicted values given new data based on rpms_forest
model.
## S3 method for class 'rpms_forest' predict(object, newdata, ...)
## S3 method for class 'rpms_forest' predict(object, newdata, ...)
object |
Object inheriting from |
newdata |
data frame with variables to use for predicting new values. |
... |
further arguments passed to or from other methods. |
vector of predicticed values for each row of newdata
Predicted values based on rpms_zinf
model
## S3 method for class 'rpms_proj' predict(object, newdata, ...)
## S3 method for class 'rpms_proj' predict(object, newdata, ...)
object |
Object inheriting from |
newdata |
data frame with variables to use for predicting new values. |
... |
further arguments passed to or from other methods. |
vector of predicticed values for each row of newdata
Predicted values based on rpms_zinf
model
## S3 method for class 'rpms_zinf' predict(object, newdata, ...)
## S3 method for class 'rpms_zinf' predict(object, newdata, ...)
object |
Object inheriting from |
newdata |
data frame with variables to use for predicting new values. |
... |
further arguments passed to or from other methods. |
vector of predicticed values for each row of newdata
print method for class rpms
## S3 method for class 'rpms' print(x, ...)
## S3 method for class 'rpms' print(x, ...)
x |
|
... |
further arguments passed to or from other methods. |
Prints information for a given rpms_forest
model.
## S3 method for class 'rpms_forest' print(x, ...)
## S3 method for class 'rpms_forest' print(x, ...)
x |
Object inheriting from |
... |
further arguments passed to or from other methods. |
vector of predicticed values for each row of newdata
print method for class rpms_zinf
## S3 method for class 'rpms_zinf' print(x, ...)
## S3 method for class 'rpms_zinf' print(x, ...)
x |
|
... |
further arguments passed to or from other methods. |
prune rpms tree to given node
prune_rpms(x, node)
prune_rpms(x, node)
x |
|
node |
number of node to prune to. |
subtree ending clipping off any splits after given node.
Code to write a latex qtree plot takes a rpm frame and returns latex code to produce qtree uses linearize as a guide Produces text code to produce tree structure in tex document Requires using LaTex packages and the following commands in preamble of LaTex doc: \usepackage{lscape} and \usepackage{tikz-qtree}
qtree( t1, title = NULL, label = NA, caption = "", digits = 2, s_size = TRUE, scale = 1, lscape = FALSE, subnode = 1 )
qtree( t1, title = NULL, label = NA, caption = "", digits = 2, s_size = TRUE, scale = 1, lscape = FALSE, subnode = 1 )
t1 |
rpms object created by rpms function |
title |
string for the top node of the tree |
label |
string used for labeling the tree figure |
caption |
string used for caption |
digits |
integer number of displayed digits |
s_size |
boolean indicating whether or not to include sample size |
scale |
numeric factor for scaling size of tree |
lscape |
boolean to display tree in landscape mode |
subnode |
starting node of subtree to plot |
{ # model mean of retirement account value for households with reported # retirment account values > 0 using a binary tree while accounting for # clusterd data and sample weights. s1<- which(CE$IRAX > 0) r1 <-rpms(IRAX~EDUCA+AGE+BLS_URBN, data = CE[s1,], weights=~FINLWT21, clusters=~CID) # get Latex code qtree(r1) }
{ # model mean of retirement account value for households with reported # retirment account values > 0 using a binary tree while accounting for # clusterd data and sample weights. s1<- which(CE$IRAX > 0) r1 <-rpms(IRAX~EDUCA+AGE+BLS_URBN, data = CE[s1,], weights=~FINLWT21, clusters=~CID) # get Latex code qtree(r1) }
Returns the estimated R^2 statistic for determining the fit of the given model to the data
r2stat(t1, data, adjusted = TRUE)
r2stat(t1, data, adjusted = TRUE)
t1 |
Object inheriting from |
data |
data frame with variables used to estimate model |
adjusted |
TRUE/FALSE whether to compute adjusted R^2 |
R^2 statistic computed using the model and provided data
main function producing a regression tree using variables from rp_equ to partition the data and fit the model e_equ on each node. Currently only uses data with complete cases of continuous variables.
rpms( rp_equ, data, weights = ~1, strata = ~1, clusters = ~1, e_equ = ~1, e_fn = "survLm", l_fn = NULL, bin_size = NULL, gridpts = 3, perm_reps = 1000L, pval = 0.05 )
rpms( rp_equ, data, weights = ~1, strata = ~1, clusters = ~1, e_equ = ~1, e_fn = "survLm", l_fn = NULL, bin_size = NULL, gridpts = 3, perm_reps = 1000L, pval = 0.05 )
rp_equ |
formula containing all variables for partitioning |
data |
data.frame that includes variables used in rp_equ, e_equ, and design information |
weights |
formula or vector of sample weights for each observation |
strata |
formula or vector of strata labels |
clusters |
formula or vector of cluster labels |
e_equ |
formula for modeling data in each node |
e_fn |
string name of function to use for modeling (only "survLm" is operational) |
l_fn |
loss function (ignored) |
bin_size |
integer specifying minimum number of observations in each node |
gridpts |
integer number of middle points to do in search; set to n for categorical variables when e_equ is used. |
perm_reps |
integer specifying the number of thousands of permutation replications to use to estimate p-value |
pval |
numeric p-value used to reject null hypothesis in permutation test |
object of class "rpms"
{ # model mean of retirement account value for households with reported # retirment account values > 0 using a binary tree while accounting for # clusterd data and sample weights. s1<- which(CE$IRAX > 0) rpms(IRAX~EDUCA+AGE+BLS_URBN, data=CE[s1,], weights=~FINLWT21, clusters=~CID) # model linear fit between retirement account value and amount of income # conditioning on education and accounting for clusterd data for households # with reported retirment account values > 0 rpms(IRAX~EDUCA, e_equ=IRAX~FINCBTAX, data=CE[s1,], weights=~FINLWT21, clusters=~CID) }
{ # model mean of retirement account value for households with reported # retirment account values > 0 using a binary tree while accounting for # clusterd data and sample weights. s1<- which(CE$IRAX > 0) rpms(IRAX~EDUCA+AGE+BLS_URBN, data=CE[s1,], weights=~FINLWT21, clusters=~CID) # model linear fit between retirement account value and amount of income # conditioning on education and accounting for clusterd data for households # with reported retirment account values > 0 rpms(IRAX~EDUCA, e_equ=IRAX~FINCBTAX, data=CE[s1,], weights=~FINLWT21, clusters=~CID) }
function for producing boosted rpms models (trees or random forests)
rpms_boost( rp_equ, data, weights = ~1, strata = ~1, clusters = ~1, e_equ = ~1, bin_size = NULL, gridpts = 3, perm_reps = 100L, pval = 0.05, f_size = 200L, model_type = "tree", times = 2L )
rpms_boost( rp_equ, data, weights = ~1, strata = ~1, clusters = ~1, e_equ = ~1, bin_size = NULL, gridpts = 3, perm_reps = 100L, pval = 0.05, f_size = 200L, model_type = "tree", times = 2L )
rp_equ |
formula containing all variables for partitioning |
data |
data.frame that includes variables used in rp_equ, e_equ, and design information |
weights |
formula or vector of sample weights for each observation |
strata |
formula or vector of strata labels |
clusters |
formula or vector of cluster labels |
e_equ |
formula for modeling data in each node |
bin_size |
numeric minimum number of observations in each node |
gridpts |
integer number of middle points to do in search |
perm_reps |
integer specifying the number of thousands of permuation replications to use to estimate p-value |
pval |
numeric p-value used to reject null hypothesis in permutation test |
f_size |
integer specifying the number of trees in the forest (only used if model_type is "forest") |
model_type |
string: one of "tree" or "forest" |
times |
integer specifying number of boosting levels to try. |
object of class "rpms_boost"
{ # model mean of retirement contributions with a binary tree while accounting # for clusterd data and sample weights. rpms_boost(IRAX~EDUCA+AGE+BLS_URBN, data = CE, weights=~FINLWT21, clusters=~CID, pval=.01) }
{ # model mean of retirement contributions with a binary tree while accounting # for clusterd data and sample weights. rpms_boost(IRAX~EDUCA+AGE+BLS_URBN, data = CE, weights=~FINLWT21, clusters=~CID, pval=.01) }
produces a random forest using rpms to create the individual trees.
rpms_forest( rp_equ, data, weights = ~1, strata = ~1, clusters = ~1, e_fn = "survLm", l_fn = NULL, bin_size = 5, f_size = 500, cores = 1 )
rpms_forest( rp_equ, data, weights = ~1, strata = ~1, clusters = ~1, e_fn = "survLm", l_fn = NULL, bin_size = 5, f_size = 500, cores = 1 )
rp_equ |
formula containing all variables for partitioning |
data |
data.frame that includes variables used in rp_equ, e_equ, and design information |
weights |
formula or vector of sample weights for each observation |
strata |
formula or vector of strata labels |
clusters |
formula or vector of cluster labels |
e_fn |
string name of function to use for modeling (only "survLm" is operational) |
l_fn |
loss function (ignored) |
bin_size |
numeric minimum number of observations in each node |
f_size |
integer specifying the number of trees in the forest |
cores |
integer number of cores to use in parallel if > 1 (doesn't work with Windows operating systems) |
object of class "rpms"
Returns a survLm_fit object with coeficients projecting new data onto splits from the given rpms model.
rpms_proj(object, newdata, weights = ~1, strata = ~1, clusters = ~1)
rpms_proj(object, newdata, weights = ~1, strata = ~1, clusters = ~1)
object |
Object inheriting from |
newdata |
data frame with variables used to estimate model |
weights |
formula or vector of sample weights for each observation |
strata |
formula or vector of strata labels |
clusters |
formula or vector of cluster labels |
survLm_fit object
main function producing a regression tree using variables from rp_equ to partition the data and fit the model e_equ on each node. Currently only uses data with complete cases.
rpms_zinf( rp_equ, data, weights = ~1, strata = ~1, clusters = ~1, e_equ = ~1, e_fn = "survLm", l_fn = NULL, bin_size = NULL, gridpts = 3, perm_reps = 1000L, pval = 0.05 )
rpms_zinf( rp_equ, data, weights = ~1, strata = ~1, clusters = ~1, e_equ = ~1, e_fn = "survLm", l_fn = NULL, bin_size = NULL, gridpts = 3, perm_reps = 1000L, pval = 0.05 )
rp_equ |
formula containing all variables for partitioning |
data |
data.frame that includes variables used in rp_equ, e_equ, and design information |
weights |
formula or vector of sample weights for each observation |
strata |
formula or vector of strata labels |
clusters |
formula or vector of cluster labels |
e_equ |
formula for modeling data in each node |
e_fn |
string name of function to use for modeling (only "survLm" is operational) |
l_fn |
loss function (does nothing yet) |
bin_size |
numeric minimum number of observations in each node |
gridpts |
integer number of middle points to do in search |
perm_reps |
integer specifying the number of thousands of permuation replications to use to estimate p-value |
pval |
numeric p-value used to reject null hypothesis in permutation test |
object of class "rpms"