--- title: "Getting Started with ModeLLtest" author: "Shana Scogin" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{getting_started} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", width = 68) options(width = 68, cli.unicode = FALSE, cli.width = 68) ``` ## Introduction `modeLLtest` is an R package which implements model comparison tests. This package includes functions for the cross-validated difference in means (CVDM) test and the cross-validated median fit (CVMF) test. The CVDM and CVMF tests assist researchers and students in selecting among models describing the same process. Selection among estimation methods describing the same process is a crucial methodological step within the social sciences. Other tools in `modeLLtest` include a function to output a vector of cross-validated log-likelihood (CVLL) values and a function to perform the CVDM test on an input of two CVLL vectors. The relevant papers including details can be found below: * Harden, J. J., & Desmarais, B. A. (2011). Linear models with outliers: Choosing between conditional-mean and conditional-median methods. State Politics & Policy Quarterly, 11(4), 371-389. * Desmarais, B. A., & Harden, J. J. (2012). Comparing partial likelihood and robust estimation methods for the Cox regression model. Political Analysis, 20(1), 113-135. * Desmarais, B. A., & Harden, J. J. (2014). An unbiased model comparison test using cross-validation. Quality & Quantity, 48(4), 2155-2173. For more background information, see also the following: * Clarke, K. (2007). A simple distribution-free test for nonnested hypotheses. Political Analysis 15(3):347–63. * Johnson, N. J. (1978). Modified t Tests and Confidence Intervals for Asymmetrical Populations. Journal of the American Statistical Association 73:536–44. * Smyth, P. (2000). Model selection for probabilistic clustering using cross-validated likelihood. Statistics and Computing 10: 63–72. * Vuong, Q. H. (1989). Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica, 57, 307-333. ## Basic Usage This package has four main functions: `cvdm()`, `cvmf()`, `cvll()`, and `cvlldiff()`. The function `cvdm()` deploys the CVDM test, which uses a bias-corrected Johnson's t-test to choose between the leave-one-out cross-validated log-likelihood outputs between two non-nested models. The function `cvll()` outputs a vector of leave-one-out cross-validated log-likelihoods for a given method. The`cvlldiff()` function performs the bias-corrected Johnson's t-test on two vectors of cross-validated log-likelihoods. After loading the package, you can find the documentation for the functions with `?cvdm`, `?cvll`, `?cvlldiff`, or `?cvmf`. For the output, type `?cvdm_object`, `?cvll_object`, `?cvlldiff_object`, or `?cvmf_object` to view the documentation. If you encounter a bug or have comments or questions, please email me at sscogin@nd.edu. ## Examples Here are some examples of the functions in this package. First, we'll look at `cvdm()`, which applies cross-validated log-likelihood difference in means (CVDM) test to compare two methods of estimating a formula. The example compares ordinary least squares (OLS) estimation to median regression (MR). ``` library(modeLLtest) set.seed(123456) b0 <- .2 # True value for the intercept b1 <- .5 # True value for the slope n <- 500 # Sample size X <- runif(n, -1, 1) Y <- b0 + b1 * X + rnorm(n, 0, 1) # N(0, 1 error) obj_cvdm <- cvdm(Y ~ X, data.frame(cbind(Y, X)), method1 = "OLS", method2 = "MR") obj_cvdm ``` Next, let's do the same as we did above, but with `cvll()` and `cvlldiff()`. These are general functions that create vectors of the cross-validated log-likelihood functions and then compute the bias-corrected Johnson's t-test, respectively. ``` library(modeLLtest) set.seed(123456) b0 <- .2 # True value for the intercept b1 <- .5 # True value for the slope n <- 500 # Sample size X <- runif(n, -1, 1) Y <- b0 + b1 * X + rnorm(n, 0, 1) # N(0, 1 error) obj_cvll_ols <- cvll(Y ~ X, data.frame(cbind(Y, X)), method = "OLS") obj_cvll_mr <- cvll(Y ~ X, data.frame(cbind(Y, X)), method = "MR") obj_cvlldiff <- cvlldiff(obj_cvll_ols$cvll, obj_cvll_mr$cvll, obj_cvll_ols$df) obj_cvlldiff ``` Finally, let's look at the `cvmf()` function. this function compares the partial likelihood maximization (PLM) and the iteratively reweighted robust (IRR) method of estimation for a given application of the Cox model. ``` library(modeLLtest) library(survival) set.seed(12345) x1 <- rnorm(100) x2 <- rnorm(100) x2e <- x2 + rnorm(100, 0, 0.5) y <- rexp(100, exp(x1 + x2)) y <- Surv(y) # Changing y into a survival obj with survival package dat <- data.frame(y, x1, x2e) form <- y ~ x1 + x2e obj_cvfm <- cvmf(formula = form, data = dat) obj_cvfm ``` ## Data This package includes two datasets from real-world analyses to facilitate examples. These datasets are publicly available and have been included in this package with the endorsement of the authors. More on the data and original analysis can be found in the following papers: * Joshi, M., & Mason, T. D. (2008). Between democracy and revolution: peasant support for insurgency versus democracy in Nepal. Journal of Peace Research, 45(6), 765-782. * Golder, S. N. (2010). Bargaining delays in the government formation process. Comparative Political Studies, 43(1), 3-32. ## Examples with Replication Data For an example of the CVDM test utilizing real-world analysis, we can look at a study by Joshi and Mason (2008, Journal of Peace Research 45(6): 765-782). This study employs robust regression to analyze district-level election turnout among peasants in Nepal. Specifically, Joshi and Mason hypothesize that peasant dependence on landed elite for survival will result in higher voter turnout. Using their model of the 1999 parliamentary elections, we can see how the use of a robust regression is supported by the CVDM test. These data are available on the [Journal of Peace Research Replication Datasets website](https://www.prio.org/JPR/Datasets/) and have been included in the package for ease of replication. For full replication and discussion of the CVDM test, see Desmarais and Harden (2014, Quality and Quantity 48(4): 2155-2173). ``` library(MASS) library(modeLLtest) data(nepaldem) obj_cvdm_jm <- cvdm(percent_regvote1999 ~ landless_gap + below1pa_gap + sharecrop_gap + service_gap + fixmoney_gap + fixprod_gap + per_without_instcredit + totoalkilled_1000 + hdi_gap1 + ln_pop2001 + totalcontestants1999 + cast_eth_fract, data = nepaldem, method1 = "OLS", method2 = "RLM-MM") obj_cvdm_jm model_1999 <- rlm(percent_regvote1999 ~ landless_gap + below1pa_gap + sharecrop_gap + service_gap + fixmoney_gap + fixprod_gap + per_without_instcredit + totoalkilled_1000 + hdi_gap1 + ln_pop2001 + totalcontestants1999 + cast_eth_fract, data = nepaldem) model_1999 rm(nepaldem, obj_cvdm_jm, model_1999) # removing the data and objects ``` Next, we can look at a study by Golder (2010, Comparative Political Studies 43(1): 3-32) to see an example of the CVMF test. Golder employs the PLM method of estimating a Cox model to investigate Western European government formation duration. She hypothesizes that bargaining complexity leads to increasing delays in government formation as uncertainty increases. The CVMF test indicates that her choice of PLM is the better performing estimator (p < .05) compared to IRR. These data are available on the [Harvard Dataverse page (https://doi.org/10.7910/DVN/BUWZBA)](https://doi.org/10.7910/DVN/BUWZBA) and have been included in the package for ease of replication. For full replication and discussion of the CVMF test, see Desmarais and Harden (2012, Political Analysis 20(1): 113-135). ``` library(survival) library(coxrobust) library(modeLLtest) data(govtform) golder_surv <- Surv(govtform$bargainingdays) golder_x <- cbind(govtform$postelection, govtform$legislative_parties, govtform$polarization, govtform$positive_parl, govtform$post_legislative_parties, govtform$post_polariz, govtform$post_positive, govtform$continuation, govtform$singleparty_majority) colnames(golder_x) <- c("govtform$postelection", "govtform$legislative_parties", "govtform$polarization", "govtform$positive_parl", "govtform$post_legislative_parties", "govtform$post_polariz", "govtform$post_positive", "govtform$continuation", "govtform$singleparty_majority") obj_cvmf_golder <- cvmf(golder_surv ~ golder_x, method = "efron") obj_cvmf_golder govtform_plm <- coxph(golder_surv ~ golder_x, method = "efron") govtform_plm rm(govtform, golder_surv, golder_x, obj_cvmf_golder, govtform_plm) # removing the data and objects ``` ## Future Improvements Next steps for this package include adding more methods to the `cvdm()` and `cvll()` functions and optimizing functions - especially `cvmf()` - to improve speed.