---
title: "Take a `moderndive` into introductory linear regression with R"
author: "Albert Y. Kim, Chester Ismay, and Max Kuhn"
date: "`r Sys.Date()`"
vignette: >
  %\VignetteIndexEntry{Take a `moderndive` into introductory linear regression with R}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
tags:
  - R
  - Rstats
  - tidyverse
  - regression
  - moderndive
  - modeling
  - modelling
  - kaggle
authors:
  - name: Albert Y. Kim
    orcid: 0000-0001-7824-306X
    affiliation: 1 
  - name: Chester Ismay
    orcid: 0000-0003-2820-2547
    affiliation: 2
  - name: Max Kuhn
    orcid: 0000-0003-2402-136X
    affiliation: 3   
affiliations:
 - name: Assistant Professor of Statistical and Data Sciences, Smith College, Northampton, MA, USA.
   index: 1
 - name: Data Science Evangelist, DataRobot, Portland, OR, USA.
   index: 2
 - name: Software Engineer, RStudio, USA.
   index: 3
bibliography: paper.bib
# output:
#   rticles::joss_article:
#     keep_md: yes
#     number_sections: yes
---

```{r, include = FALSE}
# knitr settings
knitr::opts_chunk$set(
  # Code output:
  warning = FALSE,
  message = FALSE,
  echo = TRUE,
  # Figure:
  out.width = "100%",
  fig.width = 16 / 2.5,
  fig.height = 9 / 2.5,
  fig.align = "center",
  fig.show = "hold",
  # Etc:
  collapse = TRUE,
  comment = "##"
  # tidy = FALSE
)

# Needed packages in vignette
library(moderndive)
library(ggplot2)
library(dplyr)
library(knitr)
library(broom)

# Needed packages internally
library(patchwork)

# Random number generator seed value
set.seed(76)

# Set ggplot defaults for rticles output:
if (!knitr::is_html_output()) {
  # Grey theme:
  theme_set(theme_light())

  scale_colour_discrete <- ggplot2::scale_colour_viridis_d
}


# Set output width for rticles:
options(width = 70)
```


# Summary

We present the [`moderndive`](https://moderndive.github.io/moderndive/){target="_blank"} R package of datasets and functions for [tidyverse](https://www.tidyverse.org/)-friendly introductory linear regression [@tidyverse2019]. These tools leverage the well-developed `tidyverse` and `broom` packages to facilitate 1) working with regression tables that include confidence intervals, 2) accessing regression outputs on an observation level (e.g. fitted/predicted values and residuals), 3) inspecting scalar summaries of regression fit (e.g. $R^2$, $R^2_{adj}$, and mean squared error), and 4) visualizing parallel slopes regression models using `ggplot2`-like syntax [@R-ggplot2; @R-broom]. This R package is designed to supplement the book "Statistical Inference via Data Science: A ModernDive into R and the Tidyverse" [@ismay2019moderndive]. Note that the book is also available online at https://moderndive.com and is referred to as "ModernDive" for short.


# Statement of Need

Linear regression has long been a staple of introductory statistics courses. While the curricula of introductory statistics courses has much evolved of late, the overall importance of regression remains the same [@ASAGuidelines]. Furthermore, while the use of the R statistical programming language for statistical analysis is not new, recent developments such as the `tidyverse` suite of packages have made statistical computation with R accessible to a broader audience [@tidyverse2019]. We go one step further by leveraging the `tidyverse` and the `broom` packages to make linear regression accessible to students taking an introductory statistics course [@R-broom]. Such students are likely to be new to statistical computation with R; we designed `moderndive` with these students in mind. 


# Introduction

Let's load all the R packages we are going to need.

```{r}
library(moderndive)
library(ggplot2)
library(dplyr)
library(knitr)
library(broom)
```

Let's consider data gathered from end of semester student evaluations for a sample of 463 courses taught by 94 professors from the University of Texas at Austin [@diez2015openintro]. This data is included in the `evals` data frame from the `moderndive` package.

```{r, echo=FALSE}
evals_sample <- evals %>%
  select(ID, prof_ID, score, age, bty_avg, gender, ethnicity, language, rank) %>%
  sample_n(5)
```

In the following table, we present a subset of `r ncol(evals_sample)` of the `r ncol(evals)` variables included for a random sample of `r nrow(evals_sample)` courses^[For details on the remaining `r ncol(evals) - ncol(evals_sample)` variables, see the help file by running `?evals`.]:

1. `ID` uniquely identifies the course whereas `prof_ID` identifies the professor who taught this course. This distinction is important since many professors taught more than one course.
1. `score` is the outcome variable of interest: average professor evaluation score out of 5 as given by the students in this course.
1. The remaining variables are demographic variables describing that course's instructor, including `bty_avg` (average "beauty" score) for that professor as given by a panel of 6 students.^[Note that `gender` was collected as a binary variable at the time of the study (2005).]

```{r random-sample-courses, echo=FALSE}
evals_sample %>%
  kable()
```


## Regression analysis the "good old-fashioned" way

Let's fit a simple linear regression model of teaching `score` as a function of instructor `age` using the `lm()` function.

```{r}
score_model <- lm(score ~ age, data = evals)
```

Let's now study the output of the fitted model `score_model` "the good old-fashioned way": using `summary()` which calls `summary.lm()` behind the scenes (we'll refer to them interchangeably throughout this paper).

```{r}
summary(score_model)
```


## Regression analysis using `moderndive`

As an improvement to base R's regression functions, we've included three functions in the `moderndive` package that take a fitted model object as input and return the same information as `summary.lm()`, but output them in tidyverse-friendly format [@tidyverse2019]. As we'll see later, while these three functions are thin wrappers to existing functions in the `broom` package for converting statistical objects into tidy tibbles, we modified them with the introductory statistics student in mind [@R-broom].

1. Get a tidy regression table **with confidence intervals**:
    ```{r}
get_regression_table(score_model)
    ```
2. Get information on each point/observation in your regression, including fitted/predicted values and residuals, in a single data frame:
    ```{r}
get_regression_points(score_model)
    ```
3. Get scalar summaries of a regression fit including $R^2$ and $R^2_{adj}$ but also the (root) mean-squared error:
    ```{r}
get_regression_summaries(score_model)
    ```

Furthermore, say you would like to create a visualization of the relationship between two numerical variables and a third categorical variable with $k$ levels. Let's create this using a colored scatterplot via the `ggplot2` package for data visualization [@R-ggplot2]. Using `geom_smooth(method = "lm", se = FALSE)` yields a visualization of an *interaction model* where each of the $k$ regression lines has their own intercept and slope. For example in \autoref{fig:interaction-model}, we extend our previous regression model by now mapping the categorical variable `ethnicity` to the `color` aesthetic.

```{r interaction-model, fig.cap="Visualization of interaction model."}
# Code to visualize interaction model:
ggplot(evals, aes(x = age, y = score, color = ethnicity)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Age", y = "Teaching score", color = "Ethnicity")
```

However, many introductory statistics courses start with the easier to teach "common slope, different intercepts" regression model, also known as the *parallel slopes* model. However, no argument to plot such models exists within `geom_smooth()`.

[Evgeni Chasnovski](https://github.com/echasnovski){target="_blank"} thus wrote a custom `geom_` extension to `ggplot2` called `geom_parallel_slopes()`; this extension is included in the `moderndive` package. Much like `geom_smooth()` from the `ggplot2` package, you add `geom_parallel_slopes()` as a layer to the code, resulting in \autoref{fig:parallel-slopes-model}.

```{r parallel-slopes-model, fig.cap="Visualization of parallel slopes model."}
# Code to visualize parallel slopes model:
ggplot(evals, aes(x = age, y = score, color = ethnicity)) +
  geom_point() +
  geom_parallel_slopes(se = FALSE) +
  labs(x = "Age", y = "Teaching score", color = "Ethnicity")
```


# Repository README

In the GitHub repository README, we present an in-depth discussion of six features of the `moderndive` package:

1. Focus less on p-value stars, more confidence intervals
2. Outputs as tibbles
3. Produce residual analysis plots from scratch using `ggplot2`
4. A quick-and-easy Kaggle predictive modeling competition submission!
5. Visual model selection: plot parallel slopes & interaction regression models
6. Produce metrics on the quality of regression model fits

Furthermore, we discuss the inner-workings of the `moderndive` package:

1. It leverages the `broom` package in its wrappers
1. It builds a custom `ggplot2` geometry for the `geom_parallel_slopes()` function that allows for quick visualization of parallel slopes models in regression. 


# Author contributions

Albert Y. Kim and Chester Ismay contributed equally to the development of the `moderndive` package. Albert Y. Kim wrote a majority of the initial version of this manuscript with Chester Ismay writing the rest. Max Kuhn provided guidance and feedback at various stages of the package development and manuscript writing. 


# Acknowledgments

Many thanks to Jenny Smetzer [\@smetzer180](https://github.com/smetzer180){target="_blank"}, Luke W. Johnston [\@lwjohnst86](https://github.com/lwjohnst86){target="_blank"}, and Lisa Rosenthal [\@lisamr](https://github.com/lisamr){target="_blank"} for their helpful feedback for this paper and to Evgeni Chasnovski [\@echasnovski](https://github.com/echasnovski){target="_blank"} for contributing the `geom_parallel_slopes()` function via GitHub [pull request](https://github.com/moderndive/moderndive/pull/55){target="_blank"}. The authors do not have any financial support to disclose.

# References