---
title: "Comparison with dplyr Tabulation"
author: "Gabriel Becker and Adrian Waddell"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Comparison with dplyr Tabulation}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options:
  chunk_output_type: console
---

```{r, include = FALSE}
suggested_dependent_pkgs <- c("dplyr", "tibble")
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = all(vapply(
    suggested_dependent_pkgs,
    requireNamespace,
    logical(1),
    quietly = TRUE
  ))
)
```

```{r, echo=FALSE}
knitr::opts_chunk$set(comment = "#")
```

```{css, echo=FALSE}
.reveal .r code {
    white-space: pre;
}
```

## Introduction

In this vignette, we would like to discuss the similarities and differences between `dplyr` and `rtable`.

Much of the `rtables` framework focuses on tabulation/summarizing of data and then the visualization of the table. In this vignette, we focus on summarizing data using `dplyr` and contrast it to `rtables`. We won't pay attention to the table visualization/markup and just derive the cell content.

Using `dplyr` to summarize data and `gt` to visualize the table is a good way if the tabulation is of a certain nature or complexity. However, there are tables such as the table created in the [`introduction`](https://insightsengineering.github.io/rtables/latest-tag/articles/introduction.html) vignette that take some effort to create with `dplyr`. Part of the effort is due to fact that when using `dplyr` the table data is stored in `data.frame`s or `tibble`s which is not the most natural way to represent a table as we will show in this vignette.

If you know a more elegant way of deriving the table content with `dplyr` please let us know and we will update the vignette.


```{r, message=FALSE}
library(rtables)
library(dplyr)
```

Here is the table and data used in the [`introduction`](https://insightsengineering.github.io/rtables/latest-tag/articles/introduction.html) vignette:

```{r}
n <- 400

set.seed(1)

df <- tibble(
  arm = factor(sample(c("Arm A", "Arm B"), n, replace = TRUE), levels = c("Arm A", "Arm B")),
  country = factor(sample(c("CAN", "USA"), n, replace = TRUE, prob = c(.55, .45)), levels = c("CAN", "USA")),
  gender = factor(sample(c("Female", "Male"), n, replace = TRUE), levels = c("Female", "Male")),
  handed = factor(sample(c("Left", "Right"), n, prob = c(.6, .4), replace = TRUE), levels = c("Left", "Right")),
  age = rchisq(n, 30) + 10
) %>% mutate(
  weight = 35 * rnorm(n, sd = .5) + ifelse(gender == "Female", 140, 180)
)

lyt <- basic_table(show_colcounts = TRUE) %>%
  split_cols_by("arm") %>%
  split_cols_by("gender") %>%
  split_rows_by("country") %>%
  summarize_row_groups() %>%
  split_rows_by("handed") %>%
  summarize_row_groups() %>%
  analyze("age", afun = mean, format = "xx.x")

tbl <- build_table(lyt, df)
tbl
```


## Getting Started

We will start by deriving the first data cell on row 3 (note, row 1 and 2 have content cells, see the [`introduction`](https://insightsengineering.github.io/rtables/latest-tag/articles/introduction.html) vignette). Cell 3,1 contains the mean age for left handed & female Canadians in "Arm A":


```{r}
mean(df$age[df$country == "CAN" & df$arm == "Arm A" & df$gender == "Female" & df$handed == "Left"])
```

or with `dplyr`:

```{r}
df %>%
  filter(country == "CAN", arm == "Arm A", gender == "Female", handed == "Left") %>%
  summarise(mean_age = mean(age))
```

Further, `dplyr` gives us other verbs to easily get the average age of left handed Canadians for each group defined by the 4 columns:

```{r}
df %>%
  group_by(arm, gender) %>%
  filter(country == "CAN", handed == "Left") %>%
  summarise(mean_age = mean(age))
```

We can further get to all the average age cell values with:

```{r}
average_age <- df %>%
  group_by(arm, gender, country, handed) %>%
  summarise(mean_age = mean(age))

average_age
```

In `rtable` syntax, we need the following code to get to the same content:

```{r}
lyt <- basic_table() %>%
  split_cols_by("arm") %>%
  split_cols_by("gender") %>%
  split_rows_by("country") %>%
  split_rows_by("handed") %>%
  analyze("age", afun = mean, format = "xx.x")

tbl <- build_table(lyt, df)
tbl
```

As mentioned in the introduction to this vignette, please ignore the difference in arranging and formatting the data: it's possible to condense the `rtable` more and it is possible to make the `tibble` look more like the reference table using the `gt` R package.

In terms of tabulation for this example there was arguably not much added by `rtables` over `dplyr`.

## Content Information
Unlike in `rtables` the different levels of summarization are discrete computations in `dplyr` which we will then need to combine

We first focus on the count and percentage information for handedness within each country (for each arm-gender pair), along with the analysis row mean values:


```{r}
c_h_df <- df %>%
  group_by(arm, gender, country, handed) %>%
  summarize(mean = mean(age), c_h_count = n()) %>%
  ## we need the sum below to *not* be by country, so that we're dividing by the column counts
  ungroup(country) %>%
  # now the `handed` grouping has been removed, therefore we can calculate percent now:
  mutate(n_col = sum(c_h_count), c_h_percent = c_h_count / n_col)
c_h_df
```

which has 16 rows (cells) like the `average_age` data frame defined above. Next, we will derive the group information for countries:

```{r}
c_df <- df %>%
  group_by(arm, gender, country) %>%
  summarize(c_count = n()) %>%
  # now the `handed` grouping has been removed, therefore we can calculate percent now:
  mutate(n_col = sum(c_count), c_percent = c_count / n_col)
c_df
```

Finally, we `left_join()` the two levels of summary to get a data.frame containing the full
set of values which make up the body of our table (note, however, they are not in the same order):

```{r}
full_dplyr <- left_join(c_h_df, c_df) %>% ungroup()
```

Alternatively, we could calculate only the counts in `c_h_df`, and use
`mutate()` after the `left_join()` to divide the counts by the `n_col`
values which are more naturally calculated within `c_df`. This would
simplify `c_h_df`'s creation somewhat by not requiring the explicit
`ungroup()`, but it prevents each level of summarization from being a
self-contained set of computations.


The `rtables` call in contrast is:

```{r}
lyt <- basic_table(show_colcounts = TRUE) %>%
  split_cols_by("arm") %>%
  split_cols_by("gender") %>%
  split_rows_by("country") %>%
  summarize_row_groups() %>%
  split_rows_by("handed") %>%
  summarize_row_groups() %>%
  analyze("age", afun = mean, format = "xx.x")

tbl <- build_table(lyt, df)
tbl
```

We can now spot check that the values are the same

```{r}
frm_rtables_h <- cell_values(
  tbl,
  rowpath = c("country", "CAN", "handed", "Right", "@content"),
  colpath = c("arm", "Arm B", "gender", "Female")
)[[1]]
frm_rtables_h

frm_dplyr_h <- full_dplyr %>%
  filter(country == "CAN" & handed == "Right" & arm == "Arm B" & gender == "Female") %>%
  select(c_h_count, c_h_percent)

frm_dplyr_h


frm_rtables_c <- cell_values(
  tbl,
  rowpath = c("country", "CAN", "@content"),
  colpath = c("arm", "Arm A", "gender", "Male")
)[[1]]

frm_rtables_c

frm_dplyr_c <- full_dplyr %>%
  filter(country == "CAN" & arm == "Arm A" & gender == "Male") %>%
  select(c_count, c_percent)

frm_dplyr_c
```

```{r echo = FALSE, result="hidden"}
stopifnot(isTRUE(all.equal(frm_rtables_h, unname(unlist(frm_dplyr_h)))))
stopifnot(isTRUE(all.equal(frm_rtables_c, unname(unlist(frm_dplyr_c[1, ])))))
```

Further, the `rtable` syntax has hopefully also become a bit more straightforward to derive the cell values than with `dplyr` for this particular table.


## Summary

In this vignette learned that:

* many tables are quite easily created with `dplyr` and `data.frame` or `tibble` as data structure
  * `dplyr` keeps simple things simple
* if tables have group summaries then repeating of information is required
* `rtables` streamlines the construction of complex tables

We recommend that you continue reading the [`clinical_trials`](https://insightsengineering.github.io/rtables/latest-tag/articles/clinical_trials.html) vignette where we create a number of more advanced tables using layouts.