--- title: "Descriptive Analysis with freqtables" author: "Brad Cannell" date: "Created: 2017-11-10
Updated: `r Sys.Date()`" output: rmarkdown::html_vignette: css: styles.css vignette: > %\VignetteIndexEntry{Descriptive Analysis in a dplyr Pipeline} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- Table of contents: [Overview](#overview) [Univariate percentages and 95% log transformed confidence intervals](#one-way-log) [Univariate percentages and 95% Wald confidence intervals](#one-way-wald) [Bivariate percentages and 95% log transformed confidence intervals](#two-way-log) [Interpretation of confidence intervals](#interpretation) # Overview The `freqtables` package is designed to quickly make tables of descriptive statistics for categorical variables (i.e., counts, percentages, confidence intervals). This package is designed to work in a Tidyverse pipeline and consideration has been given to get results from R to Microsoft Word ® with minimal pain. The package currently consistes of the following functions: 1. `freq_table()`: Estimate Percentages and 95 Percent Confidence Intervals in `dplyr` Pipelines. 2. `freq_test()`: Hypothesis Testing For Frequency Tables. 3. `freq_format()`: Format freq_table Output for Publication and Dissemination. 4. `freq_group_n()`: Formatted Group Sample Sizes for Tables This vignette is not intended to be representative of every possible descriptive analysis that one may want to carry out on a given data set. Rather, it is intended to be representative of descriptive analyses that are commonly used when conducting epidemiologic research. ```{r message=FALSE} library(dplyr) library(freqtables) ``` ```{r} data(mtcars) ``` # Univariate percentages and 95% confidence intervals {#one-way-log} In this section we provide an example of calculating common univariate descriptive statistics for a single categorical variable. Again, we are assuming that we are working in a `dplyr` pipeline and that we are passing a grouped data frame to the `freq_table()` function. ## Logit transformed confidence intervals The default confidence intervals are logit transformed - matching the method used by Stata: https://www.stata.com/manuals13/rproportion.pdf ```{r} mtcars %>% freq_table(am) ``` **Interpretation of results:** `var` contains the name of the variable passed to the `freq_table()` function. `cat` contains the unique levels (values) of the variable in `var`. `n` contains a count of the number of rows in the data frame that have the value `cat` for the variable `var`. `n_total` contains the sum of `n`. `percent` = `n` / `n_total`. `se` = $\sqrt{proportion * (1 - proportion) / (n - 1)}$ `t_crit` is the critical value from Student's t distribution with `n_total` - 1 degrees of freedom. The default probability value is 0.975, which corresponds to an alpha of 0.05. `lcl` is the lower bound of the confidence interval. By default, it is a 95% confidence interval. `ucl` is the upper bound of the confidence interval. By default, it is a 95% confidence interval. **Compare to Stata:** ![](freq_am_stata.png){width=600px} ## Wald confidence intervals {#one-way-wald} Optionally, the `ci_type = "wald"` argument can be used to calculate Wald confidence intervals that match those returned by SAS. The exact methods are documented here: https://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_surveyfreq_a0000000221.htm https://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_surveyfreq_a0000000217.htm ```{r} mtcars %>% freq_table(am, ci_type = "wald") ``` **Compare to SAS:** ![](freq_am_sas.png){width=600} [top](#top) ## Returning arbitrary confidence intervals The default behavior of `freq_table()` is to return 95% confidence intervals (two-sided). However, this behavior can be adjusted to return any alpha level. For example, to return 99% confidence intervals instead just pass 99 to the `percent_ci` parameter of `freq_table()` as demonstrated below. ```{r} mtcars %>% freq_table(am, percent_ci = 99) ``` Notice that the lower bounds of the 99% confidence limits (34.88730 and 20.05315) are _less than_ the lower bounds of the 95% confidence limits (40.94225 and 24.50235). Likewise, the upper bounds of the 99% confidence limits (79.94685 and 65.11270) are _greater than_ the upper bounds of the 95% confidence limits (75.49765 and 59.05775) # Bivariate percentages and 95% log transformed confidence intervals {#two-way-log} In this section we provide an example of calculating common bivariate descriptive statistics for categorical variables. Currently, all confidence intervals for (grouped) row percentages, and their accompanying confidence intervals, are logit transformed - matching the method used by Stata: https://www.stata.com/manuals13/rproportion.pdf At this time, you may pass two variables to the `...` argument to the `freq_table()` function. The first variable is labeled `row_var` in the resulting frequency table. The second variable passed to `freq_table()` is labeled `col_var` in the resulting frequency table. These labels are somewhat arbitrary and uniformative, but are used to match common naming conventions. Having said that, they may change in the future. The resulting frequency table is organized so that the n and percent of observations where the value of `col_var` equals `col_cat` is calculated within levels of `row_cat`. For example, the frequency table below tells us that that there are 11 rows (`n_row`) with a value of `4` (`row_cat`) for the variable `cyl` (`row_var`). Among those 11 rows only, there are 3 rows (`n`) with a value of `0` (`col_cat`) for the variable `am` (`col_var`), and 8 rows (`n`) with a value of `1` (`col_cat`) for the variable `am` (`col_var`). ```{r} mtcars %>% freq_table(cyl, am) ``` **Interpretation of results:** `row_var` contains the name of the first variable passed to the `...` argument of the `freq_table()` function. `row_cat` contains the levels (values) of the variable in `row_var`. `col_var` contains the name of the second variable passed to the `...` argument of the `freq_table()` function. `col_cat` contains the levels (values) of the variable in `col_var`. `n` contains a count of the number of rows in the data frame that have the value `row_cat` for the variable `row_var` _AND_ the value `col_cat` for the variable `col_var`. `n_row` contains the sum of `n` for each level of `row_cat`. `n_total` contains the sum of `n`. `percent_total` = `n` / `n_total`. `se_total` = $\sqrt{proportion_{overall} * (1 - proportion_{overall}) / (n_{overall} - 1)}$ `t_crit_total` is the critical value from Student's t distribution with `n_total` - 1 degrees of freedom. The default probability value is 0.975, which corresponds to an alpha of 0.05. `lcl_total` is the lower bound of the confidence interval around `percent_total`. By default, it is a 95% confidence interval. `ucl_total` is the upper bound of the confidence interval around `percent_total`. By default, it is a 95% confidence interval. `percent_row` = `n` / `n_row`. `se_row` = $\sqrt{proportion_{row} * (1 - proportion_{row}) / (n_{row} - 1)}$ `t_crit_row` is the critical value from Student's t distribution with `n_total` - 1 degrees of freedom. The default probability value is 0.975, which corresponds to an alpha of 0.05. `lcl_row` is the lower bound of the confidence interval around `percent_row`. By default, it is a 95% confidence interval. `ucl_row` is the upper bound of the confidence interval around `percent_row`. By default, it is a 95% confidence interval. **Compare to Stata:** ![](freq_am_by_cyl_stata.png){width=600}   These estimates do not match those generated by SAS, which uses a different variance estimation method (https://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_surveyfreq_a0000000217.htm). ![](freq_am_by_cyl_sas.png){width=600} [top](#top) ------------------------------------------------------------------------------- # Interpretation of confidence intervals {#interpretation} The following are frequentist interpretations for 95% confidence intervals taken from relevant texts and peer-reviewed journal articles. **Biostatistics: A foundation for analysis in the health sciences** > In repeated sampling, from a normally distributed population with a known standard deviation, 95% of all intervals will in the long run include the populations mean. Daniel, W. W., & Cross, C. L. (2013). Biostatistics: A foundation for analysis in the health sciences (Tenth). Hoboken, NJ: Wiley. **Fundamentals of biostatistics** > You may be puzzled at this point as to what a CI is. The parameter µ is a fixed unknown constant. How can we state that the probability that it lies within some specific interval is, for example, 95%? The important point to understand is that the boundaries of the interval depend on the sample mean and sample variance and vary from sample to sample. Furthermore, 95% of such intervals that could be constructed from repeated random samples of size n contain the parameter µ. > The idea is that over a large number of hypothetical samples of size 10, 95% of such intervals contain the parameter µ. Any one interval from a particular sample may or may not contain the parameter µ. In Figure 6.7, by chance all five intervals contain the parameter µ. However, with additional random samples this need not be the case. Therefore, we cannot say there is a 95% chance that the parameter µ will fall within a particular 95% CI. However, we can say the following: The length of the CI gives some idea of the precision of the point estimate x. In this particular case, the length of each CI ranges from 20 to 47 oz, which makes the precision of the point estimate x doubtful and implies that a larger sample size is needed to get a more precise estimate of µ. Rosner, B. (2015). Fundamentals of biostatistics (Eighth). MA: Cengage Learning. **Statistical modeling: A fresh approach** > Treat the confidence interval just as an indication of the precision of the measurement. If you do a study that finds a statistic of 17 ± 6 and someone else does a study that gives 23 ± 5, then there is little reason to think that the two studies are inconsistent. On the other hand, if your study gives 17 ± 2 and the other study is 23 ± 1, then something seems to be going on; you have a genuine disagreement on your hands. Kaplan, D. T. (2017). Statistical modeling: A fresh approach (Second). Project MOSAIC Books. **Modern epidemiology** > If the underlying statistical model is correct and there is no bias, a confidence interval derived from a valid test will, over unlimited repetitions of the study, contain the true parameter with a frequency no less than its confidence level. This definition specifies the coverage property of the method used to generate the interval, not the probability that the true parameter value lies within the interval. For example, if the confidence level of a valid confidence interval is 90%, the frequency with which the interval will contain the true parameter will be at least 90%, if there is no bias. Consequently, under the assumed model for random variability (e.g., a binomial model, as described in Chapter 14) and with no bias, we should expect the confidence interval to include the true parameter value in at least 90% of replications of the process of obtaining the data. Unfortunately, this interpretation for the confidence interval is based on probability models and sampling properties that are seldom realized in epidemiologic studies; consequently, it is preferable to view the confidence limits as only a rough estimate of the uncertainty in an epidemiologic result due to random error alone. Even with this limited interpretation, the estimate depends on the correctness of the statistical model, which may be incorrect in many epidemiologic settings (Greenland, 1990). > Furthermore, exact 95% confidence limits for the true rate ratio are 0.7–13. The fact that the null value (which, for the rate ratio, is 1.0) is within the interval tells us the outcome of the significance test: The estimate would not be statistically significant at the 1 - 0.95 = 0.05 alpha level. The confidence limits, however, indicate that these data, although statistically compatible with no association, are even more compatible with a strong association — assuming that the statistical model used to construct the limits is correct. Stating the latter assumption is important because confidence intervals, like P-values, do nothing to address biases that may be present. > Indeed, because statistical hypothesis testing promotes so much misinterpretation, we recommend avoiding its use in epidemiologic presentations and research reports. Such avoidance requires that P-values (when used) be presented without reference to alpha levels or “statistical significance,” and that careful attention be paid to the confidence interval, especially its width and its endpoints (the confidence limits) (Altman et al., 2000; Poole, 2001c). > An astute investigator may properly ask what frequency interpretations have to do with the single study under analysis. It is all very well to say that an interval estimation procedure will, in 95% of repetitions, produce limits that contain the true parameter. But in analyzing a given study, the relevant scientific question is this: Does the single pair of limits produced from this one study contain the true parameter? The ordinary (frequentist) theory of confidence intervals does not answer this question. The question is so important that many (perhaps most) users of confidence intervals mistakenly interpret the confidence level of the interval as the probability that the answer to the question is “yes.” It is quite tempting to say that the 95% confidence limits computed from a study contain the true parameter with 95% probability. Unfortunately, this interpretation can be correct only for Bayesian interval estimates (discussed later and in Chapter 18), which often diverge from ordinary confidence intervals. Rothman, K. J., Greenland, S., & Lash, T. L. (2008). Modern epidemiology (Third). Philadelphia, PA: Lippincott Williams & Wilkins. **Greenland, 2016** > The specific 95 % confidence interval presented by a study has a 95% chance of containing the true effect size. No! A reported confidence interval is a range between two numbers. The frequency with which an observed interval (e.g., 0.72–2.88) contains the true effectis either 100% if the true effect is within the interval or 0% if not; the 95% refers only to how often 95% confidence intervals computed from very many studies would contain the true size if all the assumptions used to compute the intervals were correct. > The 95 % confidence intervals from two subgroups or studies may overlap substantially and yet the test for difference between them may still produce P < 0.05. Suppose for example, two 95 % confidence intervals for means from normal populations with known variancesare (1.04, 4.96) and (4.16, 19.84); these intervals overlap, yet the test of the hypothesis of no difference in effect across studies gives P = 0.03. As with P values, comparison between groups requires statistics that directly test and estimate the differences across groups. It can, however, be noted that if the two 95 % confidence intervals fail to overlap, then when using the same assumptions used to compute the confidence intervals we will find P > 0.05 for the difference; and if one of the 95% intervals contains the point estimate from the other group or study, we will find P > 0.05 for the difference. Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350. https://doi.org/10.1007/s10654-016-0149-3 **Bottom Line** Give the point estimate along with the 95% confidence interval. Say NOTHING about statistical significance. Write some kind of statement about the data's compatibility with the model. For example, "the confidence limits, however, indicate that these data, although statistically compatible with no association, are even more compatible with a strong association — assuming that the statistical model used to construct the limits is correct."