Statistical test workflows

Introduction

testflow provides statistical testing workflows organized by study design.

One numerical variable

library(testflow)
cardio <- make_cardio_data()
test_one_sample(cardio, sbp_3m, mu = 140)
#> Statistical test workflow
#> 
#> Outcome: sbp_3m
#> Design: one numerical sample
#> 
#> Assumptions
#> * Independence of observations: assumed: Assumed from study design.
#> * Normality: sbp_3m: acceptable: Approximate normality looks reasonable. (method=Shapiro-Wilk; statistic=0.99; p=0.308)
#> * Symmetry of deviations: not checked: Normality made the symmetry check unnecessary. (method=Signed-rank teaching note)
#> 
#> Recommended test
#> One-sample t-test
#> 
#> Result
#> H0: the population mean or location of sbp_3m equals the reference value.
#> statistic = -0.46, df = 179.00, p = 0.646, 95% CI [136.47, 142.20]
#> 
#> Effect size
#> Cohen's d: -0.03, negligible
#> 
#> Report
#> The one numerical sample workflow for sbp_3m did not show a statistically significant result using One-sample t-test, statistic = -0.46, df = 179.00, p = 0.646. The 95% confidence interval was [136.47, 142.20]. The effect size was negligible (Cohen's d = -0.03). H0: the population mean or location of sbp_3m equals the reference value.

Two independent groups

test_two_groups(sbp_3m ~ sex, data = cardio)
#> Statistical test workflow
#> 
#> Outcome: sbp_3m
#> Group: sex
#> Design: two independent groups
#> 
#> Assumptions
#> * Independence of observations: assumed: Assumed from study design.
#> * Normality: sbp_3m (female): acceptable: Approximate normality looks reasonable. (method=Shapiro-Wilk; statistic=0.99; p=0.913)
#> * Normality: sbp_3m (male): acceptable: Approximate normality looks reasonable. (method=Shapiro-Wilk; statistic=0.98; p=0.233)
#> * Variance homogeneity: acceptable: Variance homogeneity looks reasonable. (method=Levene test; statistic=1.57)
#> * Extreme outliers: warning: 4 potential outlier(s) flagged by IQR. (IQR rule, n = 4)
#> * Variance ratio check: acceptable: Variance ratio looks reasonable. (statistic=1.27)
#> 
#> Recommended test
#> Student independent t-test
#> 
#> Result
#> H0: the population mean or location of sbp_3m is equal across levels of sex.
#> statistic = -1.91, df = 178.00, p = 0.058, 95% CI [-11.22, 0.18]
#> 
#> Effect size
#> Cohen's d: -0.29, small
#> 
#> Report
#> The two independent groups workflow for sbp_3m did not show a statistically significant result using Student independent t-test, statistic = -1.91, df = 178.00, p = 0.058. The 95% confidence interval was [-11.22, 0.18]. The effect size was small (Cohen's d = -0.29). H0: the population mean or location of sbp_3m is equal across levels of sex.

Paired measurements

test_paired(sbp_3m ~ sbp_baseline, data = cardio)
#> Statistical test workflow
#> 
#> Outcome: sbp_3m - sbp_baseline
#> Design: paired measurements
#> 
#> Assumptions
#> * Independence of observations: assumed: Paired observations from the same subjects are assumed by design.
#> * Normality: diff: acceptable: Approximate normality looks reasonable. (method=Shapiro-Wilk; statistic=0.99; p=0.557)
#> * Symmetry of paired differences: not checked: Normality made the symmetry check unnecessary.
#> * Extreme outliers: warning: 1 potential outlier(s) flagged by IQR. (IQR rule, n = 1)
#> 
#> Recommended test
#> Paired t-test
#> 
#> Result
#> H0: the mean or median paired difference (sbp_3m - sbp_baseline) equals 0.
#> statistic = -9.20, df = 179.00, p = <0.001, 95% CI [-9.53, -6.16]
#> 
#> Effect size
#> Cohen's dz: -0.69, moderate
#> 
#> Report
#> The paired measurements workflow for sbp_3m - sbp_baseline showed a statistically significant result using Paired t-test, statistic = -9.20, df = 179.00, p = <0.001. The 95% confidence interval was [-9.53, -6.16]. The effect size was moderate (Cohen's dz = -0.69). H0: the mean or median paired difference (sbp_3m - sbp_baseline) equals 0.

More than two groups

test_groups(sbp_3m ~ treatment, data = cardio)
#> Statistical test workflow
#> 
#> Outcome: sbp_3m
#> Group: treatment
#> Design: more than two independent groups
#> 
#> Assumptions
#> * Independence of observations: assumed: Assumed from study design.
#> * Normality: sbp_3m (lifestyle): not acceptable: Normality may be violated. (method=Shapiro-Wilk; statistic=0.96; p=0.030)
#> * Normality: sbp_3m (medication): acceptable: Approximate normality looks reasonable. (method=Shapiro-Wilk; statistic=0.98; p=0.647)
#> * Normality: sbp_3m (usual care): acceptable: Approximate normality looks reasonable. (method=Shapiro-Wilk; statistic=0.98; p=0.349)
#> * Variance homogeneity: acceptable: Variance homogeneity looks reasonable. (method=Levene test; statistic=1.20)
#> * Bartlett test: acceptable: Variance homogeneity looks reasonable. (method=Bartlett test; statistic=0.66)
#> * Extreme outliers: warning: 4 potential outlier(s) flagged by IQR. (IQR rule, n = 4)
#> 
#> Recommended test
#> Kruskal-Wallis test
#> 
#> Result
#> H0: the population mean or location of sbp_3m is equal across levels of treatment.
#> statistic = 7.58, df = 2.00, p = 0.023
#> 
#> Effect size
#> Kruskal epsilon squared: 0.03, small
#> 
#> Report
#> The more than two independent groups workflow for sbp_3m showed a statistically significant result using Kruskal-Wallis test, statistic = 7.58, df = 2.00, p = 0.023. The effect size was small (Kruskal epsilon squared = 0.03). H0: the population mean or location of sbp_3m is equal across levels of treatment.

Factorial designs

test_factorial(sbp_3m ~ sex * treatment, data = cardio)
#> Statistical test workflow
#> 
#> Outcome: sbp_3m
#> Group: sex, treatment
#> Design: factorial design
#> 
#> Assumptions
#> * Independence of observations: assumed: Assumed from study design.
#> * Normality of residuals: acceptable: Residuals appear approximately normal. (method=Shapiro-Wilk; statistic=0.99; p=0.560)
#> * Variance homogeneity: acceptable: Variance homogeneity looks reasonable. (method=Levene test; statistic=1.57; p=0.211; Df1=1; Df2=178)
#> * Balanced design: not required: Cell sizes are unbalanced; the workflow still reports the design.
#> 
#> Recommended test
#> Factorial ANOVA
#> 
#> Result
#> H0: the population mean or location of sbp_3m is equal across levels of sex, treatment.
#> statistic = 3.78, df = 1.00, p = 0.053
#> 
#> Effect size
#> eta squared: 0.02, small
#> 
#> Report
#> The factorial design workflow for sbp_3m did not show a statistically significant result using Factorial ANOVA, statistic = 3.78, df = 1.00, p = 0.053. The effect size was small (eta squared = 0.02). H0: the population mean or location of sbp_3m is equal across levels of sex, treatment.

Repeated measurements

test_repeated(cardio, c(sbp_baseline, sbp_3m, sbp_6m), id = id)
#> Statistical test workflow
#> 
#> Outcome: sbp_baseline, sbp_3m, sbp_6m
#> Group: time
#> Design: repeated numeric measurements
#> 
#> Assumptions
#> * Independence of observations: assumed: Repeated measurements from the same subjects are assumed by design.
#> * Normality: sbp_3m: acceptable: Approximate normality looks reasonable. (method=Shapiro-Wilk; statistic=0.99; p=0.308)
#> * Normality: sbp_6m: acceptable: Approximate normality looks reasonable. (method=Shapiro-Wilk; statistic=1.00; p=0.842)
#> * Normality: sbp_baseline: acceptable: Approximate normality looks reasonable. (method=Shapiro-Wilk; statistic=0.99; p=0.732)
#> * Sphericity: not checked: Sphericity is not checked here; use this as a teaching note unless a formal test is added.
#> 
#> Recommended test
#> Repeated-measures ANOVA
#> 
#> Result
#> H0: the population mean or location of sbp_baseline, sbp_3m, sbp_6m is equal across levels of time.
#> statistic = 3.76, df = 2.00, p = 0.024
#> 
#> Effect size
#> eta squared: 0.05, small
#> 
#> Report
#> The repeated numeric measurements workflow for sbp_baseline, sbp_3m, sbp_6m showed a statistically significant result using Repeated-measures ANOVA, statistic = 3.76, df = 2.00, p = 0.024. The effect size was small (eta squared = 0.05). H0: the population mean or location of sbp_baseline, sbp_3m, sbp_6m is equal across levels of time.

The repeated numeric workflow chooses repeated-measures ANOVA when the within-time normality checks are acceptable and Friedman otherwise. Post-hoc comparisons are paired t-tests for the parametric branch and paired Wilcoxon tests for the non-parametric branch.

Categorical outcomes

test_categorical(treatment ~ controlled_3m, data = cardio)
#> Statistical test workflow
#> 
#> Outcome: treatment
#> Group: controlled_3m
#> Design: two categorical variables
#> 
#> Assumptions
#> * Independence of observations: assumed: Assumed from study design.
#> * Expected cell counts: acceptable: Chi-square approximation is reasonable. (method=Pearson chi-square approximation; Min expected = 26.1)
#> 
#> Recommended test
#> Chi-square test of independence
#> 
#> Result
#> H0: treatment and controlled_3m are independent.
#> statistic = 5.02, df = 2.00, p = 0.081
#> 
#> Effect size
#> Cramer's V: 0.17, small
#> 
#> Report
#> The two categorical variables workflow for treatment did not show a statistically significant result using Chi-square test of independence, statistic = 5.02, df = 2.00, p = 0.081. The effect size was small (Cramer's V = 0.17). H0: treatment and controlled_3m are independent.

Repeated categorical outcomes

test_repeated_categorical(cardio, c(controlled_baseline, controlled_3m, controlled_6m))
#> Statistical test workflow
#> 
#> Outcome: controlled_baseline, controlled_3m, controlled_6m
#> Design: repeated categorical measurements
#> 
#> Assumptions
#> * Repeated binary measurements: assumed: Same subjects should be measured at 3 or more time points.
#> * Complete repeated data: acceptable: Missingness should be handled explicitly or via complete-case analysis.
#> 
#> Recommended test
#> Cochran Q test
#> 
#> Result
#> H0: the success proportions are equal across repeated categorical measures.
#> statistic = 39.58, df = 2.00, p = <0.001
#> 
#> Effect size
#> Cochran Q Kendall's W: 0.11, small
#> 
#> Report
#> The repeated categorical measurements workflow for controlled_baseline, controlled_3m, controlled_6m showed a statistically significant result using Cochran Q test, statistic = 39.58, df = 2.00, p = <0.001. The effect size was small (Cochran Q Kendall's W = 0.11). H0: the success proportions are equal across repeated categorical measures.

The repeated categorical workflow uses Cochran Q for binary repeated outcomes and pairwise McNemar tests for follow-up comparisons.

References

  • Fisher, R. A. (1925). .
  • Gosset, W. S. (1908). The probable error of a mean.
  • Welch, B. L. (1947). Generalization of Student’s problem with unequal variances.
  • Wilcoxon, F. (1945). Individual comparisons by ranking methods.
  • Mann, H. B., & Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other.
  • Levene, H. (1960). Robust tests for equality of variances.
  • Kruskal, W. H., & Wallis, W. A. (1952). Use of ranks in one-criterion variance analysis.
  • Tukey, J. W. (1949). Comparing individual means in the analysis of variance.
  • Dunn, O. J. (1964). Multiple comparisons using rank sums.
  • Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance.
  • Cochran, W. G. (1950). The comparison of percentages in matched samples.
  • McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages.
  • Pearson, K. (1895, 1900).
  • Spearman, C. (1904). The proof and measurement of association between two things.
  • Kendall, M. G. (1938). A new measure of rank correlation.
  • Cramer, H. (1946). .
  • Clopper, C. J., & Pearson, E. S. (1934). The use of confidence or fiducial limits illustrated in the case of the binomial.
  • Cohen, J. (1988). .

Correlation

test_correlation(sbp_3m ~ age, data = cardio)
#> Statistical test workflow
#> 
#> Outcome: sbp_3m
#> Group: age
#> Design: two numeric variables
#> 
#> Assumptions
#> * Monotonic relationship: warning: Relationship may be non-monotonic. (method=Spearman correlation; statistic=793638.65; p=0.014)
#> * Extreme outliers: warning: 7 potential outlier(s) flagged by IQR. (IQR rule applied to age, sbp_3m)
#> * Normality: not required: Normality is not required for Spearman correlation.
#> 
#> Recommended test
#> Spearman Correlation
#> 
#> Result
#> H0: the correlation between age and sbp_3m is 0.
#> statistic = 793638.65, p = 0.014
#> 
#> Effect size
#> Spearman Correlation r: 0.18, small
#> 
#> Report
#> The two numeric variables workflow for sbp_3m showed a statistically significant result using Spearman Correlation, statistic = 793638.65, p = 0.014. The effect size was small (Spearman Correlation r = 0.18). H0: the correlation between age and sbp_3m is 0.

Outliers

test_outliers(c(sbp_3m, ldl, crp), data = cardio)
#> Warning: `outliers` is a screening workflow, not a single hypothesis test.
#> Statistical test workflow
#> 
#> Outcome: sbp_3m, ldl, crp
#> Design: outlier screening
#> 
#> Assumptions
#> * Numeric variable: acceptable: IQR outlier detection is univariate and does not require normality.
#> * Skewness sensitivity: warning: Interpret IQR outliers with care when the distribution is strongly skewed.
#> 
#> Recommended test
#> IQR outlier detection
#> 
#> Result
#> flagged rows = 11
#> 
#> Effect size
#> * Effect size not reported.
#> 
#> Report
#> The outlier workflow flagged 11 rows for review.

Reporting and plotting

Every workflow returns a testflow object. Use report(x), plot(x), and as_tibble(x). See effect-size-formulas.Rmd for the exact formulas used by the reported effect-size estimates.