Statistical test workflows

Introduction

testflow provides statistical testing workflows organized by study design.

One numerical variable

library(testflow)
cardio <- make_cardio_data()
test_one_sample(cardio, sbp_3m, mu = 140)
#> Statistical test workflow
#> 
#> Outcome: sbp_3m
#> Design: one numerical sample
#> 
#> Assumptions
#> * Independence of observations: assumed: Assumed from study design.
#> * Normality: sbp_3m: acceptable: Approximate normality looks reasonable. (method=Shapiro-Wilk; statistic=0.99; p=0.308)
#> * Symmetry of deviations: not checked: Normality made the symmetry check unnecessary. (method=Signed-rank teaching note)
#> 
#> Recommended test
#> One-sample t-test
#> 
#> Result
#> H0: the population mean or location of sbp_3m equals the reference value.
#> statistic = -0.46, df = 179.00, p = 0.646, 95% CI [136.47, 142.20]
#> 
#> Effect size
#> Cohen's d: -0.03, negligible
#> 
#> Report
#> The one numerical sample workflow for sbp_3m did not show a statistically significant result using One-sample t-test, statistic = -0.46, df = 179.00, p = 0.646. The 95% confidence interval was [136.47, 142.20]. The effect size was negligible (Cohen's d = -0.03). H0: the population mean or location of sbp_3m equals the reference value.

Two independent groups

test_two_groups(sbp_3m ~ sex, data = cardio)
#> Statistical test workflow
#> 
#> Outcome: sbp_3m
#> Group: sex
#> Design: two independent groups
#> 
#> Assumptions
#> * Independence of observations: assumed: Assumed from study design.
#> * Normality: sbp_3m (female): acceptable: Approximate normality looks reasonable. (method=Shapiro-Wilk; statistic=0.99; p=0.913)
#> * Normality: sbp_3m (male): acceptable: Approximate normality looks reasonable. (method=Shapiro-Wilk; statistic=0.98; p=0.233)
#> * Variance homogeneity: acceptable: Variance homogeneity looks reasonable. (method=Levene test; statistic=1.57)
#> * Extreme outliers: warning: 4 potential outlier(s) flagged by IQR. (IQR rule, n = 4)
#> * Variance ratio check: acceptable: Variance ratio looks reasonable. (statistic=1.27)
#> 
#> Recommended test
#> Student independent t-test
#> 
#> Result
#> H0: the population mean or location of sbp_3m is equal across levels of sex.
#> statistic = -1.91, df = 178.00, p = 0.058, 95% CI [-11.22, 0.18]
#> 
#> Effect size
#> Cohen's d: -0.29, small
#> 
#> Report
#> The two independent groups workflow for sbp_3m did not show a statistically significant result using Student independent t-test, statistic = -1.91, df = 178.00, p = 0.058. The 95% confidence interval was [-11.22, 0.18]. The effect size was small (Cohen's d = -0.29). H0: the population mean or location of sbp_3m is equal across levels of sex.

Paired measurements

test_paired(sbp_3m ~ sbp_baseline, data = cardio)
#> Statistical test workflow
#> 
#> Outcome: sbp_3m - sbp_baseline
#> Design: paired measurements
#> 
#> Assumptions
#> * Independence of observations: assumed: Paired observations from the same subjects are assumed by design.
#> * Normality: diff: acceptable: Approximate normality looks reasonable. (method=Shapiro-Wilk; statistic=0.99; p=0.557)
#> * Symmetry of paired differences: not checked: Normality made the symmetry check unnecessary.
#> * Extreme outliers: warning: 1 potential outlier(s) flagged by IQR. (IQR rule, n = 1)
#> 
#> Recommended test
#> Paired t-test
#> 
#> Result
#> H0: the mean or median paired difference (sbp_3m - sbp_baseline) equals 0.
#> statistic = -9.20, df = 179.00, p = <0.001, 95% CI [-9.53, -6.16]
#> 
#> Effect size
#> Cohen's dz: -0.69, moderate
#> 
#> Report
#> The paired measurements workflow for sbp_3m - sbp_baseline showed a statistically significant result using Paired t-test, statistic = -9.20, df = 179.00, p = <0.001. The 95% confidence interval was [-9.53, -6.16]. The effect size was moderate (Cohen's dz = -0.69). H0: the mean or median paired difference (sbp_3m - sbp_baseline) equals 0.

More than two groups

test_groups(sbp_3m ~ treatment, data = cardio)
#> Statistical test workflow
#> 
#> Outcome: sbp_3m
#> Group: treatment
#> Design: more than two independent groups
#> 
#> Assumptions
#> * Independence of observations: assumed: Assumed from study design.
#> * Normality: sbp_3m (lifestyle): not acceptable: Normality may be violated. (method=Shapiro-Wilk; statistic=0.96; p=0.030)
#> * Normality: sbp_3m (medication): acceptable: Approximate normality looks reasonable. (method=Shapiro-Wilk; statistic=0.98; p=0.647)
#> * Normality: sbp_3m (usual care): acceptable: Approximate normality looks reasonable. (method=Shapiro-Wilk; statistic=0.98; p=0.349)
#> * Variance homogeneity: acceptable: Variance homogeneity looks reasonable. (method=Levene test; statistic=1.20)
#> * Bartlett test: acceptable: Variance homogeneity looks reasonable. (method=Bartlett test; statistic=0.66)
#> * Extreme outliers: warning: 4 potential outlier(s) flagged by IQR. (IQR rule, n = 4)
#> 
#> Recommended test
#> Kruskal-Wallis test
#> 
#> Result
#> H0: the population mean or location of sbp_3m is equal across levels of treatment.
#> statistic = 7.58, df = 2.00, p = 0.023
#> 
#> Effect size
#> Kruskal epsilon squared: 0.03, small
#> 
#> Report
#> The more than two independent groups workflow for sbp_3m showed a statistically significant result using Kruskal-Wallis test, statistic = 7.58, df = 2.00, p = 0.023. The effect size was small (Kruskal epsilon squared = 0.03). H0: the population mean or location of sbp_3m is equal across levels of treatment.

Factorial designs

test_factorial(sbp_3m ~ sex * treatment, data = cardio)
#> Statistical test workflow
#> 
#> Outcome: sbp_3m
#> Group: sex, treatment
#> Design: factorial design
#> 
#> Assumptions
#> * Independence of observations: assumed: Assumed from study design.
#> * Normality of residuals: acceptable: Residuals appear approximately normal. (method=Shapiro-Wilk; statistic=0.99; p=0.560)
#> * Variance homogeneity: acceptable: Variance homogeneity looks reasonable. (method=Levene test; statistic=1.57; p=0.211; Df1=1; Df2=178)
#> * Balanced design: not required: Cell sizes are unbalanced; the workflow still reports the design.
#> 
#> Recommended test
#> Factorial ANOVA
#> 
#> Result
#> H0: the population mean or location of sbp_3m is equal across levels of sex, treatment.
#> statistic = 3.78, df = 1.00, p = 0.053
#> 
#> Effect size
#> eta squared: 0.02, small
#> 
#> Report
#> The factorial design workflow for sbp_3m did not show a statistically significant result using Factorial ANOVA, statistic = 3.78, df = 1.00, p = 0.053. The effect size was small (eta squared = 0.02). H0: the population mean or location of sbp_3m is equal across levels of sex, treatment.

Repeated measurements

test_repeated(cardio, c(sbp_baseline, sbp_3m, sbp_6m), id = id)
#> Statistical test workflow
#> 
#> Outcome: sbp_baseline, sbp_3m, sbp_6m
#> Group: time
#> Design: repeated numeric measurements
#> 
#> Assumptions
#> * Independence of observations: assumed: Repeated measurements from the same subjects are assumed by design.
#> * Normality: sbp_3m: acceptable: Approximate normality looks reasonable. (method=Shapiro-Wilk; statistic=0.99; p=0.308)
#> * Normality: sbp_6m: acceptable: Approximate normality looks reasonable. (method=Shapiro-Wilk; statistic=1.00; p=0.842)
#> * Normality: sbp_baseline: acceptable: Approximate normality looks reasonable. (method=Shapiro-Wilk; statistic=0.99; p=0.732)
#> * Sphericity: not checked: Sphericity is not checked here; use this as a teaching note unless a formal test is added.
#> 
#> Recommended test
#> Repeated-measures ANOVA
#> 
#> Result
#> H0: the population mean or location of sbp_baseline, sbp_3m, sbp_6m is equal across levels of time.
#> statistic = 3.76, df = 2.00, p = 0.024
#> 
#> Effect size
#> eta squared: 0.05, small
#> 
#> Report
#> The repeated numeric measurements workflow for sbp_baseline, sbp_3m, sbp_6m showed a statistically significant result using Repeated-measures ANOVA, statistic = 3.76, df = 2.00, p = 0.024. The effect size was small (eta squared = 0.05). H0: the population mean or location of sbp_baseline, sbp_3m, sbp_6m is equal across levels of time.

The repeated numeric workflow chooses repeated-measures ANOVA when the within-time normality checks are acceptable and Friedman otherwise. Post-hoc comparisons are paired t-tests for the parametric branch and paired Wilcoxon tests for the non-parametric branch.

Categorical outcomes

test_categorical(treatment ~ controlled_3m, data = cardio)
#> Statistical test workflow
#> 
#> Outcome: treatment
#> Group: controlled_3m
#> Design: two categorical variables
#> 
#> Assumptions
#> * Independence of observations: assumed: Assumed from study design.
#> * Expected cell counts: acceptable: Chi-square approximation is reasonable. (method=Pearson chi-square approximation; Min expected = 26.1)
#> 
#> Recommended test
#> Chi-square test of independence
#> 
#> Result
#> H0: treatment and controlled_3m are independent.
#> statistic = 5.02, df = 2.00, p = 0.081
#> 
#> Effect size
#> Cramer's V: 0.17, small
#> 
#> Report
#> The two categorical variables workflow for treatment did not show a statistically significant result using Chi-square test of independence, statistic = 5.02, df = 2.00, p = 0.081. The effect size was small (Cramer's V = 0.17). H0: treatment and controlled_3m are independent.

Repeated categorical outcomes

test_repeated_categorical(cardio, c(controlled_baseline, controlled_3m, controlled_6m))
#> Statistical test workflow
#> 
#> Outcome: controlled_baseline, controlled_3m, controlled_6m
#> Design: repeated categorical measurements
#> 
#> Assumptions
#> * Repeated binary measurements: assumed: Same subjects should be measured at 3 or more time points.
#> * Complete repeated data: acceptable: Missingness should be handled explicitly or via complete-case analysis.
#> 
#> Recommended test
#> Cochran Q test
#> 
#> Result
#> H0: the success proportions are equal across repeated categorical measures.
#> statistic = 39.58, df = 2.00, p = <0.001
#> 
#> Effect size
#> Cochran Q Kendall's W: 0.11, small
#> 
#> Report
#> The repeated categorical measurements workflow for controlled_baseline, controlled_3m, controlled_6m showed a statistically significant result using Cochran Q test, statistic = 39.58, df = 2.00, p = <0.001. The effect size was small (Cochran Q Kendall's W = 0.11). H0: the success proportions are equal across repeated categorical measures.

The repeated categorical workflow uses Cochran Q for binary repeated outcomes and pairwise McNemar tests for follow-up comparisons.

References

Fisher, R. A. (1925). .
Gosset, W. S. (1908). The probable error of a mean.
Welch, B. L. (1947). Generalization of Student’s problem with unequal variances.
Wilcoxon, F. (1945). Individual comparisons by ranking methods.
Mann, H. B., & Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other.
Levene, H. (1960). Robust tests for equality of variances.
Kruskal, W. H., & Wallis, W. A. (1952). Use of ranks in one-criterion variance analysis.
Tukey, J. W. (1949). Comparing individual means in the analysis of variance.
Dunn, O. J. (1964). Multiple comparisons using rank sums.
Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance.
Cochran, W. G. (1950). The comparison of percentages in matched samples.
McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages.
Pearson, K. (1895, 1900).
Spearman, C. (1904). The proof and measurement of association between two things.
Kendall, M. G. (1938). A new measure of rank correlation.
Cramer, H. (1946). .
Clopper, C. J., & Pearson, E. S. (1934). The use of confidence or fiducial limits illustrated in the case of the binomial.
Cohen, J. (1988). .

Correlation

test_correlation(sbp_3m ~ age, data = cardio)
#> Statistical test workflow
#> 
#> Outcome: sbp_3m
#> Group: age
#> Design: two numeric variables
#> 
#> Assumptions
#> * Monotonic relationship: warning: Relationship may be non-monotonic. (method=Spearman correlation; statistic=793638.65; p=0.014)
#> * Extreme outliers: warning: 7 potential outlier(s) flagged by IQR. (IQR rule applied to age, sbp_3m)
#> * Normality: not required: Normality is not required for Spearman correlation.
#> 
#> Recommended test
#> Spearman Correlation
#> 
#> Result
#> H0: the correlation between age and sbp_3m is 0.
#> statistic = 793638.65, p = 0.014
#> 
#> Effect size
#> Spearman Correlation r: 0.18, small
#> 
#> Report
#> The two numeric variables workflow for sbp_3m showed a statistically significant result using Spearman Correlation, statistic = 793638.65, p = 0.014. The effect size was small (Spearman Correlation r = 0.18). H0: the correlation between age and sbp_3m is 0.

Outliers

test_outliers(c(sbp_3m, ldl, crp), data = cardio)
#> Warning: `outliers` is a screening workflow, not a single hypothesis test.
#> Statistical test workflow
#> 
#> Outcome: sbp_3m, ldl, crp
#> Design: outlier screening
#> 
#> Assumptions
#> * Numeric variable: acceptable: IQR outlier detection is univariate and does not require normality.
#> * Skewness sensitivity: warning: Interpret IQR outliers with care when the distribution is strongly skewed.
#> 
#> Recommended test
#> IQR outlier detection
#> 
#> Result
#> flagged rows = 11
#> 
#> Effect size
#> * Effect size not reported.
#> 
#> Report
#> The outlier workflow flagged 11 rows for review.

Reporting and plotting

Every workflow returns a testflow object. Use report(x), plot(x), and as_tibble(x). See effect-size-formulas.Rmd for the exact formulas used by the reported effect-size estimates.