Categorical Association Measures

library(moderncor)

The moderncor_cat() function provides a unified interface for computing association measures between categorical (factor) variables. All measures require the DescTools package.

Basic Usage

moderncor_cat() accepts two factor (or character/numeric-as-categorical) vectors:

set.seed(42)
x <- factor(sample(c("A", "B", "C"), 100, replace = TRUE))
y <- factor(sample(c("X", "Y"), 100, replace = TRUE))

moderncor_cat(x, y, method = "cramers_v")
#> 
#>    Cramer's V 
#> 
#>   Estimate:  0.0173
#>   Statistic: 0.03
#>   P-value:   0.9851
#>   Sample size (n): 100

The output is an S3 object of class "moderncor_cat" with the same structure as moderncor() output:

  • $estimate: the association coefficient
  • $statistic: the chi-square test statistic (for nominal methods)
  • $p.value: the p-value (for nominal methods; NULL for ordinal methods)
  • $n: the sample size
  • $method_label: human-readable method name

Querying Available Methods

available_methods_cat()
#>        method                   label   package    type
#> 1   cramers_v              Cramer's V DescTools nominal
#> 2         phi         Phi Coefficient DescTools nominal
#> 3       gamma   Goodman-Kruskal Gamma DescTools ordinal
#> 4    somers_d               Somers' D DescTools ordinal
#> 5 contingency Contingency Coefficient DescTools nominal
#> 6   tschuprow           Tschuprow's T DescTools nominal

Methods fall into two categories:

  • Nominal: for unordered categories (Cramér’s V, Phi, Contingency Coefficient, Tschuprow’s T)
  • Ordinal: for ordered categories (Goodman-Kruskal Gamma, Somers’ D)

Nominal Association Measures

Nominal measures are appropriate when categories have no natural ordering. They are all based on the chi-square statistic and return a p-value.

Cramér’s V

Cramér’s V is the most widely used measure of nominal association. It ranges from 0 (no association) to 1 (perfect association) and is symmetric:

moderncor_cat(x, y, method = "cramers_v")
#> 
#>    Cramer's V 
#> 
#>   Estimate:  0.0173
#>   Statistic: 0.03
#>   P-value:   0.9851
#>   Sample size (n): 100

For a 2×2 table, Cramér’s V equals the absolute value of the Phi coefficient.

Phi Coefficient

The Phi coefficient is designed for 2×2 contingency tables. For larger tables it can exceed 1, so prefer Cramér’s V in that case:

x_bin <- factor(sample(c("Yes", "No"), 100, replace = TRUE))
y_bin <- factor(sample(c("Pass", "Fail"), 100, replace = TRUE))

moderncor_cat(x_bin, y_bin, method = "phi")
#> 
#>    Phi Coefficient 
#> 
#>   Estimate:  0.1386
#>   Statistic: 1.9218
#>   P-value:   0.1657
#>   Sample size (n): 100

Contingency Coefficient

The contingency coefficient (Pearson’s C) is bounded between 0 and \(\sqrt{(k-1)/k}\) where \(k\) is the number of categories, so it is not comparable across tables of different sizes:

moderncor_cat(x, y, method = "contingency")
#> 
#>    Contingency Coefficient 
#> 
#>   Estimate:  0.0173
#>   Statistic: 0.03
#>   P-value:   0.9851
#>   Sample size (n): 100

Tschuprow’s T

Tschuprow’s T is similar to Cramér’s V but uses the geometric mean of the marginal category counts as its normalizer. It is symmetric and ranges from 0 to 1:

moderncor_cat(x, y, method = "tschuprow")
#> 
#>    Tschuprow's T 
#> 
#>   Estimate:  0.0146
#>   Statistic: 0.03
#>   P-value:   0.9851
#>   Sample size (n): 100

Ordinal Association Measures

Ordinal measures are appropriate when categories have a natural ordering (e.g., Likert scales, severity grades). They do not return p-values by default.

Goodman-Kruskal Gamma

Goodman-Kruskal Gamma (\(\gamma\)) measures the tendency for pairs of observations to be concordant (both variables increase together) vs. discordant. It ranges from −1 to 1 and is symmetric:

# Simulate ordinal survey data
set.seed(1)
quality  <- factor(sample(c("Low", "Medium", "High"), 100, replace = TRUE,
                           prob = c(0.3, 0.4, 0.3)),
                   levels = c("Low", "Medium", "High"), ordered = TRUE)
satisfaction <- factor(sample(c("Dissatisfied", "Neutral", "Satisfied"), 100,
                               replace = TRUE, prob = c(0.3, 0.4, 0.3)),
                       levels = c("Dissatisfied", "Neutral", "Satisfied"), ordered = TRUE)

moderncor_cat(quality, satisfaction, method = "gamma")
#> 
#>    Goodman-Kruskal Gamma 
#> 
#>   Estimate:  0.0808
#>   Sample size (n): 100

Somers’ D

Somers’ D is an asymmetric ordinal measure: it measures the predictability of y from x (but not vice versa). Values range from −1 to 1:

moderncor_cat(quality, satisfaction, method = "somers_d")
#> 
#>    Somers' D 
#> 
#>   Estimate:  0.0548
#>   Sample size (n): 100

Note that swapping x and y gives a different result:

moderncor_cat(satisfaction, quality, method = "somers_d")
#> 
#>    Somers' D 
#> 
#>   Estimate:  0.0549
#>   Sample size (n): 100

Pairwise Matrix for Multiple Variables

Pass a data.frame of factor columns to compute pairwise associations across all pairs:

df <- data.frame(
  cyl   = factor(mtcars$cyl),
  gear  = factor(mtcars$gear),
  am    = factor(mtcars$am)
)

res_mat <- moderncor_cat(df, method = "cramers_v")
res_mat
#> 
#>    Cramer's V 
#> 
#>   Association Matrix (n = 32):
#> 
#>         cyl   gear     am
#> cyl  1.0000 0.5309 0.5226
#> gear 0.5309 1.0000 0.8090
#> am   0.5226 0.8090 1.0000
#> 
#>   P-value Matrix:
#> 
#>         cyl   gear     am
#> cyl  0.0000 0.0012 0.0126
#> gear 0.0012 0.0000 0.0000
#> am   0.0126 0.0000 0.0000

The result is a matrix of association coefficients. For nominal methods, the associated p-value matrix is also stored in $p.value:

res_mat$p.value
#>              cyl         gear           am
#> cyl  0.000000000 1.214066e-03 1.264661e-02
#> gear 0.001214066 0.000000e+00 2.830889e-05
#> am   0.012646605 2.830889e-05 0.000000e+00

Use as.data.frame() to convert to tidy format:

as.data.frame(res_mat)
#>   var1 var2 association      p.value
#> 1 gear  cyl   0.5308655 1.214066e-03
#> 2   am  cyl   0.5226355 1.264661e-02
#> 3  cyl gear   0.5308655 1.214066e-03
#> 4   am gear   0.8090247 2.830889e-05
#> 5  cyl   am   0.5226355 1.264661e-02
#> 6 gear   am   0.8090247 2.830889e-05

Handling Missing Values

The use argument controls how missing values are handled, mirroring the interface of moderncor():

  • "complete.obs" (default): remove all rows with any NA before computing
  • "pairwise.complete.obs": remove NAs per pair
  • "everything": propagate NAs (returns NA for any pair with missing values)
x_na <- factor(c("A", "B", NA, "A", "B", "C"))
y_na <- factor(c("X", "Y", "X", NA, "Y", "X"))

moderncor_cat(x_na, y_na, method = "cramers_v", use = "complete.obs")
#> 
#>    Cramer's V 
#> 
#>   Estimate:  1
#>   Statistic: 4
#>   P-value:   0.1353
#>   Sample size (n): 4

Choosing the Right Method

Situation Recommended method
Two unordered categorical variables (general) cramers_v
Two binary variables (2×2 table) phi
Two ordered categorical (Likert) variables gamma
Predicting one ordered variable from another somers_d
Comparing association across different table sizes cramers_v or tschuprow

For continuous variables, use moderncor() instead. See vignette("introduction") for a full overview.