library(smdi)
library(gt)
suppressPackageStartupMessages(library(dplyr))

Multivariate missing data in `smdi`

In this article, we want to briefly highlight two aspect regarding multivariate missingness:

How does smdi handle multivariate missingness?
What is the link between missing data patterns and missing data mechanisms and how does this affect the behavior and performance of the smdi functionality?

Established taxonomies

In general, there are two classic established missing data taxonomies:

Mechanisms: Missing completely at random (MCAR), at random (MAR) and not at random (MNAR)
Patterns: Monotone versus Non-monotone missingness

How does `smdi` handle multivariate missingness?

In all smdi functions, except smdi_little(), binary missing indicator variables are created for each partially observed variable (either specified by the analyst using the covar parameter or automatically identified via smdi_check_covar() if covar = NULL) and the columns with the actual variable values are dropped. For the variable importance visualization in smdi_rf(), these variables are indicated with a *“_NA”* suffix. Missing values are accordingly indicated with a 1 and complete observations with a 0. This functionality is controlled via the smdi_na_indicator() utility function.

smdi_data %>% 
  smdi_na_indicator(
    drop_NA_col = FALSE # usually TRUE, but for demonstration purposes set to FALSE
    ) %>% 
  select(
    ecog_cat, ecog_cat_NA, 
    egfr_cat, egfr_cat_NA, 
    pdl1_num, pdl1_num_NA
    ) %>% 
  head() %>% 
  gt()

ecog_cat	ecog_cat_NA	egfr_cat	egfr_cat_NA	pdl1_num	pdl1_num_NA
1	0	NA	1	45.03	0
NA	1	0	0	NA	1
0	0	1	0	41.74	0
1	0	NA	1	45.51	0
NA	1	1	0	31.28	0
0	0	NA	1	NA	1

Now, let’s assume we have three partially observed covariates X, Y and Z, which we would like to include in our missingness diagnostics. All smdi_diagnose() functions, except smdi_little(), create X_NA, Y_NA and Z_NA and X, Y and Z are discarded from the dataset. The functions will then iterate the diagnostics through all X_NA, Y_NA and Z_NA one-by-one. That is, if, for example, X_NA is assessed, Y_NA and Z_NA serve as predictor variables along with all other covariates in the dataset. If Y_NA is assessed, X_NA and Z_NA are included as predictor variables, and so forth.

Important! It is important to notice that this strategy is the default to deal with multivariate missingness in the smdi package, however, another possible approach could be to not consider the other partially observed variables in the first place (e.g. by dropping them before applying any smdi function) and stacking the diagnostics focusing on one partially observed variable at a time. Such a strategy would be advisable in scenarios of monotone missing data patterns (see next section).

`smdi` in case of monotone missing data patterns

While in the smdi package we mainly focus on missing data mechanisms, missing data patterns always need to be considered, too. Please refer to the routine structural missing data diagnostics article, where we highlight the importance of describing missingness proportions and patterns before running any of the smdi diagnostics.

As mentioned in the section before, in case of monotone missing data patterns, the smdi functionality may be misleading.

Monotonicity A missing data pattern is said to be monotone if the variables Yj can be ordered such that if Yj is missing then all variables Yk with k > j are also missing (taken from Stef van Buuren ¹).

A good example for monotone missing data could be clinical blood laboratory tests (“labs”) which are often tested together in a lab panel. If one lab is missing, typically the other labs of this panel are also missing.

# we simulatea monotone missingness pattern
# following an MCAR mechanism

set.seed(42)

data_monotone <- smdi_data_complete %>% 
  mutate(
    lab1 = rnorm(nrow(smdi_data_complete), mean = 5, sd = 0.5),
    lab2 = rnorm(nrow(smdi_data_complete), mean = 10, sd = 2.25)
    )

data_monotone[3:503, "lab1"] <- NA
data_monotone[1:500, "lab2"] <- NA

smdi::gg_miss_upset(data = data_monotone)

smdi::md.pattern(data_monotone[, c("lab1", "lab2")], plot = FALSE)
#>      lab2 lab1     
#> 1997    1    1    0
#> 3       1    0    1
#> 2       0    1    1
#> 498     0    0    2
#>       500  501 1001

In extreme cases of perfect linearity, this can lead to multiple warnings and errors such as system is exactly singular or -InfWarning: Variable has only NA's in at least one stratum.

In cases in which monotonicity is still clearly present but not as extreme (like in the example above), smdi will prompt a message to the analyst to raise awareness of this issue as the smdi output can be highly misleading in such instances.

diagnostics_jointly <- smdi_diagnose(
  data = data_monotone,
  covar = NULL, # NULL includes all covariates with at least one NA
  model = "cox",
  form_lhs = "Surv(eventtime, status)"
  )
#> Important note:
#> AUC for predicting covariate lab1 is very high (>0.9).
#> Predictor with highest importance: lab2_NA.
#> Check for potentially underlying monotone missing data pattern.
#> Important note:
#> AUC for predicting covariate lab2 is very high (>0.9).
#> Predictor with highest importance: lab1_NA.
#> Check for potentially underlying monotone missing data pattern.

diagnostics_jointly %>% 
  smdi_style_gt()

Covariate	ASMD (min/max)¹	p Hotelling¹	AUC²	beta univariate (95% CI)³	beta (95% CI)³
lab1	0.060 (0.008, 0.843)	<.001	0.998	-0.01 (95% CI -0.12, 0.09)	1.06 (95% CI 0.01, 2.11)
lab2	0.054 (0.003, 0.341)	<.001	0.993	-0.03 (95% CI -0.14, 0.08)	-1.11 (95% CI -2.16, -0.05)
p little: 0.245, Abbreviations: ASMD = Median absolute standardized mean difference across all covariates, AUC = Area under the curve, beta = beta coefficient, CI = Confidence interval, max = Maximum, min = Minimum
¹ Group 1 diagnostic: Differences in patient characteristics between patients with and without covariate
² Group 2 diagnostic: Ability to predict missingness
³ Group 3 diagnostic: Assessment if missingness is associated with the outcome (univariate, adjusted)

In such cases, it may be advisable to not consider including lab2 in the missingness diagnostics of lab1 and vice versa and stack the diagnostics focusing on one partially observed variable at a time.

Lab 1 analyzed without Lab 2

# lab 1
lab1_diagnostics <- smdi_diagnose(
  data = data_monotone %>% select(-lab2),
  model = "cox",
  form_lhs = "Surv(eventtime, status)"
  )

lab1_diagnostics %>% 
  smdi_style_gt()

Covariate	ASMD (min/max)¹	p Hotelling¹	AUC²	beta univariate (95% CI)³	beta (95% CI)³
lab1	0.050 (0.008, 0.083)	0.296	0.500	-0.01 (95% CI -0.12, 0.09)	-0.04 (95% CI -0.15, 0.07)
p little: 0.295, Abbreviations: ASMD = Median absolute standardized mean difference across all covariates, AUC = Area under the curve, beta = beta coefficient, CI = Confidence interval, max = Maximum, min = Minimum
¹ Group 1 diagnostic: Differences in patient characteristics between patients with and without covariate
² Group 2 diagnostic: Ability to predict missingness
³ Group 3 diagnostic: Assessment if missingness is associated with the outcome (univariate, adjusted)

Lab 2 analyzed without Lab 1

# lab 2
lab2_diagnostics <- smdi_diagnose(
  data = data_monotone %>% select(-lab1),
  model = "cox",
  form_lhs = "Surv(eventtime, status)"
  )

lab2_diagnostics %>% 
  smdi_style_gt()

Covariate	ASMD (min/max)¹	p Hotelling¹	AUC²	beta univariate (95% CI)³	beta (95% CI)³
lab2	0.051 (0.003, 0.082)	0.291	0.500	-0.03 (95% CI -0.14, 0.08)	-0.05 (95% CI -0.16, 0.06)
p little: 0.291, Abbreviations: ASMD = Median absolute standardized mean difference across all covariates, AUC = Area under the curve, beta = beta coefficient, CI = Confidence interval, max = Maximum, min = Minimum
¹ Group 1 diagnostic: Differences in patient characteristics between patients with and without covariate
² Group 2 diagnostic: Ability to predict missingness
³ Group 3 diagnostic: Assessment if missingness is associated with the outcome (univariate, adjusted)

Presented in one table using `smdi_style_gt()`

We can also combine the output of individually stacked smdi_diagnose tables and enhance it with a global Little’s test that takes into account the multivariate missingness of the entire dataset.

# computing a gloabl p-value for Little's test including both lab1 and lab2
little_global <- smdi_little(data = data_monotone)

# combining two individual lab smdi tables and global Little's test
smdi_style_gt(
  smdi_object = rbind(lab1_diagnostics$smdi_tbl, lab2_diagnostics$smdi_tbl), 
  include_little = little_global
  )

Covariate	ASMD (min/max)¹	p Hotelling¹	AUC²	beta univariate (95% CI)³	beta (95% CI)³
lab1	0.050 (0.008, 0.083)	0.296	0.500	-0.01 (95% CI -0.12, 0.09)	-0.04 (95% CI -0.15, 0.07)
lab2	0.051 (0.003, 0.082)	0.291	0.500	-0.03 (95% CI -0.14, 0.08)	-0.05 (95% CI -0.16, 0.06)
p little: 0.245, Abbreviations: ASMD = Median absolute standardized mean difference across all covariates, AUC = Area under the curve, beta = beta coefficient, CI = Confidence interval, max = Maximum, min = Minimum
¹ Group 1 diagnostic: Differences in patient characteristics between patients with and without covariate
² Group 2 diagnostic: Ability to predict missingness
³ Group 3 diagnostic: Assessment if missingness is associated with the outcome (univariate, adjusted)

Since the missingness follows an MCAR mechanism, smdi_diagnose() now shows the expected missingness diagnostics patterns one would expect from an MCAR mechanism.

For more information on missing data patterns see https://stefvanbuuren.name/fimd/missing-data-pattern.html ↩︎

Multivariate missingness and monotonicity

Multivariate missing data in smdi