--- title: "Multivariate missingness and monotonicity" author: "Janick Weberpals" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Multivariate missingness and monotonicity} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", dpi = 150, fig.width = 6, fig.height = 4.5 ) ``` ```{r setup} library(smdi) library(gt) suppressPackageStartupMessages(library(dplyr)) ``` # Multivariate missing data in `smdi` In this article, we want to briefly highlight two aspect regarding **multivariate missingness**: 1. How does `smdi` handle multivariate missingness? 2. What is the link between missing data patterns and missing data mechanisms and how does this affect the behavior and performance of the `smdi` functionality? ## Established taxonomies In general, there are two classic established missing data taxonomies: * Mechanisms: Missing completely at random (MCAR), at random (MAR) and not at random (MNAR) * Patterns: Monotone versus Non-monotone missingness # How does `smdi` handle multivariate missingness? In all `smdi` functions, except `smdi_little()`, binary missing indicator variables are created for each partially observed variable (either specified by the analyst using the `covar` parameter or automatically identified via `smdi_check_covar()` if `covar` = NULL) and the columns with the actual variable values are dropped. For the variable importance visualization in `smdi_rf()`, these variables are indicated with a *"_NA"* suffix. Missing values are accordingly indicated with a *1* and complete observations with a *0*. This functionality is controlled via the `smdi_na_indicator()` utility function. ```{r, fig.cap="Illustrating missing indicator variable generation within `smdi` functions"} smdi_data %>% smdi_na_indicator( drop_NA_col = FALSE # usually TRUE, but for demonstration purposes set to FALSE ) %>% select( ecog_cat, ecog_cat_NA, egfr_cat, egfr_cat_NA, pdl1_num, pdl1_num_NA ) %>% head() %>% gt() ``` Now, let's assume we have three partially observed covariates *X*, *Y* and *Z*, which we would like to include in our missingness diagnostics. All `smdi_diagnose()` functions, except `smdi_little()`, create *X_NA*, *Y_NA* and *Z_NA* and *X*, *Y* and *Z* are discarded from the dataset. The functions will then iterate the diagnostics through all *X_NA*, *Y_NA* and *Z_NA* one-by-one. That is, if, for example, *X_NA* is assessed, *Y_NA* and *Z_NA* serve as predictor variables along with all other covariates in the dataset. If *Y_NA* is assessed, *X_NA* and *Z_NA* are included as predictor variables, and so forth.
Important! It is important to notice that this strategy is the default to deal with multivariate missingness in the `smdi` package, however, another possible approach could be to *not* consider the other partially observed variables in the first place (e.g. by dropping them before applying any `smdi` function) and stacking the diagnostics focusing on one partially observed variable at a time. Such a strategy would be advisable in scenarios of monotone missing data patterns (see next section).
# `smdi` in case of monotone missing data patterns While in the `smdi` package we mainly focus on missing data mechanisms, missing data patterns always need to be considered, too. Please refer to the [routine structural missing data diagnostics article](https://janickweberpals.gitlab-pages.partners.org/smdi/articles/b_routine_diagnostics.html#descriptives), where we highlight the importance of describing missingness proportions and patterns before running any of the `smdi` diagnostics. As mentioned in the section before, in case of monotone missing data patterns, the `smdi` functionality may be misleading.
Monotonicity A missing data pattern is said to be monotone if the variables Yj can be ordered such that if Yj is missing then all variables Yk with k > j are also missing (taken from Stef van Buuren [^1]).
[^1]: For more information on missing data patterns see A good example for monotone missing data could be clinical blood laboratory tests ("labs") which are often tested together in a lab panel. If one lab is missing, typically the other labs of this panel are also missing. ```{r} # we simulatea monotone missingness pattern # following an MCAR mechanism set.seed(42) data_monotone <- smdi_data_complete %>% mutate( lab1 = rnorm(nrow(smdi_data_complete), mean = 5, sd = 0.5), lab2 = rnorm(nrow(smdi_data_complete), mean = 10, sd = 2.25) ) data_monotone[3:503, "lab1"] <- NA data_monotone[1:500, "lab2"] <- NA ``` ```{r} smdi::gg_miss_upset(data = data_monotone) ``` ```{r} smdi::md.pattern(data_monotone[, c("lab1", "lab2")], plot = FALSE) ``` In extreme cases of perfect linearity, this can lead to multiple warnings and errors such as `system is exactly singular` or `-InfWarning: Variable has only NA's in at least one stratum`. In cases in which monotonicity is still clearly present but not as extreme (like in the example above), `smdi` will prompt a message to the analyst to raise awareness of this issue as the `smdi` output can be **highly misleading** in such instances. ```{r} diagnostics_jointly <- smdi_diagnose( data = data_monotone, covar = NULL, # NULL includes all covariates with at least one NA model = "cox", form_lhs = "Surv(eventtime, status)" ) ``` ```{r, fig.cap="Diagnostics of lab 1 if analyzed separately."} diagnostics_jointly %>% smdi_style_gt() ``` In such cases, it may be advisable to *not* consider including *lab2* in the missingness diagnostics of *lab1* and vice versa and stack the diagnostics focusing on one partially observed variable at a time. ## Lab 1 analyzed without Lab 2 ```{r, fig.cap="Diagnostics of lab 1 if analyzed separately."} # lab 1 lab1_diagnostics <- smdi_diagnose( data = data_monotone %>% select(-lab2), model = "cox", form_lhs = "Surv(eventtime, status)" ) lab1_diagnostics %>% smdi_style_gt() ``` ## Lab 2 analyzed without Lab 1 ```{r, fig.cap="Diagnostics of lab 2 if analyzed separately."} # lab 2 lab2_diagnostics <- smdi_diagnose( data = data_monotone %>% select(-lab1), model = "cox", form_lhs = "Surv(eventtime, status)" ) lab2_diagnostics %>% smdi_style_gt() ``` ## Presented in one table using `smdi_style_gt()` We can also combine the output of individually stacked `smdi_diagnose` tables and enhance it with a global Little's test that takes into account the multivariate missingness of the entire dataset. ```{r} # computing a gloabl p-value for Little's test including both lab1 and lab2 little_global <- smdi_little(data = data_monotone) # combining two individual lab smdi tables and global Little's test smdi_style_gt( smdi_object = rbind(lab1_diagnostics$smdi_tbl, lab2_diagnostics$smdi_tbl), include_little = little_global ) ``` Since the missingness follows an MCAR mechanism, `smdi_diagnose()` now shows the expected missingness diagnostics patterns one would expect from an MCAR mechanism.