--- title: 'Survival tests as differences-of-means' author: "Dominic Magirr" date: "6/23/2022" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Survival tests as differences-of-means} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## A simple randomized controlled trial Consider a simple randomized controlled trial in which $N$ patients are randomized between treatment $1$ and treatment $0$ (let $Z_i$ denote treatment assignment for patient $i$). Each patient contributes a response $Y_i$, and we are primarily interested in the difference in means on the two treatments, $$S:= \frac{\sum_{i=1}^N Y_i \mathbb{I}(Z_i = 1)}{\sum_{i=1}^N \mathbb{I}(Z_i = 1)} - \frac{\sum_{i=1}^N Y_i \mathbb{I}(Z_i = 0)}{\sum_{i=1}^N \mathbb{I}(Z_i = 0)}.$$ If this statistic is large then we would consider that evidence in favour of treatment $A$. ## Time-to-event outcome If we have a time-to-event outcomes $(T_i, \delta_i)$, where $T_i = \min(\tilde{T}_i, C_i)$ is the follow-up time (the minimum of the time-to-event-of-interest and time-to-censoring) and $\delta_i = \mathbb{I}\{\tilde{T}_i < C_i\}$, then since the outcome is two dimensional we cannot simply compare the treatments via a difference in means. ## ...Unless Unless we first make a transformation of the two-dimensional outcome space to a one-dimensional "score", and then compare the mean "score" on the two arms. It turns out that a large number of test statistics that are commonly used in time-to-event settings can be expressed in this way: - log-rank statistics - weighted log-rank statistic - difference in restricted mean survival times - difference in milestone survival ## What do the scores look like? Let's take the example of the log-rank statistic applied to the data from the POPLAR study [(Gandara et al.)](https://www.nature.com/articles/s41591-018-0134-3). ```{r echo=FALSE, message=FALSE, warning=FALSE} library(nphRCT) library(dplyr) library(survival) library(ggplot2) data("moderate_cross") dat <- moderate_cross km <- survfit(Surv(time, event) ~ arm, data = dat) p_km <- survminer::ggsurvplot(km, data = dat, risk.table = TRUE, break.x.by = 6, legend.title = "", xlab = "Time (months)", ylab = "Overall survival", risk.table.fontsize = 4, legend = c(0.8,0.8)) df_lr <- find_scores(formula=Surv(time, event) ~ arm, data=dat, method = "lr") p_lrt <- plot(df_lr,title="Log rank") cowplot::plot_grid(p_km[[1]],p_lrt,rel_widths=c(2,3)) ``` In the graph on the right-hand side, each dot corresponds to a patient in the trial. On the x-axis is the patients' follow-up time and on the y-axis is the score assigned to each observation. The dots form two approximately parallel lines. The top line corresponds to observed events; the bottom line corresponds to censored observations. An observed event close to time zero receives a score close to 1; a censored observation at around month 24 gets a score of -1; intermediate outcomes receive an intermediate score (the scores have been shifted and scaled to range between 1 and -1). The mean score on the two treatment arms are indicated with horizontal lines. The difference in mean score *is* the log-rank statistic (or, more precisely, a re-scaled version of it). ## Why is this interesting? I think this perspective becomes helpful when we start to compare alternative test statistics. One popular approach in the context of non-proportional hazards is to base inference on the difference in restricted mean survival times (RMST) on the two arms (based on the Kaplan-Meier estimates). If we express the corresponding test statistic as a difference in scores (following [Andersen et el.](https://pubmed.ncbi.nlm.nih.gov/28384840/)), we see that every observation after the restriction time (in this case 18 months) get the same score. So an observed event at month 18 receives the same score as a censored observation at 24 months, for example. ```{r echo=FALSE, message=FALSE, warning=FALSE} df_rmst_pseudo <- find_scores(formula=Surv(time, event) ~ arm, tau=18, data=dat, method = "rmst") p_rmst <- plot(df_rmst_pseudo, title = "RMST") cowplot::plot_grid(p_km[[1]],p_rmst,rel_widths=c(2,3)) ``` Next we might try a statistic based on the difference in milestone survival probabilities at 12 months (via the Kaplan-Meier estimates). If there were no censoring prior to 12 months, this would be like giving everyone with an event prior to 12 months a score of 1 and everyone still alive at 12 months a score of -1. Owing to censoring, however, events closer to 12 months are up-weighted. ```{r echo=FALSE, message=FALSE, warning=FALSE} df_milestone_pseudo <- find_scores(formula=Surv(time, event) ~ arm, tau=12, data=dat, method = "ms") p_surv <- plot(df_milestone_pseudo, title = "Milestone") cowplot::plot_grid(p_km[[1]],p_surv,rel_widths=c(2,3)) ``` ## Weighted log-rank tests Weighted log-rank tests can also be compared in this framework. For example, a [modestly-weighted log-rank test](https://arxiv.org/abs/1807.11097) ($t^*= 9$). ```{r echo=FALSE, message=FALSE, warning=FALSE} df_mwlrt <- find_scores(formula=Surv(time, event) ~ arm, data=dat, t_star=9, method = "mw") p_mwlrt <- plot(df_mwlrt,title="MWLRT") cowplot::plot_grid(p_km[[1]],p_mwlrt,rel_widths=c(2,3)) ``` This might be thought of as intermediate between a log-rank test and a milestone test. It's similar to a milestone analysis in the sense that it is a contrast of the later parts of the survival curves, with early events all getting a score of 1. However, unlike the milestone analysis, the contrast does not focus on just one timepoint. Rather, it is an average contrast over late follow-up times with better outomes receiving gradually better scores, and in this sense is more like the log-rank test. ## Early versus late contrasts By looking carefully at these graphs we can get a sense about whether a particular test is putting more emphasis on early parts of follow-up or late parts of follow-up. Roughly speaking, if, at a particular Time, there is a large difference betwen the score given to an observed event and the score given to a censored observation, then this indicates that heavy emphasis is being given to that timepoint. Here, for example, we see that for RMST the scores give high emphasis to early follow-up times, and vice-versa for the milestone and modestly-weighted tests. The log-rank test has more gradually changing scores across the whole follow-up period. ```{r echo=FALSE, message=FALSE, warning=FALSE} cowplot::plot_grid(p_rmst, p_lrt,p_mwlrt,p_surv, nrow = 2) ``` # Appendix ## Where do these numbers come from? In all cases, we can think in the permutation test framework, considering the scores as fixed and permuting the treatment labels. ### (Weighted) log-rank test Following Leton \& Zuluaga (2000), letting $l_{1,j}$ and $l_{0,j}$ denote the number of patients censored on the test treatment and control treatment, respectively, during $\left[\left.t_j, t_{j+1}\right)\right.$, we can express the (weighted) log-rank statistic as $$\begin{align*} U_W &:=\sum_{j = 1}^{k} w_j\left( O_{1,j} - O_j\frac{n_{1,j}}{n_j} \right)\\ &=\sum_{j = 1}^{k} w_jO_{1,j} - \sum_{j = 1}^{k}w_j\frac{O_{j}}{n_j} \times n_{1,j}\\ &=\sum_{j = 1}^{k} w_jO_{1,j} - \sum_{j = 1}^{k}w_j\frac{O_{j}}{n_j} \times \sum_{i = j}^{k}(O_{1,i} + l_{1,i})\\ &=\sum_{j = 1}^{k} w_jO_{1,j} - \sum_{j = 1}^{k}(O_{1,j} + l_{1,j}) \times \sum_{i = 1}^{j}w_i\frac{O_{i}}{n_i}\\ &=\sum_{j = 1}^{k} O_{1,j}\left( w_j - \sum_{i = 1}^{j}w_i\frac{O_{i}}{n_i} \right) + \sum_{j = 1}^{k}l_{1,j} \left( - \sum_{i = 1}^{j}w_i\frac{O_{i}}{n_i}\right). \end{align*}$$ This means that an observed event at time $t_j$ is given a score of $a_i=w_j - \sum_{i = 1}^{j}w_i\frac{O_{i}}{n_i}$, and an observation censored during $\left[\left.t_j, t_{j+1}\right)\right.$ is given a score of $a_i=- \sum_{i = 1}^{j}w_i\frac{O_{i}}{n_i}$. Note that instead of using $U_W$ one could also use \begin{equation*} \tilde{U}_W = \frac{\sum_{i= 1}^n \mathbb{I}\left\lbrace z_{i} = 1 \right\rbrace a_{i}}{\sum_{i= 1}^n \mathbb{I}\left\lbrace z_{i} = 1 \right\rbrace} - \frac{\sum_{i= 1}^n \mathbb{I}\left\lbrace z_{i} = 0 \right\rbrace a_{i}}{\sum_{i= 1}^n \mathbb{I}\left\lbrace z_{(i)} = 0 \right\rbrace}, \end{equation*} as the test statistic in a permutation test, as $U$ and $\tilde{U}$ are equivalent up to a (positive) scale and shift transformation, i.e., \begin{equation*} \tilde{U}_W = \frac{n}{\sum_{i= 1}^n \mathbb{I}\left\lbrace z_{i} = 1 \right\rbrace\sum_{i= 1}^n \mathbb{I}\left\lbrace z_{i} = 0 \right\rbrace} \times U_W - \frac{\sum_{i= 1}^n a_{i} \sum_{i= 1}^n \mathbb{I}\left\lbrace z_{i} = 1 \right\rbrace}{\sum_{i= 1}^n \mathbb{I}\left\lbrace z_{i} = 1 \right\rbrace\sum_{i= 1}^n \mathbb{I}\left\lbrace z_{i} = 0 \right\rbrace}. \end{equation*} Furthermore, re-scaling the scores to $$b_i = \frac{2a_i - \max{a} - \min{a}}{\max{a} - \min{a}}$$ so that $b \in (-1,1)$ would also leave the p-value of the permutation test unchanged. ### RMST / Milestone We use the concept of pseudo-values following [Andersen et al. (2017)](https://pubmed.ncbi.nlm.nih.gov/28384840/). For the RMST, without any adjustment for covariates, the $i$-th pseudo-value (at time $\tau$) is defined as $$\widehat{\theta}_{i}^{\tiny \mbox{RMST}} = n \int_0^\tau \widehat{S}(t) dt - (n-1) \int_0^\tau \widehat{S}^{(-i)}(t) dt,$$ where $\widehat{S}^{(-i)}(t)$ is the Kaplan-Meier estimator excluding observation (or subject) $i$. For the Milestone survival analysis rate, also without any adjustment for covariates, the $i$-th pseudo-value is defined as $$\widehat{\theta}_{i}^{\tiny \mbox{MLST}} = n \widehat{S}(t) - (n-1) \widehat{S}^{(-i)}(t).$$ Having found pseudo-values for each patient, they can be used just like any other (continuous) outcomes. For example, they could be fed into a linear model with treatment term only. In this case, the resultant test statistic for testing the null hypothesis of zero difference between treatments would be a difference in mean pseudo-values, i.e., the pseudo-values are performing the same role as the "scores" in the weighted log-rank tests. Again these scores could be standardized to lie between -1 and 1 without affecting the p-value.