Techniques of survival analysis are needed once you have right-censored data. Such data is the result of clinical trials or retrospective studies that observe a defined endpoint such as progression free survival or overall survival: At time of analysis, the endpoint has not occurred for all subjects. There are two possible reasons: A subject is lost to follow-up at some point in time and you cannot find out if the endpoint occurred or not, or your analysis is done when the subject is still under follow-up. In any case, you need standard methods of analysis, which you will find in any paper reporting trial result in clinical oncology. (Please note that this introduction is very short and insufficient to cover many important statistical aspects)
The survivalAnalysis package assumes your data to be contained in one data frame with one row per subject and at least two columns: - A numeric column giving the survival time. - A logical or numeric (0/1) column with the censoring indicator. TRUE or 1 is interpreted such that the endpoint occurred for the subject at the given time, FALSE or 0 is interpreted such that the endpoint did not occur during the given time under observation.
From a dataset containing only survival data, you can generate descriptive statistics such as median survival, but usually you want to compare groups by subject characteristics. Thus, the data frame will usually contain further columns containing covariates of interest. If one such covariate (or a computational result using these covariates) result in a logical or categorical variable, continue reading this vignette. If you want to analyse multiple covariates at once, accounting for possible interactions, or need to analyse continuous variables (truly as a continous variable, not categorical by specifying cut-offs), please read the companion vignette on multivariate analysis.
For the purpose of this vignette, we use the lung
dataset from the survival
package:
inst | time | status | age | sex | ph.ecog | ph.karno | pat.karno | meal.cal | wt.loss |
---|---|---|---|---|---|---|---|---|---|
3 | 306 | 2 | 74 | 1 | 1 | 90 | 100 | 1175 | NA |
3 | 455 | 2 | 68 | 1 | 0 | 90 | 90 | 1225 | 15 |
3 | 1010 | 1 | 56 | 1 | 0 | 90 | 90 | NA | 15 |
5 | 210 | 2 | 57 | 1 | 1 | 90 | 60 | 1150 | 11 |
1 | 883 | 2 | 60 | 1 | 0 | 100 | 90 | NA | 0 |
12 | 1022 | 1 | 74 | 1 | 1 | 50 | 80 | 513 | 0 |
7 | 310 | 2 | 68 | 2 | 2 | 70 | 60 | 384 | 10 |
11 | 361 | 2 | 71 | 2 | 2 | 60 | 80 | 538 | 1 |
1 | 218 | 2 | 53 | 1 | 1 | 70 | 80 | 825 | 16 |
7 | 166 | 2 | 61 | 1 | 2 | 70 | 70 | 271 | 34 |
Survival data is contained in the “time” and “status” columns, with all other columns as presumed or potential covariates.
First of all, we load the tidyverse, for some tools the tidytidbits package and finally and the survivalAnalysis package:
The survivalAnalysis package is built around the notion of performing
an analysis generating a result object, and then performing
output such as printing or plotting using this object. Of course,
everything can be done in a magrittr
pipeline.
As a first step, we want to get a feeling for the data and have a
look at the median survival. You need to tell the
analyse_survival
method which columns to regard as time and
censoring indicator. We do that with the vars()
method from
dplyr
, which allows to give the plain variable names.
Passing column names as strings is also possible.
survival::lung %>%
analyse_survival(vars(time, status))
#> Overall Analysis:
#> log.rank.p n median Lower.CI Upper.CI min max
#> overall: <NA> 228 310.0 285.0 363.0 5.0 1022.0
#>
#> <only one group in analysis>
The analyse_survival method actually returns a list of class
‘SurvivalAnalysisUnivariateResults’. There are reimplementations for
print()
and format()
provided. For the
timespans that apply here we are more used to thinking in months than in
days, so let’s print it in months:
survival::lung %>%
analyse_survival(vars(time, status)) %>%
print(timespan_unit="months")
#> Overall Analysis:
#> log.rank.p n median Lower.CI Upper.CI min max
#> overall: <NA> 228 10.2 9.3 11.9 0.2 33.5
#>
#> <only one group in analysis>
As you see, there is a comment that only one group was analyzed and a log-rank p is not defined. You get a median survival with confidence interval and range.
The ECOG performance status is a classical prognostic indicator. To
get a feeling for the data, let’s first have a look how it is
distributed. The count_by
method from
tidytidbits
comes handy.
survival::lung %>%
count_by(ph.ecog)
#> # A tibble: 5 × 4
#> ph.ecog n rel percent
#> <dbl> <int> <dbl> <chr>
#> 1 0 63 0.276 "27.6%"
#> 2 1 113 0.496 "49.6%"
#> 3 2 50 0.219 "21.9%"
#> 4 3 1 0.00439 " 0.4%"
#> 5 NA 1 0.00439 " 0.4%"
There is only one subject with ECOG 3. It makes sense to group 2 und 3 together. Then we perform the survival analysis comparing the subgroups formed by the ECOG status. We store the result object for later use.
survival::lung %>%
mutate(ecog=recode_factor(ph.ecog, `0`="0", `1`="1", `2`="2-3", `3`="2-3")) %>%
analyse_survival(vars(time, status), by=ecog) ->
result
print(result)
#> Analysis by ecog:
#> Overall Analysis:
#> log.rank.p n median Lower.CI Upper.CI min max
#> overall: p<0.001 227 310.0 285.0 363.0 5.0 1022.0
#>
#> Descriptive Statistics per Subgroup:
#> records events median Lower.CI Upper.CI
#> 0 63 37 394.0 348.0 574.0
#> 1 113 82 306.0 268.0 429.0
#> 2-3 51 45 183.0 153.0 288.0
#>
#> Pair-wise p-values (log-rank, uncorrected):
#> 0 1
#> 1 0.063 NA
#> 2-3 0.0 0.003
#>
#> Pair-wise Hazard Ratios:
#> Lower.CI HR Upper.CI p
#> 1 vs. 0 0.98 1.45 2.13 0.064
#> 0 vs. 1 0.47 0.69 1.02 0.064
#> 2-3 vs. 0 1.64 2.54 3.93 <0.001
#> 0 vs. 2-3 0.25 0.39 0.61 <0.001
We get much more information now. ECOG status is a highly significant prognostic factor.
Please note that the textual presentation can be customized (see help
page of format.SurvivalAnalysisUnivariateResult
) and is
sufficient for many purposes, but if you need the numbers, the
survival_data_frames
function allows to extract the same
information in a structured list as data frames.
You may have noted that the by
argument also takes plain
variable names. If you are used to the tidyverse style, just skip this,
but if not: You can use the columns of the data frame as if they were
local variables. You can also pass expressions. We may, for example,
want to compare ECOG 0-1 with ECOG 2-3. Instead of inserting a
mutate
call, we just pass the logical expression
ph.ecog <= 1
. In this case, you should give
interpretation of the logical values so that the result can easily be
understood. There is the by_label_map
parameter in
analyse_survival
which allows to give labels to the values
of by
. These labels will also be used in plots.
survival::lung %>%
analyse_survival(vars(time, status),
by=ph.ecog <= 1,
by_label_map=c(`TRUE`="ECOG 1-2",
`FALSE`="ECOG 2-3"))
#> Analysis by ph.ecog <= 1:
#> Overall Analysis:
#> log.rank.p n median Lower.CI Upper.CI min max
#> overall: p<0.001 227 310.0 285.0 363.0 5.0 1022.0
#>
#> Descriptive Statistics per Subgroup:
#> records events median Lower.CI Upper.CI
#> ECOG 1-2 176 119 353.0 306.0 433.0
#> ECOG 2-3 51 45 183.0 153.0 288.0
#>
#> Hazard Ratio:
#> Lower.CI HR Upper.CI p
#> ECOG 2-3 vs. ECOG 1-2 1.41 2.0 2.82 <0.001
#> ECOG 1-2 vs. ECOG 2-3 0.35 0.5 0.71 <0.001
For your paper you will want to provide Kaplan-Meier plots of your results. The survminer package has revolutionized drawing of Kaplan-Meier plots in R. The survivalAnalysis package fully integrates plotting with survminer, providing sane defaults:
Apparently, some details need tweaking. The relevant function,
ggsurvplot()
, is highly customizable, please refer to its
documentation! All parameters supported by ggsurvplot can simply be
passed to kaplan_meier_plot()
. In addition, there are some
shortcuts which we will now use.
We want to add * x axis label for OS * x axis breaks by quarter year * legend label * display the hazard ratios * a risk table, but “clean” only with minimal overhead * use a theme of our choice
kaplan_meier_plot(result,
break.time.by="breakByQuarterYear",
xlab=".OS.months",
legend.title="ECOG Status",
hazard.ratio=TRUE,
risk.table=TRUE,
table.layout="clean",
ggtheme=ggplot2::theme_bw(10))
What about differences by sex? Let’s draw two KM plots side-by-side
using kaplan_meier_grid()
. This piece of code also
demonstrates the possible use of lists of default arguments, which can
be overridden. If arguments differ per-plot, there is the
mapped_plot_args argument.
Internally, gridExtra::marrangeGrob()
does the layout.
For some reason, the returned value needs an explicit
print()
. Normally you will want to save it to PDF. I
recommend using tidytidbits
’ save_pdf
, into
which you can simply pipe the kaplan_meier_grid output.
default_args <- list(break.time.by="breakByMonth",
xlab=".OS.months",
legend.title="ECOG Status",
hazard.ratio=TRUE,
risk.table=TRUE,
table.layout="clean",
ggtheme=ggplot2::theme_bw(10))
list(result,
survival::lung %>%
analyse_survival(vars(time, status),
by=sex,
by_label_map=c(`1`="Male", `2`="Female"))
) %>%
kaplan_meier_grid(nrow=2,
default_args,
break.time.by="breakByQuarterYear",
mapped_plot_args=list(
legend.title=c("ECOG Status", "Sex"),
title=c("A", "B")
)) %>%
print