The LTRCtrees
package is designed to fit survival trees
for left-truncated and right censored (LTRC) data as described in
Fu and Simonoff (2017a)
[1], and interval-censored survival
data as described in Fu and Simonoff (2017b)
[2]. For right
censored data, the tree algorithms in rpart
[3] and
partykit
[4], respectively, can be used to fit a survival
tree. The LTRCtrees
package extends the survival tree
algorithms in these two packages to fit trees for LTRC data, as well as
interval-censored survival data. Since the package uses the fitting
functions in rpart
and partykit
, these
packages will be installed if they have not already been installed on
the user’s machine, but they are not loaded into the session when the
LTRCtrees
package is loaded. In addition to fitting trees
to LTRC data, a major benefit of these extensions is that they can be
used to fit survival trees for data with time-varying covariates.
This document gives several examples of how to use the LTRC trees and
ICtree (interval-censored tree) implemented in the
LTRCtrees
package.
The assay of serum free light chain data for 7874 subjects in the R
package survival
[5] is used as a data example to show how
to fit survival trees for LTRC data. The original data were obtained
from the residents of Olmsted County aged 50 or greater, and therefore
the data are left-truncated at age 50. The original analysis of this
data set was based on the Cox model with age as a covariate and time
after enrollment in the study as response. However, in a mortality study
such as this one, since greater age is almost always associated with
higher risk of death, it is not that meaningful (or surprising) to have
age as a (significant) covariate. Also, it is often the case that the
real response of interest should be the subject’s life length, not the
time from enrollment in the study to death/censoring.
In this example, we use age as a left-truncation point and death/censored time as the event time.
## Adjust data & clean data
library(survival)
set.seed(0)
## Since LTRCART uses cross-validation to prune the tree, specifying the seed
## guarantees that the results given here will be duplicated in other analyses
Data <- flchain
Data <- Data[!is.na(Data$creatinine),]
Data$End <- Data$age + Data$futime/365
DATA <- Data[Data$End > Data$age,]
names(DATA)[6] <- "FLC"
The key here is to have a (start, stop, event)
triplet
in the data, where start
is the left-truncation time,
stop
is the right censored/event time and
event
is the event indicator, where 1 means death and 0
means censored. In our case, the triplet is
(age, End, death)
.
Here we set up a training set and a test set to illustrate predictions using the fitted trees.
## Fit LTRCART and LTRCIT survival tree
library(LTRCtrees)
LTRCART.obj <- LTRCART(Surv(age, End, death) ~ sex + FLC + creatinine, Train)
LTRCIT.obj <- LTRCIT(Surv(age, End, death) ~ sex + FLC + creatinine, Train)
## Loading required namespace: inum
## Putting Surv(End, death) in formula would result an error message
## since both LTRCART and LTRCIT are expecting Surv(time1, time2, event)
## Plot the fitted LTRCART tree using rpart.plot function in rpart.plot[6] package
library(rpart.plot)
## Loading required package: rpart
Note that the terminal node of the LTRCART tree (left panel) displays
the relative risk on that node (relative to an estimated baseline
hazard), while the terminal node of the LTRCIT tree (right panel)
displays the fitted Kaplan-Meier curve for the observations in that
node. One can also plot Kaplan-Meier curves on terminal nodes of the
LTRCART tree by converting it to party
object.
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
LTRCART.obj.party <- as.party(LTRCART.obj)
LTRCART.obj.party$fitted[["(response)"]]<- Surv(Train$age, Train$End, Train$death)
plot(LTRCART.obj.party)
Since the class of an LTRCIT object is party
, the
prediction method on an LTRCIT object is the same as those for any trees
of class party
using the partykit
package.
## predict median survival time on test data using fitted LTRCIT tree
LTRCIT.pred <- predict(LTRCIT.obj, newdata=Test, type = "response")
head(LTRCIT.pred)
## 1057 1058 1059 1060 1061 1062
## 90.47123 90.47123 90.47123 90.47123 90.47123 83.41644
## predict Kaplan Meier survival curve on test data
## return a list of survfit objects -- the predicted KM curves
LTRCIT.pred <- predict(LTRCIT.obj, newdata=Test, type = "prob")
head(LTRCIT.pred,2)
## $`1057`
## Call: survfit(formula = y ~ 1, weights = w, subset = w > 0)
##
## records n.max n.start events median 0.95LCL 0.95UCL
## [1,] 248 144 36 189 90.5 88.4 91.4
##
## $`1058`
## Call: survfit(formula = y ~ 1, weights = w, subset = w > 0)
##
## records n.max n.start events median 0.95LCL 0.95UCL
## [1,] 248 144 36 189 90.5 88.4 91.4
The prediction function on a survival rpart
object can
only predict the relative risk, so we can only predict the relative risk
on an LTRCART object.
## Predict relative risk on test set
LTRCART.pred <- predict(LTRCART.obj, newdata=Test)
head(LTRCART.pred)
## 1057 1058 1059 1060 1061 1062
## 0.7212083 0.7212083 0.7212083 0.7212083 0.7212083 1.5257057
In order to predict the median survival time and Kaplan-Meier curves
based on an LTRCART object, one can use the Pred.rpart
function available in the LTRCtrees
package.
## Predict median survival time and Kaplan Meier survival curve
## on test data using Pred.rpart
LTRCART.pred <- Pred.rpart(Surv(age, End, death) ~ sex + FLC + creatinine, Train, Test)
head(LTRCART.pred$KMcurves, 2) ## list of predicted KM curves
## [[1]]
## Call: survfit(formula = Formula, data = subset)
##
## records n.max n.start events median 0.95LCL 0.95UCL
## [1,] 248 144 36 189 90.5 88.4 91.4
##
## [[2]]
## Call: survfit(formula = Formula, data = subset)
##
## records n.max n.start events median 0.95LCL 0.95UCL
## [1,] 248 144 36 189 90.5 88.4 91.4
## [1] 90.5 90.5 90.5 90.5 90.5 83.8
Note that in order to use the Pred.rpart
function one
needs to pass both the Training and Test sets into the function. If the
formula passed to Pred.rpart
has the form Surv(time1,
time2, event), then the function recognizes this as LTRC data and
LTRCART is called internally; if the formula has the form Surv(time,
event), then the function recognizes this as ordinary right censored
data, and the rpart
function in the rpart
package is called internally.
To fit time-varying covariates survival trees using
LTRCART
or LTRCIT
, data need to be
prepared/transformed into the so-called Andersen-Gill style. This is
exactly the same data format needed to fit the Cox model with
time-varying covariates.
Patient.ID | Sex | Blood.pressure | Start | End | Death |
---|---|---|---|---|---|
1 | F | 100 | 0 | 10 | 0 |
1 | F | 89 | 10 | 20 | 0 |
1 | F | 120 | 20 | 27 | 1 |
2 | M | 110 | 0 | 10 | 0 |
2 | M | 105 | 10 | 19 | 0 |
This table gives an example of the form of Andersen-Gill style data. Information for two patients is presented in this table, with Sex being a time-independent covariate and Blood pressure being a time-varying covariate. For the first patient (patient.ID=1), the event-death is observed at time 27, while the 3 measurements of Sex and blood pressure are recorded at the beginning and time 10 and 20 respectively. Information for the second patient has a similar interpretation except he is censored at time 19.
Note that each row represents a time interval, where within each interval all covariates are assumed to have constant values. The first row means the first patient is a female with blood pressure equal to 100 during the interval (0,10]. Of course, blood pressure is unlikely to keep at a fixed level during this time interval (since it is time-varying), but in practice a patient’s blood pressure is measured only at the time he/she visits the clinic, so it is common practice to treat it as constant between the two visits.
Note that as a result each row has the form of LTRC data – left-truncated at Start and right-censored or experience event at End, with Death being the event indicator.
The pbcseq
dataset in the survival
package
is used as an example to illustrate fitting survival trees with
time-varying covariates. These data were obtained from 312 patients with
primary biliary cirrhosis (PBC) enrolled in a double-blind,
placebo-controlled, randomized trial conducted between January, 1974 and
May, 1984 at the Mayo Clinic to evaluate the use of D-penicillamine for
treating PBC. A comprehensive clinical and laboratory database was
established on each patient. Follow-up was extended to April, 1988, by
which time 140 of the patients had died and 29 had undergone orthotopic
liver transplantation. These patients generated 1,945 patient visits
that enable us to study the change in the prognostic variables of
PBC.
set.seed(0)
library(survival)
## Create the start-stop-event triplet needed for coxph and LTRC trees
first <- with(pbcseq, c(TRUE, diff(id) !=0)) #first id for each subject
last <- c(first[-1], TRUE) #last id
time1 <- with(pbcseq, ifelse(first, 0, day))
time2 <- with(pbcseq, ifelse(last, futime, c(day[-1], 0)))
event <- with(pbcseq, ifelse(last, status, 0))
event <- 1*(event==2)
pbcseq$time1 <- time1
pbcseq$time2 <- time2
pbcseq$event <- event
## Fit the Cox model and LTRC trees with time-varying covariates
fit.cox <- coxph(Surv(time1, time2, event) ~ age + sex + log(bili), pbcseq)
LTRCIT.fit <- LTRCIT(Surv(time1, time2, event) ~ age + sex + log(bili), pbcseq)
LTRCART.fit <- LTRCART(Surv(time1, time2, event) ~ age + sex + log(bili), pbcseq)
## Result of the Cox model with time-varying covariates
fit.cox
## Call:
## coxph(formula = Surv(time1, time2, event) ~ age + sex + log(bili),
## data = pbcseq)
##
## coef exp(coef) se(coef) z p
## age 0.068285 1.070670 0.008624 7.918 2.42e-15
## sexf 0.182856 1.200641 0.234190 0.781 0.435
## log(bili) 1.465486 4.329646 0.094157 15.564 < 2e-16
##
## Likelihood ratio test=352.8 on 3 df, p=< 2.2e-16
## n= 1945, number of events= 140
## plots of fitted survival trees with time-varying covariates
prp(LTRCART.fit,type=0, roundint=FALSE)
plot(LTRCIT.fit)
Note that both survival trees fitted using LTRCART and LTRCIT,
respectively, identify age
and log(bili)
as
risk factors, which is consistent with the Cox model result.
In the previous example, the pbcseq
data are in the long
format – each row represents a start-stop-event triplet, which contains
the information in the time interval (start, stop]. Note that all
covariates have constant values inside each such interval, while each
subject could have multiple rows (multiple intervals) to represent the
time-varying covariates effect, i.e. different rows (intervals) of the
same subject contain different covariates’ value. The long format data
can be directly used to fit survival trees with time-varying covariates
using either LTRCART or LTRCIT.
It is common to have time-varying covariates survival data that is in
wide format – each row contains all of the information of one subject.
To use the tree algorithms in LTRCtrees
, we first need to
transform the wide format into the long format. The tmerge
function in the survival
package can be used to serve this
purpose.
library(survival)
### transform the wide format data into the long format data using tmerge function
### from survival package on Stanford Heart Transplant data
jasa$subject <- 1:nrow(jasa)
tdata <- with(jasa, data.frame(subject = subject,
futime= pmax(.5, fu.date - accept.dt),
txtime= ifelse(tx.date== fu.date,
(tx.date -accept.dt) -.5,
(tx.date - accept.dt)),
fustat = fustat))
sdata <- tmerge(jasa, tdata, id=subject,death = event(futime, fustat),trt = tdc(txtime), options= list(idname="subject"))
sdata$age <- sdata$age - 48
sdata$year <- as.numeric(sdata$accept.dt - as.Date("1967-10-01"))/365.25
Cox.fit <- coxph(Surv(tstart, tstop, death) ~ age+ surgery, data= sdata)
LTRCART.fit <- LTRCART(Surv(tstart, tstop, death) ~ age + transplant, data = sdata)
LTRCIT.fit <- LTRCIT(Surv(tstart, tstop, death) ~ age + transplant, data = sdata)
## results
Cox.fit
## Call:
## coxph(formula = Surv(tstart, tstop, death) ~ age + surgery, data = sdata)
##
## coef exp(coef) se(coef) z p
## age 0.03068 1.03115 0.01364 2.25 0.0245
## surgery -0.77284 0.46170 0.35953 -2.15 0.0316
##
## Likelihood ratio test=10.72 on 2 df, p=0.00471
## n= 170, number of events= 75
## Loading required package: perm
## Loading required package: Icens
## Loading required package: MLEcens
## Depends on Icens package available on bioconductor.
## To install use for example:
## install.packages('BiocManager')
## BiocManager::install('Icens')
data(bcos)
## Fit ICtree survival tree
Ctree <- ICtree(Surv(left,right,type="interval2")~treatment, bcos)
## ans < 0 || ans > 1. t_diff = 12.000000, pStep = 0.055206, intLen = 12.000000, ind = 19, k = 20
[1] Fu, W. and Simonoff, J.S. (2017a). Survival trees for left-truncated and right-censored data, with application to time-varying covariate data. Biostatistics 18 (2), 352-369
[2] Fu, W. and Simonoff, J.S. (2017b). Survival trees for Interval Censored Survival data. Statistics in medicine 36 (30), 4831-4842
[3] Therneau, T., Atkinson, B., Ripley, B. and Ripley, M.B., (2015). rpart: Recursive Partitioning and Regression Trees.
[4] Hothorn, T. and Zeileis, A., (2015). partykit: A Modular Toolkit for Recursive Partytioning in R.
[5] Therneau, T., (2015). survival: A Package for Survival Analysis in S. version 2.38.
[6] Milborrow, S., (2011). rpart.plot: Plot ‘rpart’ models: An Enhanced Version of ‘plot.rpart’.