The coxrt
package
accompanies the paper of Vakulenko-Lagun, Mandel and
Betensky (2019) and is designed for analysis of right-truncated
data.
When we are interested in the analysis of lifetimes that start at an initiating event and end up when an event of interest occurs, but we cannot obtain a population of interest to follow them prospectively, retrospective ascertainment of cases could be a good alternative. However, this might result in right-truncated survival data.
Right truncation often happens when we cannot identify the population of interest since their initiating events are latent (like the times of infections), and the population becomes “visible” only when another event, that can be more easily identified, happens. Here we assume that for those who are sampled, the times of initiating events are known or can be reliably estimated.
One example of such type of data is the data set on AIDS cases resulted from contaminated blood transfusion (Kalbfeisch and Lawless 1989). Here, the time of interest is the incubation period of AIDS, i.e., the time from HIV infection to development of AIDS. Figure 1 shows schematically how these data were collected. The individuals who developed AIDS by June 30, 1986 (the date when the AIDS cases were retrospectively ascertained) were selected, satisfying by this the condition Ti < Ri, where Ti is an incubation period for the ith person and Ri is his/her truncation time, defined as the time from HIV infection to June 30, 1986.
In the two functions of the package, coxph.RT
and
coxph.RT.a0
, we assume that Ti are observed
exactly and there is no censoring. The analysis is based only on sampled
events. Those who belong to the population but have not developed AIDS
by the time of collection of data remain unobserved.
The sampling condition Ti < Ri means that we tend to sample shorter lifetimes than longer ones. And an important complication of such sampling scheme is that the follow-up is limited: on Figure 1, r*, defined as the time from the earliest HIV infection (January 1983) to June 30 1986, is the maximum follow-up time in this data set. This is an important statistic, since it allows us to realize whether our sampling covers the whole range of possible incubation periods or not. If the follow-up is too short, most likely we do not capture longer incubation periods, and they will not only be underrepresented in our sample, but even worse, they are likely to be non-represented at all. We shall call the condition P(T > r*) = 0 a positivity assumption.
For example in the AIDS data, positivity does not hold since the incubation period of AIDS can be longer than the maximum observed follow-up time in this data set that equals 3.5 years. Under such short retrospective follow-up there was no chance to observe longer incubation times, which means that P(T > r*) > 0, or equivalently, the support of observed T* is shorter than the support of the original T in the population.
Even if positivity holds, truncation anyway results in distorted
distribution, and it is important to account for this bias. For example,
on Figure 2 we illustrate the problem of distortion of the original
distribution using simulated data. Here, the population lifetime T follows Weibull
distribution as rweibull(20000, shape=3, scale=1)
. And the
population truncation time R
was generated (independently of T) from
rweibull(20000, shape=2.5, scale=1)
. The sampled lifetimes
T* comprise all
observations from the population that satisfy Ti < Ri.
This resulted in truncation probability P(T > R) = 0.5.
Figure 2 shows densities of T,
R and T*. It is clear that
distributions of T and T* are different.
Importantly, in this example positivity holds since the supports of
T and R are the same, so that every
possible lifetime has a chance to be sampled, and this kind of bias can
be easily corrected.
Even if we do not aim to estimate the whole distribution and are interested only in estimation of the effects of some covariates on the incubation period, we still need to adjust for right truncation, since, as we illustrate below, the right truncation blurs the difference between subgroups.
Suppose we have one binary covariate Z. Figure 3 shows distributions in the two subpopulations defined by Z. The lifetime T is generated from the proportional hazard regression model h(t; z) = h0(t)exp (βz) with β = 2:
library(ggplot2)
library(survival)
set.seed(17)
N <- 20000
Z <- as.numeric(runif(N,0,1) < 0.5) # binary explanatory variable
X <-rweibull(N, shape=3, scale=1*exp(-(2*Z)/3)) # shape a=3, scale b=1, beta=2
m <- max(X)
d.tr <- data.frame(time=c(X), variable=c(Z))
d.tr$variable=as.factor(d.tr$variable)
levels(d.tr$variable) <- c( "T | Z=0", "T | Z=1", " ")
p <- ggplot(d.tr) +
geom_density(aes(x=time, color=variable, fill=variable), alpha=0.5) +
scale_color_brewer(palette = "Set1")+
scale_fill_brewer(palette = "Set1")+
theme_classic() +
labs(x = "time")+
ggtitle("In the population")+
theme(axis.text.x = element_text(face="bold"),
axis.text.y = element_text(face="bold"),
plot.title = element_text(hjust = 0.5) )+ xlim(0, m)
p
Figure 4 shows sample distributions under right truncation with positivity. For this illustration we use the same population as in Figure 3, but truncate it using uniformly distributed R:
set.seed(0)
r <- runif(N, 0, m)
ind <- X<r
cat("P(T<R) = selection probability = ", mean(ind), "\n")
## P(T<R) = selection probability = 0.6922
d.tr <- data.frame(time=c(X[ind],r), variable=c(Z[ind], rep(3, N)))
d.tr$variable=as.factor(d.tr$variable)
levels(d.tr$variable) <- c( "T* | Z*=0", "T* | Z*=1", "R")
Despite the 30% chance of truncation, both Figures 3 and 4 look relatively similar to each other.
On Figure 5, we shorten the follow-up time by using r* = 1.1. This leads to violation of positivity, since longer lifetimes have no chance to be sampled, and the sub-groups become more similar to each other. In such case, without any adjustment, the estimators of β will be biased and it is likely we will not able to discover the difference between subgroups.
On Figure 6, severe violation of positivity with R ∼ U(0, 0.4) causes the groups to be almost identical:
And the situation becomes more complicated when there are several dependent covariates, as described in Vakulenko-Lagun, Mandel, and Betensky (2019).
coxrt
suggest for right-truncated data?The coxrt
package allows to estimate regression
coefficients β in the
proportional hazards model h(t; z) = h0(t)exp (βz).
coxph.RT
Function coxph.RT
implements estimating equations
denoted as IPW-S
in Vakulenko-Lagun,
Mandel, and Betensky (2019). This function assumes positivity
(as, e.g., on Figure 4).
For example, for AIDS data we can fit proportional hazards regression of incubation time on age:
library(gss)
library(coxrt)
data(aids)
all <- data.frame(age=aids$age, ageg=as.numeric(aids$age<=59), T=aids$incu, R=aids$infe, hiv.mon =102-aids$infe)
all$T[all$T==0] <- 0.5 # as in Kalbfeisch and Lawless (1989)
s <- all[all$hiv.mon>60,] # select those who were infected in 1983 or later
# analysis assuming positivity
# we request bootstrap estimate of Asymptotic Standard Error (ASE) as well:
sol <- try(coxph.RT(T~ageg, right=R, data=s, bs=TRUE, nbs.rep=500) )
knitr::kable(sol$summary) # print the summary of fit based on the analytic ASE
coef | exp.coef | SE | CI.L | CI.U | pvalue | pvalue.H1.b.gr0 | pvalue.H1.b.le0 | |
---|---|---|---|---|---|---|---|---|
ageg | 1.175433 | 3.239545 | 0.3823836 | 0.425975 | 1.924891 | 0.0021124 | 0.0010562 | 0.9989438 |
SE | CI.L | CI.U | CI.L.H1.b.gr0 | CI.U.H1.b.le0 | |
---|---|---|---|---|---|
ageg | 0.4310668 | 0.3915876 | 2.142297 | 0.5049887 | 1.957792 |
Technically, for point estimation coxph.RT
uses the
standard coxph
function from the survival
package. However, the estimator of the standard error (SE) obtained from
coxph
even with robust
option might
underestimate for smaller samples (of < 1000 observations). Although
coxph.RT
includes both analytic and bootstrap estimates of
a SE, the inference based on the analytic SE estimate, implemented in
coxph.RT
, is more reliable than based on its bootstrap
counterpart, since bootstrap resampling does not always capture the
variability in data that comes from near-zero sampling probabilities
P(R > Ti).
Since the Inverse-Probability-Weighting (IPW) approach, implemented
in coxrt
, relies on “blowing up” the sampled lifetimes
Ti in
order to obtain the unbiased pseudo-population, which should recreate
the same probability law as in the original population,
coxph.RT
in general will give biased estimators in case
there is an unobservable right tail of the distribution. Hence the
function coxph.RT.a0
aims to adjust for this violation of
positivity by using the structural assumption of the Cox model in the
population and the information on a0 = P(T > r* ∣ Z = 0).
coxph.RT.a0
For the situations without positivity (as in Figures 5 and 6), if
a0 can be
hypothesized, based on the subject matter knowledge, or if a0 is unknown but its
interval can be guessed, this information can be used in
coxph.RT.a0
. This function implements estimating equations
denoted as IPW-SA
in the original paper.
IPW-SA
uses stabilized weights. Technically,
coxph.RT.a0
solves non-linear equations using
BBsolve
function from BB package using
default parameter settings. For example, in order to estimate β, the effect of age at HIV on the
incubation period, given that a0 = 0.2, we run:
# analysis using adjusted estimating equations for a0=0.2
sol.02 <- try(coxph.RT.a0(T~ageg, right=R, data=s, a0=0.2, bs=TRUE, nbs.rep = 500))
knitr::kable(round(sol.02$bs$summary,2))
coef | exp.coef | SE | CI.L | CI.U | CI.L.H1.b.gr0 | CI.U.H1.b.le0 | pvalue | pvalue.H1.b.gr0 | pvalue.H1.b.le0 | |
---|---|---|---|---|---|---|---|---|---|---|
ageg | 1.42 | 4.13 | 0.42 | 0.63 | 2.33 | 0.73 | 2.15 | 0 | 0 | 1 |
We can do a sensitivity analysis for β̂ for different values of a sensitivity parameter a0:
# for a0=0:
sol <- try(coxph.RT(T~ageg, right=R, data=s, bs=FALSE) )
# senstivity analysis for different values of a0
a_ <- seq(0.05, 0.55, by=0.05)
est <- CI.L <- CI.U <- NULL
start_time <- Sys.time()
for(q in 1:length(a_))
{
sol.a <- try(coxph.RT.a0(T~ageg, right=R, data=s, a0=a_[q], bs=TRUE))
if (sol.a$convergence!=0)
{
cat("a0=", a_[q], ". Error occurred in BBsolve.\n")
} else
{
est <- c(est, sol.a$coef)
CI.L <- c(CI.L, sol.a$bs$summary$CI.L)
CI.U <- c(CI.U, sol.a$bs$summary$CI.U)
}
}
end_time <- Sys.time()
end_time - start_time # It takes some time to solve nonlinear estimating equations and to run bootstrap for each value of a0 ...
## Time difference of 18.77543 secs
require(ggplot2)
res.d <- data.frame(a0=c(0, a_), beta=c(sol$coef, est),
L.95=c(sol$summary$CI.L, CI.L), U.95=c(sol$summary$CI.U, CI.U))
knitr::kable(round(res.d, 2))
a0 | beta | L.95 | U.95 |
---|---|---|---|
0.00 | 1.18 | 0.43 | 1.92 |
0.05 | 1.21 | 0.35 | 1.86 |
0.10 | 1.28 | 0.47 | 2.00 |
0.15 | 1.35 | 0.58 | 2.22 |
0.20 | 1.42 | 0.57 | 2.19 |
0.25 | 1.49 | 0.64 | 2.25 |
0.30 | 1.57 | 0.76 | 2.52 |
0.35 | 1.65 | 0.77 | 2.26 |
0.40 | 1.74 | 0.86 | 2.48 |
0.45 | 1.83 | 1.05 | 2.67 |
0.50 | 1.93 | 0.95 | 2.82 |
0.55 | 2.04 | 1.10 | 2.79 |
p <- ggplot(res.d, aes(x=a0, y=beta)) +
geom_ribbon(aes(ymin=L.95, ymax=U.95), alpha=0.2) +
geom_line() + geom_point() +
geom_hline(yintercept=0)
p + xlab(expression( paste(a[0], "=P(T>", r['*']," | z=0)" , sep="")) )+
ylab(expression( paste(hat(beta), "(", a[0], ")" , sep="")) ) +
scale_x_continuous(breaks=res.d$a0, labels=res.d$a0) +
theme(axis.text.x = element_text(face="bold", angle=45),
axis.text.y = element_text(face="bold"))
In coxph.RT.a0
, only the bootstrap estimator of SE is
available.
Positivity cannot be tested. This assumption is based on the subject matter knowledge.
The approach assumes independence between T and (R, Z) in the original population. If there is dependence between T and R, it can be discovered by Kendall’s tau test which can be done using packages tranSurv or SurvTrunc, or by more powerful permutation tests which can be done using package permDep.
If a non-terminal event of interest cannot be observed because death preceded it, that is the event of interest is truncated by death, the approach implemented here can be used only if two lifetimes, time to an intermediate event (=T) and time to death (=R), are independent, otherwise the results might be biased.
coxrt
packageAlthough in practice, it is possible to have additional left or interval censoring in right-truncated data, the package assumes that the exact lifetimes, without any censoring, are observed.
The current implementation allows for only time-independent covariates.
The package estimates only covariate effects β, and does not provide estimates for the baseline hazard h0(t) in the Cox model.