Cox regression for right-truncated data

The coxrt package accompanies the paper of Vakulenko-Lagun, Mandel and Betensky (2019) and is designed for analysis of right-truncated data.

What are right-truncated data?

When we are interested in the analysis of lifetimes that start at an initiating event and end up when an event of interest occurs, but we cannot obtain a population of interest to follow them prospectively, retrospective ascertainment of cases could be a good alternative. However, this might result in right-truncated survival data.

Right truncation often happens when we cannot identify the population of interest since their initiating events are latent (like the times of infections), and the population becomes “visible” only when another event, that can be more easily identified, happens. Here we assume that for those who are sampled, the times of initiating events are known or can be reliably estimated.

One example of such type of data is the data set on AIDS cases resulted from contaminated blood transfusion (Kalbfeisch and Lawless 1989). Here, the time of interest is the incubation period of AIDS, i.e., the time from HIV infection to development of AIDS. Figure 1 shows schematically how these data were collected. The individuals who developed AIDS by June 30, 1986 (the date when the AIDS cases were retrospectively ascertained) were selected, satisfying by this the condition T_i < R_i, where T_i is an incubation period for the i^th person and R_i is his/her truncation time, defined as the time from HIV infection to June 30, 1986.

In the two functions of the package, coxph.RT and coxph.RT.a0, we assume that T_i are observed exactly and there is no censoring. The analysis is based only on sampled events. Those who belong to the population but have not developed AIDS by the time of collection of data remain unobserved.

Figure 1. Illustration of retrospective ascertainment of AIDS cases

The sampling condition T_i < R_i means that we tend to sample shorter lifetimes than longer ones. And an important complication of such sampling scheme is that the follow-up is limited: on Figure 1, r_*, defined as the time from the earliest HIV infection (January 1983) to June 30 1986, is the maximum follow-up time in this data set. This is an important statistic, since it allows us to realize whether our sampling covers the whole range of possible incubation periods or not. If the follow-up is too short, most likely we do not capture longer incubation periods, and they will not only be underrepresented in our sample, but even worse, they are likely to be non-represented at all. We shall call the condition P(T > r_*) = 0 a positivity assumption.

For example in the AIDS data, positivity does not hold since the incubation period of AIDS can be longer than the maximum observed follow-up time in this data set that equals 3.5 years. Under such short retrospective follow-up there was no chance to observe longer incubation times, which means that P(T > r_*) > 0, or equivalently, the support of observed T^* is shorter than the support of the original T in the population.

Right truncation causes distortion of the original distribution

Even if positivity holds, truncation anyway results in distorted distribution, and it is important to account for this bias. For example, on Figure 2 we illustrate the problem of distortion of the original distribution using simulated data. Here, the population lifetime T follows Weibull distribution as rweibull(20000, shape=3, scale=1). And the population truncation time R was generated (independently of T) from rweibull(20000, shape=2.5, scale=1). The sampled lifetimes T^* comprise all observations from the population that satisfy T_i < R_i. This resulted in truncation probability P(T > R) = 0.5. Figure 2 shows densities of T, R and T^*. It is clear that distributions of T and T^* are different. Importantly, in this example positivity holds since the supports of T and R are the same, so that every possible lifetime has a chance to be sampled, and this kind of bias can be easily corrected.

Figure 2. Right truncation causes distortion of the original distribution of T

Why is it important to adjust for right truncation in regression?

Even if we do not aim to estimate the whole distribution and are interested only in estimation of the effects of some covariates on the incubation period, we still need to adjust for right truncation, since, as we illustrate below, the right truncation blurs the difference between subgroups.

Suppose we have one binary covariate Z. Figure 3 shows distributions in the two subpopulations defined by Z. The lifetime T is generated from the proportional hazard regression model h(t; z) = h₀(t)exp (βz) with β = 2:

library(ggplot2)
library(survival)
set.seed(17)
N <- 20000
Z <- as.numeric(runif(N,0,1) < 0.5)             # binary explanatory variable     
X <-rweibull(N, shape=3, scale=1*exp(-(2*Z)/3)) # shape a=3, scale b=1, beta=2
m <- max(X)
d.tr <- data.frame(time=c(X), variable=c(Z))
d.tr$variable=as.factor(d.tr$variable)
levels(d.tr$variable) <- c( "T | Z=0", "T | Z=1", " ")
p <- ggplot(d.tr) +
  geom_density(aes(x=time, color=variable, fill=variable), alpha=0.5) + 
  scale_color_brewer(palette = "Set1")+
  scale_fill_brewer(palette = "Set1")+
  theme_classic() +
  labs(x = "time")+
  ggtitle("In the population")+
  theme(axis.text.x = element_text(face="bold"),
        axis.text.y = element_text(face="bold"),
        plot.title = element_text(hjust = 0.5) )+ xlim(0, m)
p

Figure 3. Two sub-populations defined by a covariate Z.

Figure 4 shows sample distributions under right truncation with positivity. For this illustration we use the same population as in Figure 3, but truncate it using uniformly distributed R:

  set.seed(0)
  r <- runif(N, 0, m)
  ind <- X<r
  cat("P(T<R) = selection probability = ", mean(ind), "\n")

## P(T<R) = selection probability =  0.6922

  d.tr <- data.frame(time=c(X[ind],r), variable=c(Z[ind], rep(3, N)))
  d.tr$variable=as.factor(d.tr$variable)
  levels(d.tr$variable) <- c( "T* | Z*=0", "T* | Z*=1", "R")

Figure 4. Sample distributions under right truncation and positivity.

Despite the 30% chance of truncation, both Figures 3 and 4 look relatively similar to each other.

On Figure 5, we shorten the follow-up time by using r_* = 1.1. This leads to violation of positivity, since longer lifetimes have no chance to be sampled, and the sub-groups become more similar to each other. In such case, without any adjustment, the estimators of β will be biased and it is likely we will not able to discover the difference between subgroups.

Figure 5. Sample distributions under mild violation of positivity.

On Figure 6, severe violation of positivity with R ∼ U(0, 0.4) causes the groups to be almost identical:

Figure 6. Sample distributions under severe violation of positivity.

And the situation becomes more complicated when there are several dependent covariates, as described in Vakulenko-Lagun, Mandel, and Betensky (2019).

What does `coxrt` suggest for right-truncated data?

The coxrt package allows to estimate regression coefficients β in the proportional hazards model h(t; z) = h₀(t)exp (βz).

Under positivity use `coxph.RT`

Function coxph.RT implements estimating equations denoted as IPW-S in Vakulenko-Lagun, Mandel, and Betensky (2019). This function assumes positivity (as, e.g., on Figure 4).

For example, for AIDS data we can fit proportional hazards regression of incubation time on age:

library(gss)
library(coxrt)
data(aids)
all <- data.frame(age=aids$age, ageg=as.numeric(aids$age<=59), T=aids$incu, R=aids$infe, hiv.mon =102-aids$infe)
all$T[all$T==0] <- 0.5 # as in Kalbfeisch and Lawless (1989)
s <- all[all$hiv.mon>60,] # select those who were infected in 1983 or later
# analysis assuming positivity
# we request bootstrap estimate of Asymptotic Standard Error (ASE) as well:
sol <- try(coxph.RT(T~ageg, right=R, data=s, bs=TRUE, nbs.rep=500) )
knitr::kable(sol$summary) # print the summary of fit based on the analytic ASE

	coef	exp.coef	SE	CI.L	CI.U	pvalue	pvalue.H1.b.gr0	pvalue.H1.b.le0
ageg	1.175433	3.239545	0.3823836	0.425975	1.924891	0.0021124	0.0010562	0.9989438

knitr::kable(sol$bs$summary) # print the summary of fit based on the bootstrap sample distribution

	SE	CI.L	CI.U	CI.L.H1.b.gr0	CI.U.H1.b.le0
ageg	0.4310668	0.3915876	2.142297	0.5049887	1.957792

Technically, for point estimation coxph.RT uses the standard coxph function from the survival package. However, the estimator of the standard error (SE) obtained from coxph even with robust option might underestimate for smaller samples (of < 1000 observations). Although coxph.RT includes both analytic and bootstrap estimates of a SE, the inference based on the analytic SE estimate, implemented in coxph.RT, is more reliable than based on its bootstrap counterpart, since bootstrap resampling does not always capture the variability in data that comes from near-zero sampling probabilities P(R > T_i).

Since the Inverse-Probability-Weighting (IPW) approach, implemented in coxrt, relies on “blowing up” the sampled lifetimes T_i in order to obtain the unbiased pseudo-population, which should recreate the same probability law as in the original population, coxph.RT in general will give biased estimators in case there is an unobservable right tail of the distribution. Hence the function coxph.RT.a0 aims to adjust for this violation of positivity by using the structural assumption of the Cox model in the population and the information on a₀ = P(T > r_* ∣ Z = 0).

Without positivity use `coxph.RT.a0`

For the situations without positivity (as in Figures 5 and 6), if a₀ can be hypothesized, based on the subject matter knowledge, or if a₀ is unknown but its interval can be guessed, this information can be used in coxph.RT.a0. This function implements estimating equations denoted as IPW-SA in the original paper. IPW-SA uses stabilized weights. Technically, coxph.RT.a0 solves non-linear equations using BBsolve function from BB package using default parameter settings. For example, in order to estimate β, the effect of age at HIV on the incubation period, given that a₀ = 0.2, we run:

# analysis using adjusted estimating equations for a0=0.2
sol.02 <- try(coxph.RT.a0(T~ageg, right=R, data=s, a0=0.2, bs=TRUE, nbs.rep = 500))
knitr::kable(round(sol.02$bs$summary,2))

	coef	exp.coef	SE	CI.L	CI.U	CI.L.H1.b.gr0	CI.U.H1.b.le0	pvalue	pvalue.H1.b.gr0	pvalue.H1.b.le0
ageg	1.42	4.13	0.42	0.63	2.33	0.73	2.15	0	0	1

We can do a sensitivity analysis for β̂ for different values of a sensitivity parameter a₀:

# for a0=0:
sol <- try(coxph.RT(T~ageg, right=R, data=s, bs=FALSE) )
# senstivity analysis for different values of a0
a_ <- seq(0.05, 0.55, by=0.05)
est <- CI.L <- CI.U <- NULL
start_time <- Sys.time()
for(q in 1:length(a_))
{
 sol.a <- try(coxph.RT.a0(T~ageg, right=R, data=s, a0=a_[q], bs=TRUE))
 if (sol.a$convergence!=0)
 {
   cat("a0=", a_[q], ". Error occurred in BBsolve.\n")
 } else
 {
   est <- c(est, sol.a$coef)
   CI.L <- c(CI.L, sol.a$bs$summary$CI.L)
   CI.U <- c(CI.U, sol.a$bs$summary$CI.U)
 }
}
end_time <- Sys.time()
end_time - start_time # It takes some time to solve nonlinear estimating equations and to run bootstrap for each value of a0 ...

## Time difference of 18.94293 secs

require(ggplot2)
res.d <- data.frame(a0=c(0, a_), beta=c(sol$coef, est),
                   L.95=c(sol$summary$CI.L, CI.L), U.95=c(sol$summary$CI.U, CI.U))
knitr::kable(round(res.d, 2))

a0	beta	L.95	U.95
0.00	1.18	0.43	1.92
0.05	1.21	0.35	1.86
0.10	1.28	0.47	2.00
0.15	1.35	0.58	2.22
0.20	1.42	0.57	2.19
0.25	1.49	0.64	2.25
0.30	1.57	0.76	2.52
0.35	1.65	0.77	2.26
0.40	1.74	0.86	2.48
0.45	1.83	1.05	2.67
0.50	1.93	0.95	2.82
0.55	2.04	1.10	2.79

p <- ggplot(res.d, aes(x=a0, y=beta)) +
 geom_ribbon(aes(ymin=L.95, ymax=U.95), alpha=0.2) +
 geom_line() + geom_point() +
 geom_hline(yintercept=0)
p + xlab(expression( paste(a[0], "=P(T>", r['*']," | z=0)" , sep="")) )+
 ylab(expression( paste(hat(beta), "(", a[0], ")" , sep="")) ) +
 scale_x_continuous(breaks=res.d$a0, labels=res.d$a0) +
 theme(axis.text.x = element_text(face="bold", angle=45),
         axis.text.y = element_text(face="bold"))

Figure 7. Sensitivity analysis for AIDS data

In coxph.RT.a0, only the bootstrap estimator of SE is available.

Some remarks

Positivity cannot be tested. This assumption is based on the subject matter knowledge.
The approach assumes independence between T and (R, Z) in the original population. If there is dependence between T and R, it can be discovered by Kendall’s tau test which can be done using packages tranSurv or SurvTrunc, or by more powerful permutation tests which can be done using package permDep.
If a non-terminal event of interest cannot be observed because death preceded it, that is the event of interest is truncated by death, the approach implemented here can be used only if two lifetimes, time to an intermediate event (=T) and time to death (=R), are independent, otherwise the results might be biased.

Some limitations of the `coxrt` package

Although in practice, it is possible to have additional left or interval censoring in right-truncated data, the package assumes that the exact lifetimes, without any censoring, are observed.
The current implementation allows for only time-independent covariates.
The package estimates only covariate effects β, and does not provide estimates for the baseline hazard h₀(t) in the Cox model.

References

Kalbfeisch, J. D., and J. F. Lawless. 1989. “Inference Based on Retrospective Ascertainment: An Analysis of the Data on Transfusion-Related AIDS.” Journal of the American Statistical Association 84 (406): 360–72.

Vakulenko-Lagun, B., M. Mandel, and R. A. Betensky. 2019. “Inverse Probability Weighting Methods for Cox Regression with Right-Truncated Data.” Biometrics. https://doi.org/10.1111/biom.13162.