Cox regression is not very suitable in the analysis of huge data sets with a lot of events (e.g., deaths). For instance, consider analyzing the mortality of the Swedish population aged 60–110 during the years 1968-2019, where we can count to more than four million deaths.
The obvious way to handle that situation is by tabulation and applying a piecewise constant hazard function, because it is a well-known fact that any continuous function can arbitrary well be approximated by a step function, simply by taking small enough steps.
The data sets swepop
and swedeaths
in
eha
contain age and sex specific yearly information on
population size and number of deaths, respectively. They both cover the
full Swedish population the years 1968–2019.
The first few rows of each:
## age sex year pop id
## 102.1969 0 women 1969 52673 102
## 1.1969 0 men 1969 55728 1
## 103.1969 1 women 1969 56831 103
## 2.1969 1 men 1969 59924 2
## 104.1969 2 women 1969 58994 104
## 3.1969 2 men 1969 62502 3
## age sex year deaths id
## 2.1969 0 women 1969 491 2
## 1.1969 0 men 1969 773 1
## 4.1969 1 women 1969 33 4
## 3.1969 1 men 1969 45 3
## 6.1969 2 women 1969 22 6
## 5.1969 2 men 1969 43 5
The funny rownames and the column id
are created by the
function reshape
, which was used to transform the original
tables, given in wide format, to long format. In the
original data, downloaded from Statistics
Sweden, the population size refers to the last day, December 31, of
the given year, but here it refers to an average of that value and the
corresponding one the previous year. In that way we get an estimate of
the number of person years, which allows us to consider number
of occurrences and exposure time in each cell of the
data. This information will allow us to fit proportional
hazards survival models. So we start by joining the two data sets
and remove irrelevant stuff:
dat <- swepop[, c("age", "sex", "year", "pop")]
dat$deaths <- swedeaths$deaths
rownames(dat) <- 1:NROW(dat) # Simplify rownames.
head(dat)
## age sex year pop deaths
## 1 0 women 1969 52673 491
## 2 0 men 1969 55728 773
## 3 1 women 1969 56831 33
## 4 1 men 1969 59924 45
## 5 2 women 1969 58994 22
## 6 2 men 1969 62502 43
## age sex year pop deaths
## 10499 98 women 2020 2033.5 764
## 10500 98 men 2020 578.5 301
## 10501 99 women 2020 1445.0 581
## 10502 99 men 2020 367.0 178
## 10503 100 women 2020 1933.5 1004
## 10504 100 men 2020 394.5 243
We note that the age column ends with age == 100
, which
in fact means age >= 100
. There are in total 4745063
observed deaths during the years 1968–2019, or 91251 deaths per year on
average. There are 101 age groups, two sexes, and 52 years, in all 10504
cells (rows in the data frame).
Assuming a piecewise constant hazards model on the 101 age groups, we
can fit a proportional hazards model by Poisson regression,
utilizing the fact that two likelihood functions in fact are identical.
In R, we use glm
.
fit.glm <- glm(deaths ~ offset(log(pop)) + I(year - 2000) + sex +
factor(age), data = dat, family = poisson)
summary(fit.glm)$coefficients[2:3, ]
## Estimate Std. Error z value Pr(>|z|)
## I(year - 2000) -0.01497 3.136e-05 -477.5 0
## sexmen 0.43997 9.304e-04 472.9 0
The 101 coefficients corresponding to the intercept and the age factor can be used to estimate the hazard function: The intercept, -5.6181, is the log of the hazard in the age interval 0-1, and the rest are differences to that value, so we can reconstruct the baseline hazard by
lhaz <- coefficients(fit.glm)[-(2:3)]
n <- length(lhaz)
lhaz[-1] <- lhaz[-1] + lhaz[1]
haz <- exp(lhaz)
and plot the result, see Figure @ref(fig:glmbasefig).
oldpar <- par(las = 1, lwd = 1.5, mfrow = c(1, 2))
plot(0:(n-1), haz, type = "s", main = "log(hazards)",
xlab = "Age", ylab = "", log = "y")
plot(0:(n-1), haz, type = "s", main = "hazards",
xlab = "Age", ylab = "Deaths / Year")
While it straightforward to use glm and Poisson regression to fit the
model, it takes some efforts to get it right. That is the reason for the
creation of the function tpchreg
(“Tabular Piecewise
Constant Hazards REGression”), and with it, the “Poisson analysis” is
performed by
Note:
The function oe
(“occurrence/exposure”) takes two
arguments, the first is the number of events (deaths in our example),
and the second is exposure time, or person years.
The argument time
is the defining time intervals
variable. It can be either character, like c(“0-1”, “1-2”, …, “100-101”)
or numeric (as here). If numeric, the value refers to the start of the
corresponding interval, and the next start is the end of the previous
interval. This leaves the last interval’s endpoint undefined, and if not
given by the last
argument (see below), it is chosen so
that the length of the last interval is one.
The argument last
closes the last interval, if is
not already closed, see above. The exact value of last is only important
for plotting and for the calculation of the restricted mean survival
time, (RMST) see the summary result below.
## Covariate Mean Coef Rel.Risk S.E. LR p
## I(year - 2000) -4.480 -0.015 0.985 0.000 0.000
## sex 0.000
## women 0.503 0 1 (reference)
## men 0.497 0.440 1.553 0.001
##
## Events 4745063
## Total time at risk 459651727
## Max. log. likelihood -19283067
## LR test statistic 443018.82
## Degrees of freedom 2
## Overall p-value 0
##
## Restricted mean survival: 81.58 in (0, 101]
The restricted mean survival time is defined as the integral of the survivor function over the given time interval. Note that if the lower limit of the interval is larger than zero, it gives the conditional restricted mean survival time, given survival to the lower endpoint.
Graphs of the hazards and the log(hazards) functions are shown in Figure @ref(fig:tpplot).
oldpar <- par(mfrow = c(1, 2), las = 1, lwd = 1.5)
plot(fit, fn = "haz", log = "y", main = "log(hazards)",
xlab = "Age")
plot(fit, fn = "haz", log = "", main = "hazards",
xlab = "Age", ylab = "Deaths / Year")
Same results as with glm
and Poisson regression, but a
lot simpler.
Sometimes you have a large data file in classical, individual form,
suitable for Cox regression with coxreg
, but the
mere size makes it impractical, or even impossible. Then help is close
by tabulating and assuming a piecewise constant hazard function,
returning to the method in the previous section, that is, using
tpchreg
.
The helper function is toTpch
, and we illustrate its use
on the oldmort
data frame:
## enter exit event sex civ birthdate
## 1 94.51 95.81 TRUE female widow 1765
## 2 94.27 95.76 TRUE female unmarried 1766
## 3 91.09 91.95 TRUE female widow 1769
## 4 89.01 89.59 TRUE female widow 1771
## 5 90.00 90.21 TRUE female widow 1770
## 6 88.43 89.76 TRUE female widow 1772
oldmort$birthyear <- floor(oldmort$birthdate) - 1800
om <- toTpch(Surv(enter, exit, event) ~ sex + civ + birthyear,
cuts = seq(60, 100, by = 2), data = oldmort)
head(om)
## sex civ birthyear age event exposure
## 1 male unmarried -2 60-62 0 0.578
## 2 female unmarried -2 60-62 0 4.109
## 3 male married -2 60-62 0 17.366
## 4 female married -2 60-62 0 14.129
## 5 male widow -2 60-62 0 0.148
## 6 female widow -2 60-62 0 5.805
Note two things:
The creation of a new variable, birthyear
. The
original birthdate
is given with precision days and
contains 3570 unique values, which will contribute to creating a very
large table, so the transformation gives birth year with 52
unique values. Further, the new variable is given a reference
value of 1800 by subtraction, necessary so that the baseline does
not coincide with the birth of Christ. Will foremost affect plotting of
the estimated survivor function. However, regression parameter estimates
are unaffected, as long as no interaction effect including
birthyear
is present.
The length of the time pieces is set to two years, in order to avoid empty intervals in the upper age range. This choice has only a marginal effect on the final results of the analyses. You can try it out yourself. Note that it is not necessary to use equidistant cut points, it is chosen here only for convenience.
Now we can run tpchreg
as before
## Covariate Mean Coef Rel.Risk S.E. LR p
## sex 0.000
## male 0.406 0 1 (reference)
## female 0.594 -0.245 0.783 0.047
## civ 0.000
## unmarried 0.080 0 1 (reference)
## married 0.530 -0.397 0.672 0.081
## widow 0.390 -0.258 0.773 0.079
## birthyear 2.114 -0.006 0.994 0.004 0.150
##
## Events 1971
## Total time at risk 37824
## Max. log. likelihood -7265.5
## LR test statistic 43.45
## Degrees of freedom 4
## Overall p-value 8.34423e-09
##
## Restricted mean survival: 12.65 in (60, 100]
And the hazards graphs are shown in Figure @ref(fig:plotom).
oldpar <- par(mfrow = c(1, 2), las = 1, lwd = 1.5)
plot(fit3, fn = "haz", log = "y", main = "log(hazards)",
xlab = "Age", ylab = "log(Deaths / Year)", col = "blue")
plot(fit3, fn = "haz", log = "", main = "hazards",
xlab = "Age", ylab = "Deaths / Year", col = "blue")
The plots of the survivor and cumulative hazards functions are “smoother”, see Figure @ref(fig:plotsurcum).
oldpar <- par(mfrow = c(1, 2), las = 1, lwd = 1.5)
plot(fit3, fn = "cum", log = "y", main = "Cum. hazards",
xlab = "Age", col = "blue")
plot(fit3, fn = "sur", log = "", main = "Survivor function",
xlab = "Age", col = "blue")
A comparison with Cox regression on the original data.
fit4 <- coxreg(Surv(enter, exit, event) ~ sex + civ + I(birthdate - 1800),
data = oldmort)
summary(fit4)
## Covariate Mean Coef Rel.Risk S.E. LR p
## sex 0.000
## male 0.406 0 1 (reference)
## female 0.594 -0.244 0.783 0.047
## civ 0.000
## unmarried 0.080 0 1 (reference)
## married 0.530 -0.397 0.673 0.081
## widow 0.390 -0.259 0.772 0.079
## I(birthdate - 18 2.602 -0.005 0.995 0.004 0.212
##
## Events 1971
## Total time at risk 37824
## Max. log. likelihood -13557
## LR test statistic 42.79
## Degrees of freedom 4
## Overall p-value 1.14378e-08
And the graphs, see Figure @ref(fig:coxgraphs).
oldpar <- par(mfrow = c(1, 2), lwd = 1.5, las = 1)
plot(fit4, main = "Cumulative hazards", xlab = "Age",
col = "blue")
plot(fit4, main = "Survivor function", xlab = "Age",
fn = "surv", col = "blue")