Title: | Discrete Time Survival Analysis |
---|---|
Description: | Provides data transformations, estimation utilities, predictive evaluation measures and simulation functions for discrete time survival analysis. |
Authors: | Thomas Welchowski <[email protected]> and Moritz Berger <[email protected]> and David Koehler <[email protected]> and Matthias Schmid <[email protected]> |
Maintainer: | Thomas Welchowski <[email protected]> |
License: | GPL-3 |
Version: | 2.0.0 |
Built: | 2025-02-17 07:02:45 UTC |
Source: | CRAN |
Includes functions for data transformations, estimation, evaluation and simulation of discrete survival analysis. The most important functions are listed below:
contToDisc
: Discretizes continuous time variable into a
specified grid of censored data for discrete survival analysis.
dataLong
: Transform data from short format into long
format for discrete survival analysis and right censoring.
dataLongCompRisks
: Transforms short data format to long
format for discrete survival modelling in the case of competing risks with
right censoring.
dataLongTimeDep
: Transforms short data
format to long format for discrete survival modelling of single event
analysis with right censoring.
cIndex
: Calculates
the concordance index for discrete survival models (independent measure of
time).
dataLongSubDist
: Converts
the data to long format suitable for applying discrete subdistribution
hazard modelling (competing risks).
"DataShort" format is defined as data without repeated measurements. "DataSemiLong" format consists of repeated measurements, but there are gaps between the discrete time intervals. "DataLong" format is expanded to include all time intervals up to the last observation per individual.
Package: | discSurv |
Type: | Package |
Version: | 2.0.0 |
Date: | 2022-03-02 |
License: | GPL-3 |
Thomas Welchowski [email protected]
Moritz Berger [email protected]
David Koehler [email protected]
Matthias Schmid [email protected]
Berger M, Schmid M (2018).
“Semiparametric regression for discrete time-to-event data.”
Statistical Modelling, 18, 322–345.
Berger M, Welchowski T, Schmitz-Valckenberg S, Schmid M (2019).
“A classification tree approach for the modeling of competing risks in discrete time.”
Advances in Data Analysis and Classification, 13, 965-990.
Berger M, Schmid M, Welchowski T, Schmitz-Valckenberg S, Beyersmann J (2020).
“Subdistribution Hazard Models for Competing Risks in Discrete Time.”
Biostatistics, 21, 449-466.
Schmid M, Tutz G, Welchowski T (2018).
“Discrimination Measures for Discrete Time-to-Event Predictions.”
Econometrics and Statistics, 7, 153-164.
Tutz G, Schmid M (2016).
Modeling discrete time-to-event data.
Springer Series in Statistics.
Calculates the adjusted deviance residuals for arbitrary prediction models. The adjusted deviance residuals should be approximately normal distributed, in the case of a well fitting model.
adjDevResid(dataLong, hazards)
adjDevResid(dataLong, hazards)
dataLong |
Data set in long format ("class data.frame"). |
hazards |
Estimated discrete hazards of the data in long format("numeric vector"). Hazard rates are probabilities and therefore restricted to the interval [0, 1]. |
Output List with objects:
AdjDevResid Adjusted deviance residuals as numeric vector
Input A list of given argument input values (saved for reference)
Thomas Welchowski [email protected]
Tutz G, Schmid M (2016).
Modeling discrete time-to-event data.
Springer Series in Statistics.
Tutz G (2012).
Regression for Categorical Data.
Cambridge University Press.
library(survival) # Transform data to long format heart[, "stop"] <- ceiling(heart[, "stop"]) set.seed(0) Indizes <- sample(unique(heart$id), 25) randSample <- heart[unlist(sapply(1:length(Indizes), function(x) which(heart$id == Indizes[x]))),] heartLong <- dataLongTimeDep(dataSemiLong = randSample, timeColumn = "stop", eventColumn = "event", idColumn = "id", timeAsFactor = FALSE) # Fit a generalized, additive model and predict discrete hazards on data in long format library(mgcv) gamFit <- gam(y ~ timeInt + surgery + transplant + s(age), data = heartLong, family = "binomial") hazPreds <- predict(gamFit, type = "response") # Calculate adjusted deviance residuals devResiduals <- adjDevResid(dataLong = heartLong, hazards = hazPreds)$Output$AdjDevResid devResiduals
library(survival) # Transform data to long format heart[, "stop"] <- ceiling(heart[, "stop"]) set.seed(0) Indizes <- sample(unique(heart$id), 25) randSample <- heart[unlist(sapply(1:length(Indizes), function(x) which(heart$id == Indizes[x]))),] heartLong <- dataLongTimeDep(dataSemiLong = randSample, timeColumn = "stop", eventColumn = "event", idColumn = "id", timeAsFactor = FALSE) # Fit a generalized, additive model and predict discrete hazards on data in long format library(mgcv) gamFit <- gam(y ~ timeInt + surgery + transplant + s(age), data = heartLong, family = "binomial") hazPreds <- predict(gamFit, type = "response") # Calculate adjusted deviance residuals devResiduals <- adjDevResid(dataLong = heartLong, hazards = hazPreds)$Output$AdjDevResid devResiduals
Calibration plot based on predictions. Overall root mean squared error (RMSE) of predicted and observed discrete hazards is calculated.
calibrationPlot( testPreds, testDataLong, weights = NULL, K = 10, event = "e1", ... )
calibrationPlot( testPreds, testDataLong, weights = NULL, K = 10, event = "e1", ... )
testPreds |
Predictions on the validation data with model fitted on training data ("numeric vector"). |
testDataLong |
Validation data set in long format ("class data.frame"). |
weights |
optional vector of weights ("numeric vector"). The length of weights must be equal to the number of observations of the validation data set. |
K |
Number of subsets for plotting ("integer vector"). |
event |
Column names of the event to be considered for plotting (only in case of cause-specific hazards) ("character vector"). |
... |
Additional arguments passed to |
Calibration plot
Moritz Berger [email protected]
https://www.imbie.uni-bonn.de/personen/dr-moritz-berger/
Berger M, Schmid M (2018).
“Semiparametric regression for discrete time-to-event data.”
Statistical Modelling, 18, 322–345.
Heyard R, Timsit J, Held L, COMBACTE-MAGNET,consortium (2019).
“Validation of discrete time-to-event prediction models in the presence of competing risks.”
Biometrical Journal, 62, 643-657.
Berger M, Schmid M (2020).
“Assessing the calibration of subdistribution hazard models in discrete time.”
arXiv:2001.11240.
estRecal
, dataLong
, dataLongCompRisks
, dataLongSubDist
#################### # Data preprocessing # Example unemployment data library(Ecdat) data(UnempDur) # Select subsample selectInd1 <- 1:100 selectInd2 <- 101:200 trainSet <- UnempDur[which(UnempDur$spell %in% (1:10))[selectInd1], ] valSet <- UnempDur[which(UnempDur$spell %in% (1:10))[selectInd2], ] #################### # One event # Convert to long format trainSet_long <- dataLong(dataShort = trainSet, timeColumn = "spell", eventColumn = "censor1") valSet_long <- dataLong(dataShort = valSet, timeColumn = "spell", eventColumn = "censor1") # Estimate continuation ratio model with logit link glmFit <- glm(formula = y ~ timeInt + age + logwage, data = trainSet_long, family = binomial()) # Calculate predicted hazards predHazards <- predict(glmFit, newdata = valSet_long, type = "response") # Calibration plot calibrationPlot(predHazards, testDataLong = valSet_long) ############################ # Two cause specific hazards # Convert to long format trainSet_long <- dataLongCompRisks(dataShort = trainSet, timeColumn = "spell", eventColumns = c("censor1", "censor4")) valSet_long <- dataLongCompRisks(dataShort = valSet, timeColumn = "spell", eventColumns = c("censor1", "censor4")) # Estimate continuation ratio model with logit link vglmFit <- VGAM::vglm(formula = cbind(e0, e1, e2) ~ timeInt + age + logwage, data = trainSet_long, family = VGAM::multinomial(refLevel = "e0")) # Calculate predicted hazards predHazards <- VGAM::predictvglm(vglmFit, newdata = valSet_long, type = "response") # Calibration plots calibrationPlot(predHazards, testDataLong = valSet_long) calibrationPlot(predHazards, testDataLong = valSet_long, event = "e2") ############################### # Subdistribution hazards model # Convert to long format trainSet_long <- dataLongSubDist(dataShort = trainSet, timeColumn = "spell", eventColumns = c("censor1", "censor4"), eventFocus = "censor1") valSet_long <- dataLongSubDist(dataShort = valSet, timeColumn = "spell", eventColumns = c("censor1", "censor4"), eventFocus = "censor1") # Estimate continuation ratio model with logit link glmFit <- glm(formula = y ~ timeInt + age + logwage, data = trainSet_long, family = binomial(), weights = trainSet_long$subDistWeights) # Calculate predicted hazards predHazards <- predict(glmFit, newdata = valSet_long, type = "response") # Calibration plot calibrationPlot(predHazards, testDataLong = valSet_long, weights = valSet_long$subDistWeights)
#################### # Data preprocessing # Example unemployment data library(Ecdat) data(UnempDur) # Select subsample selectInd1 <- 1:100 selectInd2 <- 101:200 trainSet <- UnempDur[which(UnempDur$spell %in% (1:10))[selectInd1], ] valSet <- UnempDur[which(UnempDur$spell %in% (1:10))[selectInd2], ] #################### # One event # Convert to long format trainSet_long <- dataLong(dataShort = trainSet, timeColumn = "spell", eventColumn = "censor1") valSet_long <- dataLong(dataShort = valSet, timeColumn = "spell", eventColumn = "censor1") # Estimate continuation ratio model with logit link glmFit <- glm(formula = y ~ timeInt + age + logwage, data = trainSet_long, family = binomial()) # Calculate predicted hazards predHazards <- predict(glmFit, newdata = valSet_long, type = "response") # Calibration plot calibrationPlot(predHazards, testDataLong = valSet_long) ############################ # Two cause specific hazards # Convert to long format trainSet_long <- dataLongCompRisks(dataShort = trainSet, timeColumn = "spell", eventColumns = c("censor1", "censor4")) valSet_long <- dataLongCompRisks(dataShort = valSet, timeColumn = "spell", eventColumns = c("censor1", "censor4")) # Estimate continuation ratio model with logit link vglmFit <- VGAM::vglm(formula = cbind(e0, e1, e2) ~ timeInt + age + logwage, data = trainSet_long, family = VGAM::multinomial(refLevel = "e0")) # Calculate predicted hazards predHazards <- VGAM::predictvglm(vglmFit, newdata = valSet_long, type = "response") # Calibration plots calibrationPlot(predHazards, testDataLong = valSet_long) calibrationPlot(predHazards, testDataLong = valSet_long, event = "e2") ############################### # Subdistribution hazards model # Convert to long format trainSet_long <- dataLongSubDist(dataShort = trainSet, timeColumn = "spell", eventColumns = c("censor1", "censor4"), eventFocus = "censor1") valSet_long <- dataLongSubDist(dataShort = valSet, timeColumn = "spell", eventColumns = c("censor1", "censor4"), eventFocus = "censor1") # Estimate continuation ratio model with logit link glmFit <- glm(formula = y ~ timeInt + age + logwage, data = trainSet_long, family = binomial(), weights = trainSet_long$subDistWeights) # Calculate predicted hazards predHazards <- predict(glmFit, newdata = valSet_long, type = "response") # Calibration plot calibrationPlot(predHazards, testDataLong = valSet_long, weights = valSet_long$subDistWeights)
Calculates the concordance index for discrete survival models, which does not depend on time. This is the probability that, for a pair of randomly chosen comparable samples, the sample with the higher risk prediction will experience an event before the other sample or belongs to a higher binary class.
cIndex(marker, testTime, testEvent, trainTime, trainEvent)
cIndex(marker, testTime, testEvent, trainTime, trainEvent)
marker |
Gives the predicted values of the linear predictor of a regression model ("numeric vector"). May also be on the response scale. |
testTime |
New time intervals in the test data ("integer vector"). |
testEvent |
Event indicators in the test data ("binary vector"). |
trainTime |
Time intervals in the training data ("integer vector"). |
trainEvent |
Event indicators in the training data ("binary vector"). |
Value of discrete concordance index between zero and one ("numeric vector").
It is assumed that all time points up to the last observed interval [a_q-1, a_q) are available.
Thomas Welchowski [email protected]
Schmid M, Tutz G, Welchowski T (2018).
“Discrimination Measures for Discrete Time-to-Event Predictions.”
Econometrics and Statistics, 7, 153-164.
Uno H, Cai T, Tian L, Wei LJ (2012).
“Evaluating Prediction Rules fort-Year Survivors With Censored Regression Models.”
Journal of the American Statistical Association, 102, 527-537.
Heagerty PJ, Zheng Y (2005).
“Survival Model Predictive Accuracy and ROC Curves.”
Biometrics, 61, 92-105.
################################################## # Example with unemployment data and prior fitting library(Ecdat) library(caret) library(mgcv) data(UnempDur) summary(UnempDur$spell) # Extract subset of data set.seed(635) IDsample <- sample(1:dim(UnempDur)[1], 100) UnempDurSubset <- UnempDur [IDsample, ] set.seed(-570) TrainingSample <- sample(1:100, 75) UnempDurSubsetTrain <- UnempDurSubset [TrainingSample, ] UnempDurSubsetTest <- UnempDurSubset [-TrainingSample, ] # Convert to long format UnempDurSubsetTrainLong <- dataLong(dataShort = UnempDurSubsetTrain, timeColumn = "spell", eventColumn = "censor1") # Estimate gam with smooth baseline gamFit <- gam(formula = y ~ s(I(as.numeric(as.character(timeInt)))) + s(age) + s(logwage), data = UnempDurSubsetTrainLong, family = binomial()) gamFitPreds <- predict(gamFit, newdata = cbind(UnempDurSubsetTest, timeInt = UnempDurSubsetTest$spell)) # Evaluate C-Index based on short data format cIndex(marker = gamFitPreds, testTime = UnempDurSubsetTest$spell, testEvent = UnempDurSubsetTest$censor1, trainTime = UnempDurSubsetTrain$spell, trainEvent = UnempDurSubsetTrain$censor1) ##################################### # Example National Wilm's Tumor Study library(survival) head(nwtco) summary(nwtco$rel) # Select subset set.seed(-375) Indices <- sample(1:dim(nwtco)[1], 500) nwtcoSub <- nwtco [Indices, ] # Convert time range to 30 intervals intLim <- quantile(nwtcoSub$edrel, prob = seq(0, 1, length.out = 30)) intLim [length(intLim)] <- intLim [length(intLim)] + 1 nwtcoSubTemp <- contToDisc(dataShort = nwtcoSub, timeColumn = "edrel", intervalLimits = intLim) nwtcoSubTemp$instit <- factor(nwtcoSubTemp$instit) nwtcoSubTemp$histol <- factor(nwtcoSubTemp$histol) nwtcoSubTemp$stage <- factor(nwtcoSubTemp$stage) # Split in training and test sample set.seed(-570) TrainingSample <- sample(1:dim(nwtcoSubTemp)[1], round(dim(nwtcoSubTemp)[1]*0.75)) nwtcoSubTempTrain <- nwtcoSubTemp [TrainingSample, ] nwtcoSubTempTest <- nwtcoSubTemp [-TrainingSample, ] # Convert to long format nwtcoSubTempTrainLong <- dataLong(dataShort = nwtcoSubTempTrain, timeColumn = "timeDisc", eventColumn = "rel", timeAsFactor=TRUE) # Estimate glm inputFormula <- y ~ timeInt + histol + instit + stage glmFit <- glm(formula = inputFormula, data = nwtcoSubTempTrainLong, family = binomial()) linPreds <- predict(glmFit, newdata = cbind(nwtcoSubTempTest, timeInt = factor(nwtcoSubTempTest$timeDisc, levels=levels(nwtcoSubTempTrainLong$timeInt)))) # Evaluate C-Index based on short data format cIndex(marker = linPreds, testTime = as.numeric(as.character(nwtcoSubTempTest$timeDisc)), testEvent = nwtcoSubTempTest$rel, trainTime = as.numeric(as.character(nwtcoSubTempTrain$timeDisc)), trainEvent = nwtcoSubTempTrain$rel)
################################################## # Example with unemployment data and prior fitting library(Ecdat) library(caret) library(mgcv) data(UnempDur) summary(UnempDur$spell) # Extract subset of data set.seed(635) IDsample <- sample(1:dim(UnempDur)[1], 100) UnempDurSubset <- UnempDur [IDsample, ] set.seed(-570) TrainingSample <- sample(1:100, 75) UnempDurSubsetTrain <- UnempDurSubset [TrainingSample, ] UnempDurSubsetTest <- UnempDurSubset [-TrainingSample, ] # Convert to long format UnempDurSubsetTrainLong <- dataLong(dataShort = UnempDurSubsetTrain, timeColumn = "spell", eventColumn = "censor1") # Estimate gam with smooth baseline gamFit <- gam(formula = y ~ s(I(as.numeric(as.character(timeInt)))) + s(age) + s(logwage), data = UnempDurSubsetTrainLong, family = binomial()) gamFitPreds <- predict(gamFit, newdata = cbind(UnempDurSubsetTest, timeInt = UnempDurSubsetTest$spell)) # Evaluate C-Index based on short data format cIndex(marker = gamFitPreds, testTime = UnempDurSubsetTest$spell, testEvent = UnempDurSubsetTest$censor1, trainTime = UnempDurSubsetTrain$spell, trainEvent = UnempDurSubsetTrain$censor1) ##################################### # Example National Wilm's Tumor Study library(survival) head(nwtco) summary(nwtco$rel) # Select subset set.seed(-375) Indices <- sample(1:dim(nwtco)[1], 500) nwtcoSub <- nwtco [Indices, ] # Convert time range to 30 intervals intLim <- quantile(nwtcoSub$edrel, prob = seq(0, 1, length.out = 30)) intLim [length(intLim)] <- intLim [length(intLim)] + 1 nwtcoSubTemp <- contToDisc(dataShort = nwtcoSub, timeColumn = "edrel", intervalLimits = intLim) nwtcoSubTemp$instit <- factor(nwtcoSubTemp$instit) nwtcoSubTemp$histol <- factor(nwtcoSubTemp$histol) nwtcoSubTemp$stage <- factor(nwtcoSubTemp$stage) # Split in training and test sample set.seed(-570) TrainingSample <- sample(1:dim(nwtcoSubTemp)[1], round(dim(nwtcoSubTemp)[1]*0.75)) nwtcoSubTempTrain <- nwtcoSubTemp [TrainingSample, ] nwtcoSubTempTest <- nwtcoSubTemp [-TrainingSample, ] # Convert to long format nwtcoSubTempTrainLong <- dataLong(dataShort = nwtcoSubTempTrain, timeColumn = "timeDisc", eventColumn = "rel", timeAsFactor=TRUE) # Estimate glm inputFormula <- y ~ timeInt + histol + instit + stage glmFit <- glm(formula = inputFormula, data = nwtcoSubTempTrainLong, family = binomial()) linPreds <- predict(glmFit, newdata = cbind(nwtcoSubTempTest, timeInt = factor(nwtcoSubTempTest$timeDisc, levels=levels(nwtcoSubTempTrainLong$timeInt)))) # Evaluate C-Index based on short data format cIndex(marker = linPreds, testTime = as.numeric(as.character(nwtcoSubTempTest$timeDisc)), testEvent = nwtcoSubTempTest$rel, trainTime = as.numeric(as.character(nwtcoSubTempTrain$timeDisc)), trainEvent = nwtcoSubTempTrain$rel)
Estimates the discrete concordance index in the case of competing risks.
cIndexCompRisks(markers, testTime, testEvents, trainTime, trainEvents)
cIndexCompRisks(markers, testTime, testEvents, trainTime, trainEvents)
markers |
Predictions on the test data with model fitted on training data ("numeric matrix"). Predictions are stored in the rows and the number of columns equal to the number of events. |
testTime |
New time intervals in the test data ("integer vector"). |
testEvents |
New event indicators (0 or 1) in the test data ("binary matrix"). Number of columns are equal to the number of events. |
trainTime |
Time intervals in the training data ("integer vector"). |
trainEvents |
Event indicators (0 or 1) in the training data ("binary matrix"). Number of columns are equal to the number of events. |
Value of discrete concordance index between zero and one ("numeric vector").
It is assumed that all time points up to the last observed interval [a_q-1, a_q) are available.
Moritz Berger [email protected]
https://www.imbie.uni-bonn.de/personen/dr-moritz-berger/
Heyard R, Timsit J, Held L, COMBACTE-MAGNET,consortium (2019). “Validation of discrete time-to-event prediction models in the presence of competing risks.” Biometrical Journal, 62, 643-657.
################################################## # Example with unemployment data and prior fitting library(Ecdat) data(UnempDur) summary(UnempDur$spell) # Extract subset of data set.seed(635) IDsample <- sample(1:dim(UnempDur)[1], 100) UnempDurSubset <- UnempDur [IDsample, ] set.seed(-570) TrainingSample <- sample(1:100, 75) UnempDurSubsetTrain <- UnempDurSubset [TrainingSample, ] UnempDurSubsetTest <- UnempDurSubset [-TrainingSample, ] # Convert to long format UnempDurSubsetTrainLong <- dataLongCompRisks(dataShort = UnempDurSubsetTrain, timeColumn = "spell", eventColumns = c("censor1", "censor4"), timeAsFactor = TRUE) # Estimate continuation ratio model with logit link vglmFit <- VGAM::vglm(formula = cbind(e0, e1, e2) ~ timeInt + age + logwage, data = UnempDurSubsetTrainLong, family=VGAM::multinomial(refLevel = "e0")) gamFitPreds <- VGAM::predictvglm(vglmFit , newdata = cbind(UnempDurSubsetTest, timeInt = as.factor(UnempDurSubsetTest$spell))) # Evaluate C-Index based on short data format cIndexCompRisks(markers = gamFitPreds, testTime = UnempDurSubsetTest$spell, testEvents = UnempDurSubsetTest[, c("censor1", "censor4")], trainTime = UnempDurSubsetTrain$spell, trainEvents = UnempDurSubsetTrain[, c("censor1", "censor4")])
################################################## # Example with unemployment data and prior fitting library(Ecdat) data(UnempDur) summary(UnempDur$spell) # Extract subset of data set.seed(635) IDsample <- sample(1:dim(UnempDur)[1], 100) UnempDurSubset <- UnempDur [IDsample, ] set.seed(-570) TrainingSample <- sample(1:100, 75) UnempDurSubsetTrain <- UnempDurSubset [TrainingSample, ] UnempDurSubsetTest <- UnempDurSubset [-TrainingSample, ] # Convert to long format UnempDurSubsetTrainLong <- dataLongCompRisks(dataShort = UnempDurSubsetTrain, timeColumn = "spell", eventColumns = c("censor1", "censor4"), timeAsFactor = TRUE) # Estimate continuation ratio model with logit link vglmFit <- VGAM::vglm(formula = cbind(e0, e1, e2) ~ timeInt + age + logwage, data = UnempDurSubsetTrainLong, family=VGAM::multinomial(refLevel = "e0")) gamFitPreds <- VGAM::predictvglm(vglmFit , newdata = cbind(UnempDurSubsetTest, timeInt = as.factor(UnempDurSubsetTest$spell))) # Evaluate C-Index based on short data format cIndexCompRisks(markers = gamFitPreds, testTime = UnempDurSubsetTest$spell, testEvents = UnempDurSubsetTest[, c("censor1", "censor4")], trainTime = UnempDurSubsetTrain$spell, trainEvents = UnempDurSubsetTrain[, c("censor1", "censor4")])
Estimates generalized estimation equation model for each competing event separately. Dependence within person IDs is accounted for by assuming a working covariance structure.
compRisksGEE( datShort, dataTransform = "dataLongCompRisks", corstr = "independence", formulaVariable = ~timeInt, ... ) ## S3 method for class 'dCRGEE' predict(object, newdata, ...)
compRisksGEE( datShort, dataTransform = "dataLongCompRisks", corstr = "independence", formulaVariable = ~timeInt, ... ) ## S3 method for class 'dCRGEE' predict(object, newdata, ...)
datShort |
Original data set in short format with each row corresponding to one independent observation("class data.frame"). |
dataTransform |
Specification of the data transformation function from short to long format("character vector"). There are two available options: Without time dependent covariates ("dataLongCompRisks") and with time dependent covariates ("dataLongCompRisksTimeDep"). The default is set to the former. |
corstr |
Assumption of correlation structure ("character vector"). The following are permitted: '"independence"', '"exchangeable"', '"ar1"', '"unstructured"' and '"userdefined". |
formulaVariable |
Specifies the right hand side of the regression formula ("class formula"). The default is to use the discrete time variable, which corresponds to a covariate free hazard. It is recommended to always include the discrete time variable "timeInt". |
... |
Additional arguments to data transformation (compRisksGEE) or prediction function (predict). Preprocessing function argument responseAsFactor has to be set to FALSE (Default). |
object |
Discrete time competing risks GEE model prediction model ("class dCRGEE"). |
newdata |
("class data.set") New data set to be used for prediction (class data.frame). |
Variables in argument formulaVariable need to be separated by "+ ". For example if the two variables timeInt and X1 should be included the formula would be "~ timeInt + X1". The variable timeInt is constructed before estimation of the model.
Returns an object of class "geeglm".
Thomas Welchowski [email protected]
Lee M, Feuer EJ, Fine JP (2018). “On the analysis of discrete time competing risks data.” Biometrics, 74, 1468-1481.
covarGEE
, dataLongCompRisks
, dataLongCompRisksTimeDep
,
geeglm
# Example with unemployment data library(Ecdat) data(UnempDur) # Select subsample SubUnempDur <- UnempDur [1:100, ] # Estimate GEE models for all events estGEE <- compRisksGEE(datShort = SubUnempDur, dataTransform = "dataLongCompRisks", corstr = "independence", formulaVariable =~ timeInt + age + ui + logwage * ui, eventColumns = c("censor1", "censor2", "censor3", "censor4"), timeColumn = "spell") names(estGEE) estGEE[[1]] # Predictions SubUnempDurLong <- dataLongCompRisks(dataShort = SubUnempDur, eventColumns = c("censor1", "censor2", "censor3", "censor4"), timeColumn = "spell") preds <- predict(estGEE, newdata = SubUnempDurLong) head(preds)
# Example with unemployment data library(Ecdat) data(UnempDur) # Select subsample SubUnempDur <- UnempDur [1:100, ] # Estimate GEE models for all events estGEE <- compRisksGEE(datShort = SubUnempDur, dataTransform = "dataLongCompRisks", corstr = "independence", formulaVariable =~ timeInt + age + ui + logwage * ui, eventColumns = c("censor1", "censor2", "censor3", "censor4"), timeColumn = "spell") names(estGEE) estGEE[[1]] # Predictions SubUnempDurLong <- dataLongCompRisks(dataShort = SubUnempDur, eventColumns = c("censor1", "censor2", "censor3", "censor4"), timeColumn = "spell") preds <- predict(estGEE, newdata = SubUnempDurLong) head(preds)
Discretizes continuous time variable into a specified grid of censored data for discrete survival analysis. It is a data preprocessing step, before the data can be extendend in long format and further analysed with discrete survival models.
contToDisc( dataShort, timeColumn, intervalLimits, equi = FALSE, timeAsFactor = FALSE )
contToDisc( dataShort, timeColumn, intervalLimits, equi = FALSE, timeAsFactor = FALSE )
dataShort |
Original data in short format ("class data.frame"). |
timeColumn |
Name of the column of discrete survival times ("character vector"). |
intervalLimits |
Right interval borders ("numeric vector"), e. g. if the intervals are [0, a_1), [a_1, a_2), [a_2, a_max), then intervalLimits = c(a_1, a_2, a_max) |
equi |
Specifies if argument intervalLimits should be interpreted as number of equidistant intervals ("logical vector"). |
timeAsFactor |
Specifies if the computed discrete time intervals should be converted to a categorical variable ("logical vector"). Default is FALSE. In the default settings the discret time intervals are treated as quantitative ("numeric vector"). |
Gives the data set expanded with a first column "timeDisc". This column includes the discrete time intervals ("class factor").
In discrete survival analysis the survival times have to be categorized in time intervals. Therefore this function is required, if there are observed continuous survival times.
Thomas Welchowski [email protected]
Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.
dataLong
, dataLongTimeDep
,
dataLongCompRisks
# Example copenhagen stroke study data library(pec) data(cost) head(cost) # Convert observed times to months # Right borders of intervals [0, a_1), [a_1, a_2), ... , [a_{\max-1}, a_{\max}) IntBorders <- 1:ceiling(max(cost$time)/30)*30 # Select subsample subCost <- cost [1:100, ] CostMonths <- contToDisc(dataShort=subCost, timeColumn = "time", intervalLimits = IntBorders) head(CostMonths) # Select subsample giving number of equidistant intervals CostMonths <- contToDisc(dataShort = subCost, timeColumn = "time", intervalLimits = 10, equi = TRUE) head(CostMonths)
# Example copenhagen stroke study data library(pec) data(cost) head(cost) # Convert observed times to months # Right borders of intervals [0, a_1), [a_1, a_2), ... , [a_{\max-1}, a_{\max}) IntBorders <- 1:ceiling(max(cost$time)/30)*30 # Select subsample subCost <- cost [1:100, ] CostMonths <- contToDisc(dataShort=subCost, timeColumn = "time", intervalLimits = IntBorders) head(CostMonths) # Select subsample giving number of equidistant intervals CostMonths <- contToDisc(dataShort = subCost, timeColumn = "time", intervalLimits = 10, equi = TRUE) head(CostMonths)
Estimates covariance of estimated parameters of all competing events generalized estimation equation models using sandwich approach.
covarGEE(modelEst)
covarGEE(modelEst)
modelEst |
Discrete time competing risks GEE model prediction model ("class dCRGEE"). |
Returns symmetric matrix of rows and columns dimension "number of competing risks" * "number of regression parameters" ("numeric matrix").
Thomas Welchowski [email protected]
Lee M, Feuer EJ, Fine JP (2018). “On the analysis of discrete time competing risks data.” Biometrics, 74, 1468-1481.
compRisksGEE
, dataLongCompRisks
, dataLongCompRisksTimeDep
,
geeglm
# Example with unemployment data library(Ecdat) data(UnempDur) # Select subsample SubUnempDur <- UnempDur [1:100, ] # Estimate GEE models for all events estGEE <- compRisksGEE(datShort = SubUnempDur, dataTransform = "dataLongCompRisks", corstr = "independence", formulaVariable =~ timeInt + age + ui + logwage * ui, eventColumns = c("censor1", "censor2", "censor3", "censor4"), timeColumn = "spell") ## Not run: # Estimate covariance matrix of estimated parameters and competing events estCovar <- covarGEE(modelEst=estGEE) estCovar # Covariances of estimated parameters of one event equal the diagonal blocks lengthParameters <- length(estGEE[[1]]$coefficients) noCompEvents <- length(estGEE) meanAbsError <- rep(NA, noCompEvents) for( k in 1:noCompEvents ){ relInd <- (1 + (k-1) * lengthParameters) : (k * lengthParameters) meanAbsError[k] <- mean(abs(estCovar[relInd, relInd] - estGEE[[k]]$geese$vbeta)) } mean(meanAbsError) # -> Covariance estimates within each event are equal to diagonal blocks in # complete covariance matrix with very small differences due to numerical accuracy. ## End(Not run)
# Example with unemployment data library(Ecdat) data(UnempDur) # Select subsample SubUnempDur <- UnempDur [1:100, ] # Estimate GEE models for all events estGEE <- compRisksGEE(datShort = SubUnempDur, dataTransform = "dataLongCompRisks", corstr = "independence", formulaVariable =~ timeInt + age + ui + logwage * ui, eventColumns = c("censor1", "censor2", "censor3", "censor4"), timeColumn = "spell") ## Not run: # Estimate covariance matrix of estimated parameters and competing events estCovar <- covarGEE(modelEst=estGEE) estCovar # Covariances of estimated parameters of one event equal the diagonal blocks lengthParameters <- length(estGEE[[1]]$coefficients) noCompEvents <- length(estGEE) meanAbsError <- rep(NA, noCompEvents) for( k in 1:noCompEvents ){ relInd <- (1 + (k-1) * lengthParameters) : (k * lengthParameters) meanAbsError[k] <- mean(abs(estCovar[relInd, relInd] - estGEE[[k]]$geese$vbeta)) } mean(meanAbsError) # -> Covariance estimates within each event are equal to diagonal blocks in # complete covariance matrix with very small differences due to numerical accuracy. ## End(Not run)
Adapted version of the crash2 trial data as availlable in the package Hmisc. Both death or survival and main cause of death are included. Death times are discretized into days. Included covariates are sex and age of patient, elapsed time between injury and hospitalization, type of injury, systolic blood pressure, heart rate, respiratory rate, central capillary refill time and total glascow coma score. #'
Column "time" is time until death or hospital discharge/transfer in weeks.
Column "Status" denotes type of death
bleeding
head injury
vascular occlusion
multi organ failure
other
NA if patient is not dead (see "statusSE")
Column "statusSE" denotes death or discharge/transfer from hospital
0 - Transfer/Discharge
1 - Death
Column "sex" denotes sex of the patient.
Column "age" denotes age of the patient in years.
Column "injurytime" gives time in hours between injury and hospitalization.
Column "injurytype" denotes type of injury, one in
blunt
penetrating
blunt and penetrating
Column "sbp" denotes systolic blood pressure in mmHg.
Column "rr" denotes respiratory rate per minute.
Column "cc" denotes central capillary refill time in seconds.
Column "hr" denotes heart rate per minute.
Column "gcs" denotes total Glascow Coma Score.
data(crash2)
data(crash2)
David Koehler [email protected]
Function for transformation of discrete survival times in censoring encoding. The original data is expanded to include the censoring process. Alternatively the long data format can also be augmented. With the new generated variable "yCens", the discrete censoring process can be analyzed instead of the discrete survival process. In discrete survival analysis this information is used to constructs weights for predictive evaluation measures. It is applicable in single event survival analysis.
dataCensoring(dataShort, eventColumns, timeColumn, shortFormat = TRUE)
dataCensoring(dataShort, eventColumns, timeColumn, shortFormat = TRUE)
dataShort |
Original data set in short format ("class data.frame"). |
eventColumns |
Name of event columns ("character vector"). The event columns have to be in binary format. If the sum of all events equals zero in a row, then this observation is interpreted as censored. |
timeColumn |
Name of column with discrete time intervals ("character vector"). |
shortFormat |
Is the supplied data set dataShort not preprocessed with function dataLong() ("logical vector")? Default is TRUE. If shortFormat=FALSE then it is assumed that the data set was augmented with function dataLong(). |
Original data set as argument dataShort, but with added censoring process as first variable in column "yCens".
Thomas Welchowski [email protected]
Tutz G, Schmid M (2016).
Modeling discrete time-to-event data.
Springer Series in Statistics.
Fahrmeir L (2005).
“Discrete Survival-Time Models.”
In Encyclopedia of Biostatistics, chapter Survival Analysis.
John Wiley \& Sons.
Thompson Jr. WA (1977).
“On the Treatment of Grouped Observations in Life Studies.”
Biometrics, 33, 463-470.
contToDisc
,
dataLong
, dataLongTimeDep
,
dataLongCompRisks
library(pec) data(cost) head(cost) IntBorders <- 1:ceiling(max(cost$time)/30)*30 subCost <- cost [1:100, ] # Convert from days to months CostMonths <- contToDisc(dataShort=subCost, timeColumn="time", intervalLimits=IntBorders) head(CostMonths) # Generate censoring process variable in short format CostMonthsCensorShort <- dataCensoring (dataShort = CostMonths, eventColumns = "status", timeColumn = "time", shortFormat = TRUE) head(CostMonthsCensorShort) ################################ # Example with long data format library(pec) data(cost) head(cost) IntBorders <- 1:ceiling(max(cost$time)/30)*30 subCost <- cost [1:100, ] # Convert from days to months CostMonths <- contToDisc(dataShort = subCost, timeColumn = "time", intervalLimits = IntBorders) head(CostMonths) # Convert to long format based on months CostMonthsLong <- dataLong(dataShort = CostMonths, timeColumn = "timeDisc", eventColumn = "status") head(CostMonthsLong, 20) # Generate censoring process variable CostMonthsCensor <- dataCensoring (dataShort = CostMonthsLong, timeColumn = "timeInt", shortFormat = FALSE) head(CostMonthsCensor) tail(CostMonthsCensor [CostMonthsCensor$obj==1, ], 10) tail(CostMonthsCensor [CostMonthsCensor$obj==3, ], 10)
library(pec) data(cost) head(cost) IntBorders <- 1:ceiling(max(cost$time)/30)*30 subCost <- cost [1:100, ] # Convert from days to months CostMonths <- contToDisc(dataShort=subCost, timeColumn="time", intervalLimits=IntBorders) head(CostMonths) # Generate censoring process variable in short format CostMonthsCensorShort <- dataCensoring (dataShort = CostMonths, eventColumns = "status", timeColumn = "time", shortFormat = TRUE) head(CostMonthsCensorShort) ################################ # Example with long data format library(pec) data(cost) head(cost) IntBorders <- 1:ceiling(max(cost$time)/30)*30 subCost <- cost [1:100, ] # Convert from days to months CostMonths <- contToDisc(dataShort = subCost, timeColumn = "time", intervalLimits = IntBorders) head(CostMonths) # Convert to long format based on months CostMonthsLong <- dataLong(dataShort = CostMonths, timeColumn = "timeDisc", eventColumn = "status") head(CostMonthsLong, 20) # Generate censoring process variable CostMonthsCensor <- dataCensoring (dataShort = CostMonthsLong, timeColumn = "timeInt", shortFormat = FALSE) head(CostMonthsCensor) tail(CostMonthsCensor [CostMonthsCensor$obj==1, ], 10) tail(CostMonthsCensor [CostMonthsCensor$obj==3, ], 10)
Transform data from short format into long format for discrete survival analysis and right censoring. Data is assumed to include no time varying covariates, e. g. no follow up visits are allowed. It is assumed that the covariates stay constant over time, in which no information is available.
dataLong( dataShort, timeColumn, eventColumn, timeAsFactor = FALSE, remLastInt = FALSE, aggTimeFormat = FALSE, lastTheoInt = NULL )
dataLong( dataShort, timeColumn, eventColumn, timeAsFactor = FALSE, remLastInt = FALSE, aggTimeFormat = FALSE, lastTheoInt = NULL )
dataShort |
Original data in short format ("class data.frame"). |
timeColumn |
Character giving the column name of the observed times. It is required that the observed times are discrete ("integer vector"). |
eventColumn |
Column name of the event indicator ("character vector"). It is required that this is a binary variable with 1=="event" and 0=="censored". |
timeAsFactor |
Should the time intervals be coded as factor ("logical vector")? Default is FALSE. In the default settings the column is treated as quantitative variable ("numeric vector"). |
remLastInt |
Should the last theoretical interval be removed in long format ("logical vector")? Default setting (FALSE) is no deletion. This is only important, if the short format data includes the last theoretic interval [a_q, Inf). There are only events in the last theoretic interval, so the discrete hazard is always one and these observations have to be excluded for estimation. |
aggTimeFormat |
Instead of the usual long format, should every observation have all time intervals ("logical vector")? Default is standard long format (FALSE). In the case of nonlinear risk score models, the time effect has to be integrated out before these can be applied to the C-index. |
lastTheoInt |
Gives the number of the last theoretic interval ("integer vector"). Only used, if argument aggTimeFormat is set to TRUE. |
If the data has continuous survival times, the response may be transformed
to discrete intervals using function contToDisc
. If the data
set has time varying covariates the function dataLongTimeDep
should be used instead. In the case of competing risks and no time varying
covariates see function dataLongCompRisks
.
Original data.frame with three additional columns:
obj Index of persons as integer vector
timeInt Index of time intervals (factor)
y Response in long format as binary vector. 1=="event happens in period timeInt" and zero otherwise. If argument responseAsFactor is set to TRUE, then responses will be coded as factor in one column.
Thomas Welchowski [email protected]
Matthias Schmid [email protected]
Tutz G, Schmid M (2016).
Modeling discrete time-to-event data.
Springer Series in Statistics.
Fahrmeir L (2005).
“Discrete Survival-Time Models.”
In Encyclopedia of Biostatistics, chapter Survival Analysis.
John Wiley \& Sons.
Thompson Jr. WA (1977).
“On the Treatment of Grouped Observations in Life Studies.”
Biometrics, 33, 463-470.
contToDisc
, dataLongTimeDep
,
dataLongCompRisks
# Example unemployment data library(Ecdat) data(UnempDur) # Select subsample subUnempDur <- UnempDur [1:100, ] head(subUnempDur) # Convert to long format UnempLong <- dataLong (dataShort = subUnempDur, timeColumn = "spell", eventColumn = "censor1") head(UnempLong, 20) # Is there exactly one observed event of y for each person? splitUnempLong <- split(UnempLong, UnempLong$obj) all(sapply(splitUnempLong, function (x) sum(x$y))==subUnempDur$censor1) # TRUE # Second example: Acute Myelogenous Leukemia survival data library(survival) head(leukemia) leukLong <- dataLong(dataShort = leukemia, timeColumn = "time", eventColumn = "status", timeAsFactor=TRUE) head(leukLong, 30) # Estimate discrete survival model estGlm <- glm(formula = y ~ timeInt + x, data=leukLong, family = binomial()) summary(estGlm) # Estimate survival curves for non-maintained chemotherapy newDataNonMaintained <- data.frame(timeInt = factor(1:161), x = rep("Nonmaintained")) predHazNonMain <- predict(estGlm, newdata = newDataNonMaintained, type = "response") predSurvNonMain <- cumprod(1-predHazNonMain) # Estimate survival curves for maintained chemotherapy newDataMaintained <- data.frame(timeInt = factor(1:161), x = rep("Maintained")) predHazMain <- predict(estGlm, newdata = newDataMaintained, type = "response") predSurvMain <- cumprod(1-predHazMain) # Compare survival curves plot(x = 1:50, y = predSurvMain [1:50], xlab = "Time", ylab = "S(t)", las = 1, type = "l", main = "Effect of maintained chemotherapy on survival of leukemia patients") lines(x = 1:161, y = predSurvNonMain, col = "red") legend("topright", legend = c("Maintained chemotherapy", "Non-maintained chemotherapy"), col = c("black", "red"), lty = rep(1, 2)) # The maintained therapy has clearly a positive effect on survival over the time range ############################################## # Simulation # Single event in case of right-censoring # Simulate multivariate normal distribution library(discSurv) library(mvnfast) set.seed(-1980) X <- mvnfast::rmvn(n = 1000, mu = rep(0, 10), sigma = diag(10)) # Specification of discrete hazards with 11 theoretical intervals betaCoef <- seq(-1, 1, length.out = 11)[-6] timeInt <- seq(-1, 1, length.out = 10) linPred <- c(X %*% betaCoef) hazTimeX <- cbind(sapply(1:length(timeInt), function(x) exp(linPred+timeInt[x]) / (1+exp(linPred+timeInt[x])) ), 1) # Simulate discrete survival and censoring times in 10 observed intervals discT <- rep(NA, dim(hazTimeX)[1]) discC <- rep(NA, dim(hazTimeX)[1]) for( i in 1:dim(hazTimeX)[1] ){ discT[i] <- sample(1:11, size = 1, prob = estMargProb(haz=hazTimeX[i, ])) discC[i] <- sample(1:11, size = 1, prob = c(rep(1/11, 11))) } # Calculate observed times, event indicator and specify short data format eventInd <- discT <= discC obsT <- ifelse(eventInd, discT, discC) eventInd[obsT == 11] <- 0 obsT[obsT == 11] <- 10 simDatShort <- data.frame(obsT = obsT, event = as.numeric(eventInd), X) # Convert data to discrete data long format simDatLong <- dataLong(dataShort = simDatShort, timeColumn = "obsT", eventColumn = "event", timeAsFactor=TRUE) # Estimate discrete-time continuation ratio model formSpec <- as.formula(paste("y ~ timeInt + ", paste(paste("X", 1:10, sep=""), collapse = " + "), sep = "")) modelFit <- glm(formula = formSpec, data = simDatLong, family = binomial(link = "logit")) summary(modelFit) # Compare estimated to true coefficients coefModel <- coef(modelFit) MSE_covariates <- mean((coefModel[11:20]-timeInt)^2) MSE_covariates # -> Estimated coefficients are near true coefficients
# Example unemployment data library(Ecdat) data(UnempDur) # Select subsample subUnempDur <- UnempDur [1:100, ] head(subUnempDur) # Convert to long format UnempLong <- dataLong (dataShort = subUnempDur, timeColumn = "spell", eventColumn = "censor1") head(UnempLong, 20) # Is there exactly one observed event of y for each person? splitUnempLong <- split(UnempLong, UnempLong$obj) all(sapply(splitUnempLong, function (x) sum(x$y))==subUnempDur$censor1) # TRUE # Second example: Acute Myelogenous Leukemia survival data library(survival) head(leukemia) leukLong <- dataLong(dataShort = leukemia, timeColumn = "time", eventColumn = "status", timeAsFactor=TRUE) head(leukLong, 30) # Estimate discrete survival model estGlm <- glm(formula = y ~ timeInt + x, data=leukLong, family = binomial()) summary(estGlm) # Estimate survival curves for non-maintained chemotherapy newDataNonMaintained <- data.frame(timeInt = factor(1:161), x = rep("Nonmaintained")) predHazNonMain <- predict(estGlm, newdata = newDataNonMaintained, type = "response") predSurvNonMain <- cumprod(1-predHazNonMain) # Estimate survival curves for maintained chemotherapy newDataMaintained <- data.frame(timeInt = factor(1:161), x = rep("Maintained")) predHazMain <- predict(estGlm, newdata = newDataMaintained, type = "response") predSurvMain <- cumprod(1-predHazMain) # Compare survival curves plot(x = 1:50, y = predSurvMain [1:50], xlab = "Time", ylab = "S(t)", las = 1, type = "l", main = "Effect of maintained chemotherapy on survival of leukemia patients") lines(x = 1:161, y = predSurvNonMain, col = "red") legend("topright", legend = c("Maintained chemotherapy", "Non-maintained chemotherapy"), col = c("black", "red"), lty = rep(1, 2)) # The maintained therapy has clearly a positive effect on survival over the time range ############################################## # Simulation # Single event in case of right-censoring # Simulate multivariate normal distribution library(discSurv) library(mvnfast) set.seed(-1980) X <- mvnfast::rmvn(n = 1000, mu = rep(0, 10), sigma = diag(10)) # Specification of discrete hazards with 11 theoretical intervals betaCoef <- seq(-1, 1, length.out = 11)[-6] timeInt <- seq(-1, 1, length.out = 10) linPred <- c(X %*% betaCoef) hazTimeX <- cbind(sapply(1:length(timeInt), function(x) exp(linPred+timeInt[x]) / (1+exp(linPred+timeInt[x])) ), 1) # Simulate discrete survival and censoring times in 10 observed intervals discT <- rep(NA, dim(hazTimeX)[1]) discC <- rep(NA, dim(hazTimeX)[1]) for( i in 1:dim(hazTimeX)[1] ){ discT[i] <- sample(1:11, size = 1, prob = estMargProb(haz=hazTimeX[i, ])) discC[i] <- sample(1:11, size = 1, prob = c(rep(1/11, 11))) } # Calculate observed times, event indicator and specify short data format eventInd <- discT <= discC obsT <- ifelse(eventInd, discT, discC) eventInd[obsT == 11] <- 0 obsT[obsT == 11] <- 10 simDatShort <- data.frame(obsT = obsT, event = as.numeric(eventInd), X) # Convert data to discrete data long format simDatLong <- dataLong(dataShort = simDatShort, timeColumn = "obsT", eventColumn = "event", timeAsFactor=TRUE) # Estimate discrete-time continuation ratio model formSpec <- as.formula(paste("y ~ timeInt + ", paste(paste("X", 1:10, sep=""), collapse = " + "), sep = "")) modelFit <- glm(formula = formSpec, data = simDatLong, family = binomial(link = "logit")) summary(modelFit) # Compare estimated to true coefficients coefModel <- coef(modelFit) MSE_covariates <- mean((coefModel[11:20]-timeInt)^2) MSE_covariates # -> Estimated coefficients are near true coefficients
Transforms short data format to long format for discrete survival modelling in the case of competing risks with right censoring. It is assumed that the covariates are not time varying.
dataLongCompRisks( dataShort, timeColumn, eventColumns, eventColumnsAsFactor = FALSE, timeAsFactor = FALSE, aggTimeFormat = FALSE, lastTheoInt = NULL, responseAsFactor = FALSE )
dataLongCompRisks( dataShort, timeColumn, eventColumns, eventColumnsAsFactor = FALSE, timeAsFactor = FALSE, aggTimeFormat = FALSE, lastTheoInt = NULL, responseAsFactor = FALSE )
dataShort |
Original data in short format ("class data.frame"). |
timeColumn |
Character giving the column name of the observed times ("character vector"). It is required that the observed times are discrete ("integer vector"). |
eventColumns |
Character vector giving the column names of the event indicators (excluding censoring column)("character vector"). It is required that all events are binary encoded. If the sum of all event indicators is zero, then this is interpreted as a censored observation. Alternatively a column name of a factor representing competing events can be given. In this case the argument eventColumnsAsFactor has to be set TRUE and the first level is assumed to represent censoring. |
eventColumnsAsFactor |
Should the argument eventColumns be interpreted as column name of a factor variable ("logical vector")? Default is FALSE. |
timeAsFactor |
Should the time intervals be coded as factor ("logical vector")? Default is FALSE. In the default settings the discrete time variable are treated as quantitative. |
aggTimeFormat |
Instead of the usual long format, should every observation have all time intervals ("logical vector")? Default is standard long format. In the case of nonlinear risk score models, the time effect has to be integrated out before these can be applied to the C-index. |
lastTheoInt |
Gives the number of the last theoretic interval ("integer vector"). Only used, if aggTimeFormat is set to TRUE. |
responseAsFactor |
Should the response columns be given as factor ("logical vector")? Default is FALSE. |
It is assumed, that only one event happens at a specific time point (competing risks). Either the observation is censored or one of the possible events takes place.
In contrast to continuous survival (see e. g. Surv
)
the start and stop time notation is not used here. In discrete time survival analysis the only relevant
information is to use the stop time. Start time does not matter, because all discrete intervals need to be
included in the long data set format to ensure consistent estimation. It is assumed that the supplied
data set dataShort contains all repeated measurements of each cluster (e. g. persons).
For further information see example Start-stop notation.
Original data set in long format with additional columns
obj Gives identification number of objects (row index in short format) (integer)
timeInt Gives number of discrete time intervals (factor)
responses Columns with dimension count of events + 1 (censoring)
e0 No event (observation censored in specific interval)
e1 Indicator of first event, 1 if event takes place and 0 otherwise
... ...
ek Indicator of last k-th event, 1 if event takes place and zero otherwise
If argument responseAsFactor=TRUE, then responses will be coded as factor in one column.
Thomas Welchowski [email protected]
Tutz G, Schmid M (2016).
Modeling discrete time-to-event data.
Springer Series in Statistics.
Steele F, Goldstein H, Browne W (2004).
“A general multilevel multistate competing risks model for event history data, with an application to a study of contraceptive use dynamics.”
Statistical Modelling, 4, 145-159.
Narendranathan W, Stewart MB (1993).
“Modelling the Probability of Leaving Unemployment: Competing Risks Models with Flexible Base-Line Hazards.”
Journal of the Royal Statistical Society Series C, 42, 63-83.
contToDisc
, dataLongTimeDep
,
dataLongCompRisksTimeDep
# Example with unemployment data library(Ecdat) data(UnempDur) # Select subsample SubUnempDur <- UnempDur [1:100, ] # Convert competing risk data to long format SubUnempDurLong <- dataLongCompRisks (dataShort = SubUnempDur, timeColumn = "spell", eventColumns = c("censor1", "censor2", "censor3", "censor4")) head(SubUnempDurLong, 20) # Fit multinomial logit model with VGAM package # with one coefficient per response library(VGAM) multLogitVGM <- vgam(cbind(e0, e1, e2, e3, e4) ~ timeInt + ui + age + logwage, family = multinomial(refLevel = 1), data = SubUnempDurLong) coef(multLogitVGM) # Alternative: Use nnet # Convert response to factor rawResponseMat <- SubUnempDurLong[, c("e0", "e1", "e2", "e3", "e4")] NewFactor <- factor(unname(apply(rawResponseMat, 1, function(x) which(x == 1))), labels = colnames(rawResponseMat)) # Include recoded response in data SubUnempDurLong <- cbind(SubUnempDurLong, NewResp = NewFactor) # Construct formula of mlogit model mlogitFormula <- formula(NewResp ~ timeInt + ui + age + logwage) # Fit multinomial logit model # with one coefficient per response library(nnet) multLogitNNET <- multinom(formula = mlogitFormula, data = SubUnempDurLong) coef(multLogitNNET) ########################################################### # Simulation # Cause specific competing risks in case of right-censoring # Discrete subdistribution hazards model # Simulate covariates as multivariate normal distribution library(mvnfast) set.seed(1980) X <- mvnfast::rmvn(n = 1000, mu = rep(0, 4), sigma = diag(4)) # Specification of two discrete cause specific hazards with four intervals # Event 1 theoInterval <- 4 betaCoef_event1 <- seq(-1, 1, length.out = 5)[-3] timeInt_event1 <- seq(0.1, -0.1, length.out = theoInterval-1) linPred_event1 <- c(X %*% betaCoef_event1) # Event 2 betaCoef_event2 <- seq(-0.5, 0.5, length.out = 5)[-3] timeInt_event2 <- seq(-0.1, 0.1, length.out = theoInterval-1) linPred_event2 <- c(X %*% betaCoef_event2) # Discrete cause specific hazards in last theoretical interval theoHaz_event1 <- 0.5 theoHaz_event2 <- 0.5 haz_event1_X <- cbind(sapply(1:length(timeInt_event1), function(x) exp(linPred_event1 + timeInt_event1[x]) / (1 + exp(linPred_event1 + timeInt_event1[x]) + exp(linPred_event2 + timeInt_event2[x])) ), theoHaz_event1) haz_event2_X <- cbind(sapply(1:length(timeInt_event2), function(x) exp(linPred_event2 + timeInt_event2[x]) / (1 + exp(linPred_event1 + timeInt_event1[x]) + exp(linPred_event2 + timeInt_event2[x]) ) ), theoHaz_event2) allCauseHaz_X <- haz_event1_X + haz_event2_X pT_X <- t(sapply(1:dim(allCauseHaz_X)[1], function(i) estMargProb(allCauseHaz_X[i, ]) )) pR_T_X_event1 <- haz_event1_X / (haz_event1_X + haz_event2_X) survT <- sapply(1:dim(pT_X)[1], function(i) sample(x = 1:(length(timeInt_event1) + 1), size = 1, prob = pT_X[i, ]) ) censT <- sample(x = 1:(length(timeInt_event1)+1), size = dim(pT_X)[1], prob = rep(1/(length(timeInt_event1) + 1), (length(timeInt_event1) + 1)), replace = TRUE) obsT <- ifelse(survT <= censT, survT, censT) obsEvent <- rep(0, length(obsT)) obsEvent <- sapply(1:length(obsT), function(i) if(survT[i] <= censT[i]){ return(sample(x = c(1, 2), size=1, prob = c(pR_T_X_event1[i, obsT[i] ], 1 - pR_T_X_event1[i, obsT[i] ]) )) } else{ return(0) } ) # Recode last interval to censored lastInterval <- obsT == theoInterval obsT[lastInterval] <- theoInterval - 1 obsEvent[lastInterval] <- 0 obsT <- factor(obsT) obsEvent <- factor(obsEvent) datShort <- data.frame(event = factor(obsEvent), time = obsT, X) datLong <- dataLongCompRisks(dataShort = datShort, timeColumn = "time", eventColumns = "event", responseAsFactor = TRUE, eventColumnsAsFactor = TRUE, timeAsFactor = TRUE) # Estimate discrete cause specific hazard model library(VGAM) estModel <- vglm(formula=responses ~ timeInt + X1 + X2 + X3 + X4, data=datLong, family = multinomial(refLevel = 1)) # Mean squared errors per event coefModels <- coef(estModel) mean((coefModels[seq(7, length(coefModels), 2)] - betaCoef_event1)^2) # Event 1 mean((coefModels[seq(8, length(coefModels), 2)] - betaCoef_event2)^2) # Event 2 # -> Estimated coefficients are near true coefficients for each event type
# Example with unemployment data library(Ecdat) data(UnempDur) # Select subsample SubUnempDur <- UnempDur [1:100, ] # Convert competing risk data to long format SubUnempDurLong <- dataLongCompRisks (dataShort = SubUnempDur, timeColumn = "spell", eventColumns = c("censor1", "censor2", "censor3", "censor4")) head(SubUnempDurLong, 20) # Fit multinomial logit model with VGAM package # with one coefficient per response library(VGAM) multLogitVGM <- vgam(cbind(e0, e1, e2, e3, e4) ~ timeInt + ui + age + logwage, family = multinomial(refLevel = 1), data = SubUnempDurLong) coef(multLogitVGM) # Alternative: Use nnet # Convert response to factor rawResponseMat <- SubUnempDurLong[, c("e0", "e1", "e2", "e3", "e4")] NewFactor <- factor(unname(apply(rawResponseMat, 1, function(x) which(x == 1))), labels = colnames(rawResponseMat)) # Include recoded response in data SubUnempDurLong <- cbind(SubUnempDurLong, NewResp = NewFactor) # Construct formula of mlogit model mlogitFormula <- formula(NewResp ~ timeInt + ui + age + logwage) # Fit multinomial logit model # with one coefficient per response library(nnet) multLogitNNET <- multinom(formula = mlogitFormula, data = SubUnempDurLong) coef(multLogitNNET) ########################################################### # Simulation # Cause specific competing risks in case of right-censoring # Discrete subdistribution hazards model # Simulate covariates as multivariate normal distribution library(mvnfast) set.seed(1980) X <- mvnfast::rmvn(n = 1000, mu = rep(0, 4), sigma = diag(4)) # Specification of two discrete cause specific hazards with four intervals # Event 1 theoInterval <- 4 betaCoef_event1 <- seq(-1, 1, length.out = 5)[-3] timeInt_event1 <- seq(0.1, -0.1, length.out = theoInterval-1) linPred_event1 <- c(X %*% betaCoef_event1) # Event 2 betaCoef_event2 <- seq(-0.5, 0.5, length.out = 5)[-3] timeInt_event2 <- seq(-0.1, 0.1, length.out = theoInterval-1) linPred_event2 <- c(X %*% betaCoef_event2) # Discrete cause specific hazards in last theoretical interval theoHaz_event1 <- 0.5 theoHaz_event2 <- 0.5 haz_event1_X <- cbind(sapply(1:length(timeInt_event1), function(x) exp(linPred_event1 + timeInt_event1[x]) / (1 + exp(linPred_event1 + timeInt_event1[x]) + exp(linPred_event2 + timeInt_event2[x])) ), theoHaz_event1) haz_event2_X <- cbind(sapply(1:length(timeInt_event2), function(x) exp(linPred_event2 + timeInt_event2[x]) / (1 + exp(linPred_event1 + timeInt_event1[x]) + exp(linPred_event2 + timeInt_event2[x]) ) ), theoHaz_event2) allCauseHaz_X <- haz_event1_X + haz_event2_X pT_X <- t(sapply(1:dim(allCauseHaz_X)[1], function(i) estMargProb(allCauseHaz_X[i, ]) )) pR_T_X_event1 <- haz_event1_X / (haz_event1_X + haz_event2_X) survT <- sapply(1:dim(pT_X)[1], function(i) sample(x = 1:(length(timeInt_event1) + 1), size = 1, prob = pT_X[i, ]) ) censT <- sample(x = 1:(length(timeInt_event1)+1), size = dim(pT_X)[1], prob = rep(1/(length(timeInt_event1) + 1), (length(timeInt_event1) + 1)), replace = TRUE) obsT <- ifelse(survT <= censT, survT, censT) obsEvent <- rep(0, length(obsT)) obsEvent <- sapply(1:length(obsT), function(i) if(survT[i] <= censT[i]){ return(sample(x = c(1, 2), size=1, prob = c(pR_T_X_event1[i, obsT[i] ], 1 - pR_T_X_event1[i, obsT[i] ]) )) } else{ return(0) } ) # Recode last interval to censored lastInterval <- obsT == theoInterval obsT[lastInterval] <- theoInterval - 1 obsEvent[lastInterval] <- 0 obsT <- factor(obsT) obsEvent <- factor(obsEvent) datShort <- data.frame(event = factor(obsEvent), time = obsT, X) datLong <- dataLongCompRisks(dataShort = datShort, timeColumn = "time", eventColumns = "event", responseAsFactor = TRUE, eventColumnsAsFactor = TRUE, timeAsFactor = TRUE) # Estimate discrete cause specific hazard model library(VGAM) estModel <- vglm(formula=responses ~ timeInt + X1 + X2 + X3 + X4, data=datLong, family = multinomial(refLevel = 1)) # Mean squared errors per event coefModels <- coef(estModel) mean((coefModels[seq(7, length(coefModels), 2)] - betaCoef_event1)^2) # Event 1 mean((coefModels[seq(8, length(coefModels), 2)] - betaCoef_event2)^2) # Event 2 # -> Estimated coefficients are near true coefficients for each event type
Transforms short data format to long format for discrete survival modelling in the case of competing risks with right censoring. Covariates may vary over time.
dataLongCompRisksTimeDep( dataSemiLong, timeColumn, eventColumns, eventColumnsAsFactor = FALSE, idColumn, timeAsFactor = FALSE, responseAsFactor = FALSE )
dataLongCompRisksTimeDep( dataSemiLong, timeColumn, eventColumns, eventColumnsAsFactor = FALSE, idColumn, timeAsFactor = FALSE, responseAsFactor = FALSE )
dataSemiLong |
Original data in semi-long format ("class data.frame"). |
timeColumn |
Character giving the column name of the observed times("logical vector"). It is required that the observed times are discrete ("integer vector"). |
eventColumns |
Character vector giving the column names of the event indicators (excluding censoring column)("character vector"). It is required that all events are binary encoded. If the sum of all event indicators is zero, then this is interpreted as a censored observation. Alternatively a column name of a factor representing competing events can be given. In this case the argument eventColumnsAsFactor has to be set TRUE and the first level is assumed to represent censoring. |
eventColumnsAsFactor |
Should the argument eventColumns be intepreted as column name of a factor variable ("logical vector")? Default is FALSE. |
idColumn |
Name of column of identification number of persons as character("character vector"). |
timeAsFactor |
Should the time intervals be coded as factor ("logical vector")? Default is FALSE. In the default settings the discrete time intervals are treated as quantitative ("numeric vector"). |
responseAsFactor |
Should the response columns be given as factor ("logical vector")? Default is FALSE. |
There may be some intervals, where no additional information on the covariates is observed (e. g. observed values in interval one and three but two is missing). In this case it is assumed, that the values from the last observation stay constant over time until a new measurement was done.
In contrast to continuous survival (see e. g. Surv
)
the start and stop time notation is not used here. In discrete time survival analysis the only relevant
information is to use the stop time. Start time does not matter, because all discrete intervals need to be
included in the long data set format to ensure consistent estimation. It is assumed that the supplied
data set dataSemiLong contains all repeated measurements of each cluster in semi-long format (e. g. persons).
For further information see example Start-stop notation.
Original data set in long format with additional columns
obj Gives identification number of objects (row index in short format) (integer)
timeInt Gives number of discrete time intervals (factor)
responses Columns with dimension count of events + 1 (censoring)
e0 No event (observation censored in specific interval)
e1 Indicator of first event, 1 if event takes place and 0 otherwise
... ...
ek Indicator of last k-th event, 1 if event takes place and 0 otherwise
If argument responseAsFactor=TRUE, then responses will be coded as factor in one column.
Thomas Welchowski [email protected]
Fahrmeir L (2005).
“Discrete Survival-Time Models.”
In Encyclopedia of Biostatistics, chapter Survival Analysis.
John Wiley \& Sons.
Thompson Jr. WA (1977).
“On the Treatment of Grouped Observations in Life Studies.”
Biometrics, 33, 463-470.
contToDisc
, dataLong
,
dataLongCompRisks
# Example Primary Biliary Cirrhosis data library(survival) pbcseq_example <- pbcseq # Convert to months pbcseq_example$day <- ceiling(pbcseq_example$day/30) + 1 names(pbcseq_example)[7] <- "month" pbcseq_example$status <- factor(pbcseq_example$status) # Convert to long format for time varying effects pbcseq_exampleLong <- dataLongCompRisksTimeDep(dataSemiLong = pbcseq_example, timeColumn = "month", eventColumns = "status", eventColumnsAsFactor = TRUE, idColumn = "id", timeAsFactor = TRUE) head(pbcseq_exampleLong) ##################### # Start-stop notation library(survival) ?pbcseq # Choose subset of patients subsetID <- unique(pbcseq$id)[1:100] pbcseq_mod <- pbcseq[pbcseq$id %in% subsetID, ] # Convert to start stop notation pbcseq_mod_split <- split(pbcseq_mod, pbcseq_mod$id) pbcseq_mod_split <- lapply(1:length(pbcseq_mod_split), function(x) { cbind(pbcseq_mod_split[[x]], start_time=c(0, pbcseq_mod_split[[x]][ - dim(pbcseq_mod_split[[x]])[1], "day"]), stop_time=pbcseq_mod_split[[x]][, "day"]) }) pbcseq_mod <- do.call(rbind, pbcseq_mod_split) # Convert stop time to months intervalDef <- c(quantile(pbcseq_mod$stop_time, probs = seq(0.1, 0.9, by=0.1)), Inf) names(pbcseq_mod) pbcseq_mod <- contToDisc(dataShort = pbcseq_mod, timeColumn = "stop_time", intervalLimits = intervalDef, equi = FALSE) pbcseq_mod$status <- factor(pbcseq_mod$status) # Conversion to data long format pbcseq_mod_long <- dataLongCompRisksTimeDep(dataSemiLong = pbcseq_mod, timeColumn = "timeDisc", eventColumns = "status", idColumn = "id", eventColumnsAsFactor = TRUE, responseAsFactor = TRUE, timeAsFactor = TRUE) head(pbcseq_mod_long)
# Example Primary Biliary Cirrhosis data library(survival) pbcseq_example <- pbcseq # Convert to months pbcseq_example$day <- ceiling(pbcseq_example$day/30) + 1 names(pbcseq_example)[7] <- "month" pbcseq_example$status <- factor(pbcseq_example$status) # Convert to long format for time varying effects pbcseq_exampleLong <- dataLongCompRisksTimeDep(dataSemiLong = pbcseq_example, timeColumn = "month", eventColumns = "status", eventColumnsAsFactor = TRUE, idColumn = "id", timeAsFactor = TRUE) head(pbcseq_exampleLong) ##################### # Start-stop notation library(survival) ?pbcseq # Choose subset of patients subsetID <- unique(pbcseq$id)[1:100] pbcseq_mod <- pbcseq[pbcseq$id %in% subsetID, ] # Convert to start stop notation pbcseq_mod_split <- split(pbcseq_mod, pbcseq_mod$id) pbcseq_mod_split <- lapply(1:length(pbcseq_mod_split), function(x) { cbind(pbcseq_mod_split[[x]], start_time=c(0, pbcseq_mod_split[[x]][ - dim(pbcseq_mod_split[[x]])[1], "day"]), stop_time=pbcseq_mod_split[[x]][, "day"]) }) pbcseq_mod <- do.call(rbind, pbcseq_mod_split) # Convert stop time to months intervalDef <- c(quantile(pbcseq_mod$stop_time, probs = seq(0.1, 0.9, by=0.1)), Inf) names(pbcseq_mod) pbcseq_mod <- contToDisc(dataShort = pbcseq_mod, timeColumn = "stop_time", intervalLimits = intervalDef, equi = FALSE) pbcseq_mod$status <- factor(pbcseq_mod$status) # Conversion to data long format pbcseq_mod_long <- dataLongCompRisksTimeDep(dataSemiLong = pbcseq_mod, timeColumn = "timeDisc", eventColumns = "status", idColumn = "id", eventColumnsAsFactor = TRUE, responseAsFactor = TRUE, timeAsFactor = TRUE) head(pbcseq_mod_long)
Transform data from short format into long format for discrete multi spell survival analysis and right censoring.
dataLongMultiSpell( dataSemiLong, timeColumn, eventColumn, idColumn, timeAsFactor = FALSE, spellAsFactor = FALSE )
dataLongMultiSpell( dataSemiLong, timeColumn, eventColumn, idColumn, timeAsFactor = FALSE, spellAsFactor = FALSE )
dataSemiLong |
Original data in semi-long format ("class data.frame"). |
timeColumn |
Character giving the column name of the observed times. It is required that the observed times are discrete ("character vector"). |
eventColumn |
Column name of the event status ("character vector"). The events can take multiple values on a discrete scale (0, 1, 2, ...) and repetition of events is allowed (integer vector or class factor). It is assumed that the number zero corresponds to censoring and all number > 0 represent the observed states between transitions. |
idColumn |
Name of column of identification number of persons as character("character vector"). |
timeAsFactor |
Should the time intervals be coded as factor ("logical vector")? Default is FALSE. In the default settings the discrete time intervals are treated as quantitative ("numeric vector"). |
spellAsFactor |
Should the spells be coded as factor ("logical vector")? Default is not to use factor. If the argument is false, the column is coded as numeric. |
If the data has continuous survival times, the response may be transformed
to discrete intervals using function contToDisc
. The discrete
time variable needs to be strictly increasing for each person, because
otherwise the order of the events is not distinguishable. Here is an example
data structure in short format prior augmentation with three possible
states: \ idColumn=1, 1, ... , 1, 2, 2, ... , n \ timeColumn= t_ID1_1 <
t_ID1_1 < ... < t_ID1_k, t_ID2_1 < t_ID2_2 < ... < t_ID2_k, ... \
eventColumn = 0, 1, ... , 2, 1, 0, ... , 0
The starting state of each individual is assumed to given with time interval equals zero. For example in an illness-death model with three states ("healthy", "illness", "death") if an individual was healthy at the beginning of the study this has to be encoded with discrete time interval set to zero and event state "healthy".
Original data.frame with three additional columns:
obj Index of persons as integer vector
timeInt Index of time intervals (factor or integer vector)
spell The spell gives the actual state of each individual within a given discrete interval.
e0 Response transition in long format as binary vector. Column e0 represents censoring. If e0 is coded one in the in the last observed time interval timeInt of a person, then this observation was censored.
e1 Response in long format as binary vector. The column e1 represents the transition to the first event state.
eX Response in long format as binary vector. The column eX represents the transition to the last event state out of the set of possible states "1, 2, 3, ..., X".
... Expanded columns of original data set.
Thomas Welchowski [email protected]
Tutz G, Schmid M (2016).
Modeling discrete time-to-event data.
Springer Series in Statistics.
Fahrmeir L (2005).
“Discrete Survival-Time Models.”
In Encyclopedia of Biostatistics, chapter Survival Analysis.
John Wiley \& Sons.
Thompson Jr. WA (1977).
“On the Treatment of Grouped Observations in Life Studies.”
Biometrics, 33, 463-470.
contToDisc
, dataLongTimeDep
,
dataLongCompRisks
, dataLongCompRisks
################################ # Example with unemployment data data(unempMultiSpell) # Select subsample of first 500 persons unempSub <- unempMultiSpell[unempMultiSpell$id %in% 1:250,] # Expansion from semi-long to long format unempLong <- dataLongMultiSpell(dataSemiLong=unempSub, timeColumn = "year", eventColumn="spell", idColumn="id", spellAsFactor=TRUE, timeAsFactor=FALSE) head(unempLong, 25) # Fit discrete multi-state model regression model library(VGAM) model <- vgam(cbind(e0, e1, e2, e3, e4) ~ 0 + s(timeInt) + age:spell, data = unempLong, family = multinomial(refLevel="e0")) ############################ # Example with artificial data # Seed specification set.seed(-2578) # Construction of data set # Censoring and three possible states (0, 1, 2, 3) # Discrete time intervals (1, 2, ... , 10) # Noninfluential variable x ~ N(0, 1) datFrame <- data.frame( ID = c(rep(1, 6), rep(2, 4), rep(3, 3), rep(4, 2), rep(5, 4), rep(6, 5), rep(7, 7), rep(8, 8)), time = c(c(0, 2, 5, 6, 8, 10), c(0, 1, 6, 7), c(0, 9, 10), c(0, 6), c(0, 2, 3, 4), c(0, 3, 4, 7, 9), c(0, 2, 3, 5, 7, 8, 10), c(0, 1, 3, 4, 6, 7, 8, 9) ), state = c(c(2, 1, 3, 2, 1, 0), c(3, 1, 2, 2), c(2, 2, 1), c(1, 2), c(3, 2, 2, 0), c(1, 3, 2, 1, 3), c(1, 1, 2, 3, 2, 1, 3), c(3, 2, 3, 2, 1, 1, 2, 3) ), x = rnorm(n=6+4+3+2+4+5+7+8) ) # Transformation to long format datFrameLong <- dataLongMultiSpell(dataSemiLong=datFrame, timeColumn="time", eventColumn="state", idColumn="ID", spellAsFactor=TRUE) head(datFrameLong, 25) library(VGAM) cRm <- vglm(cbind(e0, e1, e2, e3) ~ 0 + timeInt + x:spell, data = datFrameLong, family = "multinomial") summary(cRm)
################################ # Example with unemployment data data(unempMultiSpell) # Select subsample of first 500 persons unempSub <- unempMultiSpell[unempMultiSpell$id %in% 1:250,] # Expansion from semi-long to long format unempLong <- dataLongMultiSpell(dataSemiLong=unempSub, timeColumn = "year", eventColumn="spell", idColumn="id", spellAsFactor=TRUE, timeAsFactor=FALSE) head(unempLong, 25) # Fit discrete multi-state model regression model library(VGAM) model <- vgam(cbind(e0, e1, e2, e3, e4) ~ 0 + s(timeInt) + age:spell, data = unempLong, family = multinomial(refLevel="e0")) ############################ # Example with artificial data # Seed specification set.seed(-2578) # Construction of data set # Censoring and three possible states (0, 1, 2, 3) # Discrete time intervals (1, 2, ... , 10) # Noninfluential variable x ~ N(0, 1) datFrame <- data.frame( ID = c(rep(1, 6), rep(2, 4), rep(3, 3), rep(4, 2), rep(5, 4), rep(6, 5), rep(7, 7), rep(8, 8)), time = c(c(0, 2, 5, 6, 8, 10), c(0, 1, 6, 7), c(0, 9, 10), c(0, 6), c(0, 2, 3, 4), c(0, 3, 4, 7, 9), c(0, 2, 3, 5, 7, 8, 10), c(0, 1, 3, 4, 6, 7, 8, 9) ), state = c(c(2, 1, 3, 2, 1, 0), c(3, 1, 2, 2), c(2, 2, 1), c(1, 2), c(3, 2, 2, 0), c(1, 3, 2, 1, 3), c(1, 1, 2, 3, 2, 1, 3), c(3, 2, 3, 2, 1, 1, 2, 3) ), x = rnorm(n=6+4+3+2+4+5+7+8) ) # Transformation to long format datFrameLong <- dataLongMultiSpell(dataSemiLong=datFrame, timeColumn="time", eventColumn="state", idColumn="ID", spellAsFactor=TRUE) head(datFrameLong, 25) library(VGAM) cRm <- vglm(cbind(e0, e1, e2, e3) ~ 0 + timeInt + x:spell, data = datFrameLong, family = "multinomial") summary(cRm)
Generates the augmented data matrix and the weights required for discrete subdistribution hazard modeling with right censoring.
dataLongSubDist( dataShort, timeColumn, eventColumns, eventColumnsAsFactor = FALSE, eventFocus, timeAsFactor = FALSE, aggTimeFormat = FALSE, lastTheoInt = NULL )
dataLongSubDist( dataShort, timeColumn, eventColumns, eventColumnsAsFactor = FALSE, eventFocus, timeAsFactor = FALSE, aggTimeFormat = FALSE, lastTheoInt = NULL )
dataShort |
Original data in short format ("class data.frame"). |
timeColumn |
Character specifying the column name of the observed event times ("logical vector"). It is required that the observed times are discrete ("integer vector"). |
eventColumns |
Character vector specifying the column names of the event indicators (excluding censoring events) ("logical vector"). It is required that a 0-1 coding is used for all events. The algorithm treats row sums of zero of all event columns as censored. |
eventColumnsAsFactor |
Should the argument eventColumns be interpreted as column name of a factor variable ("logical vector")? Default is FALSE. |
eventFocus |
Column name of the event of interest (type 1 event) ("character vector"). |
timeAsFactor |
Logical indicating whether time should be coded as a factor in the augmented data matrix("logical vector"). If FALSE, a numeric coding will be used. |
aggTimeFormat |
Instead of the usual long format, should every observation have all time intervals? ("logical vector") Default is standard long format. In the case of nonlinear risk score models, the time effect has to be integrated out before these can be applied to the C-index. |
lastTheoInt |
Gives the number of the last theoretic interval ("integer vector"). Only used, if aggTimeFormat==TRUE. |
This function sets up the augmented data matrix and the weights that are needed for weighted maximum likelihood (ML) estimation of the discrete subdistribution model proposed by Berger et al. (2018). The model is a discrete-time extension of the original subdistribution model proposed by Fine and Gray (1999).
Data frame with additional column "subDistWeights". The latter column contains the weights that are needed for fitting a weighted binary regression model, as described in Berger et al. (2018). The weights are calculated by a life table estimator for the censoring event.
Thomas Welchowski [email protected]
Berger M, Schmid M, Welchowski T, Schmitz-Valckenberg S, Beyersmann J (2020).
“Subdistribution Hazard Models for Competing Risks in Discrete Time.”
Biostatistics, 21, 449-466.
Fine JP, Gray RJ (2012).
“A Proportional Hazards Model for the Subdistribution of a Competing Risk.”
Journal of the American Statistical Association, 94, 496-509.
################################ # Example with unemployment data library(Ecdat) data(UnempDur) # Generate subsample, reduce number of intervals to k = 5 SubUnempDur <- UnempDur [1:500, ] SubUnempDur$time <- as.numeric(cut(SubUnempDur$spell, c(0,4,8,16,28))) # Convert competing risks data to long format # The event of interest is re-employment at full job SubUnempDurLong <- dataLongSubDist (dataShort=SubUnempDur, timeColumn = "time", eventColumns=c("censor1", "censor2", "censor3"), eventFocus="censor1") head(SubUnempDurLong) # Fit discrete subdistribution hazard model with logistic link function logisticSubDistr <- glm(y ~ timeInt + ui + age + logwage, family=binomial(), data = SubUnempDurLong, weights = SubUnempDurLong$subDistWeights) summary(logisticSubDistr) ######################################## # Simulation # Discrete subdistribution hazards model # Simulate covariates as multivariate normal distribution library(mvnfast) set.seed(1980) X <- mvnfast::rmvn(n = 1000, mu = rep(0, 4), sigma = diag(4)) # Specification of two discrete cause specific hazards with four intervals # Event 1 theoInterval <- 4 betaCoef_event1 <- seq(-1, 1, length.out = 5)[-3] timeInt_event1 <- seq(0.1, -0.1, length.out = theoInterval-1) linPred_event1 <- c(X %*% betaCoef_event1) # Event 2 betaCoef_event2 <- seq(-0.5, 0.5, length.out = 5)[-3] timeInt_event2 <- seq(-0.1, 0.1, length.out = theoInterval-1) linPred_event2 <- c(X %*% betaCoef_event2) # Discrete cause specific hazards in last theoretical interval theoHaz_event1 <- 0.5 theoHaz_event2 <- 0.5 # Derive discrete all cause hazard haz_event1_X <- cbind(sapply(1:length(timeInt_event1), function(x) exp(linPred_event1 + timeInt_event1[x]) / (1 + exp(linPred_event1 + timeInt_event1[x]) + exp(linPred_event2 + timeInt_event2[x])) ), theoHaz_event1) haz_event2_X <- cbind(sapply(1:length(timeInt_event2), function(x) exp(linPred_event2 + timeInt_event2[x]) / (1 + exp(linPred_event1 + timeInt_event1[x]) + exp(linPred_event2 + timeInt_event2[x]) ) ), theoHaz_event2) allCauseHaz_X <- haz_event1_X + haz_event2_X # Derive discrete cumulative incidence function of event 1 given covariates p_T_event1_X <- haz_event1_X * cbind(1, (1-allCauseHaz_X)[, -dim(allCauseHaz_X)[2]]) cumInc_event1_X <- t(sapply(1:dim(p_T_event1_X)[1], function(x) cumsum(p_T_event1_X[x, ]))) # Calculate all cause probability P(T=t | X) pT_X <- t(sapply(1:dim(allCauseHaz_X)[1], function(i) estMargProb(allCauseHaz_X[i, ]) )) # Calculate event probability given time interval P(R=r | T=t, X) pR_T_X_event1 <- haz_event1_X / (haz_event1_X + haz_event2_X) # Simulate discrete survival times survT <- sapply(1:dim(pT_X)[1], function(i) sample(x = 1:(length(timeInt_event1)+1), size = 1, prob = pT_X[i, ]) ) censT <- sample(x = 1:(length(timeInt_event1)+1), size = dim(pT_X)[1], prob = rep(1/(length(timeInt_event1) + 1), (length(timeInt_event1) + 1)), replace = TRUE) # Calculate observed times obsT <- ifelse(survT <= censT, survT, censT) obsEvent <- rep(0, length(obsT)) obsEvent <- sapply(1:length(obsT), function(i) if(survT[i] <= censT[i]){ return(sample(x = c(1, 2), size = 1, prob = c(pR_T_X_event1[i, obsT[i] ], 1 - pR_T_X_event1[i, obsT[i] ]) )) } else{ return(0) } ) # Recode last interval to censored lastInterval <- obsT == theoInterval obsT[lastInterval] <- theoInterval-1 obsEvent[lastInterval] <- 0 obsT <- factor(obsT) obsEvent <- factor(obsEvent) # Data preparation datShort <- data.frame(event = factor(obsEvent), time=obsT, X) # Conversion to long data format datLongSub <- dataLongSubDist(dataShort = datShort, timeColumn = "time", eventColumns = "event", eventFocus = 1, eventColumnsAsFactor = TRUE) # Estimate discrete subdistribution hazard model estSubModel <- glm(formula = y ~ timeInt + X1 + X2 + X3 + X4, data = datLongSub, family = binomial(link = "logit"), weights = datLongSub$subDistWeights) # Predict cumulative incidence function of first event predSubHaz1 <- predict(estSubModel, newdata = datLongSub[datLongSub$obj == 2, ], type = "response") mean(((1 - estSurv(predSubHaz1)) - cumInc_event1_X[2, 1:3])^2)
################################ # Example with unemployment data library(Ecdat) data(UnempDur) # Generate subsample, reduce number of intervals to k = 5 SubUnempDur <- UnempDur [1:500, ] SubUnempDur$time <- as.numeric(cut(SubUnempDur$spell, c(0,4,8,16,28))) # Convert competing risks data to long format # The event of interest is re-employment at full job SubUnempDurLong <- dataLongSubDist (dataShort=SubUnempDur, timeColumn = "time", eventColumns=c("censor1", "censor2", "censor3"), eventFocus="censor1") head(SubUnempDurLong) # Fit discrete subdistribution hazard model with logistic link function logisticSubDistr <- glm(y ~ timeInt + ui + age + logwage, family=binomial(), data = SubUnempDurLong, weights = SubUnempDurLong$subDistWeights) summary(logisticSubDistr) ######################################## # Simulation # Discrete subdistribution hazards model # Simulate covariates as multivariate normal distribution library(mvnfast) set.seed(1980) X <- mvnfast::rmvn(n = 1000, mu = rep(0, 4), sigma = diag(4)) # Specification of two discrete cause specific hazards with four intervals # Event 1 theoInterval <- 4 betaCoef_event1 <- seq(-1, 1, length.out = 5)[-3] timeInt_event1 <- seq(0.1, -0.1, length.out = theoInterval-1) linPred_event1 <- c(X %*% betaCoef_event1) # Event 2 betaCoef_event2 <- seq(-0.5, 0.5, length.out = 5)[-3] timeInt_event2 <- seq(-0.1, 0.1, length.out = theoInterval-1) linPred_event2 <- c(X %*% betaCoef_event2) # Discrete cause specific hazards in last theoretical interval theoHaz_event1 <- 0.5 theoHaz_event2 <- 0.5 # Derive discrete all cause hazard haz_event1_X <- cbind(sapply(1:length(timeInt_event1), function(x) exp(linPred_event1 + timeInt_event1[x]) / (1 + exp(linPred_event1 + timeInt_event1[x]) + exp(linPred_event2 + timeInt_event2[x])) ), theoHaz_event1) haz_event2_X <- cbind(sapply(1:length(timeInt_event2), function(x) exp(linPred_event2 + timeInt_event2[x]) / (1 + exp(linPred_event1 + timeInt_event1[x]) + exp(linPred_event2 + timeInt_event2[x]) ) ), theoHaz_event2) allCauseHaz_X <- haz_event1_X + haz_event2_X # Derive discrete cumulative incidence function of event 1 given covariates p_T_event1_X <- haz_event1_X * cbind(1, (1-allCauseHaz_X)[, -dim(allCauseHaz_X)[2]]) cumInc_event1_X <- t(sapply(1:dim(p_T_event1_X)[1], function(x) cumsum(p_T_event1_X[x, ]))) # Calculate all cause probability P(T=t | X) pT_X <- t(sapply(1:dim(allCauseHaz_X)[1], function(i) estMargProb(allCauseHaz_X[i, ]) )) # Calculate event probability given time interval P(R=r | T=t, X) pR_T_X_event1 <- haz_event1_X / (haz_event1_X + haz_event2_X) # Simulate discrete survival times survT <- sapply(1:dim(pT_X)[1], function(i) sample(x = 1:(length(timeInt_event1)+1), size = 1, prob = pT_X[i, ]) ) censT <- sample(x = 1:(length(timeInt_event1)+1), size = dim(pT_X)[1], prob = rep(1/(length(timeInt_event1) + 1), (length(timeInt_event1) + 1)), replace = TRUE) # Calculate observed times obsT <- ifelse(survT <= censT, survT, censT) obsEvent <- rep(0, length(obsT)) obsEvent <- sapply(1:length(obsT), function(i) if(survT[i] <= censT[i]){ return(sample(x = c(1, 2), size = 1, prob = c(pR_T_X_event1[i, obsT[i] ], 1 - pR_T_X_event1[i, obsT[i] ]) )) } else{ return(0) } ) # Recode last interval to censored lastInterval <- obsT == theoInterval obsT[lastInterval] <- theoInterval-1 obsEvent[lastInterval] <- 0 obsT <- factor(obsT) obsEvent <- factor(obsEvent) # Data preparation datShort <- data.frame(event = factor(obsEvent), time=obsT, X) # Conversion to long data format datLongSub <- dataLongSubDist(dataShort = datShort, timeColumn = "time", eventColumns = "event", eventFocus = 1, eventColumnsAsFactor = TRUE) # Estimate discrete subdistribution hazard model estSubModel <- glm(formula = y ~ timeInt + X1 + X2 + X3 + X4, data = datLongSub, family = binomial(link = "logit"), weights = datLongSub$subDistWeights) # Predict cumulative incidence function of first event predSubHaz1 <- predict(estSubModel, newdata = datLongSub[datLongSub$obj == 2, ], type = "response") mean(((1 - estSurv(predSubHaz1)) - cumInc_event1_X[2, 1:3])^2)
Transforms short data format to long format for discrete survival modelling of single event analysis with right censoring. Covariates may vary over time.
dataLongTimeDep( dataSemiLong, timeColumn, eventColumn, idColumn, timeAsFactor = FALSE )
dataLongTimeDep( dataSemiLong, timeColumn, eventColumn, idColumn, timeAsFactor = FALSE )
dataSemiLong |
Original data in semi-long format ("class data.frame"). |
timeColumn |
Character giving the column name of the observed times ("character vector"). It is required that the observed times are discrete ("integer vector"). |
eventColumn |
Column name of the event indicator ("character vector"). It is required that this is a binary variable with 1=="event" and 0=="censored". |
idColumn |
Name of column of identification number of persons ("character vector"). |
timeAsFactor |
Should the time intervals be coded as factor ("logical vector")? Default is FALSE. In case of default settings the discrete time intervals are treated as quantitative ("numeric vector"). |
There may be some intervals, where no additional information on the covariates is observed (e. g. observed values in interval one and three but two is missing). In this case it is assumed, that the values from the last observation stay constant over time until a new measurement was done.
In contrast to continuous survival (see e. g. Surv
)
the start and stop time notation is not used here. In discrete time survival analysis the only relevant
information is to use the stop time. Start time does not matter, because all discrete intervals need to be
included in the long data set format to ensure consistent estimation. It is assumed that the supplied
data set "dataSemiLong" contains all repeated measurements of each cluster in semi-long format (e. g. persons).
For further information see example Start-stop notation.
Original data in long format with three additional columns:
obj Index of persons as integer vector
timeInt Index of time intervals (factor)
y Response in long format as binary vector. 1=="event happens in period timeInt" and zero otherwise
Thomas Welchowski [email protected]
Tutz G, Schmid M (2016).
Modeling discrete time-to-event data.
Springer Series in Statistics.
Fahrmeir L (2005).
“Discrete Survival-Time Models.”
In Encyclopedia of Biostatistics, chapter Survival Analysis.
John Wiley \& Sons.
Thompson Jr. WA (1977).
“On the Treatment of Grouped Observations in Life Studies.”
Biometrics, 33, 463-470.
contToDisc
, dataLong
,
dataLongCompRisks
# Example Primary Biliary Cirrhosis data library(survival) dataSet1 <- pbcseq # Only event death is of interest dataSet1$status [dataSet1$status == 1] <- 0 dataSet1$status [dataSet1$status == 2] <- 1 table(dataSet1$status) # Convert to months dataSet1$day <- ceiling(dataSet1$day/30) + 1 names(dataSet1) [7] <- "month" # Convert to long format for time varying effects pbcseqLong <- dataLongTimeDep (dataSemiLong = dataSet1, timeColumn = "month", eventColumn = "status", idColumn = "id") pbcseqLong [pbcseqLong$obj == 1, ] ##################### # Start-stop notation library(survival) ?survival::heart # Assume that time was measured on a discrete scale. # Discrete interval lengths are assumed to vary. intervalLimits <- quantile(heart$stop, probs = seq(0.1, 1, by=0.1)) intervalLimits[length(intervalLimits)] <- intervalLimits[length(intervalLimits)] + 1 heart_disc <- contToDisc(dataShort = heart, timeColumn = "stop", intervalLimits = intervalLimits, equi = FALSE) table(heart_disc$timeDisc) # Conversion to long format heart_disc_long <- dataLongTimeDep(dataSemiLong = heart_disc, timeColumn = "timeDisc", eventColumn = "event", idColumn = "id") head(heart_disc_long)
# Example Primary Biliary Cirrhosis data library(survival) dataSet1 <- pbcseq # Only event death is of interest dataSet1$status [dataSet1$status == 1] <- 0 dataSet1$status [dataSet1$status == 2] <- 1 table(dataSet1$status) # Convert to months dataSet1$day <- ceiling(dataSet1$day/30) + 1 names(dataSet1) [7] <- "month" # Convert to long format for time varying effects pbcseqLong <- dataLongTimeDep (dataSemiLong = dataSet1, timeColumn = "month", eventColumn = "status", idColumn = "id") pbcseqLong [pbcseqLong$obj == 1, ] ##################### # Start-stop notation library(survival) ?survival::heart # Assume that time was measured on a discrete scale. # Discrete interval lengths are assumed to vary. intervalLimits <- quantile(heart$stop, probs = seq(0.1, 1, by=0.1)) intervalLimits[length(intervalLimits)] <- intervalLimits[length(intervalLimits)] + 1 heart_disc <- contToDisc(dataShort = heart, timeColumn = "stop", intervalLimits = intervalLimits, equi = FALSE) table(heart_disc$timeDisc) # Conversion to long format heart_disc_long <- dataLongTimeDep(dataSemiLong = heart_disc, timeColumn = "timeDisc", eventColumn = "event", idColumn = "id") head(heart_disc_long)
Computes the root of the deviance residuals for evaluation of performance in discrete survival analysis.
devResid(dataLong, hazards)
devResid(dataLong, hazards)
dataLong |
Original data in long format ("class data.frame").
The correct format can be specified with data preparation, see e. g.
|
hazards |
Estimated discrete hazards of the data in long format("numeric vector"). Discrete discrete hazards are probabilities and therefore restricted to the interval [0, 1]. |
Output List with objects:
DevResid Square root of deviance residuals as numeric vector.
Input A list of given argument input values (saved for reference)
Thomas Welchowski [email protected]
Tutz G, Schmid M (2016).
Modeling discrete time-to-event data.
Springer Series in Statistics.
Tutz G (2012).
Regression for Categorical Data.
Cambridge University Press.
library(survival) # Transform data to long format heart[, "stop"] <- ceiling(heart[, "stop"]) set.seed(0) Indizes <- sample(unique(heart$id), 25) randSample <- heart[unlist(sapply(1:length(Indizes), function(x) which(heart$id == Indizes[x]))),] heartLong <- dataLongTimeDep(dataSemiLong = randSample, timeColumn = "stop", eventColumn = "event", idColumn = "id", timeAsFactor = FALSE) # Fit a generalized, additive model and predict discrete hazards on data in long format library(mgcv) gamFit <- gam(y ~ timeInt + surgery + transplant + s(age), data = heartLong, family = "binomial") hazPreds <- predict(gamFit, type = "response") # Calculate the deviance residuals devResiduals <- devResid (dataLong = heartLong, hazards = hazPreds)$Output$DevResid # Compare with estimated normal distribution plot(density(devResiduals), main = "Empirical density vs estimated normal distribution", las = 1, ylim = c(0, 0.5)) tempFunc <- function (x) dnorm(x, mean = mean(devResiduals), sd = sd(devResiduals)) curve(tempFunc, xlim = c(-10, 10), add = TRUE, col = "red") # The empirical density seems like a mixture distribution, # but is not too far off in with values greater than 3 and less than 1
library(survival) # Transform data to long format heart[, "stop"] <- ceiling(heart[, "stop"]) set.seed(0) Indizes <- sample(unique(heart$id), 25) randSample <- heart[unlist(sapply(1:length(Indizes), function(x) which(heart$id == Indizes[x]))),] heartLong <- dataLongTimeDep(dataSemiLong = randSample, timeColumn = "stop", eventColumn = "event", idColumn = "id", timeAsFactor = FALSE) # Fit a generalized, additive model and predict discrete hazards on data in long format library(mgcv) gamFit <- gam(y ~ timeInt + surgery + transplant + s(age), data = heartLong, family = "binomial") hazPreds <- predict(gamFit, type = "response") # Calculate the deviance residuals devResiduals <- devResid (dataLong = heartLong, hazards = hazPreds)$Output$DevResid # Compare with estimated normal distribution plot(density(devResiduals), main = "Empirical density vs estimated normal distribution", las = 1, ylim = c(0, 0.5)) tempFunc <- function (x) dnorm(x, mean = mean(devResiduals), sd = sd(devResiduals)) curve(tempFunc, xlim = c(-10, 10), add = TRUE, col = "red") # The empirical density seems like a mixture distribution, # but is not too far off in with values greater than 3 and less than 1
Estimates the cumulative incidence function of a discrete time competing risks model given covariates P(T <= t, event = k | x).
estCumInz(hazards, eventFocus)
estCumInz(hazards, eventFocus)
hazards |
Estimated discrete hazard rates of all events ("numeric matrix"). Each column represents one event. The first column is assumed to contain the censoring case and the discrete hazards should only vary over time in each row. |
eventFocus |
Column that represent the discrete hazards of the primary event ("integer vector"). |
The covariates set is required to be constant across rows.
Returns cumulative incidence function of the primary event. If argument nonparCI is set to TRUE, then a list is returned: The first element includes the cumulative incidence function. The second list element contains the lower and the third list element the upper bound of the pointwise confidence intervals.
Thomas Welchowski [email protected]
Lee M, Feuer EJ, Fine JP (2018). “On the analysis of discrete time competing risks data.” Biometrics, 74, 1468-1481.
compRisksGEE
, dataLongCompRisks
,
dataLongCompRisksTimeDep
, geeglm
,
# Example with unemployment data library(Ecdat) data(UnempDur) # Select subsample SubUnempDur <- UnempDur [1:100, ] # Estimate GEE models for all events estGEE <- compRisksGEE(datShort = SubUnempDur, dataTransform = "dataLongCompRisks", corstr = "independence", formulaVariable =~ timeInt + age + ui + logwage * ui, eventColumns = c("censor1", "censor2", "censor3", "censor4"), timeColumn = "spell") # Estimate hazards of all events given the covariates of third person SubUnempDurLong <- dataLongCompRisks(dataShort = SubUnempDur, eventColumns = c("censor1", "censor2", "censor3", "censor4"), timeColumn = "spell") preds <- predict(estGEE, subset(SubUnempDurLong, obj == 3)) # Estimate cumulative incidence function cumInzGEE <- estCumInz(preds, eventFocus = 2) cumInzGEE
# Example with unemployment data library(Ecdat) data(UnempDur) # Select subsample SubUnempDur <- UnempDur [1:100, ] # Estimate GEE models for all events estGEE <- compRisksGEE(datShort = SubUnempDur, dataTransform = "dataLongCompRisks", corstr = "independence", formulaVariable =~ timeInt + age + ui + logwage * ui, eventColumns = c("censor1", "censor2", "censor3", "censor4"), timeColumn = "spell") # Estimate hazards of all events given the covariates of third person SubUnempDurLong <- dataLongCompRisks(dataShort = SubUnempDur, eventColumns = c("censor1", "censor2", "censor3", "censor4"), timeColumn = "spell") preds <- predict(estGEE, subset(SubUnempDurLong, obj == 3)) # Estimate cumulative incidence function cumInzGEE <- estCumInz(preds, eventFocus = 2) cumInzGEE
Estimates the marginal probability P(T=t|x) based on estimated discrete hazards. The discrete hazards may or may not depend on covariates. The covariates have to be equal across all estimated hazard rates. Therefore the given discrete hazards should only vary over time.
estMargProb(hazards)
estMargProb(hazards)
hazards |
Estimated discrete hazards ("numeric vector") |
The argument hazards must be given for all intervals [a_0, a_1), [a_1, a_2), ..., [a_q-1, a_q), [a_q, Inf).
Estimated marginal probabilities ("numeric vector")
It is assumed that all time points up to the last interval [a_q, Inf) are available. If not already present, these can be added manually.
Thomas Welchowski [email protected]
Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.
# Example unemployment data library(Ecdat) data(UnempDur) # Select subsample subUnempDur <- UnempDur [1:100, ] # Convert to long format UnempLong <- dataLong(dataShort = subUnempDur, timeColumn = "spell", eventColumn = "censor1") head(UnempLong) # Estimate binomial model with logit link Fit <- glm(formula = y ~ timeInt + age + logwage, data = UnempLong, family = binomial()) # Estimate discrete survival function given age, logwage of first person hazard <- predict(Fit, newdata = subset(UnempLong, obj == 1), type = "response") # Estimate marginal probabilities given age, logwage of first person MarginalProbCondX <- estMargProb (c(hazard, 1)) MarginalProbCondX sum(MarginalProbCondX)==1 # TRUE: Marginal probabilities must sum to 1!
# Example unemployment data library(Ecdat) data(UnempDur) # Select subsample subUnempDur <- UnempDur [1:100, ] # Convert to long format UnempLong <- dataLong(dataShort = subUnempDur, timeColumn = "spell", eventColumn = "censor1") head(UnempLong) # Estimate binomial model with logit link Fit <- glm(formula = y ~ timeInt + age + logwage, data = UnempLong, family = binomial()) # Estimate discrete survival function given age, logwage of first person hazard <- predict(Fit, newdata = subset(UnempLong, obj == 1), type = "response") # Estimate marginal probabilities given age, logwage of first person MarginalProbCondX <- estMargProb (c(hazard, 1)) MarginalProbCondX sum(MarginalProbCondX)==1 # TRUE: Marginal probabilities must sum to 1!
Estimates the marginal probability P(T = t,R = r|x) based on estimated discrete hazards of a competing risks model. The discrete hazards may or may not depend on covariates. The covariates have to be equal across all estimated discrete hazards. Therefore the given discrete hazards should only vary over time.
estMargProbCompRisks(hazards)
estMargProbCompRisks(hazards)
hazards |
Estimated discrete hazards ("numeric matrix"). Discrete hazards of each time interval are stored in the rows and the number of columns equal to the number of events. |
The argument hazards must be given for all intervals [a_0, a_1), [a_1, a_2), ..., [a_q-1, a_q), [a_q, Inf).
Estimated marginal probabilities ("numeric matrix")
It is assumed that all time points up to the last interval [a_q, Inf) are available. If not already present, these can be added manually. In competing risk settings the marginal probabilities of the last theoretical interval depend on the assumptions on the discrete hazards in the last theoretical interval. However the estimation of all previous discrete intervals is not affected by those assumptions.
Moritz Berger [email protected]
https://www.imbie.uni-bonn.de/personen/dr-moritz-berger/
Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.
# Example unemployment data library(Ecdat) data(UnempDur) # Select subsample subUnempDur <- UnempDur [1:100, ] # Convert to long format UnempLong <- dataLongCompRisks(dataShort = subUnempDur, timeColumn = "spell", eventColumns = c("censor1", "censor4")) head(UnempLong) # Estimate continuation ratio model with logit link vglmFit <- VGAM::vglm(formula = cbind(e0, e1, e2) ~ timeInt + age + logwage, data = UnempLong, family = VGAM::multinomial(refLevel = "e0")) # Estimate discrete survival function given age, logwage of first person hazards <- VGAM::predictvglm(vglmFit, newdata = subset(UnempLong, obj == 1), type = "response")[,-1] # Estimate marginal probabilities given age, logwage of first person # Example 1 # Assumption: Discrete hazards in last theoretical interval are equal for both event types MarginalProbCondX <- estMargProbCompRisks(rbind(hazards, 0.5)) MarginalProbCondX all.equal(sum(MarginalProbCondX), 1) # TRUE: Marginal probabilities must sum to 1! # Example 2 # Assumption: Discrete hazards in last theoretical interval are event1=, event2= MarginalProbCondX2 <- estMargProbCompRisks(rbind(hazards, c(0.75, 0.25))) MarginalProbCondX2 all.equal(sum(MarginalProbCondX2), 1) # TRUE: Marginal probabilities must sum to 1! # Compare marginal probabilities given X all.equal(MarginalProbCondX[1:5, ], MarginalProbCondX2[1:5, ]) all.equal(MarginalProbCondX[6, ], MarginalProbCondX2[6, ])
# Example unemployment data library(Ecdat) data(UnempDur) # Select subsample subUnempDur <- UnempDur [1:100, ] # Convert to long format UnempLong <- dataLongCompRisks(dataShort = subUnempDur, timeColumn = "spell", eventColumns = c("censor1", "censor4")) head(UnempLong) # Estimate continuation ratio model with logit link vglmFit <- VGAM::vglm(formula = cbind(e0, e1, e2) ~ timeInt + age + logwage, data = UnempLong, family = VGAM::multinomial(refLevel = "e0")) # Estimate discrete survival function given age, logwage of first person hazards <- VGAM::predictvglm(vglmFit, newdata = subset(UnempLong, obj == 1), type = "response")[,-1] # Estimate marginal probabilities given age, logwage of first person # Example 1 # Assumption: Discrete hazards in last theoretical interval are equal for both event types MarginalProbCondX <- estMargProbCompRisks(rbind(hazards, 0.5)) MarginalProbCondX all.equal(sum(MarginalProbCondX), 1) # TRUE: Marginal probabilities must sum to 1! # Example 2 # Assumption: Discrete hazards in last theoretical interval are event1=, event2= MarginalProbCondX2 <- estMargProbCompRisks(rbind(hazards, c(0.75, 0.25))) MarginalProbCondX2 all.equal(sum(MarginalProbCondX2), 1) # TRUE: Marginal probabilities must sum to 1! # Compare marginal probabilities given X all.equal(MarginalProbCondX[1:5, ], MarginalProbCondX2[1:5, ]) all.equal(MarginalProbCondX[6, ], MarginalProbCondX2[6, ])
Fits a logistic recalibration model to independent test data. It updates the intercept and slope parameters. It is assumed that the factor levels of time are equal in both training and validation data. Time dependent covariates, discrete cause specific competing risks and subdistribution hazards are also supported.
estRecal(testLinPred, testDataLong, weights = NULL)
estRecal(testLinPred, testDataLong, weights = NULL)
testLinPred |
Calculated linear predictor on the validation data with model fitted on training data ("numeric vector"). |
testDataLong |
Validation data set in long format ("class data.frame"). |
weights |
Weights used in estimation of the logistic recalibration model ("numeric vector"). Default is no weighting (NULL). |
Updates estimated hazards of discrete survival models to better adapt to a different environment. If there are substantial environment changes the predicted probabilities will differ between two environments. Logistic recalibration may be used to improve the calibration of predicted probabilities by incorperating information from the existing model and data from the environment. This approach works for any survival prediction model with one event that provides linear predictors.
Continuation ratio model that calibrates estimated discrete hazards to new validation environment ("class glm, lm").
Thomas Welchowski [email protected]
Heyard R, Timsit J, Held L, COMBACTE-MAGNET,consortium (2019). “Validation of discrete time-to-event prediction models in the presence of competing risks.” Biometrical Journal, 62, 643-657.
(Calibration plots links) dataLong
, dataLongTimeDep
, dataLongCompRisks
,
dataLongCompRisksTimeDep
, dataLongSubDist
#################### # Data preprocessing # Example unemployment data library(Ecdat) data(UnempDur) # Select subsample selectInd1 <- 1:100 selectInd2 <- 101:200 trainSet <- UnempDur[which(UnempDur$spell %in% (1:10))[selectInd1], ] valSet <- UnempDur[which(UnempDur$spell %in% (1:10))[selectInd2], ] #################### # One event # Convert to long format trainSet_long <- dataLong(dataShort = trainSet, timeColumn = "spell", eventColumn = "censor1") valSet_long <- dataLong(dataShort = valSet, timeColumn = "spell", eventColumn = "censor1") # Estimate continuation ratio model with logit link glmFit <- glm(formula = y ~ timeInt + age + logwage, data = trainSet_long, family = binomial()) # Calculate linear predictors on validation set linPred <- predict(glmFit, newdata = valSet_long, type = "link") # Estimate logistic recalibration model recalModel <- estRecal(testLinPred = linPred, testDataLong = valSet_long) summary(recalModel) # Calibration plots hazOrg <- predict(glmFit, newdata = valSet_long, type = "response") hazRecal <- predict(recalModel, newdata = data.frame(linPred), type = "response") # Before logistic recalibration calibrationPlot(hazOrg, testDataLong = valSet_long) # After logistic recalibration calibrationPlot(hazRecal, testDataLong = valSet_long) ############################ # Two cause specific hazards library(VGAM) # Convert to long format trainSet_long <- dataLongCompRisks(dataShort = trainSet, timeColumn = "spell", eventColumns = c("censor1", "censor4")) valSet_long <- dataLongCompRisks(dataShort = valSet, timeColumn = "spell", eventColumns = c("censor1", "censor4")) # Estimate continuation ratio model with logit link vglmFit <- VGAM::vglm(formula = cbind(e0, e1, e2) ~ timeInt + age + logwage, data = trainSet_long, family = VGAM::multinomial(refLevel = "e0")) # Calculate linear predictors on training and test set linPred <- VGAM::predictvglm(vglmFit, newdata = valSet_long, type = "link") # Estimate logistic recalibration model recalModel <- estRecal(testLinPred = linPred, testDataLong = valSet_long) recalModel # Calibration plots hazOrg <- predict(vglmFit, newdata = valSet_long, type = "response") predDat <- as.data.frame(linPred) names(predDat) <- recalModel@misc$colnames.x[-1] hazRecal <- predictvglm(recalModel, newdata = predDat, type = "response") # Before logistic recalibration calibrationPlot(hazOrg, testDataLong = valSet_long, event = "e1") calibrationPlot(hazOrg, testDataLong = valSet_long, event = "e2") # After logistic recalibration calibrationPlot(hazRecal, testDataLong = valSet_long, event = "e1") calibrationPlot(hazRecal, testDataLong = valSet_long, event = "e2") ############################### # Subdistribution hazards model # Convert to long format trainSet_long <- dataLongSubDist(dataShort = trainSet, timeColumn = "spell", eventColumns = c("censor1", "censor4"), eventFocus = "censor1") valSet_long <- dataLongSubDist(dataShort = valSet, timeColumn = "spell", eventColumns = c("censor1", "censor4"), eventFocus = "censor1") # Estimate continuation ratio model with logit link glmFit <- glm(formula = y ~ timeInt + age + logwage, data = trainSet_long, family = binomial(), weights = trainSet_long$subDistWeights) # Calculate linear predictors on training and test set linPred <- predict(glmFit, newdata = valSet_long, type = "link") # Estimate logistic recalibration model recalModel <- estRecal(testLinPred = linPred, testDataLong = valSet_long, weights = valSet_long$subDistWeights) recalModel # Calibration plots hazOrg <- predict(glmFit, newdata = valSet_long, type = "response", weights = valSet_long$subDistWeights) hazRecal <- predict(recalModel, newdata = data.frame(linPred), type = "response", weights = valSet_long$subDistWeights) # Before logistic recalibration calibrationPlot(hazOrg, testDataLong = valSet_long, weights=valSet_long$subDistWeights) # After logistic recalibration calibrationPlot(hazRecal, testDataLong = valSet_long, weights=valSet_long$subDistWeights)
#################### # Data preprocessing # Example unemployment data library(Ecdat) data(UnempDur) # Select subsample selectInd1 <- 1:100 selectInd2 <- 101:200 trainSet <- UnempDur[which(UnempDur$spell %in% (1:10))[selectInd1], ] valSet <- UnempDur[which(UnempDur$spell %in% (1:10))[selectInd2], ] #################### # One event # Convert to long format trainSet_long <- dataLong(dataShort = trainSet, timeColumn = "spell", eventColumn = "censor1") valSet_long <- dataLong(dataShort = valSet, timeColumn = "spell", eventColumn = "censor1") # Estimate continuation ratio model with logit link glmFit <- glm(formula = y ~ timeInt + age + logwage, data = trainSet_long, family = binomial()) # Calculate linear predictors on validation set linPred <- predict(glmFit, newdata = valSet_long, type = "link") # Estimate logistic recalibration model recalModel <- estRecal(testLinPred = linPred, testDataLong = valSet_long) summary(recalModel) # Calibration plots hazOrg <- predict(glmFit, newdata = valSet_long, type = "response") hazRecal <- predict(recalModel, newdata = data.frame(linPred), type = "response") # Before logistic recalibration calibrationPlot(hazOrg, testDataLong = valSet_long) # After logistic recalibration calibrationPlot(hazRecal, testDataLong = valSet_long) ############################ # Two cause specific hazards library(VGAM) # Convert to long format trainSet_long <- dataLongCompRisks(dataShort = trainSet, timeColumn = "spell", eventColumns = c("censor1", "censor4")) valSet_long <- dataLongCompRisks(dataShort = valSet, timeColumn = "spell", eventColumns = c("censor1", "censor4")) # Estimate continuation ratio model with logit link vglmFit <- VGAM::vglm(formula = cbind(e0, e1, e2) ~ timeInt + age + logwage, data = trainSet_long, family = VGAM::multinomial(refLevel = "e0")) # Calculate linear predictors on training and test set linPred <- VGAM::predictvglm(vglmFit, newdata = valSet_long, type = "link") # Estimate logistic recalibration model recalModel <- estRecal(testLinPred = linPred, testDataLong = valSet_long) recalModel # Calibration plots hazOrg <- predict(vglmFit, newdata = valSet_long, type = "response") predDat <- as.data.frame(linPred) names(predDat) <- recalModel@misc$colnames.x[-1] hazRecal <- predictvglm(recalModel, newdata = predDat, type = "response") # Before logistic recalibration calibrationPlot(hazOrg, testDataLong = valSet_long, event = "e1") calibrationPlot(hazOrg, testDataLong = valSet_long, event = "e2") # After logistic recalibration calibrationPlot(hazRecal, testDataLong = valSet_long, event = "e1") calibrationPlot(hazRecal, testDataLong = valSet_long, event = "e2") ############################### # Subdistribution hazards model # Convert to long format trainSet_long <- dataLongSubDist(dataShort = trainSet, timeColumn = "spell", eventColumns = c("censor1", "censor4"), eventFocus = "censor1") valSet_long <- dataLongSubDist(dataShort = valSet, timeColumn = "spell", eventColumns = c("censor1", "censor4"), eventFocus = "censor1") # Estimate continuation ratio model with logit link glmFit <- glm(formula = y ~ timeInt + age + logwage, data = trainSet_long, family = binomial(), weights = trainSet_long$subDistWeights) # Calculate linear predictors on training and test set linPred <- predict(glmFit, newdata = valSet_long, type = "link") # Estimate logistic recalibration model recalModel <- estRecal(testLinPred = linPred, testDataLong = valSet_long, weights = valSet_long$subDistWeights) recalModel # Calibration plots hazOrg <- predict(glmFit, newdata = valSet_long, type = "response", weights = valSet_long$subDistWeights) hazRecal <- predict(recalModel, newdata = data.frame(linPred), type = "response", weights = valSet_long$subDistWeights) # Before logistic recalibration calibrationPlot(hazOrg, testDataLong = valSet_long, weights=valSet_long$subDistWeights) # After logistic recalibration calibrationPlot(hazRecal, testDataLong = valSet_long, weights=valSet_long$subDistWeights)
Estimates the survival function S(T = t|x) based on estimated hazard rates. The hazard rates may or may not depend on covariates. The covariates have to be equal across all estimated hazard rates. Therefore the given hazard rates should only vary over time.
estSurv(haz)
estSurv(haz)
haz |
Estimated hazard rates ("numeric vector") |
The argument haz must be given for all intervals [a_0, a_1), [a_1, a_2), ..., [a_q-1, a_q), [a_q, Inf).
Estimated probabilities of survival ("numeric vector")
It is assumed that all time points up to the last theoretical interval [a_q, Inf) are available. If not already present, these can be added manually.
Thomas Welchowski [email protected]
Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.
# Example unemployment data library(Ecdat) data(UnempDur) # Select subsample subUnempDur <- UnempDur [1:100, ] # Convert to long format UnempLong <- dataLong(dataShort = subUnempDur, timeColumn = "spell", eventColumn = "censor1") head(UnempLong) # Estimate binomial model with logit link Fit <- glm(formula = y ~ timeInt + age + logwage, data=UnempLong, family = binomial()) # Estimate discrete survival function given age, logwage of first person hazard <- predict(Fit, newdata = subset(UnempLong, obj == 1), type = "response") SurvivalFuncCondX <- estSurv(c(hazard, 1)) SurvivalFuncCondX
# Example unemployment data library(Ecdat) data(UnempDur) # Select subsample subUnempDur <- UnempDur [1:100, ] # Convert to long format UnempLong <- dataLong(dataShort = subUnempDur, timeColumn = "spell", eventColumn = "censor1") head(UnempLong) # Estimate binomial model with logit link Fit <- glm(formula = y ~ timeInt + age + logwage, data=UnempLong, family = binomial()) # Estimate discrete survival function given age, logwage of first person hazard <- predict(Fit, newdata = subset(UnempLong, obj == 1), type = "response") SurvivalFuncCondX <- estSurv(c(hazard, 1)) SurvivalFuncCondX
Estimates the marginal survival function G(T=t) of the censoring process based on a life table estimator. Compatible with single event and competing risks data.
estSurvCens(dataShort, timeColumn, eventColumns)
estSurvCens(dataShort, timeColumn, eventColumns)
dataShort |
Data in original short format ("class data.frame"). |
timeColumn |
Name of column with discrete time intervals ("character vector"). |
eventColumns |
Names of the event columns of |
Named vector of estimated survival function of the censoring process for all time points except the last theoretical interval.
In the censoring survival function the last time interval [a_q, Inf) has the probability of zero.
Thomas Welchowski [email protected]
Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.
# Load unemployment data library(Ecdat) data(UnempDur) # Select subsample subUnempDur <- UnempDur [1:100, ] ###################### # Single event example # Estimate censoring survival function G(t) estG <- estSurvCens(dataShort = subUnempDur, timeColumn = "spell", eventColumns = "censor1") estG ######################### # Competing risks example # Estimate censoring survival function G(t) estG <- estSurvCens(dataShort = subUnempDur, timeColumn = "spell", eventColumns = c("censor1", "censor2", "censor3", "censor4")) estG
# Load unemployment data library(Ecdat) data(UnempDur) # Select subsample subUnempDur <- UnempDur [1:100, ] ###################### # Single event example # Estimate censoring survival function G(t) estG <- estSurvCens(dataShort = subUnempDur, timeColumn = "spell", eventColumns = "censor1") estG ######################### # Competing risks example # Estimate censoring survival function G(t) estG <- estSurvCens(dataShort = subUnempDur, timeColumn = "spell", eventColumns = c("censor1", "censor2", "censor3", "censor4")) estG
Computes the survival function S(T>t|x) based on estimated hazards of a competing risks model. The discrete hazards may or may not depend on covariates. The covariates have to be equal across all estimated hazards. Therefore the given discrete hazards should only vary over time.
estSurvCompRisks(hazards)
estSurvCompRisks(hazards)
hazards |
Estimated discrete hazards ("numeric matrix"). Discrete hazards of each time interval are stored in the rows and the number of columns equal to the number of events. |
The argument hazards must be given for all intervals [a_0, a_1), [a_1, a_2), ..., [a_q-1, a_q), [a_q, Inf).
Estimated survival probabilities ("numeric vector")
It is assumed that all time points up to the last interval [a_q, Inf) are available. If not already present, these can be added manually.
Moritz Berger [email protected]
https://www.imbie.uni-bonn.de/personen/dr-moritz-berger/
Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.
# Example unemployment data library(Ecdat) data(UnempDur) # Select subsample subUnempDur <- UnempDur [1:100, ] # Convert to long format UnempLong <- dataLongCompRisks(dataShort = subUnempDur, timeColumn = "spell", eventColumns = c("censor1", "censor4")) head(UnempLong) # Estimate continuation ratio model with logit link vglmFit <- VGAM::vglm(formula = cbind(e0, e1, e2) ~ timeInt + age + logwage, data = UnempLong, family = VGAM::multinomial(refLevel = "e0")) # Estimate discrete survival function given age, logwage of first person hazards <- VGAM::predictvglm(vglmFit, newdata = subset(UnempLong, obj == 1), type = "response")[,-1] SurvivalFuncCondX <- estSurvCompRisks(rbind(hazards, 0.5)) SurvivalFuncCondX
# Example unemployment data library(Ecdat) data(UnempDur) # Select subsample subUnempDur <- UnempDur [1:100, ] # Convert to long format UnempLong <- dataLongCompRisks(dataShort = subUnempDur, timeColumn = "spell", eventColumns = c("censor1", "censor4")) head(UnempLong) # Estimate continuation ratio model with logit link vglmFit <- VGAM::vglm(formula = cbind(e0, e1, e2) ~ timeInt + age + logwage, data = UnempLong, family = VGAM::multinomial(refLevel = "e0")) # Estimate discrete survival function given age, logwage of first person hazards <- VGAM::predictvglm(vglmFit, newdata = subset(UnempLong, obj == 1), type = "response")[,-1] SurvivalFuncCondX <- estSurvCompRisks(rbind(hazards, 0.5)) SurvivalFuncCondX
Constructs the link function with gumbel distribution in approriate format for use in generalized, linear models.
gumbel()
gumbel()
Insert this function into a binary regression model
Thomas Welchowski [email protected]
Matthias Schmid [email protected]
Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.
# Example with copenhagen stroke study library(pec) data(cost) head(cost) # Take subsample and convert time to months costSub <- cost [1:50, ] costSub$time <- ceiling(costSub$time/30) costLong <- dataLong(dataShort = costSub, timeColumn = "time", eventColumn = "status", timeAsFactor=TRUE) gumbelModel <- glm(formula = y ~ timeInt + diabetes, data = costLong, family = binomial(link = gumbel())) # Estimate hazard given prevStroke and no prevStroke hazPrevStroke <- predict(gumbelModel, newdata=data.frame(timeInt = factor(1:143), diabetes = factor(rep("yes", 143), levels = c("no", "yes"))), type = "response") hazWoPrevStroke <- predict(gumbelModel, newdata = data.frame(timeInt = factor(1:143), diabetes=factor(rep("no", 143), levels = c("no", "yes"))), type = "response") # Estimate survival function SurvPrevStroke <- cumprod(1 - hazPrevStroke) SurvWoPrevStroke <- cumprod(1 - hazWoPrevStroke) # Example graphics of survival curves with and without diabetes plot(x = 1:143, y = SurvWoPrevStroke, type = "l", xlab = "Months", ylab = "S (t|x)", las = 1, lwd = 2, ylim = c(0,1)) lines(x = 1:143, y = SurvPrevStroke, col = "red", lwd = 2) legend("topright", legend = c("Without diabetes", "Diabetes"), lty = 1, lwd =2, col = c("black", "red"))
# Example with copenhagen stroke study library(pec) data(cost) head(cost) # Take subsample and convert time to months costSub <- cost [1:50, ] costSub$time <- ceiling(costSub$time/30) costLong <- dataLong(dataShort = costSub, timeColumn = "time", eventColumn = "status", timeAsFactor=TRUE) gumbelModel <- glm(formula = y ~ timeInt + diabetes, data = costLong, family = binomial(link = gumbel())) # Estimate hazard given prevStroke and no prevStroke hazPrevStroke <- predict(gumbelModel, newdata=data.frame(timeInt = factor(1:143), diabetes = factor(rep("yes", 143), levels = c("no", "yes"))), type = "response") hazWoPrevStroke <- predict(gumbelModel, newdata = data.frame(timeInt = factor(1:143), diabetes=factor(rep("no", 143), levels = c("no", "yes"))), type = "response") # Estimate survival function SurvPrevStroke <- cumprod(1 - hazPrevStroke) SurvWoPrevStroke <- cumprod(1 - hazWoPrevStroke) # Example graphics of survival curves with and without diabetes plot(x = 1:143, y = SurvWoPrevStroke, type = "l", xlab = "Months", ylab = "S (t|x)", las = 1, lwd = 2, ylim = c(0,1)) lines(x = 1:143, y = SurvPrevStroke, col = "red", lwd = 2) legend("topright", legend = c("Without diabetes", "Diabetes"), lty = 1, lwd =2, col = c("black", "red"))
Computes the integrated prediction error curve for discrete survival models.
intPredErr( hazards, testTime, testEvent, trainTime, trainEvent, testDataLong, tmax = NULL )
intPredErr( hazards, testTime, testEvent, trainTime, trainEvent, testDataLong, tmax = NULL )
hazards |
Predicted discrete hazards in the test data ("numeric vector"). |
testTime |
Discrete time intervals in short format of the test set ("integer vector"). |
testEvent |
Events in short format in the test set ("binary vector"). |
trainTime |
Discrete time intervals in short format of the training data set ("integer vector"). |
trainEvent |
Events in short format in the training set ("binary vector"). |
testDataLong |
Test data in long format("class data.frame"). The discrete survival function is
calculated based on the predicted hazards. It is assumed that the data was
preprocessed with a function with prefix "dataLong", see e. g.
|
tmax |
Gives the maximum time interval for which prediction errors are calculated ("integer vector"). It must be smaller than the maximum observed time in the training data of the object produced by function. The default setting NULL means, that all observed intervals are used. |
Integrated prediction error ("numeric vector").
Thomas Welchowski [email protected]
Tutz G, Schmid M (2016).
Modeling discrete time-to-event data.
Springer Series in Statistics.
Gneiting T, Raftery AE (2007).
“Strictly Proper Scoring Rules, Prediction, and Estimation.”
Journal of the American Statistical Association, 102, 359-378.
########################## # Example with cancer data library(survival) head(cancer) # Data preparation and convertion to 30 intervals cancerPrep <- cancer cancerPrep$status <- cancerPrep$status-1 intLim <- quantile(cancerPrep$time, prob = seq(0, 1, length.out = 30)) intLim [length(intLim)] <- intLim [length(intLim)] + 1 # Cut discrete time in smaller number of intervals cancerPrep <- contToDisc(dataShort = cancerPrep, timeColumn = "time", intervalLimits = intLim) # Generate training and test data set.seed(753) TrainIndices <- sample (x = 1:dim(cancerPrep) [1], size = dim(cancerPrep) [1] * 0.75) TrainCancer <- cancerPrep [TrainIndices, ] TestCancer <- cancerPrep [-TrainIndices, ] TrainCancer$timeDisc <- as.numeric(as.character(TrainCancer$timeDisc)) TestCancer$timeDisc <- as.numeric(as.character(TestCancer$timeDisc)) # Convert to long format LongTrain <- dataLong(dataShort = TrainCancer, timeColumn = "timeDisc", eventColumn = "status", timeAsFactor=FALSE) LongTest <- dataLong(dataShort = TestCancer, timeColumn = "timeDisc", eventColumn = "status", timeAsFactor=FALSE) # Convert factors LongTrain$timeInt <- as.numeric(as.character(LongTrain$timeInt)) LongTest$timeInt <- as.numeric(as.character(LongTest$timeInt)) LongTrain$sex <- factor(LongTrain$sex) LongTest$sex <- factor(LongTest$sex) # Estimate, for example, a generalized, additive model in discrete survival analysis library(mgcv) gamFit <- gam (formula = y ~ s(timeInt) + s(age) + sex + ph.ecog, data = LongTrain, family = binomial()) summary(gamFit) # 1. Specification of predicted discrete hazards # Estimate survival function of each person in the test data testPredHaz <- predict(gamFit, newdata = LongTest, type = "response") # 2. Calculate integrated prediction error intPredErr(hazards = testPredHaz, testTime = TestCancer$timeDisc, testEvent = TestCancer$status, trainTime = TrainCancer$timeDisc, trainEvent = TrainCancer$status, testDataLong = LongTest)
########################## # Example with cancer data library(survival) head(cancer) # Data preparation and convertion to 30 intervals cancerPrep <- cancer cancerPrep$status <- cancerPrep$status-1 intLim <- quantile(cancerPrep$time, prob = seq(0, 1, length.out = 30)) intLim [length(intLim)] <- intLim [length(intLim)] + 1 # Cut discrete time in smaller number of intervals cancerPrep <- contToDisc(dataShort = cancerPrep, timeColumn = "time", intervalLimits = intLim) # Generate training and test data set.seed(753) TrainIndices <- sample (x = 1:dim(cancerPrep) [1], size = dim(cancerPrep) [1] * 0.75) TrainCancer <- cancerPrep [TrainIndices, ] TestCancer <- cancerPrep [-TrainIndices, ] TrainCancer$timeDisc <- as.numeric(as.character(TrainCancer$timeDisc)) TestCancer$timeDisc <- as.numeric(as.character(TestCancer$timeDisc)) # Convert to long format LongTrain <- dataLong(dataShort = TrainCancer, timeColumn = "timeDisc", eventColumn = "status", timeAsFactor=FALSE) LongTest <- dataLong(dataShort = TestCancer, timeColumn = "timeDisc", eventColumn = "status", timeAsFactor=FALSE) # Convert factors LongTrain$timeInt <- as.numeric(as.character(LongTrain$timeInt)) LongTest$timeInt <- as.numeric(as.character(LongTest$timeInt)) LongTrain$sex <- factor(LongTrain$sex) LongTest$sex <- factor(LongTest$sex) # Estimate, for example, a generalized, additive model in discrete survival analysis library(mgcv) gamFit <- gam (formula = y ~ s(timeInt) + s(age) + sex + ph.ecog, data = LongTrain, family = binomial()) summary(gamFit) # 1. Specification of predicted discrete hazards # Estimate survival function of each person in the test data testPredHaz <- predict(gamFit, newdata = LongTest, type = "response") # 2. Calculate integrated prediction error intPredErr(hazards = testPredHaz, testTime = TestCancer$timeDisc, testEvent = TestCancer$status, trainTime = TrainCancer$timeDisc, trainEvent = TrainCancer$status, testDataLong = LongTest)
Estimates integrated prediction errors of arbitrary prediction models in the case of competing risks.
intPredErrCompRisks( testPreds, testDataShort, trainDataShort, timeColumn, eventColumns, tmax = NULL )
intPredErrCompRisks( testPreds, testDataShort, trainDataShort, timeColumn, eventColumns, tmax = NULL )
testPreds |
Predictions on the test data with model fitted on training data. Must be a matrix, with the predictions in the rows and the number of columns equal to the number of events. |
testDataShort |
Test data in short format. |
trainDataShort |
Train data in short format. |
timeColumn |
Character giving the column name of the observed times. |
eventColumns |
Character vector giving the column names of the event indicators (excluding censoring column). |
tmax |
Gives the maximum time interval for which prediction errors are calculated. It must not be higher than the maximum observed time in the training data. |
Integrated prediction errors for each competing event. Matrix, with the integrated predictions in the rows and the number of columns equal to the number of events.
Moritz Berger [email protected]
https://www.imbie.uni-bonn.de/personen/dr-moritz-berger/
Heyard R, Timsit J, Held L, COMBACTE-MAGNET,consortium (2019). “Validation of discrete time-to-event prediction models in the presence of competing risks.” Biometrical Journal, 62, 643-657.
predErrCompRisks
, predErrCurve
########################### # Example unemployment data library(Ecdat) data(UnempDur) # Select subsample selectInd1 <- 1:200 selectInd2 <- 201:400 trainSet <- UnempDur[which(UnempDur$spell %in% (1:10))[selectInd1], ] testSet <- UnempDur[which(UnempDur$spell %in% (1:10))[selectInd2], ] # Convert to long format trainSet_long <- dataLongCompRisks(dataShort=trainSet, timeColumn="spell", eventColumns=c("censor1", "censor4"), timeAsFactor=TRUE) tmax <- max(trainSet$spell) testSet_long <- dataLongCompRisks(dataShort=testSet, timeColumn="spell", eventColumns=c("censor1", "censor4"), aggTimeFormat = TRUE, lastTheoInt=tmax, timeAsFactor=TRUE) # Estimate continuation ratio model with logit link vglmFit <- VGAM::vglm(formula=cbind(e0, e1, e2) ~ timeInt + age + logwage, data=trainSet_long, family=VGAM::multinomial(refLevel="e0")) # Calculate predicted hazards predHazards <- VGAM::predictvglm(vglmFit, newdata=testSet_long, type="response") # Compute integrated prediction error intPredErrCompRisks(testPreds=predHazards[,-1], testSet, trainSet, "spell", c("censor1", "censor4"), tmax)
########################### # Example unemployment data library(Ecdat) data(UnempDur) # Select subsample selectInd1 <- 1:200 selectInd2 <- 201:400 trainSet <- UnempDur[which(UnempDur$spell %in% (1:10))[selectInd1], ] testSet <- UnempDur[which(UnempDur$spell %in% (1:10))[selectInd2], ] # Convert to long format trainSet_long <- dataLongCompRisks(dataShort=trainSet, timeColumn="spell", eventColumns=c("censor1", "censor4"), timeAsFactor=TRUE) tmax <- max(trainSet$spell) testSet_long <- dataLongCompRisks(dataShort=testSet, timeColumn="spell", eventColumns=c("censor1", "censor4"), aggTimeFormat = TRUE, lastTheoInt=tmax, timeAsFactor=TRUE) # Estimate continuation ratio model with logit link vglmFit <- VGAM::vglm(formula=cbind(e0, e1, e2) ~ timeInt + age + logwage, data=trainSet_long, family=VGAM::multinomial(refLevel="e0")) # Calculate predicted hazards predHazards <- VGAM::predictvglm(vglmFit, newdata=testSet_long, type="response") # Compute integrated prediction error intPredErrCompRisks(testPreds=predHazards[,-1], testSet, trainSet, "spell", c("censor1", "censor4"), tmax)
Constructs a life table and estimates discrete hazards, survival functions, discrete cumulative hazards and their standard errors without covariates.
lifeTable(dataShort, timeColumn, eventColumn, intervalLimits = NULL) ## S3 method for class 'discSurvLifeTable' print(x, ...)
lifeTable(dataShort, timeColumn, eventColumn, intervalLimits = NULL) ## S3 method for class 'discSurvLifeTable' print(x, ...)
dataShort |
Original data in short format ("class data.frame"). |
timeColumn |
Name of the column with discrete survival times ("character vector"). |
eventColumn |
Gives the column name of the event indicator (1=observed, 0=censored) ("character vector"). |
intervalLimits |
Optional names of the intervals for each row, e. g. [a_0, a_1), [a_1, a_2), ..., [a_q-1, a_q) ("character vector") |
x |
Object of class "discSurvLifeTable"("class discSurvLifeTable") |
... |
Additional arguments to the print function |
List containing an object of class "data.frame" with following columns
n Number of individuals at risk in a given time interval (integer)
events Observed number of events in a given time interval (integer)
dropouts Observed number of dropouts in a given time interval (integer)
atRisk Estimated number of individuals at risk, corrected by dropouts (numeric)
hazard Estimated risk of death (without covariates) in a given time interval
seHazard Estimated standard deviation of estimated hazard
S Estimated survival curve
seS Estimated standard deviation of estimated survival function
cumHazard Estimated cumulative hazard function
seCumHazard Estimated standard deviation of the estimated cumulative hazard function
margProb Estimated marginal probability of event in time interval
Thomas Welchowski [email protected]
Matthias Schmid [email protected]
Tutz G, Schmid M (2016).
Modeling discrete time-to-event data.
Springer Series in Statistics.
Lawless JF (2002).
Statistical Models and Methods for Lifetime Data, 2nd edition.
Wiley series in probability and statistics.
# Example with unemployment data library(Ecdat) data(UnempDur) # Extract subset of all persons smaller or equal the median of age UnempDurSubset <- subset(UnempDur, age <= median(UnempDur$age)) LifeTabUnempDur <- lifeTable(dataShort = UnempDurSubset, timeColumn = "spell", eventColumn = "censor1") LifeTabUnempDur # Example with monoclonal gammapothy data library(survival) head(mgus) # Extract subset of mgus subMgus <- mgus [mgus$futime<=median(mgus$futime), ] # Transform time in days to intervals [0, 1), [1, 2), [2, 3), ... , [12460, 12461) mgusInt <- subMgus mgusInt$futime <- mgusInt$futime + 1 LifeTabGamma <- lifeTable(dataShort = mgusInt, timeColumn= "futime", eventColumn = "death") head(LifeTabGamma$Output, 25) plot(x = 1:dim(LifeTabGamma$Output)[1], y = LifeTabGamma$Output$hazard, type = "l", xlab = "Time interval", ylab = "Hazard", las = 1, main = "Life table estimated marginal discrete hazards")
# Example with unemployment data library(Ecdat) data(UnempDur) # Extract subset of all persons smaller or equal the median of age UnempDurSubset <- subset(UnempDur, age <= median(UnempDur$age)) LifeTabUnempDur <- lifeTable(dataShort = UnempDurSubset, timeColumn = "spell", eventColumn = "censor1") LifeTabUnempDur # Example with monoclonal gammapothy data library(survival) head(mgus) # Extract subset of mgus subMgus <- mgus [mgus$futime<=median(mgus$futime), ] # Transform time in days to intervals [0, 1), [1, 2), [2, 3), ... , [12460, 12461) mgusInt <- subMgus mgusInt$futime <- mgusInt$futime + 1 LifeTabGamma <- lifeTable(dataShort = mgusInt, timeColumn= "futime", eventColumn = "death") head(LifeTabGamma$Output, 25) plot(x = 1:dim(LifeTabGamma$Output)[1], y = LifeTabGamma$Output$hazard, type = "l", xlab = "Time interval", ylab = "Hazard", las = 1, main = "Life table estimated marginal discrete hazards")
Computes optimal minimal node size of a discrete survival tree from a given vector of possible node sizes by cross-validation. Laplace-smoothing can be applied to the estimated hazards.
minNodePruning( formula, data, treetype = "rpart", splitruleranger = "hellinger", sizes, indexList, timeColumn, eventColumn, lambda = 1, logOut = FALSE )
minNodePruning( formula, data, treetype = "rpart", splitruleranger = "hellinger", sizes, indexList, timeColumn, eventColumn, lambda = 1, logOut = FALSE )
formula |
Model formula for tree fitting("class formula") |
data |
Discrete survival data in short format for which a survival tree is to be fitted ("class data.frame"). |
treetype |
Type of tree to be fitted ("character vector"). Possible values are "rpart" or "ranger". The default is to fit an rpart tree; when "ranger" is chosen, a ranger forest with a single tree is fitted. |
splitruleranger |
String specifying the splitting rule of the ranger tree("character vector"). Possible values are either "gini", "extratrees" or "hellinger". Default is "hellinger". |
sizes |
Vector of different node sizes to try ("integer vector"). Values should be non-negative. |
indexList |
List of data partitioning indices for cross-validation ("class list"). Each element represents the test indices of one fold ("integer vector"). |
timeColumn |
Character giving the column name of the observed times in the data argument ("character vector"). |
eventColumn |
Character giving the column name of the event indicator in the data argument ("character vector"). |
lambda |
Parameter for laplace-smoothing. A value of 0 corresponds to no laplace-smoothing ("numeric vector"). |
logOut |
Logical value ("logical vector"). If the argument is set to TRUE, then computation progress will be written to console. |
Computes the out-of-sample log likelihood for all data partitionings for each node size in sizes and returns the node size for which the log likelihood was minimal. Also returns an rpart tree with the optimal minimal node size using the entire data set.
A list containing the two items
Optimal minimal node size - Node size with lowest out-of-sample log-likelihood
tree - a tree object with type corresponding to treetype argument with the optimal minimal node size
library(pec) library(caret) data(cost) # Take subsample and convert time to years cost$time <- ceiling(cost$time / 365) costSub <- cost[1:50, ] # Specify column names for data augmentation timeColumn <- "time" eventColumn <- "status" # Create data partition for cross validation indexList <- createFolds(costSub$status * max(costSub$time) + costSub$time, k = 5) # specify function arguments and perform node size pruning formula <- y ~ timeInt + prevStroke + age + sex sizes <- 1:10 optiTree <- minNodePruning(formula, costSub, treetype = "rpart", sizes = sizes, indexList = indexList, timeColumn = timeColumn, eventColumn = eventColumn, lambda = 1, logOut = TRUE)
library(pec) library(caret) data(cost) # Take subsample and convert time to years cost$time <- ceiling(cost$time / 365) costSub <- cost[1:50, ] # Specify column names for data augmentation timeColumn <- "time" eventColumn <- "status" # Create data partition for cross validation indexList <- createFolds(costSub$status * max(costSub$time) + costSub$time, k = 5) # specify function arguments and perform node size pruning formula <- y ~ timeInt + prevStroke + age + sex sizes <- 1:10 optiTree <- minNodePruning(formula, costSub, treetype = "rpart", sizes = sizes, indexList = indexList, timeColumn = timeColumn, eventColumn = eventColumn, lambda = 1, logOut = TRUE)
Computes optimal minimal node size of a discrete survival tree from a given vector of possible node sizes by cross-validation. Laplace-smoothing can be applied to the estimated hazards.
minNodePruningCompRisks( formula, data, treetype = "rpart", splitruleranger = "gini", sizes, indexList, timeColumn, eventColumns, lambda = 1, logOut = FALSE )
minNodePruningCompRisks( formula, data, treetype = "rpart", splitruleranger = "gini", sizes, indexList, timeColumn, eventColumns, lambda = 1, logOut = FALSE )
formula |
Model formula for tree fitting("class formula") |
data |
Discrete survival data in short format for which a survival tree is to be fitted ("class data.frame"). |
treetype |
Type of tree to be fitted. Possible values are "rpart" or "ranger" ("character vector"). The default is to fit an rpart tree; when "ranger" is chosen, a ranger forest with a single tree is fitted. |
splitruleranger |
String specifying the splitting rule of the ranger tree ("character vector"). Possible values are either "gini" or "extratrees". Default is "gini". |
sizes |
Vector of different node sizes to try ("integer vector"). Values need to be non-negative. |
indexList |
List of data partitioning indices for cross-validation ("class list"). Each element represents the test indices of one fold ("integer vector"). |
timeColumn |
Character giving the column name of the observed times in the "data"-argument("character vector"). |
eventColumns |
Character vector giving the column names of the event indicators (excluding censoring column) in the "data"-argument("character vector"). |
lambda |
Parameter for laplace-smoothing. A value of 0 corresponds to no laplace-smoothing ("numeric vector"). |
logOut |
Logical value("logical vector"). If True, computation progress will be written to console. |
Computes the out-of-sample log likelihood for all data partitionings for each node size in sizes and returns the node size for which the log likelihood was minimal. Also returns an rpart tree with the optimal minimal node size using the entire data set.
A list containing the two items
Optimal minimal node size - Node size with lowest out-of-sample log-likelihood
tree - a tree object with type corresponding to treetype argument with the optimal minimal node size
# Example unemployment data library(Ecdat) library(caret) data(UnempDur) # Select training and testing subsample subUnempDur <- UnempDur[which(UnempDur$spell < 10),] subUnempDur <- subUnempDur[1:250,] #creating status variable for data partitioning subUnempDur$status <- ifelse(subUnempDur$censor1, 1, ifelse(subUnempDur$censor2, 2, ifelse( subUnempDur$censor3, 3, ifelse(subUnempDur$censor4, 4, 0)))) indexList <- createFolds(subUnempDur$status*max(subUnempDur$spell) + subUnempDur$spell, k = 5) # performing minimal node size pruning formula <- responses ~ timeInt + age + logwage sizes <- 1:10 timeColumn <- "spell" eventColumns <- c("censor1", "censor2", "censor3","censor4") optiTree <- minNodePruningCompRisks(formula, subUnempDur, treetype = "rpart", sizes = sizes, indexList = indexList, timeColumn = timeColumn, eventColumns = eventColumns, lambda = 1, logOut = TRUE)
# Example unemployment data library(Ecdat) library(caret) data(UnempDur) # Select training and testing subsample subUnempDur <- UnempDur[which(UnempDur$spell < 10),] subUnempDur <- subUnempDur[1:250,] #creating status variable for data partitioning subUnempDur$status <- ifelse(subUnempDur$censor1, 1, ifelse(subUnempDur$censor2, 2, ifelse( subUnempDur$censor3, 3, ifelse(subUnempDur$censor4, 4, 0)))) indexList <- createFolds(subUnempDur$status*max(subUnempDur$spell) + subUnempDur$spell, k = 5) # performing minimal node size pruning formula <- responses ~ timeInt + age + logwage sizes <- 1:10 timeColumn <- "spell" eventColumns <- c("censor1", "censor2", "censor3","censor4") optiTree <- minNodePruningCompRisks(formula, subUnempDur, treetype = "rpart", sizes = sizes, indexList = indexList, timeColumn = timeColumn, eventColumns = eventColumns, lambda = 1, logOut = TRUE)
Generates a plot of an estimated cumulative incidence function P(T <= t, event=k | x) based on estimated hazards of a discrete competing risks model or a discrete subdistribution hazard model.
plotCumInc(hazards, eventFocus = NULL, ...)
plotCumInc(hazards, eventFocus = NULL, ...)
hazards |
Numeric matrix (where each column represents one event) or vector of estimated hazards("numeric matrix"). |
eventFocus |
Column that represent the primary event ("integer vector"). Only applicable in the case of competing risks. |
... |
Further arguments passed to |
Moritz Berger [email protected]
https://www.imbie.uni-bonn.de/personen/dr-moritz-berger/
Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.
estSurv
, estCumInz
, compRisksGEE
# Example with unemployment data library(Ecdat) data(UnempDur) # Select subsample SubUnempDur <- UnempDur [1:100, ] ################################ # Competing risks model # Estimate GEE models for all events estGEE <- compRisksGEE(datShort = SubUnempDur, dataTransform = "dataLongCompRisks", corstr = "independence", formulaVariable =~ timeInt + age + ui + logwage * ui, eventColumns = c("censor1", "censor2", "censor3", "censor4"), timeColumn = "spell") # Estimate hazards of all events given the covariates of third person SubUnempDurLong <- dataLongCompRisks(dataShort = SubUnempDur, eventColumns = c("censor1", "censor2", "censor3", "censor4"), timeColumn = "spell") preds <- predict(estGEE, subset(SubUnempDurLong, obj == 3)) plotCumInc(preds, eventFocus = 3) ############################### # Subdistribution hazards model # Convert to long format SubUnempDurLong <- dataLongSubDist(dataShort = SubUnempDur, timeColumn = "spell", eventColumns = c("censor1", "censor2", "censor3", "censor4"), eventFocus = "censor1") # Estimate continuation ratio model with logit link glmFit <- glm(formula = y ~ timeInt + age + ui + logwage * ui, data = SubUnempDurLong, family = binomial(), weights = SubUnempDurLong$subDistWeights) # Estimated subdistribution hazard given the covariates of the third person preds <- predict(glmFit, type = "response", newdata = subset(SubUnempDurLong, obj == 3)) plotCumInc(preds)
# Example with unemployment data library(Ecdat) data(UnempDur) # Select subsample SubUnempDur <- UnempDur [1:100, ] ################################ # Competing risks model # Estimate GEE models for all events estGEE <- compRisksGEE(datShort = SubUnempDur, dataTransform = "dataLongCompRisks", corstr = "independence", formulaVariable =~ timeInt + age + ui + logwage * ui, eventColumns = c("censor1", "censor2", "censor3", "censor4"), timeColumn = "spell") # Estimate hazards of all events given the covariates of third person SubUnempDurLong <- dataLongCompRisks(dataShort = SubUnempDur, eventColumns = c("censor1", "censor2", "censor3", "censor4"), timeColumn = "spell") preds <- predict(estGEE, subset(SubUnempDurLong, obj == 3)) plotCumInc(preds, eventFocus = 3) ############################### # Subdistribution hazards model # Convert to long format SubUnempDurLong <- dataLongSubDist(dataShort = SubUnempDur, timeColumn = "spell", eventColumns = c("censor1", "censor2", "censor3", "censor4"), eventFocus = "censor1") # Estimate continuation ratio model with logit link glmFit <- glm(formula = y ~ timeInt + age + ui + logwage * ui, data = SubUnempDurLong, family = binomial(), weights = SubUnempDurLong$subDistWeights) # Estimated subdistribution hazard given the covariates of the third person preds <- predict(glmFit, type = "response", newdata = subset(SubUnempDurLong, obj == 3)) plotCumInc(preds)
Generates a plot of an estimated survival function S(T>t|x) based on estimated discrete hazards.
plotSurv(hazards, ...)
plotSurv(hazards, ...)
hazards |
Estimated discrete hazards ("numeric vector") |
... |
Further arguments passed to |
Moritz Berger [email protected]
https://www.imbie.uni-bonn.de/personen/dr-moritz-berger/
Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.
# Example unemployment data library(Ecdat) data(UnempDur) # Select subsample subUnempDur <- UnempDur [1:100, ] # Convert to long format UnempLong <- dataLong(dataShort = subUnempDur, timeColumn = "spell", eventColumn = "censor1") head(UnempLong) # Estimate binomial model with logit link Fit <- glm(formula = y ~ timeInt + age + logwage, data = UnempLong, family = binomial()) # Estimate discrete survival function given age, logwage of first person Tmax <- max(subUnempDur$spell) UnempEval <- dataLong(dataShort = UnempDur[1,], timeColumn = "spell", eventColumn = "censor1", aggTimeFormat = TRUE, lastTheoInt = Tmax) hazard <- predict(Fit, newdata = UnempEval, type = "response") plotSurv(hazard)
# Example unemployment data library(Ecdat) data(UnempDur) # Select subsample subUnempDur <- UnempDur [1:100, ] # Convert to long format UnempLong <- dataLong(dataShort = subUnempDur, timeColumn = "spell", eventColumn = "censor1") head(UnempLong) # Estimate binomial model with logit link Fit <- glm(formula = y ~ timeInt + age + logwage, data = UnempLong, family = binomial()) # Estimate discrete survival function given age, logwage of first person Tmax <- max(subUnempDur$spell) UnempEval <- dataLong(dataShort = UnempDur[1,], timeColumn = "spell", eventColumn = "censor1", aggTimeFormat = TRUE, lastTheoInt = Tmax) hazard <- predict(Fit, newdata = UnempEval, type = "response") plotSurv(hazard)
Estimates prediction error curves for discrete survival competing risks models
predErrCompRisks( testPreds, testDataShort, trainDataShort, timeColumn, eventColumns, tmax = NULL )
predErrCompRisks( testPreds, testDataShort, trainDataShort, timeColumn, eventColumns, tmax = NULL )
testPreds |
Predictions on the test data with model fitted on training data ("numeric matrix"). Predictions are stored in the rows and the number of columns equal the number of events. |
testDataShort |
Test data in short format ("class data.frame"). |
trainDataShort |
Train data in short format ("class data.frame"). |
timeColumn |
Character giving the column name of the observed times("character vector"). |
eventColumns |
Character vector giving the column names of the event indicators (excluding censoring column) ("character vector"). |
tmax |
Gives the maximum time interval for which prediction errors are calculated ("integer vector"). It must not be higher than the maximum observed time in the training data. |
Calculated prediction errors for each competing event. Array with one matrix per competing event, with the predictions in the rows and the time points in the columns.
Moritz Berger [email protected]
https://www.imbie.uni-bonn.de/personen/dr-moritz-berger/
Heyard R, Timsit J, Held L, COMBACTE-MAGNET,consortium (2019). “Validation of discrete time-to-event prediction models in the presence of competing risks.” Biometrical Journal, 62, 643-657.
intPredErrCompRisks
, predErrCurve
########################### # Example unemployment data library(Ecdat) data(UnempDur) # Select subsample selectInd1 <- 1:200 selectInd2 <- 201:400 trainSet <- UnempDur[which(UnempDur$spell %in% (1:10))[selectInd1], ] testSet <- UnempDur[which(UnempDur$spell %in% (1:10))[selectInd2], ] # Convert to long format trainSet_long <- dataLongCompRisks(dataShort=trainSet, timeColumn="spell", eventColumns=c("censor1", "censor4"), timeAsFactor=TRUE) tmax <- max(trainSet$spell) testSet_long <- dataLongCompRisks(dataShort=testSet, timeColumn="spell", eventColumns=c("censor1", "censor4"), aggTimeFormat = TRUE, lastTheoInt=tmax, timeAsFactor=TRUE) # Estimate continuation ratio model with logit link vglmFit <- VGAM::vglm(formula=cbind(e0, e1, e2) ~ timeInt + age + logwage, data=trainSet_long, family=VGAM::multinomial(refLevel="e0")) # Calculate predicted hazards predHazards <- VGAM::predictvglm(vglmFit, newdata=testSet_long, type="response") # Compute prediction error predErrCompRisks(testPreds=predHazards[,-1], testSet, trainSet, "spell", c("censor1", "censor4"), tmax)
########################### # Example unemployment data library(Ecdat) data(UnempDur) # Select subsample selectInd1 <- 1:200 selectInd2 <- 201:400 trainSet <- UnempDur[which(UnempDur$spell %in% (1:10))[selectInd1], ] testSet <- UnempDur[which(UnempDur$spell %in% (1:10))[selectInd2], ] # Convert to long format trainSet_long <- dataLongCompRisks(dataShort=trainSet, timeColumn="spell", eventColumns=c("censor1", "censor4"), timeAsFactor=TRUE) tmax <- max(trainSet$spell) testSet_long <- dataLongCompRisks(dataShort=testSet, timeColumn="spell", eventColumns=c("censor1", "censor4"), aggTimeFormat = TRUE, lastTheoInt=tmax, timeAsFactor=TRUE) # Estimate continuation ratio model with logit link vglmFit <- VGAM::vglm(formula=cbind(e0, e1, e2) ~ timeInt + age + logwage, data=trainSet_long, family=VGAM::multinomial(refLevel="e0")) # Calculate predicted hazards predHazards <- VGAM::predictvglm(vglmFit, newdata=testSet_long, type="response") # Compute prediction error predErrCompRisks(testPreds=predHazards[,-1], testSet, trainSet, "spell", c("censor1", "censor4"), tmax)
Estimates prediction error curves of arbitrary discrete survival prediction models. In prediction error curves the estimated and observed survival functions are compared adjusted by weights at given timepoints.
## S3 method for class 'discSurvPredErrDisc' print(x, ...) ## S3 method for class 'discSurvPredErrDisc' plot(x, ...) predErrCurve( timepoints, estSurvList, testTime, testEvent, trainTime, trainEvent )
## S3 method for class 'discSurvPredErrDisc' print(x, ...) ## S3 method for class 'discSurvPredErrDisc' plot(x, ...) predErrCurve( timepoints, estSurvList, testTime, testEvent, trainTime, trainEvent )
x |
Object of class "discSurvPredErrDisc"("class discSurvPredErrDisc") |
... |
Additional arguments to S3 methods. |
timepoints |
Vector of the number of discrete time intervals ("integer vector"). |
estSurvList |
List of persons in the test data ("class list"). Each element contains a estimated survival functions of all given time points ("numeric vector"). |
testTime |
Discrete survival times in the test data ("numeric vector"). |
testEvent |
Univariate event indicator in the test data ("binary vector"). |
trainTime |
Numeric vector of discrete survival times in the training data ("numeric vector"). |
trainEvent |
Integer vector of univariate event indicator in the training data("integer vector"). |
The prediction error curves should be smaller than 0.25 for all time points, because this is equivalent to a random assignment error.
List List with objects:
Output List with two components
predErr Numeric vector with estimated prediction error values. Names give the evaluation time point.
weights List of weights used in the estimation. Each list component gives the weights of a person in the test data.
Input A list of given argument input values (saved for reference)
Thomas Welchowski [email protected]
Van der Laan MJ, Robins JM (2003).
Unified Methods for Censored Longitudinal Data and Causality.
Springer Series in Statistics.
Gerds TA, Schumacher M (2006).
“Consistent estimation of the expected Brier Score in general survival models with right-censored event times.”
Biometrical Journal, 48, 1029-1040.
# Example with cross validation and unemployment data library(Ecdat) library(mgcv) data(UnempDur) summary(UnempDur$spell) # Extract subset of data set.seed(635) IDsample <- sample(1:dim(UnempDur)[1], 100) UnempDurSubset <- UnempDur [IDsample, ] head(UnempDurSubset) range(UnempDurSubset$spell) # Generate training and test data set.seed(7550) TrainIndices <- sample (x = 1:dim(UnempDurSubset) [1], size = 75) TrainUnempDur <- UnempDurSubset [TrainIndices, ] TestUnempDur <- UnempDurSubset [-TrainIndices, ] # Convert to long format LongTrain <- dataLong(dataShort = TrainUnempDur, timeColumn = "spell", eventColumn = "censor1") LongTest <- dataLong(dataShort = TestUnempDur, timeColumn = "spell", eventColumn = "censor1") # Convert factor to numeric for smoothing LongTrain$timeInt <- as.numeric(as.character(LongTrain$timeInt)) LongTest$timeInt <- as.numeric(as.character(LongTest$timeInt)) ###################################################################### # Estimate a generalized, additive model in discrete survival analysis gamFit <- gam (formula = y ~ s(timeInt) + age + logwage, data = LongTrain, family = binomial()) # Estimate survival function of each person in the test data oneMinusPredHaz <- 1 - predict(gamFit, newdata = LongTest, type = "response") predSurv <- aggregate(oneMinusPredHaz ~ obj, data = LongTest, FUN = cumprod) # Prediction error in first interval tryPredErrDisc1 <- predErrCurve (timepoints = 1, estSurvList = predSurv [[2]], testTime = TestUnempDur$spell, testEvent=TestUnempDur$censor1, trainTime = TrainUnempDur$spell, trainEvent=TrainUnempDur$censor1) tryPredErrDisc1 # Prediction error of the 2. to 10. interval tryPredErrDisc2 <- predErrCurve (timepoints = 2:10, estSurvList = predSurv [[2]], testTime = TestUnempDur$spell, testEvent = TestUnempDur$censor1, trainTime = TrainUnempDur$spell, trainEvent = TrainUnempDur$censor1) tryPredErrDisc2 plot(tryPredErrDisc2) ######################################## # Fit a random discrete survival forest library(ranger) LongTrainRF <- LongTrain LongTrainRF$y <- factor(LongTrainRF$y) rfFit <- ranger(formula = y ~ timeInt + age + logwage, data = LongTrainRF, probability = TRUE) # Estimate survival function of each person in the test data oneMinusPredHaz <- 1 - predict(rfFit, data = LongTest)$predictions[, 2] predSurv <- aggregate(oneMinusPredHaz ~ obj, data = LongTest, FUN = cumprod) # Prediction error in first interval tryPredErrDisc1 <- predErrCurve (timepoints = 1, estSurvList = predSurv [[2]], testTime = TestUnempDur$spell, testEvent = TestUnempDur$censor1, trainTime = TrainUnempDur$spell, trainEvent = TrainUnempDur$censor1) tryPredErrDisc1 # Prediction error of the 2. to 10. interval tryPredErrDisc2 <- predErrCurve (timepoints = 2:10, estSurvList = predSurv [[2]], testTime = TestUnempDur$spell, testEvent = TestUnempDur$censor1, trainTime = TrainUnempDur$spell, trainEvent = TrainUnempDur$censor1) tryPredErrDisc2 plot(tryPredErrDisc2)
# Example with cross validation and unemployment data library(Ecdat) library(mgcv) data(UnempDur) summary(UnempDur$spell) # Extract subset of data set.seed(635) IDsample <- sample(1:dim(UnempDur)[1], 100) UnempDurSubset <- UnempDur [IDsample, ] head(UnempDurSubset) range(UnempDurSubset$spell) # Generate training and test data set.seed(7550) TrainIndices <- sample (x = 1:dim(UnempDurSubset) [1], size = 75) TrainUnempDur <- UnempDurSubset [TrainIndices, ] TestUnempDur <- UnempDurSubset [-TrainIndices, ] # Convert to long format LongTrain <- dataLong(dataShort = TrainUnempDur, timeColumn = "spell", eventColumn = "censor1") LongTest <- dataLong(dataShort = TestUnempDur, timeColumn = "spell", eventColumn = "censor1") # Convert factor to numeric for smoothing LongTrain$timeInt <- as.numeric(as.character(LongTrain$timeInt)) LongTest$timeInt <- as.numeric(as.character(LongTest$timeInt)) ###################################################################### # Estimate a generalized, additive model in discrete survival analysis gamFit <- gam (formula = y ~ s(timeInt) + age + logwage, data = LongTrain, family = binomial()) # Estimate survival function of each person in the test data oneMinusPredHaz <- 1 - predict(gamFit, newdata = LongTest, type = "response") predSurv <- aggregate(oneMinusPredHaz ~ obj, data = LongTest, FUN = cumprod) # Prediction error in first interval tryPredErrDisc1 <- predErrCurve (timepoints = 1, estSurvList = predSurv [[2]], testTime = TestUnempDur$spell, testEvent=TestUnempDur$censor1, trainTime = TrainUnempDur$spell, trainEvent=TrainUnempDur$censor1) tryPredErrDisc1 # Prediction error of the 2. to 10. interval tryPredErrDisc2 <- predErrCurve (timepoints = 2:10, estSurvList = predSurv [[2]], testTime = TestUnempDur$spell, testEvent = TestUnempDur$censor1, trainTime = TrainUnempDur$spell, trainEvent = TrainUnempDur$censor1) tryPredErrDisc2 plot(tryPredErrDisc2) ######################################## # Fit a random discrete survival forest library(ranger) LongTrainRF <- LongTrain LongTrainRF$y <- factor(LongTrainRF$y) rfFit <- ranger(formula = y ~ timeInt + age + logwage, data = LongTrainRF, probability = TRUE) # Estimate survival function of each person in the test data oneMinusPredHaz <- 1 - predict(rfFit, data = LongTest)$predictions[, 2] predSurv <- aggregate(oneMinusPredHaz ~ obj, data = LongTest, FUN = cumprod) # Prediction error in first interval tryPredErrDisc1 <- predErrCurve (timepoints = 1, estSurvList = predSurv [[2]], testTime = TestUnempDur$spell, testEvent = TestUnempDur$censor1, trainTime = TrainUnempDur$spell, trainEvent = TrainUnempDur$censor1) tryPredErrDisc1 # Prediction error of the 2. to 10. interval tryPredErrDisc2 <- predErrCurve (timepoints = 2:10, estSurvList = predSurv [[2]], testTime = TestUnempDur$spell, testEvent = TestUnempDur$censor1, trainTime = TrainUnempDur$spell, trainEvent = TrainUnempDur$censor1) tryPredErrDisc2 plot(tryPredErrDisc2)
Predicts the laplace-smoothed hazards of discrete survival tree. Can be used for single-risk or competing risk discrete survival data.
survTreeLaplaceHazard(treeModel, newdata, lambda)
survTreeLaplaceHazard(treeModel, newdata, lambda)
treeModel |
Fitted tree object as generated by "rpart" ("class rpart"). |
newdata |
Data in long format for which hazards are to be computed. Must contain the same columns that were used for tree fitting("class data.frame"). |
lambda |
Smoothing parameter for laplace-smoothing. Must be a non-negative number. A value of 0 corresponds to no smoothing ("numeric vector"). |
A m by k matrix with m being the length of newdata and k being the number of classes in treeModel. Each row corresponds to the smoothed hazard of the respective observation.
library(pec) library(caret) # Example data data(cost) # Convert time to years and select training and testing subsample cost$time <- ceiling(cost$time/365) costTrain <- cost[1:100, ] costTest <- cost[101:120, ] # Convert to long format timeColumn <- "time" eventColumn <- "status" costTrainLong <- dataLong(dataShort=costTrain, timeColumn = "time", eventColumn = "status") costTestLong <- dataLong(dataShort=costTest, timeColumn = "time", eventColumn = "status") head(costTrainLong) # Fit a survival tree costTree <- rpart(formula = y ~ timeInt + prevStroke + age + sex, data = costTrainLong, method = "class") # Compute smoothed hazards for test data predictedhazards <- survTreeLaplaceHazard(costTree, costTestLong, 1) predictedhazards
library(pec) library(caret) # Example data data(cost) # Convert time to years and select training and testing subsample cost$time <- ceiling(cost$time/365) costTrain <- cost[1:100, ] costTest <- cost[101:120, ] # Convert to long format timeColumn <- "time" eventColumn <- "status" costTrainLong <- dataLong(dataShort=costTrain, timeColumn = "time", eventColumn = "status") costTestLong <- dataLong(dataShort=costTest, timeColumn = "time", eventColumn = "status") head(costTrainLong) # Fit a survival tree costTree <- rpart(formula = y ~ timeInt + prevStroke + age + sex, data = costTrainLong, method = "class") # Compute smoothed hazards for test data predictedhazards <- survTreeLaplaceHazard(costTree, costTestLong, 1) predictedhazards
Predicts the laplace-smoothed hazards of discrete survival data based on a survival tree from class "ranger". Currently only single-risk data is supported.
survTreeLaplaceHazardRanger(treeModel, rangerdata, newdata, lambda)
survTreeLaplaceHazardRanger(treeModel, rangerdata, newdata, lambda)
treeModel |
Fitted tree object as generated by "ranger" ("class data.frame"). Must be a single ranger tree. |
rangerdata |
Original training data with which treeModel was fitted ("class data.frame"). Must be in long format. |
newdata |
Data in long format for which hazards are to be computed ("class data.frame"). Must contain the same columns that were used for tree fitting. |
lambda |
Smoothing parameter for laplace-smoothing ("class data.frame"). Must be a non-negative number. A value of zero corresponds to no smoothing. |
A m by k matrix with m being the length of newdata and k being the number of classes in treeModel. Each row corresponds to the smoothed hazard of the respective observation.
library(pec) library(caret) library(ranger) data(cost) # Take subsample and convert time to years cost$time <- ceiling(cost$time/365) costSubTrain <- cost[1:50,] costSubTest <- cost[51:70,] # Specify column names for data augmentation timeColumn<-"time" eventColumn<-"status" costSubTrainLong <- dataLong(costSubTrain, timeColumn, eventColumn) costSubTestLong <- dataLong(costSubTest, timeColumn, eventColumn) #create tree formula <- y ~ timeInt + diabetes + prevStroke + age + sex rangerTree <- ranger(formula, costSubTrainLong, num.trees = 1, mtry = 5, classification = TRUE, splitrule = "hellinger", replace = FALSE, sample.fraction = 1, max.depth = 5) #compute laplace-smoothed hazards laplHaz <- survTreeLaplaceHazardRanger(rangerTree, costSubTrainLong, costSubTestLong, lambda = 1) laplHaz
library(pec) library(caret) library(ranger) data(cost) # Take subsample and convert time to years cost$time <- ceiling(cost$time/365) costSubTrain <- cost[1:50,] costSubTest <- cost[51:70,] # Specify column names for data augmentation timeColumn<-"time" eventColumn<-"status" costSubTrainLong <- dataLong(costSubTrain, timeColumn, eventColumn) costSubTestLong <- dataLong(costSubTest, timeColumn, eventColumn) #create tree formula <- y ~ timeInt + diabetes + prevStroke + age + sex rangerTree <- ranger(formula, costSubTrainLong, num.trees = 1, mtry = 5, classification = TRUE, splitrule = "hellinger", replace = FALSE, sample.fraction = 1, max.depth = 5) #compute laplace-smoothed hazards laplHaz <- survTreeLaplaceHazardRanger(rangerTree, costSubTrainLong, costSubTestLong, lambda = 1) laplHaz
Subsample of 1000 persons from the national longitudinal survey of youth 1979 data. Included covariates are age, children, ethnicity, marital status and sex. The bivariate responses current state (spell) and discrete time interval (year) are the last two columns.
data(unempMultiSpell)
data(unempMultiSpell)
Column "id" is defined as identification number for each person.
Column "age" represents the time-varying age of each person in years.
Column "child" consists of values
0 - No children
1 - Individual has child/children
Column "ethnicity" consists of values
1 - Hispanic
2 - Black
3 - Other
Column "marriage" consists of values
1 - Never Married
2 - Currently married
3 - Other/Divorced
Column "sex" consists of values
1 - Male
2 - Female
Column "spell" represents the time-varying employment status of each person. Possible values are
1 - Employed
2 - Unemployed
3 - Out of labor force
4 - In active forces
0 - Censored
Column "year" represents the discrete time intervals in years.
David Koehler [email protected]
National Longitudinal Survey of Youth
Function to compute new subdistribution weights for a test data set based on the estimated censoring survival function from a learning data set
weightsLtoT( dataShortTrain, dataShortTest, timeColumn, eventColumns, eventFocus )
weightsLtoT( dataShortTrain, dataShortTest, timeColumn, eventColumns, eventFocus )
dataShortTrain |
Learning data in short format ("class data.frame"). |
dataShortTest |
Test data in short format ("class data.frame"). |
timeColumn |
Character specifying the column name of the observed event times ("character vector"). It is required that the observed times are discrete ("integer vector"). |
eventColumns |
Character vector specifying the column names of the event indicators ("logical vector")(excluding censoring events). It is required that a 0-1 coding is used for all events. The algorithm treats row sums of zero of all event columns as censored. |
eventFocus |
Column name of the event of interest, which corresponds to the type 1 event ("character vector"). |
Subdstribution weights for the test data in long format using the estimated censoring survival function from the learning data ("numeric vector"). The length of the vector is equal to the number of observations of the long test data.
Moritz Berger [email protected]
https://www.imbie.uni-bonn.de/personen/dr-moritz-berger/
Berger M, Schmid M, Welchowski T, Schmitz-Valckenberg S, Beyersmann J (2020). “Subdistribution Hazard Models for Competing Risks in Discrete Time.” Biostatistics, 21, 449-466.
dataLongSubDist
, calibrationPlot
#################### # Data preprocessing # Example unemployment data library(Ecdat) data(UnempDur) # Select subsample selectInd1 <- 1:100 selectInd2 <- 101:200 trainSet <- UnempDur[which(UnempDur$spell %in% (1:10))[selectInd1], ] valSet <- UnempDur[which(UnempDur$spell %in% (1:10))[selectInd2], ] # Convert to long format trainSet_long <- dataLongSubDist(dataShort = trainSet, timeColumn = "spell", eventColumns = c("censor1", "censor4"), eventFocus = "censor1") valSet_long <- dataLongSubDist(dataShort = valSet, timeColumn = "spell", eventColumns = c("censor1", "censor4"), eventFocus = "censor1") # Compute new weights of the validation data set valSet_long$subDistWeights <- weightsLtoT(trainSet, valSet, timeColumn = "spell", eventColumns = c("censor1", "censor4"), eventFocus = "censor1") # Estimate continuation ratio model with logit link glmFit <- glm(formula = y ~ timeInt + age + logwage, data = trainSet_long, family = binomial(), weights = trainSet_long$subDistWeights) # Calculate predicted discrete hazards predHazards <- predict(glmFit, newdata = valSet_long, type = "response") # Calibration plot calibrationPlot(predHazards, testDataLong = valSet_long, weights = valSet_long$subDistWeights)
#################### # Data preprocessing # Example unemployment data library(Ecdat) data(UnempDur) # Select subsample selectInd1 <- 1:100 selectInd2 <- 101:200 trainSet <- UnempDur[which(UnempDur$spell %in% (1:10))[selectInd1], ] valSet <- UnempDur[which(UnempDur$spell %in% (1:10))[selectInd2], ] # Convert to long format trainSet_long <- dataLongSubDist(dataShort = trainSet, timeColumn = "spell", eventColumns = c("censor1", "censor4"), eventFocus = "censor1") valSet_long <- dataLongSubDist(dataShort = valSet, timeColumn = "spell", eventColumns = c("censor1", "censor4"), eventFocus = "censor1") # Compute new weights of the validation data set valSet_long$subDistWeights <- weightsLtoT(trainSet, valSet, timeColumn = "spell", eventColumns = c("censor1", "censor4"), eventFocus = "censor1") # Estimate continuation ratio model with logit link glmFit <- glm(formula = y ~ timeInt + age + logwage, data = trainSet_long, family = binomial(), weights = trainSet_long$subDistWeights) # Calculate predicted discrete hazards predHazards <- predict(glmFit, newdata = valSet_long, type = "response") # Calibration plot calibrationPlot(predHazards, testDataLong = valSet_long, weights = valSet_long$subDistWeights)