| Title: | Building Augmented Data to Run Multi-State Models with 'msm' Package |
|---|---|
| Description: | A fast and general method for restructuring classical longitudinal observational data into augmented transition data suitable for multi-state modeling with the 'msm' package. Works with any longitudinal data where subjects accumulate repeated observations with start and end times and an optional terminal outcome. Methods are described in Grossetti, Ieva and Paganoni (2018) <doi:10.1007/s10729-017-9400-z>. |
| Authors: | Francesco Grossetti [aut, cre] (ORCID: <https://orcid.org/0000-0002-5130-7745>) |
| Maintainer: | Francesco Grossetti <[email protected]> |
| License: | GPL-3 |
| Version: | 2.2.1 |
| Built: | 2026-06-08 20:35:00 UTC |
| Source: | https://github.com/cran/msmtools |
Reshape standard longitudinal data into augmented transition data suitable for multi-state models fitted with msm.
augment( data, data_key, n_events, pattern, state = c("IN", "OUT", "DEAD"), t_start, t_end, t_cens, t_death, t_augmented, more_status = NULL, check_NA = FALSE, copy = FALSE, verbosity = getOption("msmtools.verbosity", "quiet") )augment( data, data_key, n_events, pattern, state = c("IN", "OUT", "DEAD"), t_start, t_end, t_cens, t_death, t_augmented, more_status = NULL, check_NA = FALSE, copy = FALSE, verbosity = getOption("msmtools.verbosity", "quiet") )
data |
A |
data_key |
A keying variable used to identify subjects and define a key
for |
n_events |
An integer variable indicating the progressive (monotonic)
event number for each subject. |
pattern |
Either an integer, a factor, or a character variable with 2 or 3 unique values that gives each subject's terminal outcome schema. When 2 values are detected, they must be in the format: 0 = "alive", 1 = "dead". When 3 values are detected, they must be: 0 = "alive", 1 = "dead during a transition", 2 = "dead after a transition has ended" (see Details). |
state |
A character vector of exactly three unique, non-missing,
non-empty labels used as the generated transition-state vocabulary.
Defaults to |
t_start |
The starting time of an observation. It can be passed as date, integer, or numeric format. |
t_end |
The ending time of an observation. It can be passed as date, integer, or numeric format. |
t_cens |
The censoring time of the study. This is the date until each ID is observed, if still active in the cohort. |
t_death |
The exact death time of a subject ID. If |
t_augmented |
A variable indicating the name of the new time variable
in the augmented format. If |
more_status |
A variable that marks further transitions beyond the
default ones given by |
check_NA |
If |
copy |
If |
verbosity |
Controls informational output. Use |
augment() requires a monotonic event sequence within each subject.
The data are ordered with data.table::setkey() using data_key as the
primary key and t_start as the secondary key. The function then checks the
monotonicity of n_events; if the check fails, it stops and reports the
subjects that violate the condition. If n_events is missing, augment()
first computes a progression number named n_events and then runs the same
check.
Argument pattern describes the terminal outcome schema and must follow the
expected ordering. With two statuses, values must correspond to
0 = "alive" and 1 = "dead". With three statuses, integer values must
correspond to 0 = "alive", 1 = "dead inside a transition", and
2 = "dead outside a transition". Character and factor values must follow
the same order. For example, 0 cannot be used to indicate death.
Argument state describes the generated transition-state vocabulary. Its
order also matters. The first element is the state at t_start (for example,
"IN"), the second element is the state at t_end (for example, "OUT"),
and the third element is the absorbing state (for example, "DEAD"). A
two-value pattern still requires three state labels because augment()
infers whether death maps to the absorbing state inside or outside the
transition window.
more_status lets augment() represent transitions beyond the defaults in
state. Standard observations that add no extra information should use
"df" for "default" (see Examples, or run ?hosp and inspect rehab_it).
More complex transitions should use concise, self-explanatory labels.
By default, augment() follows data.table by-reference semantics to avoid
unnecessary copies of large longitudinal datasets. This means the input may
have its key changed, and n_events may be added when the argument is
omitted. Set copy = TRUE when the original input object must remain
unchanged.
The function always returns a data.table. Use as.data.frame() on the
result if a plain data.frame is needed by downstream code.
An augmented dataset of class data.table. Each row represents a
specific transition for a given subject. augment() computes the following
key variables:
augmented: The transition time variable. If t_augmented is missing,
augment() creates augmented by default. The variable is built from
t_start and t_end and inherits their class. If t_start is a date,
augment() also creates an integer variable named augmented_int. If
t_start is a difftime, it creates a numeric variable named
augmented_num.
status: A status flag that contains the states as specified in state.
augment() automatically checks whether argument pattern has 2 or 3
unique values and computes the correct structure of a given subject as
reported in the vignette. The variable is cast as character.
status_num: The corresponding integer version of status.
n_status: A mix of status and n_events cast as character. This is
useful when modelling process progression.
If more_status is passed, augment() computes additional variables.
They mirror the meaning of status, status_num, and n_status but they
account for the more complex structure defined. They are: status_exp,
status_exp_num, and n_status_exp.
Francesco Grossetti [email protected].
Grossetti, F., Ieva, F., and Paganoni, A.M. (2018). A multi-state approach to patients affected by chronic heart failure. Health Care Management Science, 21, 281-291. doi:10.1007/s10729-017-9400-z.
Jackson, C.H. (2011). Multi-State Models for Panel Data: The msm Package for R. Journal of Statistical Software, 38(8), 1-29. https://www.jstatsoft.org/v38/i08/.
M. Dowle, A. Srinivasan, T. Short, S. Lianoglou with contributions from
R. Saporta and E. Antonyan (2016): data.table: Extension of data.frame.
R package version 1.9.6. https://github.com/Rdatatable/data.table/wiki
data.table::data.table(), data.table::setkey()
# loading data data(hosp) # augmenting hosp hosp_augmented = augment(data = hosp, data_key = subj, n_events = adm_number, pattern = label_3, t_start = dateIN, t_end = dateOUT, t_cens = dateCENS) # augmenting hosp by passing more information regarding transitions # with argument more_status hosp_augmented_more = augment(data = hosp, data_key = subj, n_events = adm_number, pattern = label_3, t_start = dateIN, t_end = dateOUT, t_cens = dateCENS, more_status = rehab_it) # requesting progress output hosp_augmented = augment(data = hosp, data_key = subj, n_events = adm_number, pattern = label_3, t_start = dateIN, t_end = dateOUT, t_cens = dateCENS, verbosity = "summary")# loading data data(hosp) # augmenting hosp hosp_augmented = augment(data = hosp, data_key = subj, n_events = adm_number, pattern = label_3, t_start = dateIN, t_end = dateOUT, t_cens = dateCENS) # augmenting hosp by passing more information regarding transitions # with argument more_status hosp_augmented_more = augment(data = hosp, data_key = subj, n_events = adm_number, pattern = label_3, t_start = dateIN, t_end = dateOUT, t_cens = dateCENS, more_status = rehab_it) # requesting progress output hosp_augmented = augment(data = hosp, data_key = subj, n_events = adm_number, pattern = label_3, t_start = dateIN, t_end = dateOUT, t_cens = dateCENS, verbosity = "summary")
A synthetic longitudinal dataset of hospital admissions for 10 subjects. It includes repeated admissions, admission-level clinical flags, demographic variables, and end-of-study status labels.
data(hosp)data(hosp)
A data.table with 53 rows and 12 variables:
subj: Subject ID (integer).
adm_number: Hospital admissions counter (integer).
gender: Gender of patient (factor with 2 levels: "F" = females,
"M" = males).
age: Age of patient in years at the given observation (integer).
rehab: Rehabilitation flag. If the admission has been in rehabilitation,
then rehab = 1; otherwise rehab = 0 (integer).
it: Intensive Therapy flag. If the admission has been in intensive
therapy, then it = 1; otherwise it = 0 (integer).
rehab_it: String marking the admission type based on rehab
and it. The standard admission is coded as "df" (default). Admissions
in rehabilitation or intensive therapy are coded as "rehab" or "it"
(character).
label_2: Subject status at the end of the study. It takes 2 values:
"alive" and "dead" (character).
label_3: Subject status at the end of the study. It takes 3 values:
"alive", "dead_in", and "dead_out" (character).
dateIN: Exact admission date (date).
dateOUT: Exact discharge date (date).
dateCENS: Either censoring time or exact death time (date).
Remove subjects with transitions to different states occurring at the same
exact time in an augmented dataset produced by augment().
polish( data, data_key, pattern, time = NULL, check_NA = FALSE, copy = FALSE, verbosity = getOption("msmtools.verbosity", "quiet") )polish( data, data_key, pattern, time = NULL, check_NA = FALSE, copy = FALSE, verbosity = getOption("msmtools.verbosity", "quiet") )
data |
A |
data_key |
A keying variable used to identify subjects and define a key
for |
pattern |
Either an integer, a factor, or a character variable with 2 or 3 unique values that gives each subject's terminal outcome schema. When 2 values are detected, they must be in the format: 0 = "alive", 1 = "dead". When 3 values are detected, they must be: 0 = "alive", 1 = "dead during a transition", 2 = "dead after a transition has ended" (see Details). |
time |
The time variable used to identify duplicate transition times.
If omitted or set to |
check_NA |
If |
copy |
If |
verbosity |
Controls informational output. Use |
The function searches for cases where two subsequent events for the
same subject land on different states but occur at the same time. When this
happens, the whole subject, as identified by data_key, is removed from the
data. The function reports how many subjects were removed.
By default, polish() follows data.table by-reference semantics to avoid
unnecessary copies of large augmented datasets. This means the input may have
its key changed while duplicate subjects are identified. Set copy = TRUE
when the original input object must remain unchanged.
The function always returns a data.table. Use as.data.frame() on the
result if a plain data.frame is needed by downstream code.
A data.table with the same columns as the input data. Subjects
whose pattern transitions occur at the same time on different states are
removed in full (every row sharing the same data_key); rows from
unaffected subjects are kept as-is. When no duplicated transitions are
found, the input data is returned unchanged.
Francesco Grossetti [email protected].
# loading data data(hosp) # augmenting longitudinal data hosp_aug = augment(data = hosp, data_key = subj, n_events = adm_number, pattern = label_3, t_start = dateIN, t_end = dateOUT, t_cens = dateCENS) # cleaning targeted duplicate transitions hosp_aug_clean = polish(data = hosp_aug, data_key = subj, pattern = label_3)# loading data data(hosp) # augmenting longitudinal data hosp_aug = augment(data = hosp, data_key = subj, n_events = adm_number, pattern = label_3, t_start = dateIN, t_end = dateOUT, t_cens = dateCENS) # cleaning targeted duplicate transitions hosp_aug_clean = polish(data = hosp_aug, data_key = subj, pattern = label_3)
Plot observed and expected state prevalences from a fitted multi-state model. The function can also compute a rough diagnostic for where the data depart from the estimated Markov model.
prevplot( x, prev.obj, exacttimes = TRUE, M = FALSE, ci = FALSE, print_plot = TRUE, verbosity = getOption("msmtools.verbosity", "quiet") )prevplot( x, prev.obj, exacttimes = TRUE, M = FALSE, ci = FALSE, print_plot = TRUE, verbosity = getOption("msmtools.verbosity", "quiet") )
x |
A fitted msm model object. |
prev.obj |
A list computed by |
exacttimes |
If |
M |
If |
ci |
If |
print_plot |
If |
verbosity |
Controls informational output. Use |
When M = TRUE, a rough indicator of the deviance from the
Markov model is computed according to Titman and Sharples (2008).
A comparison at a given time t_i of a subject k in the state s between
observed counts O_is and expected counts E_is is built as
M_is = (O_is - E_is)^2 / E_is.
The deviance M plot is returned together with the standard prevalence plot
in the second row. This layout is fixed.
When M = TRUE, the combined layout is built with patchwork, which is
an optional dependency of msmtools. Install it with
install.packages("patchwork") if it is not already available; prevplot()
raises an informative error otherwise. The default M = FALSE path has no
such requirement.
When M = FALSE, a gg/ggplot object with observed and expected
prevalences is returned. When M = TRUE, a patchwork object is returned
with the prevalence plot and the deviance M plot.
The returned object also carries a $prevalence field with the
long-format data.table used to build the plot. It always includes
time, state, obs, and hat; it also includes lwr and upr
when ci = TRUE, and M when M = TRUE. Access it directly:
p <- prevplot(model, prev_obj) p$prevalence
print_plot only controls whether the plot is printed as a side effect.
Returned objects are unchanged: use print_plot = FALSE to create the plot
silently.
Francesco Grossetti [email protected].
Titman, A. and Sharples, L.D. (2010). Model diagnostics for multi-state models, Statistical Methods in Medical Research, 19, 621-651.
Titman, A. and Sharples, L.D. (2008). A general goodness-of-fit test for Markov and hidden Markov models, Statistics in Medicine, 27, 2177-2195.
Gentleman RC, Lawless JF, Lindsey JC, Yan P. (1994). Multi-state Markov models for analysing incomplete disease data with illustrations for HIV disease. Statistics in Medicine, 13:805-821.
Jackson, C.H. (2011). Multi-State Models for Panel Data: The msm Package for R. Journal of Statistical Software, 38(8), 1-29. https://www.jstatsoft.org/v38/i08/.
msm::plot.prevalence.msm(), msm::msm(),
msm::prevalence.msm()
data(hosp) # augmenting the data hosp_augmented = augment(data = hosp, data_key = subj, n_events = adm_number, pattern = label_3, t_start = dateIN, t_end = dateOUT, t_cens = dateCENS) # let's define the initial transition matrix for our model Qmat = matrix(data = 0, nrow = 3, ncol = 3, byrow = TRUE) Qmat[1, 1:3] = 1 Qmat[2, 1:3] = 1 colnames(Qmat) = c('IN', 'OUT', 'DEAD') rownames(Qmat) = c('IN', 'OUT', 'DEAD') # fitting the model using # gender and age as covariates library(msm) msm_model = msm(status_num ~ augmented_int, subject = subj, data = hosp_augmented, covariates = ~ gender + age, exacttimes = TRUE, gen.inits = TRUE, qmatrix = Qmat, method = 'BFGS', control = list(fnscale = 6e+05, trace = 0, REPORT = 1, maxit = 10000)) # defining the times at which compute the prevalences t_min = min(hosp_augmented$augmented_int) t_max = max(hosp_augmented$augmented_int) steps = 100L # computing prevalences prev = prevalence.msm(msm_model, covariates = 'mean', ci = 'normal', times = seq(t_min, t_max, steps)) # and plotting them using prevplot() gof = prevplot(x = msm_model, prev.obj = prev, ci = TRUE, M = TRUE)data(hosp) # augmenting the data hosp_augmented = augment(data = hosp, data_key = subj, n_events = adm_number, pattern = label_3, t_start = dateIN, t_end = dateOUT, t_cens = dateCENS) # let's define the initial transition matrix for our model Qmat = matrix(data = 0, nrow = 3, ncol = 3, byrow = TRUE) Qmat[1, 1:3] = 1 Qmat[2, 1:3] = 1 colnames(Qmat) = c('IN', 'OUT', 'DEAD') rownames(Qmat) = c('IN', 'OUT', 'DEAD') # fitting the model using # gender and age as covariates library(msm) msm_model = msm(status_num ~ augmented_int, subject = subj, data = hosp_augmented, covariates = ~ gender + age, exacttimes = TRUE, gen.inits = TRUE, qmatrix = Qmat, method = 'BFGS', control = list(fnscale = 6e+05, trace = 0, REPORT = 1, maxit = 10000)) # defining the times at which compute the prevalences t_min = min(hosp_augmented$augmented_int) t_max = max(hosp_augmented$augmented_int) steps = 100L # computing prevalences prev = prevalence.msm(msm_model, covariates = 'mean', ci = 'normal', times = seq(t_min, t_max, steps)) # and plotting them using prevplot() gof = prevplot(x = msm_model, prev.obj = prev, ci = TRUE, M = TRUE)
Plot fitted survival probabilities from an msm::msm() model and compare
them with Kaplan-Meier estimates. The function can also return the data used
to build each curve.
survplot( x, from = 1, to = NULL, range = NULL, covariates = "mean", exacttimes = TRUE, times, grid = 100L, km = FALSE, ci = c("none", "normal", "bootstrap"), interp = c("start", "midpoint"), B = 100L, ci_km = c("none", "plain", "log", "log-log", "logit", "arcsin"), print_plot = TRUE, verbosity = getOption("msmtools.verbosity", "quiet"), ... )survplot( x, from = 1, to = NULL, range = NULL, covariates = "mean", exacttimes = TRUE, times, grid = 100L, km = FALSE, ci = c("none", "normal", "bootstrap"), interp = c("start", "midpoint"), B = 100L, ci_km = c("none", "plain", "log", "log-log", "logit", "arcsin"), print_plot = TRUE, verbosity = getOption("msmtools.verbosity", "quiet"), ... )
x |
A fitted msm model object. |
from |
State from which to compute the estimated survival. Defaults to state 1. |
to |
The absorbing state to which compute the estimated survival.
Defaults to the highest state found by |
range |
A numeric vector of two elements giving the time range of the plot. |
covariates |
Covariate values for which to evaluate the expected
probabilities. These can be
The unnamed list must follow the order of the covariates in the original model formula. A named list is also accepted:
|
exacttimes |
If |
times |
An optional numeric vector giving the times at which to compute the fitted survival. |
grid |
An integer specifying the grid points at which to compute the
fitted survival curve (see Details). If |
km |
If |
ci |
A character vector with the type of confidence intervals to compute for the fitted
survival curve. Specify either |
interp |
If |
B |
Number of bootstrap or normal replicates for the confidence interval. The default is 100 rather than the usual 1000, since these plots are for rough diagnostic purposes. |
ci_km |
A character vector with the type of confidence intervals to compute for the
Kaplan-Meier curve. Specify either |
print_plot |
If |
verbosity |
Controls informational output. Use |
... |
Reserved for the migration trampoline. Passing the legacy
|
The function wraps msm::plot.survfit.msm() and adds support for
exact-time plots by resetting the time scale to follow-up time. It returns
a gg/ggplot object so the plot composes directly with ggplot2::ggsave(),
ggplot2::theme(), and other ggplot operations.
You can pass custom evaluation times through times, or let survplot()
define them from grid. Larger grid values produce a finer grid and
increase computation time.
A gg/ggplot object. The fitted and (when km = TRUE)
Kaplan-Meier data tables are attached to the returned plot as named
fields:
$fitted — a data.table with columns time, surv, and (when
ci is not "none") lwr / upr. Always present.
$km — a data.table with the Kaplan-Meier curve, exposed only when
km = TRUE.
Access the data through the standard $ operator:
p <- survplot(model, km = TRUE) p # prints the plot p$fitted # fitted survival data p$km # Kaplan-Meier data
print_plot only controls whether the plot is printed as a side effect.
Returned objects are unchanged: use print_plot = FALSE to create the plot
or returned data silently.
Francesco Grossetti [email protected].
Titman, A. and Sharples, L.D. (2010). Model diagnostics for multi-state models, Statistical Methods in Medical Research, 19, 621-651.
Titman, A. and Sharples, L.D. (2008). A general goodness-of-fit test for Markov and hidden Markov models, Statistics in Medicine, 27, 2177-2195.
Jackson, C.H. (2011). Multi-State Models for Panel Data: The msm Package for R. Journal of Statistical Software, 38(8), 1-29. https://www.jstatsoft.org/v38/i08/.
msm::plot.survfit.msm(), msm::msm(),
msm::pmatrix.msm(), data.table::setDF()
data(hosp) # augmenting the data hosp_augmented = augment(data = hosp, data_key = subj, n_events = adm_number, pattern = label_3, t_start = dateIN, t_end = dateOUT, t_cens = dateCENS) # let's define the initial transition matrix for our model Qmat = matrix(data = 0, nrow = 3, ncol = 3, byrow = TRUE) Qmat[1, 1:3] = 1 Qmat[2, 1:3] = 1 colnames(Qmat) = c('IN', 'OUT', 'DEAD') rownames(Qmat) = c('IN', 'OUT', 'DEAD') # fitting the model using # gender and age as covariates library(msm) msm_model = msm(status_num ~ augmented_int, subject = subj, data = hosp_augmented, covariates = ~ gender + age, exacttimes = TRUE, gen.inits = TRUE, qmatrix = Qmat, method = 'BFGS', control = list(fnscale = 6e+05, trace = 0, REPORT = 1, maxit = 10000)) # plotting the fitted and empirical survival from state = 1 theplot = survplot(x = msm_model, km = TRUE) # the fitted and Kaplan-Meier data tables are attached to the plot head(theplot$fitted) head(theplot$km)data(hosp) # augmenting the data hosp_augmented = augment(data = hosp, data_key = subj, n_events = adm_number, pattern = label_3, t_start = dateIN, t_end = dateOUT, t_cens = dateCENS) # let's define the initial transition matrix for our model Qmat = matrix(data = 0, nrow = 3, ncol = 3, byrow = TRUE) Qmat[1, 1:3] = 1 Qmat[2, 1:3] = 1 colnames(Qmat) = c('IN', 'OUT', 'DEAD') rownames(Qmat) = c('IN', 'OUT', 'DEAD') # fitting the model using # gender and age as covariates library(msm) msm_model = msm(status_num ~ augmented_int, subject = subj, data = hosp_augmented, covariates = ~ gender + age, exacttimes = TRUE, gen.inits = TRUE, qmatrix = Qmat, method = 'BFGS', control = list(fnscale = 6e+05, trace = 0, REPORT = 1, maxit = 10000)) # plotting the fitted and empirical survival from state = 1 theplot = survplot(x = msm_model, km = TRUE) # the fitted and Kaplan-Meier data tables are attached to the plot head(theplot$fitted) head(theplot$km)