The Stochastic Process Model (SPM) was developed several decades ago Yashin, Arbeev, Akushevich, et al. (2007), and applied for analyses of clinical, demographic, epidemiologic longitudinal data as well as in many other studies that relate stochastic dynamics of repeated measures to the probability of end-points (outcomes). SPM links the dynamic of stochastical variables with a hazard rate as a quadratic function of the state variables (Yashin, Arbeev, Akushevich, et al. 2007). The R-package, “stpm”, is a set of utilities to estimate parameters of stochastic process and modeling survival trajectories and time-to-event outcomes observed from longitudinal studies. It is a general framework for studying and modeling survival (censored) traits depending on random trajectories (stochastic paths) of variables.
Data represents a typical longitudinal data in form of two datasets: longitudinal dataset (follow-up studies), in which one record represents a single observation, and vital (survival) statistics, where one record represents all information about the subject. Longitudinal dataset cat contain a subject ID (identification number), status (event(1)/censored(0)), time and measurements across the variables.
Below there is an example of clinical data that can be used in
stpm
and we will discuss the fields later.
Longitudinal table:
## ID IndicatorDeath Age DBP BMI
## 1 1 0 30 80.00000 25.00000
## 2 1 0 32 80.51659 26.61245
## 3 1 0 34 77.78412 29.16790
## 4 1 0 36 77.86665 32.40359
## 5 1 0 38 96.55673 31.92014
## 6 1 0 40 94.48616 32.89139
The packate accepts longitudinal data in two formats: “short” and “long”.
## id xi t y1
## 1 1 0 30 36.133394
## 2 1 0 31 -5.539881
## 3 1 0 32 -45.129493
## 4 1 0 33 -82.739624
## 5 1 0 34 -118.469249
## 6 1 0 35 -152.412392
## id xi t1 t2 y1 y1.next
## 1 1 0 30 31 36.133394 -5.539881
## 2 1 0 31 32 -5.539881 -45.129493
## 3 1 0 32 33 -45.129493 -82.739624
## 4 1 0 33 34 -82.739624 -118.469249
## 5 1 0 34 35 -118.469249 -152.412392
## 6 1 0 35 36 -152.412392 -184.658379
There are two main SPM types in the package: discrete-time model (Akushevich, Kulminski, and Manton 2005) and continuous-time model (Yashin, Arbeev, Akushevich, et al. 2007). Discrete model assumes equal intervals between follow-up observations. The example of discrete dataset is given below.
library(stpm)
data <- simdata_discr(N=10) # simulate data for 10 individuals, "long" format (default)
head(data)
## id xi t1 t2 y1 y1.next
## 1 1 0 30 31 36.133394 -5.539881
## 2 1 0 31 32 -5.539881 -45.129493
## 3 1 0 32 33 -45.129493 -82.739624
## 4 1 0 33 34 -82.739624 -118.469249
## 5 1 0 34 35 -118.469249 -152.412392
## 6 1 0 35 36 -152.412392 -184.658379
In this case there are equal intervals between t1 and t2.
In the continuous-time SPM, in which intervals between observations are not equal (arbitrary or random). The example of such dataset is shown below:
library(stpm)
data <- simdata_cont(N=5, format="short") # simulate data for 5 individuals, "short" format
head(data)
## id xi t y1
## 1 0 0 30 71.22668
## 2 0 0 31 28.86212
## 3 0 0 32 -11.43627
## 4 0 0 33 -49.76927
## 5 0 0 34 -86.23273
## 6 0 0 35 -120.91783
The discrete model assumes fixed time intervals between consecutive observations. In this model, Y(t) (a k × 1 matrix of the values of covariates, where k is the number of considered covariates) and μ(t, Y(t)) (the hazard rate) have the following form:
Y(t + 1) = u + RY(t) + ϵ
μ(t, Y(t)) = [μ0 + bY(t) + Y(t)*QY(t)]eθt
Coefficients u (a
k × 1 matrix, where k is a number of covariates), R (a k × k matrix), μ0, b (a 1 × k matrix), Q (a k × k matrix) are assumed
to be constant in the particular implementation of this model in the
R-package stpm
. ϵ are normally-distributed
random residuals, k × 1
matrix. A symbol ’*’ denotes transpose operation. θ is a parameter to be estimated
along with other parameters (u, R, μ0,
b, Q).
library(stpm)
#Data simulation (200 individuals)
data <- simdata_discr(N=100)
#Estimation of parameters
pars <- spm_discrete(data)
## Warning in summary.lm(res): essentially perfect fit: summary may be unreliable
## $dmodel
## $dmodel$theta
## [1] 0.2
##
## $dmodel$mu0
## [1] 9.136222991e-06
##
## $dmodel$b
## [1] 2.477520974e-08
##
## $dmodel$Q
## [,1]
## [1,] 1.678638383e-11
##
## $dmodel$u
## [1] -39.86660585
##
## $dmodel$u.std.err
## (Intercept)
## 7.696060854e-15
##
## $dmodel$R
## [,1]
## [1,] 0.95
##
## $dmodel$R.std.err
## y1_1
## [1,] 1.313344068e-17
##
## $dmodel$Sigma
## [1] 2.291453596e-13
##
##
## $cmodel
## $cmodel$a
## [,1]
## [1,] -0.05
##
## $cmodel$f1
## [,1]
## [1,] -797.3321169
##
## $cmodel$Q
## [,1]
## [1,] 1.678638383e-11
##
## $cmodel$f
## [,1]
## [1,] -737.9555354
##
## $cmodel$b
## [,1]
## [1,] 2.291453596e-13
##
## $cmodel$mu0
## [,1]
## [1,] -5.278592871e-09
##
## $cmodel$theta
## [1] 0.2
##
##
## attr(,"class")
## [1] "spm.discrete"
In the specification of the SPM described in 2007 paper by Yashin and collegaues (Yashin, Arbeev, Akushevich, et al. 2007) the stochastic differential equation describing the age dynamics of a covariate is:
dY(t) = a(t)(Y(t) − f1(t))dt + b(t)dW(t), Y(t = t0)
In this equation, Y(t) (a k × 1 matrix) is the value of a particular covariate at a time (age) t. f1(t) (a k × 1 matrix) corresponds to the long-term mean value of the stochastic process Y(t), which describes a trajectory of individual covariate influenced by different factors represented by a random Wiener process W(t). Coefficient a(t) (a k × k matrix) is a negative feedback coefficient, which characterizes the rate at which the process reverts to its mean. In the area of research on aging, f1(t) represents the mean allostatic trajectory and a(t) represents the adaptive capacity of the organism. Coefficient b(t) (a k × 1 matrix) characterizes a strength of the random disturbances from Wiener process W(t).
The following function μ(t, Y(t)) represents a hazard rate:
μ(t, Y(t)) = μ0(t) + (Y(t) − f(t))*Q(t)(Y(t) − f(t))
here μ0(t) is the baseline hazard, which represents a risk when Y(t) follows its optimal trajectory; f(t) (a k × 1 matrix) represents the optimal trajectory that minimizes the risk and Q(t) (k × k matrix) represents a sensitivity of risk function to deviation from the norm.
## id xi t1 t2 y1 y1.next
## 1 0 0 30 31 71.22667883 28.86212066
## 2 0 0 31 32 28.86212066 -11.43627413
## 3 0 0 32 33 -11.43627413 -49.76927447
## 4 0 0 33 34 -49.76927447 -86.23273469
## 5 0 0 34 35 -86.23273469 -120.91783418
## 6 0 0 35 36 -120.91783418 -153.91130546
#Estimate parameters
# a=-0.05, f1=80, Q=2e-8, f=80, b=5, mu0=2e-5, theta=0.08 are starting values for estimation procedure
pars <- spm_continuous(dat=data,a=-0.05, f1=80, Q=2e-8, f=80, b=5, mu0=2e-5, theta=0.08)
## Parameter a achieved lower/upper bound.
## 0
## Parameter Q achieved lower/upper bound.
## 0
## Parameter b achieved lower/upper bound.
## 5.5
## Parameter mu0 achieved lower/upper bound.
## 1.8e-05
## $a
## [,1]
## [1,] 0
##
## $f1
## [,1]
## [1,] 87.20242295
##
## $Q
## [,1]
## [1,] 0
##
## $f
## [,1]
## [1,] 74.69160644
##
## $b
## [,1]
## [1,] 5.5
##
## $mu0
## [1] 1.8e-05
##
## $theta
## [1] 0.08763157956
##
## $status
## [1] 5
##
## $LogLik
## [1] -25284.7227
##
## $objective
## [1] 25284.7227
##
## $message
## [1] "NLOPT_MAXEVAL_REACHED: Optimization stopped because maxeval (above) was reached."
##
## $limit
## [1] TRUE
##
## attr(,"class")
## [1] "spm.continuous"
The coefficient conversion between continuous- and discrete-time models is as follows (‘c’ and ‘d’ denote continuous- and discrete-time models respectively; note: these equations can be used if intervals between consecutive observations of discrete- and continuous-time models are equal; it also required that matrices ac and Qc, d must be full-rank matrices):
Qc = Qd
ac = Rd − I(k)
bc = Σ
f1c = −ac−1 × ud
fc = −0.5bd × Qd−1
μ0c = μ0d − fc × Qc × fc*
θc = θd
where k is a number of
covariates, which is equal to model’s dimension and ’*’ denotes
transpose operation; Σ
is a k × 1 matrix which
contains s.d.
s of corresponding residuals (residuals of a
linear regression Y(t + 1) = u + RY(t) + ϵ;
s.d.
is a standard deviation), I(k) is an identity k × k matrix.
In previous models, we assumed that coefficients is sort of time-dependant: we multiplied them on to eθt. In general, this may not be the case (Yashin, Arbeev, Kulminski, et al. 2007). We extend this to a general case, i.e. (we consider one-dimensional case):
a(t) = par1t + par2 - linear function.
The corresponding equations will be equivalent to one-dimensional continuous case described above.
library(stpm)
#Data preparation:
n <- 10
data <- simdata_time_dep(N=n)
# Estimation:
opt.par <- spm_time_dep(data,
start = list(a = -0.05, f1 = 80, Q = 2e-08, f = 80, b = 5, mu0 = 0.001),
frm = list(at = "a", f1t = "f1", Qt = "Q", ft = "f", bt = "b", mu0t= "mu0"))
opt.par
## $a
## [1] -0.05
##
## $f1
## [1] 80
##
## $Q
## [1] 2e-08
##
## $f
## [1] 80
##
## $b
## [1] 5
##
## $mu0
## [1] 0.001
Lower and upper boundaries can be set up with parameters lb and ub, which represents simple numeric vectors. Note: lengths of lb and ub must be the same as the total length of the parameters. Lower and upper boundaries can be set for continuous-time and time-dependent models only.
Below we show the example of setting up lb and ub when we have a single covariate:
library(stpm)
data <- simdata_cont(N=10, ystart = 80, a = -0.1, Q = 1e-06, mu0 = 1e-5, theta = 0.08, f1 = 80, f=80, b=1, dt=1, sd0=5)
ans <- spm_continuous(dat=data,
a = -0.1,
f1 = 82,
Q = 1.4e-6,
f = 77,
b = 1,
mu0 = 1.6e-5,
theta = 0.1,
stopifbound = FALSE,
lb=c(-0.2, 60, 0.1e-6, 60, 0.1, 0.1e-5, 0.01),
ub=c(0, 140, 5e-06, 140, 3, 5e-5, 0.20))
ans
## $a
## [,1]
## [1,] -0.1081406414
##
## $f1
## [,1]
## [1,] 80.59082153
##
## $Q
## [,1]
## [1,] 2.874769014e-07
##
## $f
## [,1]
## [1,] 73.86860137
##
## $b
## [,1]
## [1,] 1.062949155
##
## $mu0
## [1] 1.297641055e-05
##
## $theta
## [1] 0.1209884327
##
## $status
## [1] 5
##
## $LogLik
## [1] -653.4812429
##
## $objective
## [1] 653.4486801
##
## $message
## [1] "NLOPT_MAXEVAL_REACHED: Optimization stopped because maxeval (above) was reached."
##
## $limit
## [1] FALSE
##
## attr(,"class")
## [1] "spm.continuous"
This is an example for two physiological variables (covariates).
library(stpm)
data <- simdata_cont(N=10,
a=matrix(c(-0.1, 0.001, 0.001, -0.1), nrow = 2, ncol = 2, byrow = T),
f1=t(matrix(c(100, 200), nrow = 2, ncol = 1, byrow = F)),
Q=matrix(c(1e-06, 1e-7, 1e-7, 1e-06), nrow = 2, ncol = 2, byrow = T),
f=t(matrix(c(100, 200), nrow = 2, ncol = 1, byrow = F)),
b=matrix(c(1, 2), nrow = 2, ncol = 1, byrow = F),
mu0=1e-4,
theta=0.08,
ystart = c(100,200), sd0=c(5, 10), dt=1)
a.d <- matrix(c(-0.15, 0.002, 0.002, -0.15), nrow = 2, ncol = 2, byrow = T)
f1.d <- t(matrix(c(95, 195), nrow = 2, ncol = 1, byrow = F))
Q.d <- matrix(c(1.2e-06, 1.2e-7, 1.2e-7, 1.2e-06), nrow = 2, ncol = 2, byrow = T)
f.d <- t(matrix(c(105, 205), nrow = 2, ncol = 1, byrow = F))
b.d <- matrix(c(1, 2), nrow = 2, ncol = 1, byrow = F)
mu0.d <- 1.1e-4
theta.d <- 0.07
ans <- spm_continuous(dat=data,
a = a.d,
f1 = f1.d,
Q = Q.d,
f = f.d,
b = b.d,
mu0 = mu0.d,
theta = theta.d,
lb=c(-0.5, ifelse(a.d[2,1] > 0, a.d[2,1]-0.5*a.d[2,1], a.d[2,1]+0.5*a.d[2,1]), ifelse(a.d[1,2] > 0, a.d[1,2]-0.5*a.d[1,2], a.d[1,2]+0.5*a.d[1,2]), -0.5,
80, 100,
Q.d[1,1]-0.5*Q.d[1,1], ifelse(Q.d[2,1] > 0, Q.d[2,1]-0.5*Q.d[2,1], Q.d[2,1]+0.5*Q.d[2,1]), ifelse(Q.d[1,2] > 0, Q.d[1,2]-0.5*Q.d[1,2], Q.d[1,2]+0.5*Q.d[1,2]), Q.d[2,2]-0.5*Q.d[2,2],
80, 100,
0.1, 0.5,
0.1e-4,
0.01),
ub=c(-0.08, 0.002, 0.002, -0.08,
110, 220,
Q.d[1,1]+0.1*Q.d[1,1], ifelse(Q.d[2,1] > 0, Q.d[2,1]+0.1*Q.d[2,1], Q.d[2,1]-0.1*Q.d[2,1]), ifelse(Q.d[1,2] > 0, Q.d[1,2]+0.1*Q.d[1,2], Q.d[1,2]-0.1*Q.d[1,2]), Q.d[2,2]+0.1*Q.d[2,2],
110, 220,
1.5, 2.5,
1.2e-4,
0.10))
ans
## $a
## [,1] [,2]
## [1,] -0.15048619407 0.001704476221
## [2,] 0.00198434074 -0.149192508119
##
## $f1
## [,1]
## [1,] 104.6520699
## [2,] 196.9914376
##
## $Q
## [,1] [,2]
## [1,] 1.206969065e-06 1.214163166e-07
## [2,] 1.209687455e-07 1.204501009e-06
##
## $f
## [,1]
## [1,] 105.0660281
## [2,] 204.8717954
##
## $b
## [,1]
## [1,] 1.129488682
## [2,] 1.940468299
##
## $mu0
## [1] 0.0001110763846
##
## $theta
## [1] 0.07381010072
##
## $status
## [1] 5
##
## $LogLik
## [1] 1426.265177
##
## $objective
## [1] -2113.244205
##
## $message
## [1] "NLOPT_MAXEVAL_REACHED: Optimization stopped because maxeval (above) was reached."
##
## $limit
## [1] FALSE
##
## attr(,"class")
## [1] "spm.continuous"
This model uses only one covariate, therefore setting-up model parameters is easy:
n <- 10
data <- simdata_time_dep(N=n)
# Estimation:
opt.par <- spm_time_dep(data, start=list(a=-0.05, f1=80, Q=2e-08, f=80, b=5, mu0=0.001),
lb=c(-1, 30, 1e-8, 30, 1, 1e-6), ub=c(0, 120, 5e-8, 130, 10, 1e-2))
opt.par
## $a
## [1] -0.05
##
## $f1
## [1] 80
##
## $Q
## [1] 2e-08
##
## $f
## [1] 80
##
## $b
## [1] 5
##
## $mu0
## [1] 0.001
Imagine a situation when one parameter function you want to be equal to zero: f = 0. Let’s emulate this case:
library(stpm)
n <- 10
data <- simdata_time_dep(N=n)
# Estimation:
opt.par <- spm_time_dep(data, frm = list(at="a", f1t="f1", Qt="Q", ft="0", bt="b", mu0t="mu0"))
opt.par
## $a
## [1] -0.05
##
## $f1
## [1] 80
##
## $Q
## [1] 2e-08
##
## $b
## [1] 80
##
## $mu0
## [1] 5
##
## $<NA>
## <NA>
## NA
As you can see, there is no parameter f in opt.par. This is because we set f = 0 in frm!
Then, is you want to set the constraints, you must not specify the starting value (parameter start) and lb/ub for the parameter f (otherwise, the function raises an error):
library(stpm)
set.seed(12345678)
n <- 10
data <- simdata_time_dep(N=n, format = "long")
# Estimation:
opt.par <- spm_time_dep(data, frm = list(at="a", f1t="f1", Qt="Q", ft="0", bt="b", mu0t="mu0"),
start=list(a=-0.05, f1=80, Q=2e-08, b=5, mu0=0.001),
lb=c(-1, 30, 1e-8, 1, 1e-6), ub=c(0, 120, 5e-8, 10, 1e-2),
opts=list(maxit=100),
verbose=F)
## Warning in nloptr.add.default.options(opts.user = opts, x0 = x0,
## num_constraints_ineq = num_constraints_ineq, : No termination criterion
## specified, using default(relative x-tolerance = 1e-04)
## $a
## [1] -0.04572136589
##
## $f1
## [1] 84.44709945
##
## $Q
## [1] 2.280898373e-08
##
## $b
## [1] 4.808318865
##
## $mu0
## [1] 0.0007809303061
##
## $status
## [1] 5
##
## $LogLik
## t2
## -1663.057876
##
## $objective
## [1] 1663.044784
##
## $message
## [1] "NLOPT_MAXEVAL_REACHED: Optimization stopped because maxeval (above) was reached."
You can do the same manner if you want two or more parameters to be equal to zero.
Function spm_con_1d(...)
allows for very fast parameter
estimating for one-dimensional model. This function implements a
analytical solution to estimate the parameters in the continuous SPM
model by assuming all the parameters are constants. Below there is an
example.
library(stpm)
dat <- simdata_cont(N=500)
colnames(dat) <- c("id", "xi", "t1", "t2", "y", "y.next")
res <- spm_con_1d(as.data.frame(dat), a=-0.05, b=2, q=1e-8, f=80, f1=90, mu0=1e-3, theta=0.08)
## [1] "Initial values:"
## [1] -5e-02 2e+00 1e-08 8e+01 9e+01 1e-03 8e-02
## [1] "Lower bounds:"
## [1] -2.5e-01 2.0e-01 1.0e-10 4.0e+01 4.5e+01 1.0e-04 8.0e-02
## [1] "Upper bounds:"
## [1] -2.5e-03 1.0e+01 1.0e-07 1.6e+02 1.8e+02 1.0e-02 8.0e-02
## $est
## Coeff. Std. Err. z p value
## a -0.04728521921 NaN NaN NaN
## b 4.95976759335 NaN NaN NaN
## q 0.00000000010 NaN NaN NaN
## f 80.00000021765 NaN NaN NaN
## f1 79.59924722657 NaN NaN NaN
## mu0 0.00010000000 NaN NaN NaN
## theta 0.08000000000 NaN NaN NaN
##
## $hessian
## a b q f f1
## a 2.842052064e+05 6.287726309e+03 NaN -6.605492298e-07 1.757358554e+01
## b 6.287726309e+03 1.734927910e+03 NaN -2.860081590e-13 9.859894411e-08
## q NaN NaN NaN NaN NaN
## f -6.605492298e-07 -2.860081590e-13 NaN 5.806735047e-06 3.863989778e-08
## f1 1.757358554e+01 9.859894411e-08 NaN 3.863989778e-08 2.904332100e+00
## mu0 1.016027068e+00 2.082976521e-02 NaN 4.246321921e-03 -2.305176046e-04
## theta 1.430083613e-02 2.906070796e-04 NaN 3.822526769e-05 -2.025747341e-06
## mu0 theta
## a 1.016027068e+00 1.430083613e-02
## b 2.082976521e-02 2.906070796e-04
## q NaN NaN
## f 4.246321921e-03 3.822526769e-05
## f1 -2.305176046e-04 -2.025747341e-06
## mu0 -1.898989914e+08 1.887572531e+09
## theta 1.887572531e+09 1.744326871e+07
##
## $lik
## [1] 50656.62289
##
## $con
## [1] 4
##
## $message
## [1] "NLOPT_XTOL_REACHED: Optimization stopped because xtol_rel or xtol_abs (above) was reached."
We added one- and multi- dimensional simulation to be able to generate test data for hyphotesis testing. Data, which can be simulated can be discrete (equal intervals between observations) and continuous (with arbitrary intervals).
The corresponding function is (k
- a number of
variables(covariates), equal to model’s dimension):
simdata_discr(N=100, a=-0.05, f1=80, Q=2e-8, f=80, b=5, mu0=1e-5, theta=0.08, ystart=80, tstart=30, tend=105, dt=1)
Here:
N
- Number of individuals
a
- A matrix of k
xk
, which
characterize the rate of the adaptive response
f1
- A particular state, which if a deviation from the
normal (or optimal). This is a vector with length of k
Q
- A matrix of k
by k
, which
is a non-negative-definite symmetric matrix
f
- A vector-function (with length k
) of
the normal (or optimal) state
b
- A diffusion coefficient, k
by
k
matrix
mu0
- mortality at start period of time (baseline
hazard)
theta
- A displacement coefficient of the Gompertz
function
ystart
- A vector with length equal to number of
dimensions used, defines starting values of covariates
tstart
- A number that defines a start time (30 by
default). Can be a number (30 by default) or a vector of two numbers:
c(a, b) - in this case, starting value of time is simulated via
uniform(a,b) distribution.
tend
- A number, defines a final time (105 by
default)
dt
- A time interval between observations.
This function returns a table with simulated data, as shown in example below:
## id xi t1 t2 y1 y1.next
## 1 1 0 30 31 76.72988949 90.09455399
## 2 1 0 31 32 90.09455399 91.33518318
## 3 1 0 32 33 91.33518318 94.66502736
## 4 1 0 33 34 94.66502736 90.56756062
## 5 1 0 34 35 90.56756062 90.94216518
## 6 1 0 35 36 90.94216518 96.92507293
The corresponding function is (k
- a number of
variables(covariates), equal to model’s dimension):
simdata_cont(N=100, a=-0.05, f1=80, Q=2e-07, f=80, b=5, mu0=2e-05, theta=0.08, ystart=80, tstart=c(30,50), tend=105)
Here:
N
- Number of individuals
a
- A matrix of k
xk
, which
characterize the rate of the adaptive response
f1
- A particular state, which if a deviation from the
normal (or optimal). This is a vector with length of k
Q
- A matrix of k
by k
, which
is a non-negative-definite symmetric matrix
f
- A vector-function (with length k
) of
the normal (or optimal) state
b
- A diffusion coefficient, k
by
k
matrix
mu0
- mortality at start period of time (baseline
hazard)
theta
- A displacement coefficient of the Gompertz
function
ystart
- A vector with length equal to number of
dimensions used, defines starting values of covariates
tstart
- A number that defines a start time (30 by
default). Can be a number (30 by default) or a vector of two numbers:
c(a, b) - in this case, starting value of time is simulated via
uniform(a,b) distribution.
tend
- A number, defines a final time (105 by
default)
This function returns a table with simulated data, as shown in example below:
## id xi t1 t2 y1 y1.next
## 1 0 0 33.55365299 34.72824423 78.74051630 82.73685625
## 2 0 0 34.72824423 36.04201013 82.73685625 86.14131099
## 3 0 0 36.04201013 37.05205751 86.14131099 92.98495900
## 4 0 0 37.05205751 38.74860222 92.98495900 93.42421795
## 5 0 0 38.74860222 40.36727874 93.42421795 97.79917509
## 6 0 0 40.36727874 41.78302970 97.79917509 107.44126009
Stochastic Process Model has many applications in analysis of longitudinal biodemographic data. Such data contain various physiological variables (known as covariates). Data can also potentially contain genetic information available for all or a part of participants. Taking advantage from both genetic and non-genetic information can provide future insights into a broad range of processes describing aging-related changes in the organism.
In this package, SPM with partially observed covariates is implemented in form of GenSPM (Genetic SPM), presented in (Arbeev et al. 2009) and further advanced in (Arbeev et al. 2014), further elaborates the basic stochastic process model conception by introducing a categorical variable, Z, which may be a specific value of a genetic marker or, in general, any categorical variable. Currently, Z has two gradations: 0 or 1 in a genetic group of interest, assuming that P(Z = 1) = p, p ∈ [0, 1], were p is the proportion of carriers and non-carriers of an allele in a population. Example of longitudinal data with genetic component Z is provided below.
## id xi t1 t2 Z y1 y1.next
## 1 0 0 42.60856507 43.54631015 0 79.46890355 80.13789316
## 2 0 0 43.54631015 44.53389011 0 80.13789316 77.49058422
## 3 0 0 44.53389011 45.58949121 0 77.49058422 75.51496225
## 4 0 0 45.58949121 46.66869639 0 75.51496225 68.82093006
## 5 0 0 46.66869639 47.58379891 0 68.82093006 66.42902739
## 6 0 0 47.58379891 48.55247453 0 66.42902739 69.48593288
In the specification of the SPM described in 2007 paper by Yashin and colleagues (Yashin, Arbeev, Akushevich, et al. 2007) the stochastic differential equation describing the age dynamics of a physiological variable (a dynamic component of the model) is:
dY(t) = a(Z, t)(Y(t) − f1(Z, t))dt + b(Z, t)dW(t), Y(t = t0)
Here in this equation, Y(t) is a k × 1 matrix, where k is a number of covariates, which is a model dimension) describing the value of a physiological variable at a time (e.g. age) t. f1(Z, t) is a k × 1 matrix that corresponds to the long-term average value of the stochastic process Y(t), which describes a trajectory of individual variable influenced by different factors represented by a random Wiener process W(t). The negative feedback coefficient a(Z, t) (k × k matrix) characterizes the rate at which the stochastic process goes to its mean. In research on aging and well-being, f1(Z, t) represents the average allostatic trajectory and a(t) in this case represents the adaptive capacity of the organism. Coefficient b(Z, t) (k × 1 matrix) characterizes a strength of the random disturbances from Wiener process W(t). All of these parameters depend on Z (a genetic marker having values 1 or 0). The following function μ(t, Y(t)) represents a hazard rate:
μ(t, Y(t)) = μ0(t) + (Y(t) − f(Z, t))*Q(Z, t)(Y(t) − f(Z, t))
In this equation: μ0(t) is the baseline hazard, which represents a risk when Y(t) follows its optimal trajectory; f(t) (k × 1 matrix) represents the optimal trajectory that minimizes the risk and Q(Z, t) (k × k matrix) represents a sensitivity of risk function to deviation from the norm. In general, model coefficients a(Z, t), f1(Z, t), Q(Z, t), f(Z, t), b(Z, t) and μ0(t) are time(age)-dependent. Once we have data, we then can run analysis, i.e. estimate coefficients (they are assumed to be time-independent and data here is simulated):
## id xi t1 t2 Z y1 y1.next
## 1 0 0 60.43522445 61.35490666 0 79.92975032 76.33666649
## 2 0 0 61.35490666 62.42486745 0 76.33666649 76.55228226
## 3 0 0 62.42486745 63.32906272 0 76.55228226 86.59959121
## 4 0 0 63.32906272 64.40859637 0 86.59959121 91.24608750
## 5 0 0 64.40859637 65.37821039 0 91.24608750 86.17621957
## 6 0 0 65.37821039 66.43843309 0 86.17621957 80.94748698
## $aH
## [,1]
## [1,] -0.05221059994
##
## $aL
## [,1]
## [1,] -0.01086988917
##
## $f1H
## [,1]
## [1,] 59.29492659
##
## $f1L
## [,1]
## [1,] 76.82412031
##
## $QH
## [,1]
## [1,] 2.173260802e-08
##
## $QL
## [,1]
## [1,] 2.404768776e-08
##
## $fH
## [,1]
## [1,] 61.91026501
##
## $fL
## [,1]
## [1,] 81.51327953
##
## $bH
## [,1]
## [1,] 3.628298374
##
## $bL
## [,1]
## [1,] 5.131722152
##
## $mu0H
## [1] 8.723121794e-06
##
## $mu0L
## [1] 9.008470832e-06
##
## $thetaH
## [1] 0.07398114409
##
## $thetaL
## [1] 0.09007986209
##
## $p
## [1] 0.2275224063
##
## $limit
## [1] FALSE
##
## attr(,"class")
## [1] "pobs.spm"
Here and represents parameters when Z = 1 (H) and 0 (L).
###Joint analysis of two datasets: first dataset with genetic and second dataset with non-genetic component
## id xi t1 t2 Z y1 y1.next
## 1 0 0 48.23854222 49.15756343 0 80.53200283 79.41837738
## 2 0 0 49.15756343 50.13602914 0 79.41837738 87.01018905
## 3 0 0 50.13602914 51.13998235 0 87.01018905 92.19099604
## 4 0 0 51.13998235 52.04271033 0 92.19099604 88.25306909
## 5 0 0 52.04271033 53.03533207 0 88.25306909 84.70535708
## 6 0 0 53.03533207 54.03207848 0 84.70535708 74.83127361
## id xi t1 t2 y1 y1.next
## 1 0 0 65.79418095 66.78746433 81.35063444 75.86832421
## 2 0 0 66.78746433 67.84482489 75.86832421 74.75148863
## 3 0 0 67.84482489 68.92627769 74.75148863 69.03274169
## 4 0 0 68.92627769 69.83835934 69.03274169 70.07873222
## 5 0 0 69.83835934 70.82032656 70.07873222 66.51917477
## 6 0 0 70.82032656 71.79690020 66.51917477 71.03487617
## $aH
## [,1]
## [1,] -0.04899015454
##
## $aL
## [,1]
## [1,] -0.01098688742
##
## $f1H
## [,1]
## [1,] 54.84681361
##
## $f1L
## [,1]
## [1,] 82.2075927
##
## $QH
## [,1]
## [1,] 2.195962134e-08
##
## $QL
## [,1]
## [1,] 2.578243881e-08
##
## $fH
## [,1]
## [1,] 55.13856896
##
## $fL
## [,1]
## [1,] 87.53707199
##
## $bH
## [,1]
## [1,] 3.722342404
##
## $bL
## [,1]
## [1,] 5.023189245
##
## $mu0H
## [1] 8.784418358e-06
##
## $mu0L
## [1] 9.000479634e-06
##
## $thetaH
## [1] 0.07200528353
##
## $thetaL
## [1] 0.0900005526
##
## $p
## [1] 0.2714572645
##
## $limit
## [1] FALSE
##
## attr(,"class")
## [1] "pobs.spm"
Here mode ‘observed’ is used for simlation of data with genetic component Z and ‘unobserved’ - without genetic component.
This type of SPM also uses genetic component by analogy from the previous chapters but uses explicit gradient function which speeds up computations significantly. See (He et al. 2017) for details. Below we provide examples of usage:
## id xi t1 t2 y y.next
## 1 1 0 30 31 2.000000000 2.024328135
## 2 1 0 31 32 2.024328135 1.927486318
## 3 1 0 32 33 1.927486318 1.899083801
## 4 1 0 33 34 1.899083801 2.061574385
## 5 1 0 34 35 2.061574385 2.034558435
## 6 1 0 35 36 2.034558435 2.114382051
## id geno
## 1 1 1
## 2 2 1
## 3 3 0
## 4 4 0
## 5 5 1
## 6 6 0
res <- spm_con_1d_g(spm_data=ex_data$spm_data,
gene_data=ex_data$gene_data,
a = -0.02, b=0.2, q=0.01, f=3, f1=3, mu0=0.01, theta=1e-05,
upper=c(-0.01,3,0.1,10,10,0.1,1e-05), lower=c(-1,0.01,0.00001,1,1,0.001,1e-07),
effect=c('q'), method = "tnewton")
## [1] "Initial values:"
## [1] -2e-02 -2e-02 2e-01 2e-01 1e-02 1e-02 3e+00 3e+00 3e+00 3e+00
## [11] 1e-02 1e-02 1e-05
## [1] "Lower bounds:"
## [1] -1e+00 -1e+00 1e-02 1e-02 1e-05 1e-05 1e+00 1e+00 1e+00 1e+00
## [11] 1e-03 1e-03 1e-07
## [1] "Upper bounds:"
## [1] -1e-02 -1e-02 3e+00 3e+00 1e-01 1e-01 1e+01 1e+01 1e+01 1e+01
## [11] 1e-01 1e-01 1e-05
## $est
## Coeff. Std. Err. z p value
## a -0.031238526396 9.063948538e-04 -3.446458932e+01 0.0000000000
## b 0.101329776535 2.785757812e-04 3.637422323e+02 0.0000000000
## q_0 0.000010000000 3.124726125e-03 3.200280473e-03 0.9974465500
## q_2 0.003608789876 5.278943598e-03 6.836197071e-01 0.4942153371
## f 2.999731527110 3.052248406e-02 9.827940351e+01 0.0000000000
## f1 3.000056947322 1.716252597e-02 1.748027622e+02 0.0000000000
## mu0 0.001000000000 1.552501390e+04 6.441218067e-08 0.9999999486
## theta 0.000000100000 8.987117036e+01 1.112703880e-09 0.9999999991
##
## $lik
## [1] -121546.402
##
## $con
## [1] -1
##
## $message
## [1] "NLOPT_FAILURE: Generic failure code."
##
## $hessian
## a b q_0 q_2
## a 2.354652039e+06 6.786921794e+05 158.057666352 82.3853518571
## b 6.786921794e+05 1.325446527e+07 1517.123710219 775.1749316100
## q_0 1.580576664e+02 1.517123710e+03 -39.193663230 -13.8346703134
## q_2 8.238535186e+01 7.751749316e+02 -13.834670313 -13.8346664507
## f 3.489485710e-01 -4.196215027e-03 36528.531980501 19000.8175925304
## f1 -8.446522178e+04 5.065699722e+01 -2.840123131 -1.4581382824
## mu0 2.228043741e-03 8.307251846e-04 1.053761684 0.6444883911
## theta 4.230830818e-01 1.744572073e-01 194.433867000 117.7827687934
## f f1 mu0 theta
## a 3.489485710e-01 -8.446522178e+04 2.228043741e-03 0.423083081841
## b -4.196215027e-03 5.065699722e+01 8.307251846e-04 0.174457207322
## q_0 3.652853198e+04 -2.840123131e+00 1.053761684e+00 194.433867000043
## q_2 1.900081759e+04 -1.458138282e+00 6.444883911e-01 117.782768793404
## f 1.670661873e+02 -1.202719447e-02 3.587745596e-03 0.465787947178
## f1 -1.202719447e-02 6.470562259e+03 -5.237961886e-05 -0.006925314665
## mu0 3.587745596e-03 -5.237961886e-05 0.000000000e+00 0.000000000000
## theta 4.657879472e-01 -6.925314665e-03 0.000000000e+00 0.000000000000
##
## $beta
## Coeff. Std. Err. Chi. Sq p value
## beta_a NA NA NA NA
## beta_b NA NA NA NA
## beta_q 0.001799394938 0.00420183465 0.0007705734312 0.9778542036
## beta_f NA NA NA NA
## beta_f1 NA NA NA NA
## beta_mu0 NA NA NA NA
Here: spm_data
- A dataset for the SPM model. See the
STPM package for more details about the format.
gene_data
- A two column dataset containing the
genotypes for the individuals in spm_data. The first column
id
is the ID of the individuals in dataset
spm_data
, and the second column geno
is the
genotype.
a
- The initial value for the paramter . The initial
value will be predicted if not specified.
b
- The initial value for the paramter . The initial
value will be predicted if not specified.
q
- The initial value for the paramter . The initial
value will be predicted if not specified.
f
- The initial value for the paramter . The initial
value will be predicted if not specified.
f1
- The initial value for the paramter . The initial
value will be predicted if not specified.
mu0
- The initial value for the paramter in the baseline
hazard. The initial value will be predicted if not specified.
theta
- The initial value for the paramter in the
baseline hazard. The initial value will be predicted if not
specified.
lower
- A vector of the lower bound of the
parameters.
upper
- A vector of the upper bound of the
parameters.
effect
- A character vector of the parameters that are
linked to genotypes. The vector can contain any combination of , , , ,
.
control
- A list of the control parameters for the
optimization paramters.
global
- A logical variable indicating whether the MLSL
(TRUE) or the L-BFGS (FALSE) algorithm is used for the optimization.
verbose
- A logical variable indicating whether initial
information is printed.
ahessian
- A logical variable indicating whether the
approximate (FALSE) or analytical (TRUE) Hessian is returned.
est
- The estimates of the parameters.
hessian
- The Hessian matrix of the estimates.
lik
- The minus log-likelihood.
con
- A number indicating the convergence. See the
‘nloptr’ package for more details.
message
- Extra message about the convergence. See the
‘nloptr’ package for more details.
beta
- The coefficients of the genetic effect on the
parameters to be linked to genotypes.
The SPM offers longitudinal data imputation with results that are better than from other imputation tools since it preserves data structure, i.e. relation between Y(t) and mu(Y(t),t). Below there are two examples of multiple data imputation with function spm.impute(…).
library(stpm)
#######################################################################
############## One dimensional case (one covariate) ###################
#######################################################################
## Data preparation (short format)#
data <- simdata_discr(N=1000, dt = 2, format="short")
miss.id <- sample(x=dim(data)[1], size=round(dim(data)[1]/4)) # ~25% missing data
incomplete.data <- data
incomplete.data[miss.id,4] <- NA
# End of data preparation #
##### Multiple imputation with SPM #####
imp.data <- spm.impute(x=incomplete.data, id=1, case="xi", t1=3, covariates="y1", minp=1, theta_range=seq(0.075, 0.09, by=0.001))$imputed
##### Look at the incomplete data with missings #####
head(incomplete.data)
## id xi t y1
## 1 1 0 30 82.07656260
## 2 1 0 32 78.21157461
## 3 1 0 34 NA
## 4 1 0 36 68.51273186
## 5 1 0 38 72.56533923
## 6 1 0 40 NA
## id xi t y1
## 1 1 0 30 82.07656260
## 2 1 0 32 78.21157461
## 3 1 0 34 78.38950154
## 4 1 0 36 68.51273186
## 5 1 0 38 72.56533923
## 6 1 0 40 73.01533970
#########################################################
################ Two-dimensional case ###################
#########################################################
## Parameters for data simulation #
a <- matrix(c(-0.05, 0.01, 0.01, -0.05), nrow=2)
f1 <- matrix(c(90, 30), nrow=1, byrow=FALSE)
Q <- matrix(c(1e-7, 1e-8, 1e-8, 1e-7), nrow=2)
f0 <- matrix(c(80, 25), nrow=1, byrow=FALSE)
b <- matrix(c(5, 3), nrow=2, byrow=TRUE)
mu0 <- 1e-04
theta <- 0.07
ystart <- matrix(c(80, 25), nrow=2, byrow=TRUE)
## Data preparation #
data <- simdata_discr(N=1000, a=a, f1=f1, Q=Q, f=f0, b=b, ystart=ystart, mu0 = mu0, theta=theta, dt=2, format="short")
## Delete some observations in order to have approx. 25% missing data
incomplete.data <- data
miss.id <- sample(x=dim(data)[1], size=round(dim(data)[1]/4))
incomplete.data <- data
incomplete.data[miss.id,4] <- NA
miss.id <- sample(x=dim(data)[1], size=round(dim(data)[1]/4))
incomplete.data[miss.id,5] <- NA
## End of data preparation #
###### Multiple imputation with SPM #####
imp.data <- spm.impute(x=incomplete.data, id=1, case="xi", t1=3, covariates=c("y1", "y2"), minp=1, theta_range=seq(0.060, 0.07, by=0.001))$imputed
###### Look at the incomplete data with missings #####
head(incomplete.data)
## id xi t y1 y2
## 1 1 0 30 81.33522205 22.82389494
## 2 1 0 32 90.25707847 NA
## 3 1 0 34 NA NA
## 4 1 0 36 87.94061016 23.22128375
## 5 1 0 38 87.19685070 29.43091772
## 6 1 0 40 95.10697984 NA
## id xi t y1 y2
## 1 1 0 30 81.33522205 22.82389494
## 2 1 0 32 90.25707847 22.89880634
## 3 1 0 34 90.21374425 23.04351001
## 4 1 0 36 87.94061016 23.22128375
## 5 1 0 38 87.19685070 29.43091772
## 6 1 0 40 95.10697984 29.22900829
We provide a simple function to predict the next value of . Refer to the example below:
#library(stpm)
#data <- simdata_discr(N=100, format="long")
#res <- spm_discrete(data)
#splitted <- split(data, data$id)
#df <- data.frame()
#lapply(1:100, function(i) {df<<-rbind(df,splitted[[i]][dim(splitted[[i]])[1],c("id", "xi", "t1", "y1")])})
#names(df) <- c("id", "xi", "t", "y")
#predicted <- predict(object=res, data=df, dt=3)
#head(predicted)
The package offers following five hypotheses to test for function (Arbeev et al. 2016):
H01
: Q(t) = 0 (i.e., aQ = 0 and bQ = 0,so that
there is no quadratic term in the hazard rate and mortality is described
by the baseline Gompertz rate μ0(t)).
H02
: Q(t) = aQ
(i.e., bQ = 0).
H03
: f1(t) = 0 (i.e.,
af1 = 0
and bf1 = 0).
H04
: f1(t) = af1
(i.e., bf1 = 0).
H05
: a(t) = aY
(i.e., bY = 0).
To perform hypothesis testing you should put the variable
lrtest
to TRUE
(this is "H01"
by
default) or to any of the following: "H01"
,
"H02"
, "H03"
, "H04"
,
"H05"
.
library(stpm)
n <- 1000
# Data simulation:
data <- simdata_time_dep(N=n, format="long")
head(data)
# Hypotheses testing
## H01
res <- spm_time_dep(data, verbose=F,
frm = list(at="a", f1t="f1", Qt="Q", ft="f", bt="b", mu0t="mu0"),
start=list(a=-0.05, f1=80, Q=1e-8, f=90, b=5, mu0=0.001),
lb=c(a=-1, f1=30, Q=1e-9, f=10, b=1, mu0=1e-6),
ub=c(a=0, f1=120, Q=1e-7, f=150, b=10, mu0=1e-2),
opts = list(algorithm = "NLOPT_LN_NELDERMEAD",
maxeval = 200, ftol_rel = 1e-12), lrtest="H01")
res$alternative$lr.test.pval
## H02
res <- spm_time_dep(data, verbose=F,
frm = list(at="a", f1t="f1", Qt="1e-6", ft="f", bt="b", mu0t="mu0"),
start=list(a=-0.05, f1=80, Q=1e-8, f=90, b=5, mu0=0.001),
lb=c(a=-1, f1=30, Q=1e-9, f=10, b=1, mu0=1e-6),
ub=c(a=0, f1=120, Q=1e-7, f=150, b=10, mu0=1e-2),
opts = list(algorithm = "NLOPT_LN_NELDERMEAD",
maxeval = 200, ftol_rel = 1e-12), lrtest="H02")
res$alternative$lr.test.pval
## H03
res <- spm_time_dep(data, verbose=F,
frm = list(at="a", f1t="f1", Qt="Q", ft="f", bt="b", mu0t="mu0"),
start=list(a=-0.05, f1=80, Q=1e-8, f=90, b=5, mu0=0.001),
ub=c(a=0, f1=120, Q=1e-7, f=150, b=10, mu0=1e-2),
opts = list(algorithm = "NLOPT_LN_NELDERMEAD",
maxeval = 200, ftol_rel = 1e-12), lrtest="H03")
res$alternative$lr.test.pval
## H04
res <- spm_time_dep(data, verbose=F,
frm = list(at="a", f1t="120", Qt="Q", ft="f", bt="b", mu0t="mu0"),
start=list(a=-0.05, f1=80, Q=1e-8, f=90, b=5, mu0=0.001),
lb=list(a=-1, f1=30, Q=1e-9, f=10, b=1, mu0=1e-6),
opts = list(algorithm = "NLOPT_LN_NELDERMEAD",
maxeval = 200, ftol_rel = 1e-12), lrtest="H04")
res$alternative$lr.test.pval
## H05
res <- spm_time_dep(data, verbose=F,
frm = list(at="-0.1", f1t="f1", Qt="Q", ft="f", bt="b", mu0t="mu0"),
start=list(a=-0.05, f1=80, Q=1e-8, f=90, b=5, mu0=0.001),
opts = list(algorithm = "NLOPT_LN_NELDERMEAD",
maxeval = 200, ftol_rel = 1e-12), lrtest="H05")
res$alternative$lr.test.pval