--- title: "Colossus Description" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Colossus Description} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(Colossus) library(data.table) library(parallel) ``` ## Colossus Description ## Model Structure At its full potential, Colossus is able to analyze a vast number of possible risk/rate models. Specifically Colossus is designed to allow for a combination of linear and non-linear models to estimate cox proportional hazard ratios, Poisson model rates, and Fine-Grey Competing risks ratios. The simplest of which is an exponential model. The simplest model generally used for hazard ratios in survival analysis is the exponential of a linear function of covariates.

$$ \begin{aligned} R(\vec{\beta},\vec{x})=\exp(\vec{\beta} \cdot \vec{x}) \end{aligned} $$

In Colossus, this general model is extended by abstracting to the sum and product of terms and subterms. In this vignette, the value calculated to quantify the risk of an event will be denoted as the risk, R, however the exact meaning differs. In Cox proportional hazard modeling, the risk calculated is a hazard ratio. In Poisson modeling the risk calculated is an estimated number of events per person-year. The risk ($R$) has a set formula dependent on terms ($T$). Each term has a formula dependent on the product of subterms ($S$). Each subterm is a function of a single covariate ($x$) and several parameters ($\alpha,\beta$). There are currently five types of risk models available. The risk can be expressed as an additive model ($R_A$), product additive model ($R_{PA}$), Product additive excess model ($R_{PAE}$), multiplicative excess model ($R_M$), or the geometric mixture model with relative risks or excess risks ($R_{GMIX}$).

$$ \begin{aligned} R_{A}= \sum_{i=0}^n T_i\\ R_{PA}= T_0 \times \sum_{i=1}^n T_i\\ R_{PAE}= T_0 \times (1 + \sum_{i=1}^n T_i)\\ R_{M}= T_0 \times \prod_{i=1}^n( 1 + T_i)\\ R_{GMIX} = T_0 \times \left(\prod_{i=1}^n(T^{*}_i) \right) ^ \theta \times \left( 1 + \sum_{i=1}^n (T^{*}_i - 1) \right)^{1-\theta}\\ T^{*}_i = \begin{cases} T_i &\text{Relative Risk} \\ T_i+1 &\text{Excess Risk} \end{cases}\\ \end{aligned} $$

Each term is composed of a combination of 4 types of subterms. Every covariate is part of a log-linear subterm, a linear subterm, a product-linear subterm, or a general non-linear term. The log-linear subterm ($S_{LL}$) is the exponential of a linear combination of covariates. The linear subterm ($S_L$) is a linear combination of covariates. The product-linear subterm ($S_{PL}$) is one plus a linear combination of covariates. The general non-linear term ($S_{NL}$) is the sum of exponential terms, quadratic terms, linear-threshold terms ($F_{LT}$), step function terms ($F_{STP}$), linear-quadratic terms ($F_{LQ}$), and linear-exponential terms ($F_{LEXP}$). Each term is the product of the non-empty subterms.

$$ \begin{aligned} S_{LL}=\prod_{i} (\exp{(x_i \cdot \beta_i)})\\ S_{L}=\sum_i (x_i \cdot \beta_i)\\ S_{PL}=1+ \sum (x_i \cdot \beta_i)\\ S_{NL}=\sum_i (\alpha_i \times \exp(x_i \cdot \beta_i)) + \sum_i (\beta_i \cdot (x_i)^2) + \sum_i F_{LT} + \sum_i F_{STP} + \sum_i F_{LQ} + \sum_i F_{LEXP}\\ F_{LT} = \begin{cases} \alpha_i \cdot (x-\beta_i) & (x>\beta_i) \\ 0 &\text{else} \end{cases}\\ F_{STP} = \begin{cases} \alpha_i & (x>\beta_i) \\ 0 &\text{else} \end{cases}\\ F_{LQ} = \begin{cases} \beta_i \cdot x & (x>\alpha_i) \\ \lambda_i \cdot x^2 + \nu_i &\text{else} \end{cases}\\ F_{LEXP} = \begin{cases} \beta_i \cdot x & (x>\alpha_i) \\ \lambda_i - \exp{(\nu_i + \mu \cdot x)} &\text{else} \end{cases}\\ T_j=S_{LL,j} \times S_{L,j} \times S_{PL,j} \times S_{NL,j} \end{aligned} $$

In short, every element of the risk model has a subterm type, term number, covariate, and parameter value. ## Using The Standard Model In Colossus every equation is defined in a very similar fashion. On the user side of the code, elements of the risk equation can be viewed as rows of a table. The columns store covariate names, term numbers, subterms types, and a starting point. For more complex regression most parameters can be also be set to remain constant over the regression. Doing so forces the parameter value to remain constant but not remove the element from the risk calculations. Constant elements are used for risk calculation, but are not used for calculating steps or standard deviations. For a multiplicative excess risk model the following table and equation are equivalent. | Term number | subterm type | covariate| | :---------: | :----------: | :-------: | | 0 | LogLinear | a | | 1 | Linear | b | | 1 | Linear | c |

$$ \begin{aligned} R = \exp{(\beta_{a} \cdot x_{a})} \times (1+\beta_{b} \cdot x_{b} + \beta_{c} \cdot x_{c}) \end{aligned} $$

If the user wanted to update the model to include a new term with a product-linear subterm, then the table would only need to be updated with a new row. | Term number | subterm type | covariate | | :---------: | :----------: | :-------: | | 0 | LogLinear | a | | 1 | Linear | b | | 1 | Linear | c | | 2 | Product-Linear | d |

$$ \begin{aligned} R = \exp{(\beta_{a} \cdot x_{a})} \times (1 + (\beta_{b} \cdot x_{b} + \beta_{c} \cdot x_{c})) \times (2 + \beta_{d} \cdot x_{d}) \end{aligned} $$

Note that the multiplicative model is taking the product of terms, each of which is the product of subterms. The same equation may be written with subterms distributed over different terms in the multiplicative model without changing the final value. The only exception is subterms moved from or into the "first" term in the multiplicative excess model. The same equation could be expressed in ways with different computational complexity. The specific model options are passed to the code as separate variables. The model identifier (A, M, PA, PAE, GMIX) is passed as a string. Every model except the additive model has a distinction between the first and remaining terms. Colossus defaults to term 0 being the unique term, however it can be set to any used term number without producing an error. The final table is also equivalent to the following code ```{r} names <- c("a", "b", "c", "d") term_n <- c(0, 1, 1, 2) tform <- c("loglin", "lin", "lin", "plin") modelform <- "M" fir <- 0 a_n <- c(0.1, 0.1, 0.1, 0.1) ``` *names* denotes the column names used, *term_n* denotes the term numbers, *tform* denotes the subterm formula, *modelform* denotes the term formula, *fir* denotes the "first" term indexed from zero which defaults to 0, and *a_n* denotes the initial guesses for the parameter values. All of which are assumed to be in the same order. A function is called prior to regression that reorders the inputs in order of terms, subterm types, etc. ## Survival Time and Event Data Colossus performs survival analysis via either a Cox Proportional Hazards or a Poisson model regression. In both cases the user specifies which columns contain time duration and events of interest. For the Poisson model regression these would be which column contains the person-years and number of events for each row of data. For the Cox Proportional Hazards regression the user identifies which columns provide starting and ending times, and what column gives the event status. The event status is assumed to be a binary covariate which is 1 for intervals containing an event. Colossus supports left censored data, right censored data, and interval censored data. The data is interval censored by default, so left and right censoring are handled by defining interval endpoints outside the minimum or maximum event times. For Poisson model regression the user only provides columns for the person-years per row and number of events. Poisson model regression supports any non-negative number of events. On the user side, the names of columns containing the time and events need to be given. Lets assume we have a dataframe organized as follows: | UserID | Starting_Age | Ending_Age | Cancer_Status | | :---------: | :----------: | :-------: | :-------: | | 112 | 18 | 30 | 0 | | 114 | 20 | 45 | 0 | | 213 | 18 | 57 | 1 | | ... | ... | ... | ... | For a Cox Proportional Hazard we need to provide three column names: the starting age, the ending age, and if an event happened during the interval. In this case that would be "Starting_Age", "Ending_Age", and "Cancer_Status". In this example we have interval data, however that may not always be the case. Colossus is designed to allow for the user to input "%trunc%" in-place of a missing side of the interval. Colossus assumes that this means that the person has the missing endpoint outside of the available data range and creates a dummy column to reference as the interval endpoint. So in code these variables would look like this: ```{r} df <- data.table( "UserID" = c(112, 114, 213, 214, 115, 116, 117), "Starting_Age" = c(18, 20, 18, 19, 21, 20, 18), "Ending_Age" = c(30, 45, 57, 47, 36, 60, 55), "Cancer_Status" = c(0, 0, 1, 0, 1, 0, 0), "a" = c(0, 1, 1, 0, 1, 0, 1), "b" = c(1, 1.1, 2.1, 2, 0.1, 1, 0.2), "c" = c(10, 11, 10, 11, 12, 9, 11), "d" = c(0, 0, 0, 1, 1, 1, 1) ) # For the interval case time1 <- "Starting_Age" time2 <- "Ending_Age" event <- "Cancer_Status" # Supposing we had left truncated data the following would change time1 <- "Starting_Age" time2 <- "%trunc%" # and with right truncated data the following is used time1 <- "%trunc%" time2 <- "Ending_Age" # setting back to normal time1 <- "Starting_Age" time2 <- "Ending_Age" ``` Suppose instead we were interested in a Poisson model regression, the only difference is instead of providing interval endpoints we give a column of interval lengths. In this case we need to define a new column, but that may not be the case in general. Suppose we are now working with the following table: | UserID | Starting_Age | Ending_Age | Person_Years | Cancer_Status | | :---------: | :----------: | :-------: | :-------: | :-------: | | 112 | 18 | 30 | 12 | 0 | | 114 | 20 | 45 | 25 | 0 | | 213 | 18 | 57 | 39 | 1 | | ... | ... | ... | ... | ... | we would just provide the duration column and event column, or "Person_Years" and "Cancer_Status". Which in code looks like the following: ```{r} df$Person_Years <- df$Ending_Age - df$Starting_Age pyr <- "Person_Years" event <- "Cancer_Status" ``` ## Control and Verbosity Colossus offers several options that control how the input data is used and returned. This can be split into two general categories. The first category is convergence parameters. These cover the number of iterations used, limits on parameter changes, and stopping criteria. The second category is parameters used for debugging and additional information. A more common option is the verbosity. The verbose mode prints time-steps and intermediate values for the sum of terms, sum of risks, and log-likelihood each iteration. A less common option is the *der_iden* parameter, which is used change only one parameter at a time by a set amount. This was used in the debugging process for recording intermediate values on a plane. This option could be used by the user in conjunction with the verbose option to plot intermediate values by parameter value. The convergence parameters currently supported in general are as follows: | Parameter | Description | | :---------: | :----------: | | lr | learning rate applied to steps taken, smaller values avoid overshooting at the risk of slower convergence. Applied to the solution for setting first derivative of Log-Likelihood to zero | | maxiter | maximum number of iterations run | | maxiters | list of maximum iterations if a list of initial values is given | | halfmax | Maximum number of half-steps taken each iteration | | epsilon | Minimum parameter change above convergence. If change is below this, the regression exits | | dbeta_max | learning rate applied to limit on steps taken. Applied to solution to zero Log-Likelihood, for double_step=0| | deriv_epsilon | Minimum Log-Likelihood first derivative above convergence. If the largest first derivative is below this, the regression exits | | abs_max | Largest starting step size. Code reduces it every time halfmax is hit without finding an improvement. When abs_max is below epsilon, the code exits | | dose_abs_max | Largest starting step size for threshold parameters. Assumes the threshold parameters may be orders of magnitude different than other parameters | | ties | Allows for efron or breslow tie methods | | double_step | Changes what step calculation method is used. double_step=0 changes each parameter by the ratio of first to second derivative independently. double_step=1 solves the system of equations for all first and second derivatives | The additional information parameters currently supported are as follows: | Parameter | Description | | :---------: | :----------: | | verbose | integer valued 0-4 controlling what information is printed to the terminal. Each level includes the lower levels. 0: silent, 1: errors printed, 2: warnings printed, 3: notes printed, 4: debug information printed. Errors are situations that stop the regression, warnings are situations that assume default values that the user might not have intended, notes provide information on regression progress, and debug prints out C++ progress and intermediate results. The default level is 2 and True/False is converted to 3/0.| | change_all | TRUE/FALSE flag to choose if the user wants a regression or to calculate log-likelihoods on a grid changing only one parameter at a time | | ncores | Number of physical cores assigned, cannot be greater than the number of cores available | The previous options are all contained within a list of parameters, however there are additional standalone parameters which control the convergence and information. They are as follows: | Parameter | Description | | :---------: | :----------: | | der_iden | If change_all is used, selects which parameter number indexed from 0 to change. The parameter is changed by abs_max or dose_abs_max every iteration over the maximum number of iterations| | keep_constant | a vector of 0/1 to denote if any parameters should be held constant over the calculations. 0 denoting false and 1 denoting true | Going back to our example, the code may look like the following: ```{r} keep_constant <- c(0, 0, 0, 0) der_iden <- 0 control <- list( "ncores" = 1, "lr" = 0.75, "maxiter" = 100, "halfmax" = 5, "epsilon" = 1e-9, "dbeta_max" = 0.5, "deriv_epsilon" = 1e-9, "abs_max" = 1.0, "change_all" = TRUE, "dose_abs_max" = 100.0, "verbose" = FALSE, "ties" = "breslow", "double_step" = 1 ) ``` In this example, no parameters are held constant, 100 iterations with 5 half-steps are used, a regression is performed with matrix step solutions, and breslow's method for tied event times is used. There are no errors for unused control parameters being used, they will simply be ignored. Every commonly required control parameter has a default value which is automatically used if not declared. This includes an optional list of model control options. These are used for nonstandard model options not covered in this vignette. ## Running the Regression and Output Finally the user calls the regression function they need. The names used throughout this vignette are the defaults assumed. Both the Cox PH regression and Poisson model regression are called directly and return a list of results. Colossus contains a suite of additional checks it runs on the inputs prior to starting and calculations, which are designed to output explicit details on any issues. Printing error details may require the verbose option to be TRUE. In code the functions are called as follows: ```{r} # assuming the table of covariates is stored in a data.table "df" e <- RunCoxRegression( df, time1, time2, event, names, term_n, tform, keep_constant, a_n, modelform, fir, der_iden, control ) print(e) # or a Poisson model regression a_n <- c(0.1, 0.1, 0.1, 0.1) # a_n is updated when either regression is called e <- RunPoissonRegression( df, pyr, event, names, term_n, tform, keep_constant, a_n, modelform, fir, der_iden, control ) print(e) ``` The following are output: | Item | Description | | :---------: | :----------: | | LogLik | Final Log-Likelihood | | First_Der | First derivative of Log-Likelihood at the final point | | Second_Der | Matrix of second derivatives of Log-Likelihood at the final point | | beta_0 | The final parameter list | | Standard_Deviation | Inverse-Hessian estimate of the standard deviation | | AIC | Akaike information criterion for the final solution | | Deviation | Deviance calculated using the difference in log-likelihood contributions between the saturated model and current model | | Parameter_Lists | the term numbers and term formulas used, to compare against the expected | | Control_List | a list containing the number of iterations, the maximum step at the final iteration, and the highest first derivative magnitude | |Converged | TRUE/FALSE if the regression converged | For both cases, the regressions did not converge. Given that the examples were arbitrary this is not unexpected. The Cox PH ran out of iterations and did not decrease the step limit, so if it were restarted at the new optimal guess it would be expected to continue converging. The Poisson model regression exited due to the step limit being below the threshold, which means that it hit some local optimum. To find a better solution the user would likely have to either change the model equation or provide a new starting guess. This is an example of how to interpret the results of a Colossus run. There are additional variants of these functions which will be described outside of this section. They return a similar list of results.