psc-vignette

Introduction

The psc.R package implements the methods for applying Personalised Synthetic Controls, which allows for patients receiving some experimental treatment to be compared against a model which predicts their reponse to some control. This is a form of causal inference which differes from othere approaches in that

Data are only required on a single treatment - all counterfactual evidence is supplied by a parametric model

Causal inference, in theory at least, is estimated at a patient level - as opposed to estimating average effects over a population

In its basic form, this method creates a likelihood to compare a cohort of data to a parametric model. See (X) for disucssion on it’s use as a causal inference tool. To use this package, two basic peices of information are required, a dataset and a model against which they can be compared.

In this vignette, we will detail how the psc.r package is constructed and give some examples for it’s application in practice.

To load the library use the code

Methodology

The pscfit function compares a dataset (‘DC’) against a parametric model. This is done by selecting a likelihood which is identified by the type of CFM that is supplied. At present, two types of model are supported, a flexible parmaeteric survival model of type ‘flexsurvreg’ and a geleneralised linear model of type ‘glm’.

Where the CFM is of type ‘flexsurvreg’ the likeihood supplied is of the form:

$$L(D∣\Lambda,\Gamma_i)=\prod_{i=1}^{n} f(t_i∣\Lambda,\Gamma_i)^{c_i} S(t_i∣\Gamma,\Lambda_i)^{(1−c_i)}$$ Where Γ defines the cumulative baseline hazard function, Λ is the linear predictor and t and c are the event time and indicator variables.

Where the CFM is of the type ‘glm’ the likelihood supplied is of the form:

$$L(x∣\Gamma_i) = \prod_{i=1}^{n} b (x∣ \Gamma_i )\exp\{\Gamma_i t(x)−c(\Gamma_i)\}$$

Where b(.), t(.) and c(.) represent the functions of the exponential family. In both cases, Γ is defiend as:

Γi = γxi + β

Where γ are the model coefficients supplied by the CFM and β is the parameter set to measure the difference between the CFM and the DC.

Estimation is performed using a Bayesian MCMC procedure. Prior distributions for Γ (& Λ) are derived directly from the model coefficients (mean and variance covariance matrix) or the CFM. A bespoke MCMC routine is performed to estimate β. Please see ‘?mcmc’ for more detials.

For the standard example where the DC contains information from only a single treatment, trt need not be specified. Where comparisons between the CFM and multiple treatments are require, a covariate of treamtne allocations must be specified sperately (using the ‘trt’ option).

Package Structure

The main function for using applying Personal Synthetic Controls is the pscfit() function which has two inputs, a Counter-Factual Model (CFM) and a data cohort (DC). Further arguments include

  • nsim which sets the number of MCMC iterations (defaults to 5000)
  • ‘id’ if the user wishes to restrict estimation to a sub-set (or individual) within the DC
  • ‘trt’ to be used as an initial identifier if mulitple treatment comparisons are to be made (please see the Mulitple Treatment Comparison below)

psc object

The output of the “pscfit()” function is an object of class ‘psc’. This class contains the following attributes

  • A ‘cleaned’ dataset including extracted components of the CFM and the cleaned DC included in the procedure
  • An object defingin the class of model (and therefore the procedure applied - see above)
  • A matrix containing the draws of the posterior distributions

Postestimation functions

basic post estimation functions have been developed to work with the psc object, namely “print()”, “coef()”, “summary()” and “plot()”. For the first three of these these provided basic summaries of the efficacy parameter obtained from the posterior distribution.

Data

The psc.r package includes as example, a dataset which is derived from patients with advanced Hepatocellular Carcinoma (aHCC) who have all received some experimental treatment. The dataset is simply named ‘data’ and is loaded into the enviroment using the “data()” function

data <- psc::data

Included is a list of prognostic covariates:

  • vi - Vascular Invasion
  • age60 - Age - centered at 60
  • ecog - ECOG performance Status
  • logafp - AFP on the natural log scale
  • alb - Albumin
  • logcreat - Creatinine on the log scale
  • logast - AST on the natuarl log scale
  • allmets -Presence of Metastesis
  • aet - Ateiology; HBV, HCV or Other

Also included are the following structures

  • time - survival time
  • cen - censoring indictor
  • os - time to be used as a continuous outcome
  • event - binary event for use as a binary outcome
  • count - count data to be used for a count outcome

Lastly the dataset also inlclude a ‘trt’ variable to be used in the estimation of multiple treatment comparisons.

Examples

Survival

For an example with a survival outcome a model must be supplied which is contructed ont he basis of flexible parametric splines. This is contructed using the “flexsurvreg” function within the “flexsurv” package. An example is included within the ‘psc.r’ package names ‘surv.mod’ and is loaded using the ’data()” function:

surv.mod <- psc::surv.mod
surv.mod
#> Call:
#> flexsurvspline(formula = Surv(time, cen) ~ vi/age60 + ecog + 
#>     allmets + logafp + alb + logcreat + logast + aet, data = pros, 
#>     k = 3)
#> 
#> Estimates: 
#>              data mean  est       L95%      U95%      se        exp(est)
#> gamma0             NA   -7.57573  -9.90310  -5.24836   1.18746        NA
#> gamma1             NA    2.23506   1.37951   3.09060   0.43651        NA
#> gamma2             NA   -0.13758  -0.74322   0.46807   0.30901        NA
#> gamma3             NA    0.26981  -0.72101   1.26063   0.50553        NA
#> gamma4             NA   -0.03031  -0.60738   0.54676   0.29443        NA
#> viyes         0.28400    0.32856   0.08489   0.57224   0.12433   1.38897
#> ecog1         0.40400    0.45602   0.23429   0.67776   0.11313   1.57779
#> allmetsyes    0.69400    0.29767   0.03666   0.55869   0.13317   1.34672
#> logafp        5.42436    0.08320   0.05038   0.11602   0.01674   1.08676
#> alb          38.55600   -0.05539  -0.07809  -0.03269   0.01158   0.94612
#> logcreat      4.30929    0.71084   0.26457   1.15712   0.22770   2.03571
#> logast        4.08274    0.35000   0.15915   0.54086   0.09738   1.41907
#> aetHCV        0.21200   -0.52754  -0.86299  -0.19210   0.17115   0.59005
#> aetOther      0.37000   -0.01847  -0.26319   0.22624   0.12486   0.98170
#> vino:age60    0.06400   -0.02312  -0.03479  -0.01144   0.00596   0.97715
#> viyes:age60  -0.37000    0.00716  -0.00790   0.02221   0.00768   1.00718
#>              L95%      U95%    
#> gamma0             NA        NA
#> gamma1             NA        NA
#> gamma2             NA        NA
#> gamma3             NA        NA
#> gamma4             NA        NA
#> viyes         1.08859   1.77223
#> ecog1         1.26401   1.96946
#> allmetsyes    1.03734   1.74838
#> logafp        1.05167   1.12301
#> alb           0.92488   0.96784
#> logcreat      1.30286   3.18076
#> logast        1.17251   1.71748
#> aetHCV        0.42190   0.82523
#> aetOther      0.76860   1.25388
#> vino:age60    0.96581   0.98862
#> viyes:age60   0.99214   1.02246
#> 
#> N = 500,  Events: 360,  Censored: 140
#> Total time at risk: 5357.664
#> Log-likelihood = -1221.857, df = 16
#> AIC = 2475.714

In this example you can see that this is a model constructed with 3 internal knots and hence 5 parameters to describe the baseline cumulative hazard function. There are also prognostic covariates which match with the prognostic covaraites in the data cohort.

To begin it is worth looking at the performance of the model and looking at how the survival of patietns in the data cohort compare

sfit <- survfit(Surv(data$time,data$cen)~1)
plot(sfit)

Comparing the dataset to the model is then performed using

and we can view the attributes of the psc object that is created

attributes(surv.psc)
#> $names
#> [1] "model.type" "DC_clean"   "posterior" 
#> 
#> $class
#> [1] "psc"

For example to view the matrix contianing the draws of the posterior distribution we use

surv.post <- surv.psc$posterior
head(surv.post)
#>      gamma0    gamma1        gamma2     gamma3      gamma4     viyes     ecog1
#> 1 -7.575727 2.2350572 -1.375763e-01  0.2698109 -0.03030718 0.3285628 0.4560236
#> 2 -6.177702 1.7131056 -4.586078e-01  0.8446338 -0.39464194 0.2492159 0.3448721
#> 3 -7.402628 2.4795860  3.186636e-01 -0.5318789  0.40622842 0.4342349 0.5524287
#> 4 -6.617513 1.9195851 -5.327126e-05 -0.1116571  0.25039831 0.4704164 0.4643693
#> 5 -7.620996 2.6976160 -2.249089e-03  0.3454113 -0.30410438 0.2206927 0.4994067
#> 6 -8.437812 0.6630725 -1.228197e+00  1.7734664 -0.61134172 0.3292307 0.5002040
#>   allmetsyes     logafp         alb  logcreat    logast     aetHCV    aetOther
#> 1  0.2976742 0.08319787 -0.05538982 0.7108430 0.3500019 -0.5275438 -0.01847276
#> 2  0.4634186 0.07809508 -0.07068141 0.5898201 0.4036147 -0.5273226 -0.25076036
#> 3  0.2497853 0.07633782 -0.07194850 0.8548329 0.3468265 -0.6925098 -0.24006816
#> 4  0.1705090 0.08127646 -0.05179644 0.5226085 0.3599905 -0.5909674  0.04883408
#> 5  0.2298813 0.11168570 -0.07046628 0.6643340 0.5081121 -0.5829386 -0.16984486
#> 6  0.2903587 0.09968702 -0.04063800 0.7996067 0.4447975 -0.6538008  0.06704888
#>    vino:age60  viyes:age60      beta      DIC
#> 1 -0.02311823 0.0071560470 0.3784481       NA
#> 2 -0.01954169 0.0091924994 0.3784481 283.8779
#> 3 -0.01908248 0.0068133893 0.3784481 279.2638
#> 4 -0.02378719 0.0186442595 0.3784481 275.3187
#> 5 -0.01868482 0.0002189175 0.3784481 287.9250
#> 6 -0.01061563 0.0111369628 0.3784481 289.0717

Inspection will show that there is a column for each parameter in the original model as well as ‘beta’ and ‘DIC’ vcolumns which give teh posterior estiamtes for β and the Deviance Informaiton Criterion respectively.

We can inspect the poterior distribution using the autocorrelation function, trace and stardard summary statistics:

Autocorrelation

acf(surv.post$beta)

Trace

plot(surv.post$beta,typ="s")

Summary

summary(surv.post$beta)
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#> -0.09211  0.27602  0.34769  0.35246  0.43044  0.78634

Alternatively the standard post estimation functions can be used

print(surv.psc)
#> Counterfactual Model (CFM): 
#> A model of class 'flexsurvreg' 
#>  Fit with 3 internal knots
#> 
#> Formula: 
#> Surv(time, cen) ~ vi/age60 + ecog + allmets + logafp + alb + 
#>     logcreat + logast + aet
#> <environment: 0x564ceebc6c30>
#> 
#> Call:
#>  CFM model + beta
#> 
#> Coefficients:
#>       median    2.5%      97.5%     Pr(x<0)   Pr(x>0) 
#> beta    0.3477    0.1100    0.5986    0.0022    0.9978
#> DIC   280.8061  273.6206  292.5939        NA        NA
coef(surv.psc)
#>           median        2.5%       97.5% Pr(x<0) Pr(x>0)
#> beta   0.3476866   0.1100292   0.5986241  0.0022  0.9978
#> DIC  280.8061186 273.6206400 292.5938952      NA      NA
summary(surv.psc)
#> Summary: 
#>  
#> 100 observations selected from the data cohort for comparison 
#> CFM of type flexsurvreg identified  
#> linear predictor succesfully obtained with a median of  3.15 
#> Average expected response: 9.1 
#> Average observed response: 6.366 
#> 
#> Counterfactual Model (CFM): 
#> A model of class 'flexsurvreg' 
#>  Fit with 3 internal knots
#> 
#> Formula: 
#> Surv(time, cen) ~ vi/age60 + ecog + allmets + logafp + alb + 
#>     logcreat + logast + aet
#> <environment: 0x564ceebc6c30>
#> 
#> Call:
#>  CFM model + beta
#> 
#> Coefficients:
#>       median    2.5%      97.5%     Pr(x<0)   Pr(x>0) 
#> beta    0.3477    0.1100    0.5986    0.0022    0.9978
#> DIC   280.8061  273.6206  292.5939        NA        NA

To visualise the original model and the fit of the data, the plot function has been developed

plot(surv.psc)

GLM

The “pscfit()” object uses class of the model supplied to derive the likelihoiod and estimation procedure that is required. In this example, the “enrichwith” package is utilised to extract from the model the parameters of the exponential family. Important from the attributes of the GLM are the “family” statements which dictates the form of the likelihood. For each of the binary, continuous and Count data outcomes then, the syntax and the DC remains the same and it is the form of the CFM that dictates the analysis

Binary

Step 1: Load Model

bin.mod <- psc::bin.mod
bin.mod
#> 
#> Call:  glm(formula = event ~ vi/age60 + ecog + allmets + logafp + alb + 
#>     logcreat, family = "binomial", data = pros)
#> 
#> Coefficients:
#> (Intercept)        viyes        ecog1   allmetsyes       logafp          alb  
#>    -1.65694      0.80007      0.38279      0.40733      0.12941     -0.04993  
#>    logcreat   vino:age60  viyes:age60  
#>     0.76346     -0.01901      0.01380  
#> 
#> Degrees of Freedom: 570 Total (i.e. Null);  562 Residual
#>   (17 observations deleted due to missingness)
#> Null Deviance:       683 
#> Residual Deviance: 616   AIC: 634

Step 2: Fit psc object

Step 3: Review summary statistics

summary(psc.bin)
#> Summary: 
#>  
#> 100 observations selected from the data cohort for comparison 
#> CFM of type glm lm identified  
#> linear predictor succesfully obtained with a median of  1.2 
#> Average expected response: 0.755 
#> Average observed response: 0.48 
#> 
#> Counterfactual Model (CFM): 
#> A model of class 'GLM' 
#>  Family: binomial 
#>  Link: logit 
#> 
#> Formula: 
#> event ~ vi/age60 + ecog + allmets + logafp + alb + logcreat
#> 
#> Call:
#>  CFM model + beta
#> 
#> Coefficients:
#>       median   2.5%     97.5%    Pr(x<0)  Pr(x>0)
#> beta  -1.2372  -1.6748  -0.7533   1.0000   0.0000
#> DIC   56.7757  52.7096  64.5727       NA       NA

Step 4: Plot Output

plot(psc.bin)

Count

Step 1: Load Model

count.mod <- psc::count.mod
count.mod
#> 
#> Call:  glm(formula = count ~ ecog + logafp + alb + logcreat, family = "poisson", 
#>     data = data)
#> 
#> Coefficients:
#> (Intercept)         ecog       logafp          alb     logcreat  
#>    1.414244    -0.170884     0.017803    -0.007255    -0.056183  
#> 
#> Degrees of Freedom: 99 Total (i.e. Null);  95 Residual
#> Null Deviance:       127.2 
#> Residual Deviance: 125.1     AIC: 382.1
#data("count.mod")

Step 2: Fit psc object

Step 3: Review summary statistics

summary(psc.count)
#> Summary: 
#>  
#> 100 observations selected from the data cohort for comparison 
#> CFM of type glm lm identified  
#> linear predictor succesfully obtained with a median of  0.919 
#> Average expected response: 2.49 
#> Average observed response: 2.49 
#> 
#> Counterfactual Model (CFM): 
#> A model of class 'GLM' 
#>  Family: poisson 
#>  Link: log 
#> 
#> Formula: 
#> count ~ ecog + logafp + alb + logcreat
#> 
#> Call:
#>  CFM model + beta
#> 
#> Coefficients:
#>       median    2.5%      97.5%     Pr(x<0)   Pr(x>0) 
#> beta  -0.01569  -0.19945   0.18444   0.56500   0.43500
#> DIC   14.24813  10.24669  24.58284        NA        NA

Step 4: Plot Output

plot(psc.count)

Continuous

Step 1: Load Model

cont.mod <- psc::cont.mod
cont.mod
#> 
#> Call:  glm(formula = os ~ vi/age60 + ecog + allmets + logafp + alb, 
#>     data = pros)
#> 
#> Coefficients:
#> (Intercept)        viyes        ecog1   allmetsyes       logafp          alb  
#>     2.16581     -2.57037     -2.97107     -2.13909     -0.37175      0.36754  
#>  vino:age60  viyes:age60  
#>     0.09174      0.02106  
#> 
#> Degrees of Freedom: 570 Total (i.e. Null);  563 Residual
#>   (17 observations deleted due to missingness)
#> Null Deviance:       33570 
#> Residual Deviance: 25300     AIC: 3803
#data("cont.mod")

Step 2: Fit psc object

Step 3: Review summary statistics

summary(psc.con)
#> Summary: 
#>  
#> 100 observations selected from the data cohort for comparison 
#> CFM of type glm lm identified  
#> linear predictor succesfully obtained with a median of  10.8 
#> Average expected response: 10.623 
#> Average observed response: 10.743 
#> 
#> Counterfactual Model (CFM): 
#> A model of class 'GLM' 
#>  Family: gaussian 
#>  Link: identity 
#> 
#> Formula: 
#> os ~ vi/age60 + ecog + allmets + logafp + alb
#> 
#> Call:
#>  CFM model + beta
#> 
#> Coefficients:
#>       median      2.5%        97.5%       Pr(x<0)     Pr(x>0)   
#> beta      0.1267     -0.7112      0.8440      0.3648      0.6352
#> DIC   -6797.2767  -6893.0243  -6663.0317          NA          NA

Step 4: Plot Output

plot(psc.con)

Sub group Effects

An attractive feature of Personalised Synthetic Controls is its use in fitting sub-group effects. Whereas other casual inference tools typically make assumptions about population levels of balance and then further assume that this balance holds at sub-group levels, Personalised Synthetic Controls differ in that they estimate treatment effects at a patient level and then average across populations. To estimate sub-group effects then we need only to restrict estimation over some sub-group of the population. This can be achived in 2 ways;

  • By directly slecting the subgroup you wish to evaluate

Using an example where we wnat to see if the treatment effect is consistent by patients with ECOG=0 and ECOG =1

Sub group effects by restricting the population

summary(sub1)
#> Summary: 
#>  
#> 53 observations selected from the data cohort for comparison 
#> CFM of type flexsurvreg identified  
#> linear predictor succesfully obtained with a median of  2.84 
#> Average expected response: 12.605 
#> Average observed response: 8.135 
#> 
#> Counterfactual Model (CFM): 
#> A model of class 'flexsurvreg' 
#>  Fit with 3 internal knots
#> 
#> Formula: 
#> Surv(time, cen) ~ vi/age60 + ecog + allmets + logafp + alb + 
#>     logcreat + logast + aet
#> <environment: 0x564cecdcd180>
#> 
#> Call:
#>  CFM model + beta
#> 
#> Coefficients:
#>       median    2.5%      97.5%     Pr(x<0)   Pr(x>0) 
#> beta    0.2961   -0.0625    0.6094    0.0566    0.9434
#> DIC   150.6569  145.0069  159.9764        NA        NA
summary(sub2)
#> Summary: 
#>  
#> 47 observations selected from the data cohort for comparison 
#> CFM of type flexsurvreg identified  
#> linear predictor succesfully obtained with a median of  3.423 
#> Average expected response: 8.099 
#> Average observed response: 5.856 
#> 
#> Counterfactual Model (CFM): 
#> A model of class 'flexsurvreg' 
#>  Fit with 3 internal knots
#> 
#> Formula: 
#> Surv(time, cen) ~ vi/age60 + ecog + allmets + logafp + alb + 
#>     logcreat + logast + aet
#> <environment: 0x564cecfdddb0>
#> 
#> Call:
#>  CFM model + beta
#> 
#> Coefficients:
#>       median     2.5%       97.5%      Pr(x<0)    Pr(x>0)  
#> beta    0.42798    0.01542    0.77574    0.02100    0.97900
#> DIC   100.92748   95.14362  110.03049         NA         NA