The psc.R package implements the methods for applying Personalised Synthetic Controls, which allows for patients receiving some experimental treatment to be compared against a model which predicts their reponse to some control. This is a form of causal inference which differes from othere approaches in that
Data are only required on a single treatment - all counterfactual evidence is supplied by a parametric modelCausal inference, in theory at least, is estimated at a patient level - as opposed to estimating average effects over a population
In its basic form, this method creates a likelihood to compare a cohort of data to a parametric model. See (X) for disucssion on it’s use as a causal inference tool. To use this package, two basic peices of information are required, a dataset and a model against which they can be compared.
In this vignette, we will detail how the psc.r package is constructed and give some examples for it’s application in practice.
To load the library use the code
The pscfit
function compares a dataset (‘DC’) against a
parametric model. This is done by selecting a likelihood which is
identified by the type of CFM that is supplied. At present, two types of
model are supported, a flexible parmaeteric survival model of type
‘flexsurvreg’ and a geleneralised linear model of type ‘glm’.
Where the CFM is of type ‘flexsurvreg’ the likeihood supplied is of the form:
$$L(D∣\Lambda,\Gamma_i)=\prod_{i=1}^{n} f(t_i∣\Lambda,\Gamma_i)^{c_i} S(t_i∣\Gamma,\Lambda_i)^{(1−c_i)}$$ Where Γ defines the cumulative baseline hazard function, Λ is the linear predictor and t and c are the event time and indicator variables.
Where the CFM is of the type ‘glm’ the likelihood supplied is of the form:
$$L(x∣\Gamma_i) = \prod_{i=1}^{n} b (x∣ \Gamma_i )\exp\{\Gamma_i t(x)−c(\Gamma_i)\}$$
Where b(.), t(.) and c(.) represent the functions of the exponential family. In both cases, Γ is defiend as:
Γi = γxi + β
Where γ are the model coefficients supplied by the CFM and β is the parameter set to measure the difference between the CFM and the DC.
Estimation is performed using a Bayesian MCMC procedure. Prior distributions for Γ (& Λ) are derived directly from the model coefficients (mean and variance covariance matrix) or the CFM. A bespoke MCMC routine is performed to estimate β. Please see ‘?mcmc’ for more detials.
For the standard example where the DC contains information from only a single treatment, trt need not be specified. Where comparisons between the CFM and multiple treatments are require, a covariate of treamtne allocations must be specified sperately (using the ‘trt’ option).
The main function for using applying Personal Synthetic Controls is the pscfit() function which has two inputs, a Counter-Factual Model (CFM) and a data cohort (DC). Further arguments include
The output of the “pscfit()” function is an object of class ‘psc’. This class contains the following attributes
basic post estimation functions have been developed to work with the psc object, namely “print()”, “coef()”, “summary()” and “plot()”. For the first three of these these provided basic summaries of the efficacy parameter obtained from the posterior distribution.
The psc.r package includes as example, a dataset which is derived from patients with advanced Hepatocellular Carcinoma (aHCC) who have all received some experimental treatment. The dataset is simply named ‘data’ and is loaded into the enviroment using the “data()” function
Included is a list of prognostic covariates:
Also included are the following structures
Lastly the dataset also inlclude a ‘trt’ variable to be used in the estimation of multiple treatment comparisons.
For an example with a survival outcome a model must be supplied which is contructed ont he basis of flexible parametric splines. This is contructed using the “flexsurvreg” function within the “flexsurv” package. An example is included within the ‘psc.r’ package names ‘surv.mod’ and is loaded using the ’data()” function:
surv.mod <- psc::surv.mod
surv.mod
#> Call:
#> flexsurvspline(formula = Surv(time, cen) ~ vi/age60 + ecog +
#> allmets + logafp + alb + logcreat + logast + aet, data = pros,
#> k = 3)
#>
#> Estimates:
#> data mean est L95% U95% se exp(est)
#> gamma0 NA -7.57573 -9.90310 -5.24836 1.18746 NA
#> gamma1 NA 2.23506 1.37951 3.09060 0.43651 NA
#> gamma2 NA -0.13758 -0.74322 0.46807 0.30901 NA
#> gamma3 NA 0.26981 -0.72101 1.26063 0.50553 NA
#> gamma4 NA -0.03031 -0.60738 0.54676 0.29443 NA
#> viyes 0.28400 0.32856 0.08489 0.57224 0.12433 1.38897
#> ecog1 0.40400 0.45602 0.23429 0.67776 0.11313 1.57779
#> allmetsyes 0.69400 0.29767 0.03666 0.55869 0.13317 1.34672
#> logafp 5.42436 0.08320 0.05038 0.11602 0.01674 1.08676
#> alb 38.55600 -0.05539 -0.07809 -0.03269 0.01158 0.94612
#> logcreat 4.30929 0.71084 0.26457 1.15712 0.22770 2.03571
#> logast 4.08274 0.35000 0.15915 0.54086 0.09738 1.41907
#> aetHCV 0.21200 -0.52754 -0.86299 -0.19210 0.17115 0.59005
#> aetOther 0.37000 -0.01847 -0.26319 0.22624 0.12486 0.98170
#> vino:age60 0.06400 -0.02312 -0.03479 -0.01144 0.00596 0.97715
#> viyes:age60 -0.37000 0.00716 -0.00790 0.02221 0.00768 1.00718
#> L95% U95%
#> gamma0 NA NA
#> gamma1 NA NA
#> gamma2 NA NA
#> gamma3 NA NA
#> gamma4 NA NA
#> viyes 1.08859 1.77223
#> ecog1 1.26401 1.96946
#> allmetsyes 1.03734 1.74838
#> logafp 1.05167 1.12301
#> alb 0.92488 0.96784
#> logcreat 1.30286 3.18076
#> logast 1.17251 1.71748
#> aetHCV 0.42190 0.82523
#> aetOther 0.76860 1.25388
#> vino:age60 0.96581 0.98862
#> viyes:age60 0.99214 1.02246
#>
#> N = 500, Events: 360, Censored: 140
#> Total time at risk: 5357.664
#> Log-likelihood = -1221.857, df = 16
#> AIC = 2475.714
In this example you can see that this is a model constructed with 3 internal knots and hence 5 parameters to describe the baseline cumulative hazard function. There are also prognostic covariates which match with the prognostic covaraites in the data cohort.
To begin it is worth looking at the performance of the model and looking at how the survival of patietns in the data cohort compare
Comparing the dataset to the model is then performed using
and we can view the attributes of the psc object that is created
For example to view the matrix contianing the draws of the posterior distribution we use
surv.post <- surv.psc$posterior
head(surv.post)
#> gamma0 gamma1 gamma2 gamma3 gamma4 viyes ecog1
#> 1 -7.575727 2.2350572 -1.375763e-01 0.2698109 -0.03030718 0.3285628 0.4560236
#> 2 -6.177702 1.7131056 -4.586078e-01 0.8446338 -0.39464194 0.2492159 0.3448721
#> 3 -7.402628 2.4795860 3.186636e-01 -0.5318789 0.40622842 0.4342349 0.5524287
#> 4 -6.617513 1.9195851 -5.327126e-05 -0.1116571 0.25039831 0.4704164 0.4643693
#> 5 -7.620996 2.6976160 -2.249089e-03 0.3454113 -0.30410438 0.2206927 0.4994067
#> 6 -8.437812 0.6630725 -1.228197e+00 1.7734664 -0.61134172 0.3292307 0.5002040
#> allmetsyes logafp alb logcreat logast aetHCV aetOther
#> 1 0.2976742 0.08319787 -0.05538982 0.7108430 0.3500019 -0.5275438 -0.01847276
#> 2 0.4634186 0.07809508 -0.07068141 0.5898201 0.4036147 -0.5273226 -0.25076036
#> 3 0.2497853 0.07633782 -0.07194850 0.8548329 0.3468265 -0.6925098 -0.24006816
#> 4 0.1705090 0.08127646 -0.05179644 0.5226085 0.3599905 -0.5909674 0.04883408
#> 5 0.2298813 0.11168570 -0.07046628 0.6643340 0.5081121 -0.5829386 -0.16984486
#> 6 0.2903587 0.09968702 -0.04063800 0.7996067 0.4447975 -0.6538008 0.06704888
#> vino:age60 viyes:age60 beta DIC
#> 1 -0.02311823 0.0071560470 0.3784481 NA
#> 2 -0.01954169 0.0091924994 0.3784481 283.8779
#> 3 -0.01908248 0.0068133893 0.3784481 279.2638
#> 4 -0.02378719 0.0186442595 0.3784481 275.3187
#> 5 -0.01868482 0.0002189175 0.3784481 287.9250
#> 6 -0.01061563 0.0111369628 0.3784481 289.0717
Inspection will show that there is a column for each parameter in the original model as well as ‘beta’ and ‘DIC’ vcolumns which give teh posterior estiamtes for β and the Deviance Informaiton Criterion respectively.
We can inspect the poterior distribution using the autocorrelation function, trace and stardard summary statistics:
summary(surv.post$beta)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -0.09211 0.27602 0.34769 0.35246 0.43044 0.78634
Alternatively the standard post estimation functions can be used
print(surv.psc)
#> Counterfactual Model (CFM):
#> A model of class 'flexsurvreg'
#> Fit with 3 internal knots
#>
#> Formula:
#> Surv(time, cen) ~ vi/age60 + ecog + allmets + logafp + alb +
#> logcreat + logast + aet
#> <environment: 0x564ceebc6c30>
#>
#> Call:
#> CFM model + beta
#>
#> Coefficients:
#> median 2.5% 97.5% Pr(x<0) Pr(x>0)
#> beta 0.3477 0.1100 0.5986 0.0022 0.9978
#> DIC 280.8061 273.6206 292.5939 NA NA
coef(surv.psc)
#> median 2.5% 97.5% Pr(x<0) Pr(x>0)
#> beta 0.3476866 0.1100292 0.5986241 0.0022 0.9978
#> DIC 280.8061186 273.6206400 292.5938952 NA NA
summary(surv.psc)
#> Summary:
#>
#> 100 observations selected from the data cohort for comparison
#> CFM of type flexsurvreg identified
#> linear predictor succesfully obtained with a median of 3.15
#> Average expected response: 9.1
#> Average observed response: 6.366
#>
#> Counterfactual Model (CFM):
#> A model of class 'flexsurvreg'
#> Fit with 3 internal knots
#>
#> Formula:
#> Surv(time, cen) ~ vi/age60 + ecog + allmets + logafp + alb +
#> logcreat + logast + aet
#> <environment: 0x564ceebc6c30>
#>
#> Call:
#> CFM model + beta
#>
#> Coefficients:
#> median 2.5% 97.5% Pr(x<0) Pr(x>0)
#> beta 0.3477 0.1100 0.5986 0.0022 0.9978
#> DIC 280.8061 273.6206 292.5939 NA NA
To visualise the original model and the fit of the data, the plot function has been developed
The “pscfit()” object uses class of the model supplied to derive the likelihoiod and estimation procedure that is required. In this example, the “enrichwith” package is utilised to extract from the model the parameters of the exponential family. Important from the attributes of the GLM are the “family” statements which dictates the form of the likelihood. For each of the binary, continuous and Count data outcomes then, the syntax and the DC remains the same and it is the form of the CFM that dictates the analysis
Step 1: Load Model
bin.mod <- psc::bin.mod
bin.mod
#>
#> Call: glm(formula = event ~ vi/age60 + ecog + allmets + logafp + alb +
#> logcreat, family = "binomial", data = pros)
#>
#> Coefficients:
#> (Intercept) viyes ecog1 allmetsyes logafp alb
#> -1.65694 0.80007 0.38279 0.40733 0.12941 -0.04993
#> logcreat vino:age60 viyes:age60
#> 0.76346 -0.01901 0.01380
#>
#> Degrees of Freedom: 570 Total (i.e. Null); 562 Residual
#> (17 observations deleted due to missingness)
#> Null Deviance: 683
#> Residual Deviance: 616 AIC: 634
Step 2: Fit psc object
Step 3: Review summary statistics
summary(psc.bin)
#> Summary:
#>
#> 100 observations selected from the data cohort for comparison
#> CFM of type glm lm identified
#> linear predictor succesfully obtained with a median of 1.2
#> Average expected response: 0.755
#> Average observed response: 0.48
#>
#> Counterfactual Model (CFM):
#> A model of class 'GLM'
#> Family: binomial
#> Link: logit
#>
#> Formula:
#> event ~ vi/age60 + ecog + allmets + logafp + alb + logcreat
#>
#> Call:
#> CFM model + beta
#>
#> Coefficients:
#> median 2.5% 97.5% Pr(x<0) Pr(x>0)
#> beta -1.2372 -1.6748 -0.7533 1.0000 0.0000
#> DIC 56.7757 52.7096 64.5727 NA NA
Step 4: Plot Output
Step 1: Load Model
count.mod <- psc::count.mod
count.mod
#>
#> Call: glm(formula = count ~ ecog + logafp + alb + logcreat, family = "poisson",
#> data = data)
#>
#> Coefficients:
#> (Intercept) ecog logafp alb logcreat
#> 1.414244 -0.170884 0.017803 -0.007255 -0.056183
#>
#> Degrees of Freedom: 99 Total (i.e. Null); 95 Residual
#> Null Deviance: 127.2
#> Residual Deviance: 125.1 AIC: 382.1
#data("count.mod")
Step 2: Fit psc object
Step 3: Review summary statistics
summary(psc.count)
#> Summary:
#>
#> 100 observations selected from the data cohort for comparison
#> CFM of type glm lm identified
#> linear predictor succesfully obtained with a median of 0.919
#> Average expected response: 2.49
#> Average observed response: 2.49
#>
#> Counterfactual Model (CFM):
#> A model of class 'GLM'
#> Family: poisson
#> Link: log
#>
#> Formula:
#> count ~ ecog + logafp + alb + logcreat
#>
#> Call:
#> CFM model + beta
#>
#> Coefficients:
#> median 2.5% 97.5% Pr(x<0) Pr(x>0)
#> beta -0.01569 -0.19945 0.18444 0.56500 0.43500
#> DIC 14.24813 10.24669 24.58284 NA NA
Step 4: Plot Output
Step 1: Load Model
cont.mod <- psc::cont.mod
cont.mod
#>
#> Call: glm(formula = os ~ vi/age60 + ecog + allmets + logafp + alb,
#> data = pros)
#>
#> Coefficients:
#> (Intercept) viyes ecog1 allmetsyes logafp alb
#> 2.16581 -2.57037 -2.97107 -2.13909 -0.37175 0.36754
#> vino:age60 viyes:age60
#> 0.09174 0.02106
#>
#> Degrees of Freedom: 570 Total (i.e. Null); 563 Residual
#> (17 observations deleted due to missingness)
#> Null Deviance: 33570
#> Residual Deviance: 25300 AIC: 3803
#data("cont.mod")
Step 2: Fit psc object
Step 3: Review summary statistics
summary(psc.con)
#> Summary:
#>
#> 100 observations selected from the data cohort for comparison
#> CFM of type glm lm identified
#> linear predictor succesfully obtained with a median of 10.8
#> Average expected response: 10.623
#> Average observed response: 10.743
#>
#> Counterfactual Model (CFM):
#> A model of class 'GLM'
#> Family: gaussian
#> Link: identity
#>
#> Formula:
#> os ~ vi/age60 + ecog + allmets + logafp + alb
#>
#> Call:
#> CFM model + beta
#>
#> Coefficients:
#> median 2.5% 97.5% Pr(x<0) Pr(x>0)
#> beta 0.1267 -0.7112 0.8440 0.3648 0.6352
#> DIC -6797.2767 -6893.0243 -6663.0317 NA NA
Step 4: Plot Output
An attractive feature of Personalised Synthetic Controls is its use in fitting sub-group effects. Whereas other casual inference tools typically make assumptions about population levels of balance and then further assume that this balance holds at sub-group levels, Personalised Synthetic Controls differ in that they estimate treatment effects at a patient level and then average across populations. To estimate sub-group effects then we need only to restrict estimation over some sub-group of the population. This can be achived in 2 ways;
Using an example where we wnat to see if the treatment effect is consistent by patients with ECOG=0 and ECOG =1
summary(sub1)
#> Summary:
#>
#> 53 observations selected from the data cohort for comparison
#> CFM of type flexsurvreg identified
#> linear predictor succesfully obtained with a median of 2.84
#> Average expected response: 12.605
#> Average observed response: 8.135
#>
#> Counterfactual Model (CFM):
#> A model of class 'flexsurvreg'
#> Fit with 3 internal knots
#>
#> Formula:
#> Surv(time, cen) ~ vi/age60 + ecog + allmets + logafp + alb +
#> logcreat + logast + aet
#> <environment: 0x564cecdcd180>
#>
#> Call:
#> CFM model + beta
#>
#> Coefficients:
#> median 2.5% 97.5% Pr(x<0) Pr(x>0)
#> beta 0.2961 -0.0625 0.6094 0.0566 0.9434
#> DIC 150.6569 145.0069 159.9764 NA NA
summary(sub2)
#> Summary:
#>
#> 47 observations selected from the data cohort for comparison
#> CFM of type flexsurvreg identified
#> linear predictor succesfully obtained with a median of 3.423
#> Average expected response: 8.099
#> Average observed response: 5.856
#>
#> Counterfactual Model (CFM):
#> A model of class 'flexsurvreg'
#> Fit with 3 internal knots
#>
#> Formula:
#> Surv(time, cen) ~ vi/age60 + ecog + allmets + logafp + alb +
#> logcreat + logast + aet
#> <environment: 0x564cecfdddb0>
#>
#> Call:
#> CFM model + beta
#>
#> Coefficients:
#> median 2.5% 97.5% Pr(x<0) Pr(x>0)
#> beta 0.42798 0.01542 0.77574 0.02100 0.97900
#> DIC 100.92748 95.14362 110.03049 NA NA