--- title: "Simulate Individual Data" author: "Gabriele Pittarello" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Simulate Individual Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: '`r system.file("references.bib", package="ReSurv")`' --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(ReSurv) library(ggplot2) ``` # Introduction In this vignette we show how to simulate the individual data we included in the simulation study of @hiabu23. The simulations are based on the `SynthETIC` package and they can be used to replicate our results. In the manuscript, we named the $5$ scenarios Alpha, Beta, Gamma, Delta, Epsilon. The $5$ scenarios have the same data features described in the following table. Conversely, they have specific characteristics that we will describe in the coming sections. | Covariates | Description | |--------------------------------------------------|--------------------| | `claim_number` | Policy identifier. | | `claim_type` $\in \left\{0, 1 \right\}$ | Type of claim. | | `AP` | Accident month. | | `RP` | Reporting month. | For each scenario we will show if they satisfy the chain ladder assumptions (CL), the proportionality assumption in @cox72 (PROP) and if interactions are present (INT). Details on the simulation mechanism and the simulation parameters can be found in the manuscript. # Scenario Alpha This scenario is a mix of `claim_type 0` and `claim_type 1` with same number of claims at each accident month (i.e. the claims volume). ```{r eval=FALSE, include=TRUE} # Input data input_data_0 <- data_generator( random_seed = 1964, scenario = "alpha", time_unit = 1 / 360, years = 4, period_exposure = 200 ) ``` ```{r eval=FALSE, include=TRUE} input_data_0 %>% as.data.frame() %>% mutate(claim_type = as.factor(claim_type)) %>% ggplot(aes(x = RT - AT, color = claim_type)) + stat_ecdf(size = 1) + labs(title = "Empirical distribution of simulated notification delays", x = "Notification delay (in days)", y = "Cumulative Density") + xlim(0, 1500) + scale_color_manual( values = c("royalblue", "#a71429"), labels = c("Claim type 0", "Claim type 1") ) + scale_linetype_manual(values = c(1, 3), labels = c("Claim type 0", "Claim type 1")) + guides( color = guide_legend(title = "Claim type", override.aes = list( color = c("royalblue", "#a71429"), size = 2 )), linetype = guide_legend( title = "Claim type", override.aes = list(linetype = c(1, 3), size = 0.7) ) ) + theme_bw() ``` # Scenario Beta This scenario is similar to simulation `Alpha` but the volume of `claim_type 1` is decreasing in the most recent accident dates. When the longer tailed bodily injuries have a decreasing claim volume, aggregated chain ladder methods will overestimate reserves, see @ajne94. ```{r include=TRUE, eval =FALSE} input_data_1 <- data_generator( random_seed = 1964, scenario = 1, time_unit = 1 / 360, years = 4, period_exposure = 200 ) ``` ```{r eval=FALSE, include=TRUE} input_data_1 %>% as.data.frame() %>% mutate(claim_type = as.factor(claim_type)) %>% ggplot(aes(x = RT - AT, color = claim_type)) + stat_ecdf(size = 1) + labs(title = "Empirical distribution of simulated notification delays", x = "Notification delay (in days)", y = "Cumulative Density") + xlim(0, 1500) + scale_color_manual( values = c("royalblue", "#a71429"), labels = c("Claim type 0", "Claim type 1") ) + scale_linetype_manual(values = c(1, 3), labels = c("Claim type 0", "Claim type 1")) + guides( color = guide_legend(title = "Claim type", override.aes = list( color = c("royalblue", "#a71429"), size = 2 )), linetype = guide_legend( title = "Claim type", override.aes = list(linetype = c(1, 3), size = 0.7) ) ) + theme_bw() ``` # Scenario Gamma An interaction between `claim_type 1` and accident period affects the claims occurrence. One could imagine a scenario, where a change in consumer behavior or company policies resulted in different reporting patterns over time. For the last simulated accident month, the two reporting delay distributions will be identical. ```{r} # Input data input_data_2 <- data_generator( random_seed = 1964, scenario = 2, time_unit = 1 / 360, years = 4, period_exposure = 200 ) ``` ```{r eval=FALSE, include=TRUE} input_data_2 %>% as.data.frame() %>% mutate(claim_type = as.factor(claim_type)) %>% ggplot(aes(x = RT - AT, color = claim_type)) + stat_ecdf(size = 1) + labs(title = "Empirical distribution of simulated notification delays", x = "Notification delay (in days)", y = "Cumulative Density") + xlim(0, 1500) + scale_color_manual( values = c("royalblue", "#a71429"), labels = c("Claim type 0", "Claim type 1") ) + scale_linetype_manual(values = c(1, 3), labels = c("Claim type 0", "Claim type 1")) + guides( color = guide_legend(title = "Claim type", override.aes = list( color = c("royalblue", "#a71429"), size = 2 )), linetype = guide_legend( title = "Claim type", override.aes = list(linetype = c(1, 3), size = 0.7) ) ) + theme_bw() ``` # Scenario Delta A seasonality effect dependent on the accident months for `claim_type 0` and `claim_type 1` is present. This could occur in a real world setting with increased work load during winter for certain claim types, or a decreased workforce during the summer holidays. ```{r} input_data_3 <- data_generator( random_seed = 1964, scenario = 3, time_unit = 1 / 360, years = 4, period_exposure = 200 ) ``` ```{r eval=FALSE, include=TRUE} input_data_3 %>% as.data.frame() %>% mutate(claim_type = as.factor(claim_type)) %>% ggplot(aes(x = RT - AT, color = claim_type)) + stat_ecdf(size = 1) + labs(title = "Empirical distribution of simulated notification delays", x = "Notification delay (in days)", y = "Cumulative Density") + xlim(0, 1500) + scale_color_manual( values = c("royalblue", "#a71429"), labels = c("Claim type 0", "Claim type 1") ) + scale_linetype_manual(values = c(1, 3), labels = c("Claim type 0", "Claim type 1")) + guides( color = guide_legend(title = "Claim type", override.aes = list( color = c("royalblue", "#a71429"), size = 2 )), linetype = guide_legend( title = "Claim type", override.aes = list(linetype = c(1, 3), size = 0.7) ) ) + theme_bw() ``` # Scenario Epsilon The data generating process violates the proportional likelihood in @cox72. We generate the data assuming that a) there is an effect of the covariates on the baseline and b) the proportionality assumption is not valid. ```{r} # Input data input_data_4 <- data_generator( random_seed = 1964, scenario = 4, time_unit = 1 / 360, years = 4, period_exposure = 200 ) ``` ```{r eval=FALSE, include=TRUE} input_data_4 %>% as.data.frame() %>% mutate(claim_type = as.factor(claim_type)) %>% ggplot(aes(x = RT - AT, color = claim_type)) + stat_ecdf(size = 1) + labs(title = "Empirical distribution of simulated notification delays", x = "Notification delay (in days)", y = "Cumulative Density") + xlim(0, 1500) + scale_color_manual( values = c("royalblue", "#a71429"), labels = c("Claim type 0", "Claim type 1") ) + scale_linetype_manual(values = c(1, 3), labels = c("Claim type 0", "Claim type 1")) + guides( color = guide_legend(title = "Claim type", override.aes = list( color = c("royalblue", "#a71429"), size = 2 )), linetype = guide_legend( title = "Claim type", override.aes = list(linetype = c(1, 3), size = 0.7) ) ) + theme_bw() ``` # Bibliography