--- title: "Introduction to Smoothy" author: "Dan Ouchi" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{introduction-to-smoothy} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(rmarkdown.html_vignette.check_title = FALSE) ``` # Overview The smoothy package is an R package that provides a collection of functions to smooth the drug exposition using electronic health records. Also, the package helps to prepare the electronic drug records to model the drug exposition and perform longitudinal treatment pattern analysis. This vignette serves as a guide to understanding the main functionalities of the package and provides examples of how to use the functions. ## Installation You can install the `smoothy` package from CRAN using the following command: ```{r echo=TRUE,eval=F} install.packages('smoothy') ``` ### Load packages: ```{r setup, warning=FALSE,echo=T} library(smoothy) library(tidyr) library(knitr) ``` ## Functions The `smoothy` package includes the following main functions: - `smooth_algorithm()`: This function performs the smooth algorithm as descrubibed in Ouchi et.al.[[ref](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9709675/)] - `smooth_parse()`: This function transforms the data to obtain the daily treatment. - `smooth_deparse()`: This function returns to the original format but collapsing all the drugs exposed on the same day. - `smooth_diff()`: This function computes the differences between the original treatment and the smoothed treatment. ## The smooth algorithm The algorithm described in the published paper [[ref](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9709675/)] is based on moving averages commonly used in time series analysis. Its purpose is to facilitate modeling of drug exposure using electronic drug records. The algorithm calculates the average or most probable treatment exposure for each day of the follow-up period. The package contains all the necessary functions to execute the algorithm, which can be divided into three main steps: - Transform the raw data (parse) with the `smooth_parse()` function - Running the smoothing algorithm `smooth_algorithm()` function - Collapsing the treatment periodsfrom the smoothing algorithm with `smooth_deparse()` function Additionally, the package include one function to validate and better assess the impact of the algorithm on the original records: - `smooth_diff()`: Calculates the differences between the original data and the smoothed data ## Input dataset The raw data (input data) should be structured as a data frame with columns representing unique identifiers for each drug administration event (id), the start date of drug administration (start_date), the end date of drug administration (end_date), and the name of the drug administered (drug). ## Example dataset The smoothy package comes with an example dataset called *drugstreatment*. This dataset contains information of 387 patients and the prescribed drugs. It has the following variables: - idp: Identifier for each individual. - start_date: Date of drug initiation. - end_date: Date of drug end. - drug: Type of drug exposed, coded as A, B, or C. You can load the example dataset using the following command: ```{r data} data(drugstreatment) ``` ```{r data_tab, echo=FALSE, results='asis'} kable(head(drugstreatment),format='markdown') ``` # The workflow For this example, we are going to use a single patient: ```{r single} my_data <- dplyr::filter(drugstreatment, id == "62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956") ``` ## Step 1. Transform the Original Data (parse) We will expand the original drug records to represent daily active treatment during the defined study period. ```{r parse} structured_df <- smooth_parse( id = my_data$id, start_date = my_data$start_date, end_date = my_data$end_date, drug = my_data$drug, study_from = "1970-01-01", study_to = "1975-01-01" ) ``` The `smooth_parse()` function will return a new data frame with four columns: ```{r parse_tab, echo=FALSE, results='asis'} kable(head(structured_df),format='markdown') ``` - id: unique identifier - day: date vector with the day of follow-up - N: Number of drugs exposed to in that day - treatment: Character vector with the treatment. None: no exposed to any drug of study. ## Step 2. Apply the Algorithm Next, we will apply the smooth algorithm to the parsed data. ```{r smooth, } smoothed <- smooth_algorithm( id = structured_df$id, treatment = structured_df$treatment, day = structured_df$day, N = structured_df$N, width = 61 ) ``` This function performs the algorithm on each patient and it returns the same data frame but with a new column **smoothed_treatment**: ```{r smooth_tab, echo=FALSE, results='asis'} kable(head(smoothed),format='markdown') ``` ## Step 3. Untransform the Smoothed Data (deparse) To perform analysis on the treatment pattern, is better to collapse the treatments according to the exposition periods (end date and start date). ```{r deparse} deparsed.sm <- smooth_deparse(smoothed$id, smoothed$day, smoothed$smoothed_treatment) ``` The `smooth_parse()` function use the smoothed data frame as input and returns a new one very similar to the original dataset that we are more commonly using in analysis. It has four columns: ```{r deparsed_tab, echo=FALSE, results='asis'} knitr::kable(deparsed.sm,format='markdown') ``` - id: unique identifier. - start_date: date vector with the day of treatment initiation. - end_date: date vector with the day of treatment end. - treatment: Character vector with the treatment (smoothed or not). None: no exposed to any drug of study. Also, the function can be used on the non-smoothed treatment pattern (column *treatment*) which is the same as the initial data but including the untreated periods. ```{r deparse0} deparsed <- smooth_deparse(smoothed$id, smoothed$day, smoothed$treatment) ``` ## Step 4. Validation The validation includes assessing the differences before and after applying the smooth algorithm, measuring the number of days changed and the proportion of total days, and visualizing the longitudinal treatment pattern to see the impact of the algorithm in the overall pattern. ### 4.1. Count Differences Before and After the Smooth Algorithm We will calculate the number of days changed and the proportion of total days in overall, exposition period and at each treatment. ```{r count diff, eval=FALSE} smooth_diff(treatment = smoothed$treatment, smoothed_treatment = smoothed$smoothed_treatment) ``` ```{r count diff_tab, echo=FALSE, results='asis'} kable(smooth_diff(treatment = smoothed$treatment, smoothed_treatment = smoothed$smoothed_treatment),format='markdown') ``` ### 4.2 Visualization Using the following code, in ggplot2, we can visualize the longitudinal treatment pattern of each individual before and after using the smooth algorithm. ```{r viz, fig.show='hold',eval=T,echo=T,fig.align='center',fig.height=4,fig.width=8} require(ggplot2) require(gridExtra) tts = unique(deparsed$treatment) tts = tts[tts!='None'] yorder = c('None',tts[order(nchar(tts), tts)]) p = ggplot(deparsed, aes(x = start_date, y = treatment)) + geom_segment(aes(x = start_date, xend = end_date, y = treatment, yend = treatment), size = 2, alpha = 0.85, col = 'grey20') + theme_bw() + scale_x_date(date_breaks = '6 months') + scale_y_discrete(limits = yorder) + theme(axis.text.x = element_text(angle = 60, hjust = 1), legend.position = "none") + ylab("") + xlab("") + ggtitle(paste0('Original treatment')) tts = unique(deparsed.sm$treatment) tts = tts[tts!='None'] yorder = c('None',tts[order(nchar(tts), tts)]) p.sm = ggplot(deparsed.sm, aes(x = start_date, y = treatment)) + geom_segment(aes(x = start_date, xend = end_date, y = treatment, yend = treatment), size = 2, alpha = 0.85, col = 'grey20') + theme_bw() + scale_x_date(date_breaks = '6 months') + scale_y_discrete(limits = yorder) + theme(axis.text.x = element_text(angle = 60, hjust = 1), legend.position = "none") + ylab("") + xlab("") + ggtitle(paste0('Smoothed treatment')) grid.arrange(p,p.sm,ncol=1) ``` ## Choosing the windows size The default window size is set to 61 days. This value is based on various simulation studies and real-world analyses, which have shown that 61 days effectively captures treatment patterns without omitting critical treatment periods. However, it is essential to note that this window size may not be suitable for all patients. For cases where there are significant changes in the treatment pattern, it may be beneficial to manually review and assess whether the algorithm oversimplifies the treatment pattern. In the following example, we have selected a patient with frequent treatment changes to illustrate the effects of different window sizes: ```{r window_size} my_data <- dplyr::filter(drugstreatment, id == '25094328-3819-4061-941d-191c4e0bc939') structured_df <- smooth_parse( id = my_data$id, start_date = my_data$start_date, end_date = my_data$end_date, drug = my_data$drug, study_from = "1970-01-01", study_to = "1975-01-01" ) # smooth smoothed <- smooth_algorithm(id = structured_df$id, treatment = structured_df$treatment, day = structured_df$day, N = structured_df$N, width = 31) smoothed45 <- smooth_algorithm(id = structured_df$id, treatment = structured_df$treatment, day = structured_df$day, N = structured_df$N, width = 45) smoothed61 <- smooth_algorithm(id = structured_df$id, treatment = structured_df$treatment, day = structured_df$day, N = structured_df$N, width = 61) deparsed <- smooth_deparse(smoothed$id, smoothed$day, smoothed$treatment) deparsed.sm <- smooth_deparse(smoothed$id, smoothed$day, smoothed$smoothed_treatment) deparsed.sm45 <- smooth_deparse(smoothed45$id, smoothed45$day, smoothed45$smoothed_treatment) deparsed.sm61 <- smooth_deparse(smoothed61$id, smoothed61$day, smoothed61$smoothed_treatment) ``` ```{r viz_window, fig.show='hold',eval=T,echo=F,fig.align='center',fig.height=8,fig.width=8} ## plot: tts = unique(deparsed$treatment) tts = tts[tts!='None'] yorder = c('None',tts[order(nchar(tts), tts)]) p = ggplot(deparsed, aes(x = start_date, y = treatment)) + geom_segment(aes(x = start_date, xend = end_date, y = treatment, yend = treatment), linewidth = 2, alpha = 0.85, col = 'grey20') + theme_bw() + scale_x_date(date_breaks = '6 months') + scale_y_discrete(limits = yorder) + theme(axis.text.x = element_text(angle = 60, hjust = 1), legend.position = "none") + ylab("") + xlab("") + ggtitle(paste0('Original raw data treatment')) #> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0. #> ℹ Please use `linewidth` instead. #> This warning is displayed once every 8 hours. #> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was #> generated. tts = unique(deparsed.sm$treatment) tts = tts[tts!='None'] yorder = c('None',tts[order(nchar(tts), tts)]) p.sm = ggplot(deparsed.sm, aes(x = start_date, y = treatment)) + geom_segment(aes(x = start_date, xend = end_date, y = treatment, yend = treatment), linewidth = 2, alpha = 0.85, col = 'grey20') + theme_bw() + scale_x_date(date_breaks = '6 months') + scale_y_discrete(limits = yorder) + theme(axis.text.x = element_text(angle = 60, hjust = 1), legend.position = "none") + ylab("") + xlab("") + ggtitle(paste0('Smoothed treatment (31 days window)')) tts = unique(deparsed.sm45$treatment) tts = tts[tts!='None'] yorder = c('None',tts[order(nchar(tts), tts)]) p.sm45 = ggplot(deparsed.sm45, aes(x = start_date, y = treatment)) + geom_segment(aes(x = start_date, xend = end_date, y = treatment, yend = treatment), linewidth = 2, alpha = 0.85, col = 'grey20') + theme_bw() + scale_x_date(date_breaks = '6 months') + scale_y_discrete(limits = yorder) + theme(axis.text.x = element_text(angle = 60, hjust = 1), legend.position = "none") + ylab("") + xlab("") + ggtitle(paste0('Smoothed treatment (45 days window)')) tts = unique(deparsed.sm61$treatment) tts = tts[tts!='None'] yorder = c('None',tts[order(nchar(tts), tts)]) p.sm61 = ggplot(deparsed.sm61, aes(x = start_date, y = treatment)) + geom_segment(aes(x = start_date, xend = end_date, y = treatment, yend = treatment), linewidth = 2, alpha = 0.85, col = 'grey20') + theme_bw() + scale_x_date(date_breaks = '6 months') + scale_y_discrete(limits = yorder) + theme(axis.text.x = element_text(angle = 60, hjust = 1), legend.position = "none") + ylab("") + xlab("") + ggtitle(paste0('Smoothed treatment (61 days window)')) gridExtra::grid.arrange(p,p.sm,p.sm45,p.sm61,ncol=1) ```

# Utilizing the Algorithm in a Large Dataset Executing the entire workflow in a real study can be computationally intensive. Without a workstation equipped with ample RAM and high-speed processors, applying the smooth algorithm without parallel computing may prove to be unfeasible. To overcome this challenge, we demonstrate how to run the algorithm on the entire example dataset by dividing it into manageable chunks and parallelizing the calculations in R. To achieve this, make sure to have the following R packages installed: ```{r echo=TRUE,eval=FALSE} library(doParallel) library(foreach) library(snow) library(doSNOW) ``` Below is a script demonstrating how to run the entire workflow in parallel in R: ## 1) Split Data in Chunks In this example, the chunk size is 50, but in a big data set we recommend to use chunks of size 1000. ```{r parallel1, echo=TRUE,eval=FALSE} data = drugstreatment # starting date: s0 = Sys.time() # create chunks: chunksize = 50 niter <- n_distinct(drugs$id) chunks <- ceiling(niter/chunksize) inds <- split(seq_len(niter), sort(rep_len(seq_len(chunks), niter))) ``` ## 2) Sockets, Cores and Progress bar Parameters to pass into the `foreach()` function. Additionally, the progress bar feature provides a helpful visual representation of the ongoing computation, allowing you to monitor the progress and estimate the remaining time until completion ```{r parallel2, echo=TRUE,eval=FALSE} # parallel - cores and socket: n.cores <- 3 cl <- snow::makeSOCKcluster(n.cores) doSNOW::registerDoSNOW(cl) # progress bar: pb <- utils::txtProgressBar(min = 1, max = chunks, style = 3) progress <- function(n) utils::setTxtProgressBar(pb,n) opts <- list(progress = progress) ``` ## 3) Parallel Execution using foreach() and %dopar% The entire workflow is encompassed within the foreach() function, executed in a loop across all the data chunks using %dopar%. During each iteration, the computed output is saved in the tempdir() location in the RDS format. This allows for parallel processing, optimizing the performance and enhancing computational efficiency. ```{r parallel3, echo=TRUE,eval=FALSE} # path to temporal directory: tmp.path = tempdir() diff=FALSE l <- foreach(c = 1:chunks, .packages = c("Kendall", "smoothy", "data.table", "anytime", "dplyr"), .options.snow = opts, .multicombine = F) %dopar% { chunk.id <- unique(data$id)[inds[[c]]] # run the workflow in each individual from the chunk: df <- filter(data, id %in% chunk.id) # 1) parse data: structured_df <- smooth_parse( id = df$id, start_date = df$start_date, end_date = df$end_date, drug = df$drug, study_from = "1970-01-01", study_to = "1975-01-01" ) # 2) smooth algorithm: width <- 61 smoothed <- smooth_algorithm( id = structured_df$id, treatment = structured_df$treatment, day = structured_df$day, N = structured_df$N, width = width ) # 3) deparse data (original format): deparsed_smoothed <- smooth_deparse( smoothed$id, smoothed$day, smoothed$smoothed_treatment ) # 4) Per patient changes due to smooth algorithm: if(diff){ # Calculate differences by patient mapping with the group_map function: df <- smoothed %>% group_by(id) %>% group_map(~ smooth_diff(.$treatment,.$smoothed_treatment)) %>% bind_rows(.id = "group_id") %>% data.frame # Format output and filter global, exposure period: df <- df %>% mutate(percentage_of_change = round(proportion_of_change*100,2)) %>% filter(type%in%c('Global','Exposure period')) %>% mutate(type = factor(type,levels=c('Global','Exposure period'), labels=c('total_change','exposure_change'))) # add 'id' and reshape: df <- df %>% left_join(data.frame(id=unique(smoothed$id),group_id = as.character(seq(1,n_distinct(smoothed$id))))) %>% reshape2::dcast(id~type,value.var='percentage_of_change') # attach to deparsed_smoothed dataframe: deparsed_smoothed <- left_join( deparsed_smoothed, df ) rm(df) } # Save chunk output to a temporary folder: saveRDS(deparsed_smoothed,paste0(tmp.path,"/chunk_",c,".rds")) rm(df,structured_df,smoothed,deparsed_smoothed);gc() } ``` ### 3.1) Patient-Level Differences Within the loop, there is an optional code segment used to calculate the differences after applying the smooth algorithm at a patient level. This step is entirely optional, but it can be useful for validation, especially in cases where patients exhibit significant changes (approximately 10% of change). ```{r parallel31, echo=TRUE,eval=FALSE} # Calculate differences by patient mapping with the group_map function: aux <- smoothed %>% group_by(id) %>% group_map(~ smooth_diff(.$treatment,.$smoothed_treatment)) %>% bind_rows(.id = "group_id") %>% data.frame # Format output and filter global, exposure period: aux <- aux %>% mutate(percentage_of_change = round(proportion_of_change*100,2)) %>% filter(type%in%c('Global','Exposure period')) %>% mutate(type = factor(type,levels=c('Global','Exposure period'), labels=c('total_change','exposure_change'))) # add 'id' and reshape: aux <- aux %>% left_join(data.frame(id=unique(smoothed$id),group_id = as.character(seq(1,n_distinct(smoothed$id))))) %>% reshape2::dcast(id~type,value.var='percentage_of_change') # join to deparsed_smoothed dataframe: deparsed_smoothed <- left_join( deparsed_smoothed, aux ) ``` After completing the process, it is essential to close the clusters to release the computing resources. Additionally, you may want to calculate the total computation time for the entire workflow. ```{r parallel32, echo=TRUE,eval=FALSE} # close sockets: close(pb) snow::stopCluster(cl) # Time to finish the process: t0 = Sys.time() - s0 cat("\n The process finished in", round(t0), units(t0)) ``` ## 4) Read and Combine All Chunks To complete the process, we read the .rds files stored in the temporary folder and combine them into a single data.frame using the bind_rows function: ```{r parallel4, echo=TRUE,eval=FALSE} # Import and combine all chunks into a single data.frame: rds_files <- list.files(tmp.path, pattern = "chunk_", full.names = TRUE) all_chunks <- bind_rows(lapply(rds_files, readRDS)) ``` The computation time may vary depending on the number of samples and the study period. We conducted several tests on both Mac and Windows platforms, and the performance was similar. As a reference: - PC/Mac with 8 cores and 16GB of RAM - A dataset with 500k patients and 2.8M electronic prescriptions - Follow-up period of 6 years The entire process took 16 hours to finish.