Introduction to Smoothy

Overview

The smoothy package is an R package that provides a collection of functions to smooth the drug exposition using electronic health records. Also, the package helps to prepare the electronic drug records to model the drug exposition and perform longitudinal treatment pattern analysis. This vignette serves as a guide to understanding the main functionalities of the package and provides examples of how to use the functions.

Installation

You can install the smoothy package from CRAN using the following command:

install.packages('smoothy')

Load packages:

library(smoothy)
library(tidyr)
library(knitr)

Functions

The smoothy package includes the following main functions:

  • smooth_algorithm(): This function performs the smooth algorithm as descrubibed in Ouchi et.al.[ref]
  • smooth_parse(): This function transforms the data to obtain the daily treatment.
  • smooth_deparse(): This function returns to the original format but collapsing all the drugs exposed on the same day.
  • smooth_diff(): This function computes the differences between the original treatment and the smoothed treatment.

The smooth algorithm

The algorithm described in the published paper [ref] is based on moving averages commonly used in time series analysis. Its purpose is to facilitate modeling of drug exposure using electronic drug records. The algorithm calculates the average or most probable treatment exposure for each day of the follow-up period.

The package contains all the necessary functions to execute the algorithm, which can be divided into three main steps:

  • Transform the raw data (parse) with the smooth_parse() function
  • Running the smoothing algorithm smooth_algorithm() function
  • Collapsing the treatment periodsfrom the smoothing algorithm with smooth_deparse() function

Additionally, the package include one function to validate and better assess the impact of the algorithm on the original records:

  • smooth_diff(): Calculates the differences between the original data and the smoothed data

Input dataset

The raw data (input data) should be structured as a data frame with columns representing unique identifiers for each drug administration event (id), the start date of drug administration (start_date), the end date of drug administration (end_date), and the name of the drug administered (drug).

Example dataset

The smoothy package comes with an example dataset called drugstreatment. This dataset contains information of 387 patients and the prescribed drugs. It has the following variables:

  • idp: Identifier for each individual.
  • start_date: Date of drug initiation.
  • end_date: Date of drug end.
  • drug: Type of drug exposed, coded as A, B, or C.

You can load the example dataset using the following command:

data(drugstreatment)
id start_date end_date drug
867e99f2-6cc1-492c-a1c0-e36f48c7aa95 1971-01-02 1971-01-04 C
867e99f2-6cc1-492c-a1c0-e36f48c7aa95 1971-01-05 1971-03-09 C
867e99f2-6cc1-492c-a1c0-e36f48c7aa95 1971-03-10 1971-04-07 C
867e99f2-6cc1-492c-a1c0-e36f48c7aa95 1971-04-08 1971-04-28 C
867e99f2-6cc1-492c-a1c0-e36f48c7aa95 1971-04-29 1971-06-05 C
867e99f2-6cc1-492c-a1c0-e36f48c7aa95 1971-06-06 1971-08-24 C

The workflow

For this example, we are going to use a single patient:

my_data <- dplyr::filter(drugstreatment, id == "62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956")

Step 1. Transform the Original Data (parse)

We will expand the original drug records to represent daily active treatment during the defined study period.

structured_df <- smooth_parse(
  id = my_data$id,
  start_date = my_data$start_date,
  end_date = my_data$end_date,
  drug = my_data$drug,
  study_from = "1970-01-01",
  study_to = "1975-01-01"
)

The smooth_parse() function will return a new data frame with four columns:

id day N treatment
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 1970-01-01 0 None
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 1970-01-02 0 None
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 1970-01-03 0 None
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 1970-01-04 0 None
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 1970-01-05 0 None
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 1970-01-06 0 None
  • id: unique identifier
  • day: date vector with the day of follow-up
  • N: Number of drugs exposed to in that day
  • treatment: Character vector with the treatment. None: no exposed to any drug of study.

Step 2. Apply the Algorithm

Next, we will apply the smooth algorithm to the parsed data.

smoothed <- smooth_algorithm(
  id = structured_df$id, 
  treatment = structured_df$treatment, 
  day = structured_df$day, 
  N = structured_df$N, 
  width = 61
)

This function performs the algorithm on each patient and it returns the same data frame but with a new column smoothed_treatment:

id treatment day N smoothed_treatment
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 None 1970-01-01 0 None
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 None 1970-01-02 0 None
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 None 1970-01-03 0 None
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 None 1970-01-04 0 None
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 None 1970-01-05 0 None
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 None 1970-01-06 0 None

Step 3. Untransform the Smoothed Data (deparse)

To perform analysis on the treatment pattern, is better to collapse the treatments according to the exposition periods (end date and start date).

deparsed.sm <- smooth_deparse(smoothed$id, smoothed$day, smoothed$smoothed_treatment)

The smooth_parse() function use the smoothed data frame as input and returns a new one very similar to the original dataset that we are more commonly using in analysis. It has four columns:

id start_date end_date treatment
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 1970-01-01 1972-03-01 None
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 1972-03-02 1972-04-13 A
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 1972-04-14 1972-10-31 A+C
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 1972-11-01 1973-06-21 A+B+C
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 1973-06-22 1973-12-21 A+B
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 1973-12-22 1975-01-01 None
  • id: unique identifier.
  • start_date: date vector with the day of treatment initiation.
  • end_date: date vector with the day of treatment end.
  • treatment: Character vector with the treatment (smoothed or not). None: no exposed to any drug of study.

Also, the function can be used on the non-smoothed treatment pattern (column treatment) which is the same as the initial data but including the untreated periods.

deparsed <- smooth_deparse(smoothed$id, smoothed$day, smoothed$treatment)

Step 4. Validation

The validation includes assessing the differences before and after applying the smooth algorithm, measuring the number of days changed and the proportion of total days, and visualizing the longitudinal treatment pattern to see the impact of the algorithm in the overall pattern.

4.1. Count Differences Before and After the Smooth Algorithm

We will calculate the number of days changed and the proportion of total days in overall, exposition period and at each treatment.

smooth_diff(treatment = smoothed$treatment, smoothed_treatment = smoothed$smoothed_treatment)
type days_changed proportion_of_change
Global 54 0.0295567
Expousure period 53 0.0804249
drug change A 1 0.0333333
drug change A+B 22 0.1073171
drug change A+B+C 0 0.0000000
drug change A+C 0 0.0000000
drug change C 31 1.0000000

4.2 Visualization

Using the following code, in ggplot2, we can visualize the longitudinal treatment pattern of each individual before and after using the smooth algorithm.

require(ggplot2)
#> Loading required package: ggplot2
require(gridExtra)
#> Loading required package: gridExtra

tts = unique(deparsed$treatment)
tts = tts[tts!='None']
yorder = c('None',tts[order(nchar(tts), tts)])
p = ggplot(deparsed, aes(x = start_date, y = treatment)) + geom_segment(aes(x = start_date, 
        xend = end_date, y = treatment, yend = treatment), size = 2, alpha = 0.85, 
        col = 'grey20') + theme_bw() + scale_x_date(date_breaks = '6 months') + 
        scale_y_discrete(limits = yorder) + theme(axis.text.x = element_text(angle = 60, 
        hjust = 1), legend.position = "none") + ylab("") + xlab("") + 
        ggtitle(paste0('Original treatment'))
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#> ℹ Please use `linewidth` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

tts = unique(deparsed.sm$treatment)
tts = tts[tts!='None']
yorder = c('None',tts[order(nchar(tts), tts)])
p.sm = ggplot(deparsed.sm, aes(x = start_date, y = treatment)) + geom_segment(aes(x = start_date, 
        xend = end_date, y = treatment, yend = treatment), size = 2, alpha = 0.85, 
        col = 'grey20') + theme_bw() + scale_x_date(date_breaks = '6 months') + 
        scale_y_discrete(limits = yorder) + theme(axis.text.x = element_text(angle = 60, 
        hjust = 1), legend.position = "none") + ylab("") + xlab("") + 
        ggtitle(paste0('Smoothed treatment'))

grid.arrange(p,p.sm,ncol=1)
    

Choosing the windows size

The default window size is set to 61 days. This value is based on various simulation studies and real-world analyses, which have shown that 61 days effectively captures treatment patterns without omitting critical treatment periods.

However, it is essential to note that this window size may not be suitable for all patients. For cases where there are significant changes in the treatment pattern, it may be beneficial to manually review and assess whether the algorithm oversimplifies the treatment pattern.

In the following example, we have selected a patient with frequent treatment changes to illustrate the effects of different window sizes:

my_data <- dplyr::filter(drugstreatment, id == '25094328-3819-4061-941d-191c4e0bc939')

structured_df <- smooth_parse(
  id = my_data$id,
  start_date = my_data$start_date,
  end_date = my_data$end_date,
  drug = my_data$drug,
  study_from = "1970-01-01",
  study_to = "1975-01-01"
)

# smooth
smoothed <- smooth_algorithm(id = structured_df$id, 
                             treatment = structured_df$treatment, 
                             day = structured_df$day, 
                             N = structured_df$N, 
                             width = 31)
smoothed45 <- smooth_algorithm(id = structured_df$id, 
                             treatment = structured_df$treatment, 
                             day = structured_df$day, 
                             N = structured_df$N, 
                             width = 45)
smoothed61 <- smooth_algorithm(id = structured_df$id, 
                             treatment = structured_df$treatment, 
                             day = structured_df$day, 
                             N = structured_df$N, 
                             width = 61)


deparsed <- smooth_deparse(smoothed$id, smoothed$day, smoothed$treatment)
deparsed.sm <- smooth_deparse(smoothed$id, smoothed$day, smoothed$smoothed_treatment)
deparsed.sm45 <- smooth_deparse(smoothed45$id, smoothed45$day, smoothed45$smoothed_treatment)
deparsed.sm61 <- smooth_deparse(smoothed61$id, smoothed61$day, smoothed61$smoothed_treatment)



Utilizing the Algorithm in a Large Dataset

Executing the entire workflow in a real study can be computationally intensive. Without a workstation equipped with ample RAM and high-speed processors, applying the smooth algorithm without parallel computing may prove to be unfeasible.

To overcome this challenge, we demonstrate how to run the algorithm on the entire example dataset by dividing it into manageable chunks and parallelizing the calculations in R.

To achieve this, make sure to have the following R packages installed:

library(doParallel)
library(foreach)
library(snow)
library(doSNOW)

Below is a script demonstrating how to run the entire workflow in parallel in R:

1) Split Data in Chunks

In this example, the chunk size is 50, but in a big data set we recommend to use chunks of size 1000.

data = drugstreatment

# starting date:
s0 = Sys.time()

# create chunks:
chunksize = 50
niter <- n_distinct(drugs$id)
chunks <- ceiling(niter/chunksize)
inds <- split(seq_len(niter), sort(rep_len(seq_len(chunks),
                                           niter)))

2) Sockets, Cores and Progress bar

Parameters to pass into the foreach() function. Additionally, the progress bar feature provides a helpful visual representation of the ongoing computation, allowing you to monitor the progress and estimate the remaining time until completion


# parallel - cores and socket:
n.cores <- 3
cl <- snow::makeSOCKcluster(n.cores)
doSNOW::registerDoSNOW(cl)

# progress bar:
pb <- utils::txtProgressBar(min = 1, max = chunks, style = 3)
progress <- function(n) utils::setTxtProgressBar(pb,n)
opts <- list(progress = progress)

3) Parallel Execution using foreach() and %dopar%

The entire workflow is encompassed within the foreach() function, executed in a loop across all the data chunks using %dopar%. During each iteration, the computed output is saved in the tempdir() location in the RDS format. This allows for parallel processing, optimizing the performance and enhancing computational efficiency.


# path to temporal directory:
tmp.path = tempdir()
diff=FALSE

l <- foreach(c = 1:chunks,
             .packages = c("Kendall", "smoothy", "data.table", "anytime", "dplyr"),
             .options.snow = opts,
             .multicombine = F) %dopar% {
               
               chunk.id <- unique(data$id)[inds[[c]]]
               
               # run the workflow in each individual from the chunk:
               df <- filter(data, id %in% chunk.id)
               
               # 1) parse data:
               structured_df <- smooth_parse(
                 id = df$id,
                 start_date = df$start_date,
                 end_date = df$end_date,
                 drug = df$drug,
                 study_from = "1970-01-01",
                 study_to = "1975-01-01"
               )
               
               # 2) smooth algorithm:
               width <- 61
               
               smoothed <- smooth_algorithm(
                 id = structured_df$id,
                 treatment = structured_df$treatment,
                 day = structured_df$day,
                 N =  structured_df$N,
                 width = width
                 )
               
               # 3) deparse data (original format):
               deparsed_smoothed <- smooth_deparse(
                 smoothed$id, 
                 smoothed$day, 
                 smoothed$smoothed_treatment
                 )
               
               
               # 4) Per patient changes due to smooth algorithm:
               if(diff){
                 
                 # Calculate differences by patient mapping with the group_map function:
                 df <- smoothed %>%
                   group_by(id) %>%
                   group_map(~ smooth_diff(.$treatment,.$smoothed_treatment)) %>%
                   bind_rows(.id = "group_id") %>%
                   data.frame
                 
                 # Format output and filter global, exposure period:
                 df <- df %>%
                   mutate(percentage_of_change = round(proportion_of_change*100,2)) %>%
                   filter(type%in%c('Global','Exposure period')) %>%
                   mutate(type = factor(type,levels=c('Global','Exposure period'),
                                        labels=c('total_change','exposure_change')))
                 # add 'id' and reshape:
                 df <- df %>%
                   left_join(data.frame(id=unique(smoothed$id),group_id = as.character(seq(1,n_distinct(smoothed$id))))) %>%
                   reshape2::dcast(id~type,value.var='percentage_of_change')
                 
                 # attach to deparsed_smoothed dataframe:
                 deparsed_smoothed <- left_join(
                   deparsed_smoothed,
                   df
                 )
                 
                 rm(df)
               }
               
               # Save chunk output to a temporary folder:
               saveRDS(deparsed_smoothed,paste0(tmp.path,"/chunk_",c,".rds"))
               
               rm(df,structured_df,smoothed,deparsed_smoothed);gc()
               
             }

3.1) Patient-Level Differences

Within the loop, there is an optional code segment used to calculate the differences after applying the smooth algorithm at a patient level. This step is entirely optional, but it can be useful for validation, especially in cases where patients exhibit significant changes (approximately 10% of change).

               
# Calculate differences by patient mapping with the group_map function:
aux <- smoothed %>%
  group_by(id) %>%
  group_map(~ smooth_diff(.$treatment,.$smoothed_treatment)) %>%
  bind_rows(.id = "group_id") %>%
  data.frame

# Format output and filter global, exposure period:
aux <- aux %>%
  mutate(percentage_of_change = round(proportion_of_change*100,2)) %>%
  filter(type%in%c('Global','Exposure period')) %>%
  mutate(type = factor(type,levels=c('Global','Exposure period'),
                       labels=c('total_change','exposure_change')))
# add 'id' and reshape:
aux <- aux %>%
  left_join(data.frame(id=unique(smoothed$id),group_id = as.character(seq(1,n_distinct(smoothed$id))))) %>%
  reshape2::dcast(id~type,value.var='percentage_of_change')

# join to deparsed_smoothed dataframe:
deparsed_smoothed <- left_join(
  deparsed_smoothed,
  aux
)

After completing the process, it is essential to close the clusters to release the computing resources. Additionally, you may want to calculate the total computation time for the entire workflow.

# close sockets:
close(pb)
snow::stopCluster(cl)

# Time to finish the process:
t0 = Sys.time() - s0
cat("\n The process finished in", round(t0), units(t0))

4) Read and Combine All Chunks

To complete the process, we read the .rds files stored in the temporary folder and combine them into a single data.frame using the bind_rows function:

# Import and combine all chunks into a single data.frame:
rds_files <- list.files(tmp.path, pattern = "chunk_", full.names = TRUE)
all_chunks <- bind_rows(lapply(rds_files, readRDS))

The computation time may vary depending on the number of samples and the study period. We conducted several tests on both Mac and Windows platforms, and the performance was similar.

As a reference:

  • PC/Mac with 8 cores and 16GB of RAM
  • A dataset with 500k patients and 2.8M electronic prescriptions
  • Follow-up period of 6 years

The entire process took 16 hours to finish.