Introduction_to_tvtools

Introduction

Longitudinal data collected over a period of time can provide a view of a population’s changes. Gathering and structuring longitudinal information may require a more flexible design for the data. When a single subject’s records are updated at random intervals, and when the length of follow-up varies, a more complex data structure may be required. Panel data provides one option for storing longitudinal records. This structures information over intervals of time, with a variable number of records per subject. While useful for storing data, the structure of panel data creates complexities in data analyses. Calculations and models necessarily must account for the length of time.

The tvtools package for R was published to simplify the process of exploring and analyzing panel data. This vignette will provide an overview of panel data and introduce a range of methods.

Sample Data

The tvtools package includes a sample data set called simulated.chd. These fictional records were simulated based on a scenario of medical follow-up for patients with coronary heart disease (CHD).

library(tvtools)
library(data.table)
library(DTwrappers)

file_path <- system.file("extdata", "simulated_data.csv", package="tvtools")
simulated.chd <- fread(input = file_path)

orig.data <- copy(simulated.chd)

We can begin exploring the data by noting its dimensionality:

dim(simulated.chd)
#> [1] 33572    13

The first ten rows of the simulated.chd data are:

simulated.chd[1:10,]
#>                   id    t1    t2   age    sex region
#>               <char> <int> <int> <int> <char> <char>
#>  1: 01KTl0KSK88EFV8N     0     8    69   Male   West
#>  2: 01KTl0KSK88EFV8N     8    30    69   Male   West
#>  3: 01KTl0KSK88EFV8N    30    38    69   Male   West
#>  4: 01KTl0KSK88EFV8N    38    46    69   Male   West
#>  5: 01KTl0KSK88EFV8N    46    66    69   Male   West
#>  6: 01KTl0KSK88EFV8N    66    90    69   Male   West
#>  7: 01KTl0KSK88EFV8N    90    94    69   Male   West
#>  8: 01KTl0KSK88EFV8N    94   110    69   Male   West
#>  9: 01KTl0KSK88EFV8N   110   124    69   Male   West
#> 10: 01KTl0KSK88EFV8N   124   133    69   Male   West
#>                       baseline.condition diabetes   ace    bb statin hospital
#>                                   <char>    <int> <int> <int>  <int>    <int>
#>  1: moderate symptoms or light procedure        0     1     0      1        0
#>  2: moderate symptoms or light procedure        0     1     1      1        0
#>  3: moderate symptoms or light procedure        0     1     1      0        0
#>  4: moderate symptoms or light procedure        0     1     0      0        0
#>  5: moderate symptoms or light procedure        0     1     1      0        0
#>  6: moderate symptoms or light procedure        0     1     1      1        0
#>  7: moderate symptoms or light procedure        0     0     1      1        0
#>  8: moderate symptoms or light procedure        0     1     1      1        0
#>  9: moderate symptoms or light procedure        0     1     0      1        0
#> 10: moderate symptoms or light procedure        0     0     0      1        0
#>     death
#>     <int>
#>  1:     0
#>  2:     0
#>  3:     0
#>  4:     0
#>  5:     0
#>  6:     0
#>  7:     0
#>  8:     0
#>  9:     0
#> 10:     0

Here we see a partial view of the records for a single patient with id 01KTl0KSK88EFV8N. The variables t1 and t2 represent a time interval for the record. This will be discussed further in the section on the structure of panel data. The patient’s age at diagnosis, sex, geographic region, baseline condition, and diabetes status are provided. The variables ace (ace inhibitor), bb (beta blocker), and statin provide records of the possession of common prescription medications for patients with CHD. The patient’s admissions to the hospital are recorded, and the death variable is used to identify cases of mortality. The medications, hospital status, and mortality of the patient can change over time. These will also be discussed further.

The simulated data include records on many patients. For instance, a portion of the recods for several patients are shown below:

simulated.chd[58:70,]
#>                   id    t1    t2   age    sex    region
#>               <char> <int> <int> <int> <char>    <char>
#>  1: 01ZbYuUoYJeIyiVH     0     1    61   Male Northeast
#>  2: 01ZbYuUoYJeIyiVH     1    10    61   Male Northeast
#>  3: 01ZbYuUoYJeIyiVH    10    30    61   Male Northeast
#>  4: 01ZbYuUoYJeIyiVH    30    31    61   Male Northeast
#>  5: 01ZbYuUoYJeIyiVH    31    43    61   Male Northeast
#>  6: 01ZbYuUoYJeIyiVH    43    70    61   Male Northeast
#>  7: 01ZbYuUoYJeIyiVH    70    70    61   Male Northeast
#>  8: 01oLxxu87rRDCIvo     0    34    63   Male   Midwest
#>  9: 01oLxxu87rRDCIvo    34    61    63   Male   Midwest
#> 10: 01oLxxu87rRDCIvo    61    64    63   Male   Midwest
#> 11: 01oLxxu87rRDCIvo    64    64    63   Male   Midwest
#> 12: 01rOm5qEH4GLCiL5     0     1    62   Male   Midwest
#> 13: 01rOm5qEH4GLCiL5     1    90    62   Male   Midwest
#>                       baseline.condition diabetes   ace    bb statin hospital
#>                                   <char>    <int> <int> <int>  <int>    <int>
#>  1:      Major heart attack or operation        0     0     0      0        0
#>  2:      Major heart attack or operation        0     1     0      0        0
#>  3:      Major heart attack or operation        0     1     1      0        0
#>  4:      Major heart attack or operation        0     1     1      1        0
#>  5:      Major heart attack or operation        0     0     1      1        0
#>  6:      Major heart attack or operation        0     0     1      1        1
#>  7:      Major heart attack or operation        0     0     1      1        1
#>  8: moderate symptoms or light procedure        0     1     1      1        0
#>  9: moderate symptoms or light procedure        0     1     1      1        0
#> 10: moderate symptoms or light procedure        0     1     1      1        0
#> 11: moderate symptoms or light procedure        0     0     0      0        0
#> 12:      Major heart attack or operation        0     1     1      1        0
#> 13:      Major heart attack or operation        0     1     1      1        0
#>     death
#>     <int>
#>  1:     0
#>  2:     0
#>  3:     0
#>  4:     0
#>  5:     0
#>  6:     0
#>  7:     1
#>  8:     0
#>  9:     0
#> 10:     0
#> 11:     0
#> 12:     0
#> 13:     0

Structure of Panel Data

We can now more carefully define the elements of panel data. Some necessary variables include the:

  • subject identifier: This uniquely identifies a subject so that records across multiple rows can be linked.

  • time interval: This records a period of time [t1, t2) during which the record is observed. In particular, panel data assumes that a) the values of the record take effect at time t1, and b) the values remain constant from time t1 to time t2. It is important for the time intervals in a subject’s different rows to be mutually exclusive.

  • constant variables: These values appear in a patient’s records but cannot change. For instance, a patient’s age at the time of the first diagnosis of CHD would not vary across the records in follow-up. Important baseline factors, such as a history of comorbid medical conditions, might also be included as constant variables.

  • time-varying variables: These values can change over time. Records of a patient’s weight, laboratory tests, medication usage, and hospitalization status are all examples of time-varying variables. Medical outcomes such as medication adherence or the costs of hospitalization can often be the basis of considerable study. Time-varying data could reasonably illustrate a period during which a medical patient is adherent to a medication or the duration of a hospital admission. However, acute events such as a heart attack are necessarily not long lasting. Ideally a panel would be structured to update the record at a time shortly after the event. In some cases, interpretation of the panel is required. For instance, a lengthy interval that begins with a heart attack would require recognition that the event did not last for the entire duration. Likewise, if an event such as mortality is observed during a lengthy interval, the panel should be restructured with a new record marking the death after that previous interval. Because of these intricacies, additional attention to the details can be required. A practitioner should be careful to properly interpret the events recorded in panel data and to ensure their quality.

To better examine these issues, we will consider the first few rows of the simulated.chd data:

simulated.chd[1:3,]
#>                  id    t1    t2   age    sex region
#>              <char> <int> <int> <int> <char> <char>
#> 1: 01KTl0KSK88EFV8N     0     8    69   Male   West
#> 2: 01KTl0KSK88EFV8N     8    30    69   Male   West
#> 3: 01KTl0KSK88EFV8N    30    38    69   Male   West
#>                      baseline.condition diabetes   ace    bb statin hospital
#>                                  <char>    <int> <int> <int>  <int>    <int>
#> 1: moderate symptoms or light procedure        0     1     0      1        0
#> 2: moderate symptoms or light procedure        0     1     1      1        0
#> 3: moderate symptoms or light procedure        0     1     1      0        0
#>    death
#>    <int>
#> 1:     0
#> 2:     0
#> 3:     0

This illustrates a short period of follow-up for a patient. The first row begins with t1 = 0, the moment of the patient’s initial diagnosis of CHD. The patient is 69 years old, male, and living in the west. CHD was diagnosed from a baseline condition that included moderate symptoms or a light procedure. The patient did not have diabetes. At the beginning of the interval (t1 = 0), the patient possessed ace inhibitors (ace = 1) and statin medications (statin = 1). The patient did not possess a beta blocker (bb = 0). The patient was also not admitted to the hospital (hospital = 0) and was alive (death = 0). This state of affairs was assumed to persist for 8 days, until the end of the first interval (t2 = 8). Then a new record (the second row) was entered. It is certainly possible for an update to include no changes to the time-varying records. However, an efficient panel structure would only generate new records when updates occur. In this case, the patient filled a prescription for a beta blocker (bb = 1) at time t1 = 8 days. The patient was then on all three medications (ace = 1, bb = 1, statin = 1) with no hospitalizations (hospital = 0) and while remaining alive (death = 0). A third row was generated at day t1 = 30. In this case, the patient no longer possessed a statin medication (statin = 0), while the previous row’s other factors remained fixed. This record was maintained for 8 days (t2 = 38). Generalizing to all of the records for a single patient, the panel presents a historical record of the patient’s condition. The full set of panel data then presents the recorded histories for all of the patients. Each patient is followed from their moment of diagnosis until death or a loss of follow-up.

Sorting

For most applications, structuring panel data in sorted order can simplify the subsequent analyses. The structure.panel method is used to sort by the subject’s identifier and beginning time interval.

simulated.chd <- structure.panel(dat = simulated.chd, id.name = "id", t1.name = "t1")

simulated.chd[1:3,]
#>                  id    t1    t2   age    sex region
#>              <char> <int> <int> <int> <char> <char>
#> 1: 01KTl0KSK88EFV8N     0     8    69   Male   West
#> 2: 01KTl0KSK88EFV8N     8    30    69   Male   West
#> 3: 01KTl0KSK88EFV8N    30    38    69   Male   West
#>                      baseline.condition diabetes   ace    bb statin hospital
#>                                  <char>    <int> <int> <int>  <int>    <int>
#> 1: moderate symptoms or light procedure        0     1     0      1        0
#> 2: moderate symptoms or light procedure        0     1     1      1        0
#> 3: moderate symptoms or light procedure        0     1     1      0        0
#>    death
#>    <int>
#> 1:     0
#> 2:     0
#> 3:     0

As an additional example, we’ll show how an unsorted panel data set can be reordered:

structure.panel(dat = simulated.chd[c(2,4,3,1),])
#>                  id    t1    t2   age    sex region
#>              <char> <int> <int> <int> <char> <char>
#> 1: 01KTl0KSK88EFV8N     0     8    69   Male   West
#> 2: 01KTl0KSK88EFV8N     8    30    69   Male   West
#> 3: 01KTl0KSK88EFV8N    30    38    69   Male   West
#> 4: 01KTl0KSK88EFV8N    38    46    69   Male   West
#>                      baseline.condition diabetes   ace    bb statin hospital
#>                                  <char>    <int> <int> <int>  <int>    <int>
#> 1: moderate symptoms or light procedure        0     1     0      1        0
#> 2: moderate symptoms or light procedure        0     1     1      1        0
#> 3: moderate symptoms or light procedure        0     1     1      0        0
#> 4: moderate symptoms or light procedure        0     1     0      0        0
#>    death
#>    <int>
#> 1:     0
#> 2:     0
#> 3:     0
#> 4:     0

Methods

The tvtools package is designed to facilitate a range of methods to explore and analyze panel data. These include summarization techniques, methods of calculation, and quality checks.

Summarization

The summarize.panel function is designed to provide a simple summary of a panel data structure. The column name for the subject’s unique identifier is specified to calculate the number of subjects and the mean records per subject. The column names for the time intervals help to gain a sense of the amount of follow-up time observed in the data.

summarize.panel(dat = simulated.chd, id.name = "id", t1.name = "t1", t2.name = "t2")
#>    total.records unique.ids mean.records.per.id total.followup max.followup
#>            <int>      <int>               <num>          <int>        <int>
#> 1:         33572       1000              33.572         722772         2606

This summary can also be produced in subgroups by specifying one or more categorical grouping variables:

summarize.panel(dat = simulated.chd, id.name = "id", t1.name = "t1", t2.name = "t2", grouping.variables = "sex")
#> Key: <sex>
#>       sex total.records unique.ids mean.records.per.id total.followup
#>    <char>         <int>      <int>               <num>          <int>
#> 1: Female         16592        487            34.06982         358387
#> 2:   Male         16980        513            33.09942         364385
#>    max.followup
#>           <int>
#> 1:         2606
#> 2:         2602

summarize.panel(dat = simulated.chd, id.name = "id", t1.name = "t1", t2.name = "t2", grouping.variables = c("sex", "region"))
#> Key: <sex, region>
#>       sex    region total.records unique.ids mean.records.per.id total.followup
#>    <char>    <char>         <int>      <int>               <num>          <int>
#> 1: Female   Midwest          3063         93            32.93548          68463
#> 2: Female Northeast          5237        134            39.08209         108336
#> 3: Female     South          2796         93            30.06452          61332
#> 4: Female      West          5496        167            32.91018         120256
#> 5:   Male   Midwest          3108         98            31.71429          68222
#> 6:   Male Northeast          5074        160            31.71250         112672
#> 7:   Male     South          3699        103            35.91262          76110
#> 8:   Male      West          5099        152            33.54605         107381
#>    max.followup
#>           <int>
#> 1:         2606
#> 2:         2601
#> 3:         2567
#> 4:         2496
#> 5:         2564
#> 6:         2498
#> 7:         2569
#> 8:         2602

Follow-Up Time Calculations

The length of follow-up time can be a critical factor in the study’s analytical judgments and selected methods. In some applications, one might choose to only include patients who completed at least 1 year of observation or to ensure that the median length of follow-up is sufficient for the goals of the study.

The followup.time function calculates the length of observation for each patient. This may be performed in two separate ways:

  • Max Follow-Up: Calculate the last observed time for each subject.

  • Total Follow-Up: Calculate the overall amount of observed time for each subject. This has the effect of removing missing time intervals or including reference points other than time zero.

On the simulated.chd data, we can calculate the maximum follow-up time for each subject:

followup.time(dat = simulated.chd, id.name = "id", t2.name = "t2", calculate.as = "max")
#>                     id followup.time
#>                 <char>         <int>
#>    1: 01KTl0KSK88EFV8N          1075
#>    2: 01ZbYuUoYJeIyiVH            70
#>    3: 01oLxxu87rRDCIvo            64
#>    4: 01rOm5qEH4GLCiL5          2389
#>    5: 021eg6OjCoGotbXK            95
#>   ---                               
#>  996: zm1gtU2uw866RDGy           127
#>  997: zqMNWR16s2XrWiYJ          1463
#>  998: zs7NTtHeWTecHvxS           813
#>  999: ztSPQ3OMBA2CzgSp           255
#> 1000: zxOx9moOQBqSiKq2            63

Likewise, shifting to the total follow-up also leads to the same results.

followup.time(dat = simulated.chd, id.name = "id", t1.name = "t1", t2.name = "t2", calculate.as = "total")
#>                     id followup.time
#>                 <char>         <int>
#>    1: 01KTl0KSK88EFV8N          1075
#>    2: 01ZbYuUoYJeIyiVH            70
#>    3: 01oLxxu87rRDCIvo            64
#>    4: 01rOm5qEH4GLCiL5          2389
#>    5: 021eg6OjCoGotbXK            95
#>   ---                               
#>  996: zm1gtU2uw866RDGy           127
#>  997: zqMNWR16s2XrWiYJ          1463
#>  998: zs7NTtHeWTecHvxS           813
#>  999: ztSPQ3OMBA2CzgSp           255
#> 1000: zxOx9moOQBqSiKq2            63

This is true because the simulated.chd data begins at a baseline of t1 = 0 for each patient and does not include any missing time intervals over the length of any patient’s observation.

The followup.time method can be applied to all or a subset of a single subject’s records. Let’s consider the case of one patient from the simulated.chd data:

followup.time(dat = simulated.chd[id == id[1],], id.name = "id", t1.name = "t1", t2.name = "t2", calculate.as = "total")
#>                  id followup.time
#>              <char>         <int>
#> 1: 01KTl0KSK88EFV8N          1075
followup.time(dat = simulated.chd[id == id[1],][5:20,], id.name = "id", t1.name = "t1", t2.name = "t2", calculate.as = "total")
#>                  id followup.time
#>              <char>         <int>
#> 1: 01KTl0KSK88EFV8N           195
followup.time(dat = simulated.chd[id == id[1],][5:20,], id.name = "id", t1.name = "t1", t2.name = "t2", calculate.as = "max")
#>                  id followup.time
#>              <char>         <int>
#> 1: 01KTl0KSK88EFV8N           241

These calculations show that the patient was followed for a total of 1075 days. The records from the subject’s 5th to 20th records encompass a total of 195 days, with day 241 as the latest in this period.

The followup times can also be appended to the original data set with a user-selected name for the new column:

followup.time(dat = simulated.chd, id.name = "id", t2.name = "t2", calculate.as = "max", append.to.data = T, followup.name = "followup.time")
print(simulated.chd[1:5,])
#>                  id    t1    t2   age    sex region
#>              <char> <int> <int> <int> <char> <char>
#> 1: 01KTl0KSK88EFV8N     0     8    69   Male   West
#> 2: 01KTl0KSK88EFV8N     8    30    69   Male   West
#> 3: 01KTl0KSK88EFV8N    30    38    69   Male   West
#> 4: 01KTl0KSK88EFV8N    38    46    69   Male   West
#> 5: 01KTl0KSK88EFV8N    46    66    69   Male   West
#>                      baseline.condition diabetes   ace    bb statin hospital
#>                                  <char>    <int> <int> <int>  <int>    <int>
#> 1: moderate symptoms or light procedure        0     1     0      1        0
#> 2: moderate symptoms or light procedure        0     1     1      1        0
#> 3: moderate symptoms or light procedure        0     1     1      0        0
#> 4: moderate symptoms or light procedure        0     1     0      0        0
#> 5: moderate symptoms or light procedure        0     1     1      0        0
#>    death followup.time
#>    <int>         <int>
#> 1:     0          1075
#> 2:     0          1075
#> 3:     0          1075
#> 4:     0          1075
#> 5:     0          1075

Measuring the Time to Events

Outcome variables such as survival times may be calculated from panel data by identifying the time of an event. The first.event method is designed to facilitate these calculations on a collection of outcome variables. By specifying the identifier, we can perform the calculation separately on each subject in the data set. In the first example, we calculate the initiation times of the three medicines – the times at which a presciption for each medicine was first filled by the patient.

first.event(dat = simulated.chd, id.name = "id", outcome.names = c("ace", "bb", "statin"), t1.name = "t1")
#>                     id ace.first.event bb.first.event statin.first.event
#>                 <char>           <int>          <int>              <int>
#>    1: 01KTl0KSK88EFV8N               0              8                  0
#>    2: 01ZbYuUoYJeIyiVH               1             10                 30
#>    3: 01oLxxu87rRDCIvo               0              0                  0
#>    4: 01rOm5qEH4GLCiL5               0              0                  0
#>    5: 021eg6OjCoGotbXK               1             33                  0
#>   ---                                                                   
#>  996: zm1gtU2uw866RDGy               0              0                  1
#>  997: zqMNWR16s2XrWiYJ             220              0                  0
#>  998: zs7NTtHeWTecHvxS               0              7                  0
#>  999: ztSPQ3OMBA2CzgSp              73             35                  4
#> 1000: zxOx9moOQBqSiKq2               0              0                  1

Likewise, we can calculate the time to a first hospitalization or mortality. Note that NA values are displayed for patients who were not hospitalized and also for those who survived for the period of follow-up.

first.event(dat = simulated.chd, id.name = "id", outcome.names = c("hospital", "death"), t1.name = "t1")
#>                     id hospital.first.event death.first.event
#>                 <char>                <int>             <int>
#>    1: 01KTl0KSK88EFV8N                   NA                NA
#>    2: 01ZbYuUoYJeIyiVH                   43                70
#>    3: 01oLxxu87rRDCIvo                   NA                NA
#>    4: 01rOm5qEH4GLCiL5                  100                NA
#>    5: 021eg6OjCoGotbXK                   NA                NA
#>   ---                                                        
#>  996: zm1gtU2uw866RDGy                   NA                NA
#>  997: zqMNWR16s2XrWiYJ                    0                NA
#>  998: zs7NTtHeWTecHvxS                   NA                NA
#>  999: ztSPQ3OMBA2CzgSp                  255                NA
#> 1000: zxOx9moOQBqSiKq2                   NA                NA

These times to a first event can also be calculated on the entire population by setting id.name = NULL:

##first.event(dat = simulated.chd, id.name = NULL, outcome.names = c("hospital", "death"), t1.name = "t1")

These calculated quantities can also be appended to the data set:

one.patient <- first.event(dat = simulated.chd[id == "01ZbYuUoYJeIyiVH",], id.name = "id", outcome.names = c("hospital", "death"), t1.name = "t1", append.to.table = TRUE, event.name = "time")
setorderv(x = one.patient, cols = c("id", "t1"))
print(one.patient)
#> Key: <id>
#>                  id    t1    t2   age    sex    region
#>              <char> <int> <int> <int> <char>    <char>
#> 1: 01ZbYuUoYJeIyiVH     0     1    61   Male Northeast
#> 2: 01ZbYuUoYJeIyiVH     1    10    61   Male Northeast
#> 3: 01ZbYuUoYJeIyiVH    10    30    61   Male Northeast
#> 4: 01ZbYuUoYJeIyiVH    30    31    61   Male Northeast
#> 5: 01ZbYuUoYJeIyiVH    31    43    61   Male Northeast
#> 6: 01ZbYuUoYJeIyiVH    43    70    61   Male Northeast
#> 7: 01ZbYuUoYJeIyiVH    70    70    61   Male Northeast
#>                 baseline.condition diabetes   ace    bb statin hospital death
#>                             <char>    <int> <int> <int>  <int>    <int> <int>
#> 1: Major heart attack or operation        0     0     0      0        0     0
#> 2: Major heart attack or operation        0     1     0      0        0     0
#> 3: Major heart attack or operation        0     1     1      0        0     0
#> 4: Major heart attack or operation        0     1     1      1        0     0
#> 5: Major heart attack or operation        0     0     1      1        0     0
#> 6: Major heart attack or operation        0     0     1      1        1     0
#> 7: Major heart attack or operation        0     0     1      1        1     1
#>    followup.time hospital.time death.time
#>            <int>         <int>      <int>
#> 1:            70            43         70
#> 2:            70            43         70
#> 3:            70            43         70
#> 4:            70            43         70
#> 5:            70            43         70
#> 6:            70            43         70
#> 7:            70            43         70

Similarly, the last.event method, designed with similar inputs, is used to find the last time at which an event occurs:

last.event(dat = simulated.chd, id.name = "id", outcome.names = c("hospital", "death"), t1.name = "t1")[1:5,]
#>                  id hospital.last.event death.last.event
#>              <char>               <int>            <int>
#> 1: 01KTl0KSK88EFV8N                  NA               NA
#> 2: 01ZbYuUoYJeIyiVH                  70               70
#> 3: 01oLxxu87rRDCIvo                  NA               NA
#> 4: 01rOm5qEH4GLCiL5                2039               NA
#> 5: 021eg6OjCoGotbXK                  NA               NA

If the end of an interval is preferred, the t2 column may be substituted:

last.event(dat = simulated.chd, id.name = NULL, outcome.names = c("hospital", "death"), t1.name = "t2")[1:5,]
#>    hospital.last.event death.last.event
#>                  <int>            <int>
#> 1:                2606             2564
#> 2:                  NA               NA
#> 3:                  NA               NA
#> 4:                  NA               NA
#> 5:                  NA               NA

The last.event method is especially helpful when looking across the sample for the latest events:

last.event(dat = simulated.chd, id.name = NULL, outcome.names = c("hospital", "death"), t1.name = "t1")
#>    hospital.last.event death.last.event
#>                  <int>            <int>
#> 1:                2606             2564

Cross-Sectional Data

Panel data differs from more standard data in terms of its structure and variability of longitudinal observation. Being able to convert the panel to a more traditional form can facilitate a range of analyses. In order to do so, we must consider the:

  • Baseline Factors: These would be measurements recorded at the time of the study’s baseline.

  • Outcomes: These would measure the time to a subject’s first event relative to the baseline.

The cross.sectional.data method converts panel data into this standard form, with one row per subject. Baseline measurements are recorded as of the specified time, while outcomes are measured as the time to the first occurrence of the event (or NA if not observed). The subject’s overall length of follow-up is also calculated to enable survival analyses of censored data. We can specify the time.point as 0 to conduct the study from the beginning of the period of observation:

simulated.chd[, followup.time := NULL]
baseline <- cross.sectional.data(dat = simulated.chd, time.point = 0, id.name = "id", t1.name = "t1", t2.name = "t2", outcome.names = c("hospital", "death"))
baseline[1,]
#> Key: <id>
#>                  id   age    sex region                   baseline.condition
#>              <char> <int> <char> <char>                               <char>
#> 1: 01KTl0KSK88EFV8N    69   Male   West moderate symptoms or light procedure
#>    diabetes   ace    bb statin hospital.first.event death.first.event
#>       <int> <int> <int>  <int>                <int>             <int>
#> 1:        0     1     0      1                   NA                NA
#>    followup.time cross.sectional.time
#>            <int>                <num>
#> 1:          1075                    0
baseline[hospital.first.event > 0,][1:2,]
#> Key: <id>
#>                  id   age    sex    region              baseline.condition
#>              <char> <int> <char>    <char>                          <char>
#> 1: 01ZbYuUoYJeIyiVH    61   Male Northeast Major heart attack or operation
#> 2: 01rOm5qEH4GLCiL5    62   Male   Midwest Major heart attack or operation
#>    diabetes   ace    bb statin hospital.first.event death.first.event
#>       <int> <int> <int>  <int>                <int>             <int>
#> 1:        0     0     0      0                   43                70
#> 2:        0     1     1      1                  100                NA
#>    followup.time cross.sectional.time
#>            <int>                <num>
#> 1:            70                    0
#> 2:          2389                    0
baseline[death.first.event > 0,][1:2,]
#> Key: <id>
#>                  id   age    sex    region              baseline.condition
#>              <char> <int> <char>    <char>                          <char>
#> 1: 01ZbYuUoYJeIyiVH    61   Male Northeast Major heart attack or operation
#> 2: 0bt4Duak3aWCPO7E    71   Male   Midwest Major heart attack or operation
#>    diabetes   ace    bb statin hospital.first.event death.first.event
#>       <int> <int> <int>  <int>                <int>             <int>
#> 1:        0     0     0      0                   43                70
#> 2:        0     0     0      1                    0              2564
#>    followup.time cross.sectional.time
#>            <int>                <num>
#> 1:            70                    0
#> 2:          2564                    0

The create.baseline method is a light wrapper of cross.sectional.data that forces the time point to 0:

baseline.2 <- create.baseline(dat = simulated.chd, id.name = "id", t1.name = "t1", t2.name = "t2", outcome.names = c("hospital", "death"))
baseline.2[hospital.first.event > 0,][1:2,]
#> Key: <id>
#>                  id   age    sex    region              baseline.condition
#>              <char> <int> <char>    <char>                          <char>
#> 1: 01ZbYuUoYJeIyiVH    61   Male Northeast Major heart attack or operation
#> 2: 01rOm5qEH4GLCiL5    62   Male   Midwest Major heart attack or operation
#>    diabetes   ace    bb statin hospital.first.event death.first.event
#>       <int> <int> <int>  <int>                <int>             <int>
#> 1:        0     0     0      0                   43                70
#> 2:        0     1     1      1                  100                NA
#>    followup.time cross.sectional.time
#>            <int>                <num>
#> 1:            70                    0
#> 2:          2389                    0
baseline.2[death.first.event > 0,][1:2,]
#> Key: <id>
#>                  id   age    sex    region              baseline.condition
#>              <char> <int> <char>    <char>                          <char>
#> 1: 01ZbYuUoYJeIyiVH    61   Male Northeast Major heart attack or operation
#> 2: 0bt4Duak3aWCPO7E    71   Male   Midwest Major heart attack or operation
#>    diabetes   ace    bb statin hospital.first.event death.first.event
#>       <int> <int> <int>  <int>                <int>             <int>
#> 1:        0     0     0      0                   43                70
#> 2:        0     0     0      1                    0              2564
#>    followup.time cross.sectional.time
#>            <int>                <num>
#> 1:            70                    0
#> 2:          2564                    0

A cross-sectional data set can also be produced at later times. In these settings, the time to the first event only includes events that occur at or after the cross-sectional time point. By specifying relative.followup = FALSE, the event times and length of follow-up are recorded in absolute terms (relative to time zero rather than the cross-sectional baseline).

cs.365 <- cross.sectional.data(dat = simulated.chd, time.point = 365, id.name = "id", t1.name = "t1", t2.name = "t2", outcome.names = c("hospital", "death"), relative.followup = FALSE)
cs.365[1:2,]
#> Key: <id>
#>                  id   age    sex  region                   baseline.condition
#>              <char> <int> <char>  <char>                               <char>
#> 1: 01KTl0KSK88EFV8N    69   Male    West moderate symptoms or light procedure
#> 2: 01rOm5qEH4GLCiL5    62   Male Midwest      Major heart attack or operation
#>    diabetes   ace    bb statin hospital.first.event death.first.event
#>       <int> <int> <int>  <int>                <int>             <int>
#> 1:        0     1     1      1                   NA                NA
#> 2:        0     1     1      1                  561                NA
#>    followup.time cross.sectional.time
#>            <int>                <num>
#> 1:          1075                  365
#> 2:          2389                  365

Alternatively, we can specify relative.followup = TRUE to calculate the event and followup times after the cross-sectional baseline.

cs.365.relative <- cross.sectional.data(dat = simulated.chd, time.point = 365, id.name = "id", t1.name = "t1", t2.name = "t2", outcome.names = c("hospital", "death"), relative.followup = TRUE)
cs.365.relative[1:2,]
#> Key: <id>
#>                  id   age    sex  region                   baseline.condition
#>              <char> <int> <char>  <char>                               <char>
#> 1: 01KTl0KSK88EFV8N    69   Male    West moderate symptoms or light procedure
#> 2: 01rOm5qEH4GLCiL5    62   Male Midwest      Major heart attack or operation
#>    diabetes   ace    bb statin hospital.first.event death.first.event
#>       <int> <int> <int>  <int>                <num>             <num>
#> 1:        0     1     1      1                   NA                NA
#> 2:        0     1     1      1                  196                NA
#>    followup.time cross.sectional.time
#>            <num>                <num>
#> 1:           710                  365
#> 2:          2024                  365

It is also possible to create a purely cross-sectional data set with no outcome measurements by setting outcome.names = NULL. Then all of the measurements will be produced from the time of the cross-sectional baseline.

cs.no.outcomes <- create.baseline(dat = simulated.chd, id.name = "id", t1.name = "t1", t2.name = "t2", outcome.names = NULL)
cs.no.outcomes[1:2,]
#>                  id   age    sex    region                   baseline.condition
#>              <char> <int> <char>    <char>                               <char>
#> 1: 01KTl0KSK88EFV8N    69   Male      West moderate symptoms or light procedure
#> 2: 01ZbYuUoYJeIyiVH    61   Male Northeast      Major heart attack or operation
#>    diabetes   ace    bb statin hospital death cross.sectional.time
#>       <int> <int> <int>  <int>    <int> <int>                <num>
#> 1:        0     1     0      1        0     0                    0
#> 2:        0     0     0      0        0     0                    0

As a reminder, summary statistics like the mean age at diagnosis should be calculated based on one row per patient. Otherwise, the mean value would be weighted according to the patient’s number of rows in the panel data. As a point of comparison, consider the mean age in the baseline data versus the mean age in the panel data:

baseline[, mean(age)]
#> [1] 64.982
simulated.chd[, mean(age)]
#> [1] 64.81908

Calculating Utilization

Time-varying factors with binary measurements can include intermittent periods of utilization. Calculating how much or how often a medication is used – or the amount of time spent in the hospital – can be the basis for studying the effect or cost of an intervention. The calculate.utilization method facilitates calculations of the total amount or proportion of time that a binary outcome variable is in effect. This calculation can be performed for a specified interval of time. As an initial example, we will calculate the number of days that each patient possessed each medication or was hospitalized during the first year (365 days) of follow-up:

calculate.utilization(dat = simulated.chd, outcome.names = c("ace", "bb", "statin", "hospital"), begin = 0, end = 365, id.name = "id", t1.name = "t1", t2.name = "t2", type = "total", full.followup = F)
#>                     id   ace    bb statin hospital
#>                 <char> <num> <num>  <num>    <num>
#>    1: 01KTl0KSK88EFV8N   270   245    263        0
#>    2: 01ZbYuUoYJeIyiVH    30    60     40       27
#>    3: 01oLxxu87rRDCIvo    64    64     64        0
#>    4: 01rOm5qEH4GLCiL5   365   365    365       87
#>    5: 021eg6OjCoGotbXK    61    30     94        0
#>   ---                                             
#>  996: zm1gtU2uw866RDGy    60    97    126        0
#>  997: zqMNWR16s2XrWiYJ    90   128    307       57
#>  998: zs7NTtHeWTecHvxS   246   239    289        0
#>  999: ztSPQ3OMBA2CzgSp   180   157    218        0
#> 1000: zxOx9moOQBqSiKq2    62    62     60        0

Setting the full.followup parameter to TRUE will restrict attention to subjects who are fully observed during the period. Any patient with fewer than 365 days of follow-up would be removed from consideration:

calculate.utilization(dat = simulated.chd, outcome.names = c("ace", "bb", "statin", "hospital"), begin = 0, end = 365, id.name = "id", t1.name = "t1", t2.name = "t2", type = "total", full.followup = T)
#>                    id   ace    bb statin hospital
#>                <char> <num> <num>  <num>    <num>
#>   1: 01KTl0KSK88EFV8N   270   245    263        0
#>   2: 01rOm5qEH4GLCiL5   365   365    365       87
#>   3: 09AgoPRwaNTV9bqg    90   182    222        0
#>   4: 0Ej1m7QODV3uGh2N   265   363    365        0
#>   5: 0Iog4hzdp33JXcyv   266    90    358        0
#>  ---                                             
#> 557: zX3s9WnLFsUxvhjE   180   270    270        0
#> 558: zXYVDQrr2zh4bBb3   279   307    350        0
#> 559: zaJV99JUjXSXsS7g   194   233    295       39
#> 560: zqMNWR16s2XrWiYJ    90   128    307       57
#> 561: zs7NTtHeWTecHvxS   246   239    289        0

Utilization can also be calculated as a proportion of the period of observation, dividing the total days of utilization by the total days of follow-up:

med.utilization.rates <- calculate.utilization(dat = simulated.chd, outcome.names = c("ace", "bb", "statin", "hospital"), begin = 0, end = 365, id.name = "id", t1.name = "t1", t2.name = "t2", type = "rate", full.followup = T)

med.utilization.rates
#>                    id       ace        bb    statin  hospital
#>                <char>     <num>     <num>     <num>     <num>
#>   1: 01KTl0KSK88EFV8N 0.7397260 0.6712329 0.7205479 0.0000000
#>   2: 01rOm5qEH4GLCiL5 1.0000000 1.0000000 1.0000000 0.2383562
#>   3: 09AgoPRwaNTV9bqg 0.2465753 0.4986301 0.6082192 0.0000000
#>   4: 0Ej1m7QODV3uGh2N 0.7260274 0.9945205 1.0000000 0.0000000
#>   5: 0Iog4hzdp33JXcyv 0.7287671 0.2465753 0.9808219 0.0000000
#>  ---                                                         
#> 557: zX3s9WnLFsUxvhjE 0.4931507 0.7397260 0.7397260 0.0000000
#> 558: zXYVDQrr2zh4bBb3 0.7643836 0.8410959 0.9589041 0.0000000
#> 559: zaJV99JUjXSXsS7g 0.5315068 0.6383562 0.8082192 0.1068493
#> 560: zqMNWR16s2XrWiYJ 0.2465753 0.3506849 0.8410959 0.1561644
#> 561: zs7NTtHeWTecHvxS 0.6739726 0.6547945 0.7917808 0.0000000

These rates can then be used in subsequent calculations. For instance, we could calculate the proportion of the patients with at least 365 days of follow-up who possessed each medication at least 80% of the time:

med.utilization.rates[, lapply(X = .SD, FUN = function(x){return(mean(x > 0.8))}), .SDcols = c("ace", "bb", "statin")]
#>          ace        bb    statin
#>        <num>     <num>     <num>
#> 1: 0.2174688 0.3368984 0.5579323

Then, based upon these calculations, we would be able to compare the medications in terms of the proportion of patients with a sufficient degree of utilization.

Counting Events

Outcome variables can also be analyzed in terms of their overall counts, such as the number of deaths or hospitalizations in the sample. The count.events method is used to calculate the number of rows in which a binary variable is set to TRUE:

count.events(dat = simulated.chd, outcome.names = c("hospital", "death"), type = "overall")
#>    hospital death
#>       <int> <int>
#> 1:     1678   148

This count can also be framed in terms of the distinct occurrences of an event. When type = “distinct”, the count.events method only adds to the count when an event is preceded by a gap in utilization.

count.events(dat = simulated.chd, outcome.names = c("hospital", "death"), type = "distinct")
#>    hospital death
#>       <int> <int>
#> 1:     1007   148

In particular, distinct counting reduces the count of hospitalizations substantially. Some hospitalizations extend over a period encompassing multiple rows of observation. (For instance, if the patient’s medications are changed during the hospitalization, it would trigger the formation of an additional row in the panel without discharing the patient from the hospital.) Hospitalizations in particular have costs associated with admission and separate costs based on the length of stay. As an example, if one patient had an admission for 3 days and another for 5, the costs could be substantially different than a single admission that lasts for 8 days.

The count.events method also allows for grouped calculations based on at least one categorical variable. For instance, we could count the number of distinct hospitalizations and deaths in each geographic region:

count.events(dat = simulated.chd, outcome.names = c("hospital", "death"), grouping.variables = "region", type = "distinct")
#> Key: <region>
#>       region hospital death
#>       <char>    <int> <int>
#> 1:   Midwest      222    40
#> 2: Northeast      310    41
#> 3:     South      163    28
#> 4:      West      313    39

The count.events method can also produce counts for individual subjects when the identifier is used as a grouping variable:

count.events(dat = simulated.chd, outcome.names = c("hospital", "death"), grouping.variables = "id", type = "distinct")
#> Key: <id>
#>                     id hospital death
#>                 <char>    <int> <int>
#>    1: 01KTl0KSK88EFV8N        0     0
#>    2: 01ZbYuUoYJeIyiVH        1     1
#>    3: 01oLxxu87rRDCIvo        0     0
#>    4: 01rOm5qEH4GLCiL5        8     0
#>    5: 021eg6OjCoGotbXK        0     0
#>   ---                                
#>  996: zm1gtU2uw866RDGy        0     0
#>  997: zqMNWR16s2XrWiYJ        2     0
#>  998: zs7NTtHeWTecHvxS        0     0
#>  999: ztSPQ3OMBA2CzgSp        1     0
#> 1000: zxOx9moOQBqSiKq2        0     0

Likewise, we could also group the patients by their treatment status, such as examining the combinations of utilization of ace inhibitors and beta blockers at the time of an event:

count.events(dat = simulated.chd, outcome.names = c("hospital", "death"), grouping.variables = c("ace", "bb"), type = "distinct")
#> Key: <ace, bb>
#>      ace    bb hospital death
#>    <int> <int>    <int> <int>
#> 1:     0     0      320    86
#> 2:     0     1      228     9
#> 3:     1     0      182    24
#> 4:     1     1      282    29

Crude Rates of Events

Comparing groups in terms of their total events does not incorporate the degree of follow-up time. If one patient is hospitalized once in 6 months of observation, and if a second patient is hospitalized once over the course of a year, then the first patient’s rate of events per year could be estimated as double that of the second patient’s rate. The crude.rates method is designed to calculate the number of events divided by the amount of person-time follow-up (the total length of follow-up summed over the relevant patients). Looking at the full simulated.chd data, the rates of distinct events of hospitalizations and mortality are:

crude.rates(dat = simulated.chd, outcome.names = c("hospital", "death"), type = "distinct")
#> Key: <period>
#>           period observation.time hospital death hospital.rate   death.rate
#>           <char>            <num>    <int> <int>         <num>        <num>
#> 1: All Follow-Up           722772     1007   148   0.001393247 0.0002047672

The rates would translate to roughly 0.0014 hospitalizations and 0.0002 deaths per person-day of followup.

When interpreting these results, it can be helpful to recharacterize the period of time. For instance, using 100 person-years of follow-up can place the rates onto a scale that is more similar to a human life span. The crude.rates method implements this by specifying a time.multiplier:

crude.rates(dat = simulated.chd, outcome.names = c("hospital", "death"), time.multiplier = 100 * 365.25)
#> Key: <period>
#>           period observation.time hospital death hospital.rate death.rate
#>           <char>            <num>    <int> <int>         <num>      <num>
#> 1: All Follow-Up           722772     1678   148      84.79707   7.479122

These crude rates can then be grouped by categorical variables. Here we compare the event rates for patients on and off of each medication:

crude.rates(dat = simulated.chd, outcome.names = c("hospital", "death"), time.multiplier = 100 * 365.25, grouping.variables = "ace")
#> Key: <ace, period>
#>      ace        period observation.time hospital death hospital.rate death.rate
#>    <int>        <char>            <num>    <int> <int>         <num>      <num>
#> 1:     0 All Follow-Up           340745      887    95      95.07894  10.183202
#> 2:     1 All Follow-Up           382027      791    53      75.62626   5.067247

crude.rates(dat = simulated.chd, outcome.names = c("hospital", "death"), time.multiplier = 100 * 365.25, grouping.variables = "bb")
#> Key: <bb, period>
#>       bb        period observation.time hospital death hospital.rate death.rate
#>    <int>        <char>            <num>    <int> <int>         <num>      <num>
#> 1:     0 All Follow-Up           274077      819   110     109.14442  14.659202
#> 2:     1 All Follow-Up           448695      859    38      69.92495   3.093304

crude.rates(dat = simulated.chd, outcome.names = c("hospital", "death"), time.multiplier = 100 * 365.25, grouping.variables = "statin")
#> Key: <statin, period>
#>    statin        period observation.time hospital death hospital.rate
#>     <int>        <char>            <num>    <int> <int>         <num>
#> 1:      0 All Follow-Up           187581      645   107      125.5917
#> 2:      1 All Follow-Up           535191     1033    41       70.4988
#>    death.rate
#>         <num>
#> 1:  20.834599
#> 2:   2.798113

The ratio of these crude rates could be one estimate of the treatment effect, showing that patients who take these medications have lower rates of mortality and hospitalizations. However, some caveats apply: these crude rates may reflect confounding from other variables (measured or unmeasured) in observational studies. Additionally, complex factors can be at play. For instance, a patient with clear warning signs of an adverse event may be placed on these medications. If the event occurs shortly thereafter, the data would dubiously show a harmful association between the medication and the event. Without care in interpretation, we might falsely conclude that going to the hospital is the factor that creates the most hazard for mortality.

The crude rates can also be calculated in different eras at time by specifying numeric cut.points:

crude.rates(dat = simulated.chd, outcome.names = c("hospital", "death"), time.multiplier = 100 * 365.25, cut.points = c(90, 365/2))
#> Key: <period>
#>               period observation.time hospital death hospital.rate death.rate
#>               <char>            <num>    <int> <int>         <num>      <num>
#> 1:         Before 90          87670.0      214    22       89.1565   9.165621
#> 2: On or After 182.5         571052.5     1305    91       83.4689   5.820437
#> 3:       [90, 182.5)          64049.5      218    35      124.3171  19.959172

These calculations can also be performed in groups, such as comparing patients on and off beta blockers in each era in terms of their rates of hospitalization and mortality:

crude.rates(dat = simulated.chd, outcome.names = c("hospital", "death"), time.multiplier = 100 * 365.25, grouping.variables = "bb", cut.points = c(90, 365/2))
#> Key: <bb, period>
#>       bb            period observation.time hospital death hospital.rate
#>    <int>            <char>            <num>    <int> <int>         <num>
#> 1:     0         Before 90          29169.0      104    13     130.22730
#> 2:     0 On or After 182.5         225173.0      633    70     102.67805
#> 3:     0       [90, 182.5)          17996.0       91    27     184.69521
#> 4:     1         Before 90          58501.0      110     9      68.67831
#> 5:     1 On or After 182.5         345565.5      671    21      70.92223
#> 6:     1       [90, 182.5)          38920.5       95     8      89.15289
#>    death.rate
#>         <num>
#> 1:  16.278412
#> 2:  11.354603
#> 3:  54.799678
#> 4:   5.619135
#> 5:   2.219623
#> 6:   7.507612

Panel data presents challenges for separating the data into eras of time. Many rows of data may include intervals of time that overlap multiple eras. The crude.rates method relies upon the era.splits method to restructure the data. For rows that overlap the eras specified by the cut.points, the method adds new rows to the data set and modifies the time points of the existing rows. This ensures that each row belongs to a single era and that no information is lost.

As an example, let’s consider the first two rows of the simulated.chd data:

simulated.chd[1:2, .SD, .SDcols = c("id", "t1", "t2")]
#>                  id    t1    t2
#>              <char> <num> <num>
#> 1: 01KTl0KSK88EFV8N     0     8
#> 2: 01KTl0KSK88EFV8N     8    30

Suppose an analysis wants to consider the experience of patients in several periods: a) before 3 days, b) at least 3 and less than 5 days, and c) all subsequent follow-up starting at 5 days. Applying the era.splits method to these two rows of data splits the first row into 3 rows of data that are mutually exclusive, collectively exhaustive, and aligned with the specified eras:

era.splits(dat = simulated.chd[1:2, .SD, .SDcols = c("id", "t1", "t2")], cut.points = c(3,5))
#>                  id    t1    t2
#>              <char> <num> <num>
#> 1: 01KTl0KSK88EFV8N     0     3
#> 2: 01KTl0KSK88EFV8N     3     5
#> 3: 01KTl0KSK88EFV8N     5     8
#> 4: 01KTl0KSK88EFV8N     8    30

Quality Checks of Panel Data

The complexity and unfamiliarity of the panel structure can present challenges in basic investigations of the data. The tvtools package includes a number of methods for identifying potential issues with panel data.

Measurement Rates

Longitudinal data may be subject to censoring and loss of follow-up. The measurement.rate function calculates the proportion of subjects who have records at the specified point in follow-up:

measurement.rate(dat = simulated.chd, id.name = "id", t1.name = "t1", t2.name = "t2", time.point = 365)
#>     time observed total.subjects rate.observed rate.not.observed
#>    <num>    <int>          <int>         <num>             <num>
#> 1:   365      556           1000         0.556             0.444

Note that the rate not observed incorporates a) patients censored or lost to follow-up at that time, and also b) patients who did not survive to that time.

The rate of measurement can also be calculated in groups:

measurement.rate(dat = simulated.chd, id.name = "id", t1.name = "t1", t2.name = "t2", time.point = 365, grouping.variables = "region")
#> Key: <region>
#>       region  time observed total.subjects rate.observed rate.not.observed
#>       <char> <num>    <int>          <int>         <num>             <num>
#> 1:   Midwest   365      103            191     0.5392670         0.4607330
#> 2: Northeast   365      167            294     0.5680272         0.4319728
#> 3:     South   365      113            196     0.5765306         0.4234694
#> 4:      West   365      173            319     0.5423197         0.4576803

Panel Gaps

Longitudinal records can include periods of censorship. In panel data, this would show up through the absence of a record. The panel.gaps method is designed to identify gaps between earlier and later observed times, with an assumed starting time of t1 = 0. Here we can verify no gaps in the simulated.chd data:

pg.check = panel.gaps(dat = orig.data, id.name = "id", t1.name = "t1", t2.name = "t2")
pg.check[, .N, gap_before]
#>    gap_before     N
#>        <lgcl> <int>
#> 1:      FALSE 33572

We could also artificially construct gaps in the panel data by only selecting a subset of rows. This will verify that the panel gaps are correctly identified:

gap.dat <- simulated.chd[c(1,3,5, 7, 58, 60, 64),]
pg.check.2 <- panel.gaps(dat = gap.dat, id.name = "id", t1.name = "t1", t2.name = "t2")
pg.check.2[, .SD, .SDcols = c("id", "t1", "t2", "gap_before")]
#>                  id    t1    t2 gap_before
#>              <char> <num> <num>     <lgcl>
#> 1: 01KTl0KSK88EFV8N     0     8      FALSE
#> 2: 01KTl0KSK88EFV8N    30    38       TRUE
#> 3: 01KTl0KSK88EFV8N    46    66       TRUE
#> 4: 01KTl0KSK88EFV8N    90    94       TRUE
#> 5: 01ZbYuUoYJeIyiVH     0     1      FALSE
#> 6: 01ZbYuUoYJeIyiVH    10    30       TRUE
#> 7: 01ZbYuUoYJeIyiVH    70    70       TRUE

We can also identify the earliest gap for each subject using first.panel.gap:

first.panel.gap(dat = gap.dat, id.name = "id", t1.name = "t1", t2.name = "t2")
#>                  id gap_before.first.event
#>              <char>                  <num>
#> 1: 01KTl0KSK88EFV8N                     30
#> 2: 01ZbYuUoYJeIyiVH                     10

Likewise, a subject’s latest gap can be found with last.panel.gap:

last.panel.gap(dat = gap.dat, id.name = "id", t1.name = "t1", t2.name = "t2")
#>                  id gap_before.last.event
#>              <char>                 <num>
#> 1: 01KTl0KSK88EFV8N                    90
#> 2: 01ZbYuUoYJeIyiVH                    70

Panel Overlaps

For a single subject, we assume that the panel data is structured so that the rows and time intervals will be mutually exclusive. That assumption can be validated only through investigation of the data. The panel.overlaps function identifies whether each subject has any period of potentially overlapping time intervals:

possible.overlaps <- panel.overlaps(dat = simulated.chd, id.name = "id", t1.name = "t1", t2.name = "t2")
#print(possible.overlaps)
possible.overlaps[, mean(overlapping_panels == F)]
#> [1] 1

This verifies that the simulated.chd meets the assumption of mutually exlusive periods of observation for each user.

We can then construct a panel with overlapping observations:

overlap.dat <- data.table(id = "ABC", t1 = c(0, 7, 14, 21), t2 = c(8, 15, 21, 30), ace = c(1,0,1,0))
panel.overlaps(dat = overlap.dat, id.name = "id", t1.name = "t1", t2.name = "t2")
#>        id overlapping_panels
#>    <char>             <lgcl>
#> 1:    ABC               TRUE

It should be noted that panel.overlaps requires pairwise comparisons of the time intervals within each subject. As a result, larger panels can require some computational time to complete the investigation of overlaps.

Events of Unusual Duration

Panel data is structured on the notion that new events will take effect at the beginning of the time interval for the record. This assumption should be carefully verified in data analyses. For instance, one might erroneously code death at the end of the last interval of observation. If the last interval is especially lengthy, an analysis of the data might systematically record the deaths at significantly earlier times than they occurred.

The unusual.duration method is designed to identify cases in which an event occurs and the duration of the interval is long enough to be considered unusual. For instance, we might identify the hospitalizations that last longer than 100 days (in a single row):

long.hospitalizations <- unusual.duration(dat = simulated.chd, outcome.name = "hospital", max.length = 100, t1.name = "t1", t2.name = "t2")
long.hospitalizations[, .SD, .SDcols = c("id", "t1", "t2", "hospital")]
#>                   id    t1    t2 hospital
#>               <char> <num> <num>    <int>
#>  1: 01rOm5qEH4GLCiL5   562   823        1
#>  2: 01rOm5qEH4GLCiL5  1206  1372        1
#>  3: 378Ax3nk7KUuz9CV   507   612        1
#>  4: 5bsGQKWkbeqqCDun  1487  1929        1
#>  5: 5bsGQKWkbeqqCDun  2035  2314        1
#>  6: AnDxV4tHY0ceJg8R  1744  1854        1
#>  7: D8GJSiAFkcv29KYV  1688  1829        1
#>  8: FhkQxeEx9dvO2iZb  1387  1545        1
#>  9: H8aM5GuMndOLTrF8  1670  1832        1
#> 10: IJQFwVwBBc23vpmK  2001  2206        1
#> 11: QUUws4eexQZxWIl9   645   822        1
#> 12: REaoGgY18CgbLneN  1854  2031        1
#> 13: RtwWZQiF192PDFR9   387   554        1
#> 14: V5xu6shLtlYaIVWB   848  1002        1
#> 15: VUjW6PZAYA6OQQM3  2134  2400        1
#> 16: ZIa8W5pvnPoETLye   288   452        1
#> 17: ZSWU3tOI1LqYYd1N   656   832        1
#> 18: d8CIC18WilrexiD1   373   549        1
#> 19: d8CIC18WilrexiD1  1752  1914        1
#> 20: dCUgM2bLCEzlMvqt  1669  1849        1
#> 21: dQ4PHmnuZel9Aicq   684   849        1
#> 22: jGjvJOMwhwyQtipu  1208  1561        1
#> 23: vIpAgiKixKObEXfh   814   948        1
#>                   id    t1    t2 hospital

These cases might be further investigated to ensure the accuracy of the data. Likewise, we could also verify that deaths are not recorded for a period greater than 1 day:

unusual.duration(dat = simulated.chd, outcome.name = "death", max.length = 1, t1.name = "t1", t2.name = "t2")
#> Empty data.table (0 rows and 15 cols): id,t1,t2,age,sex,region...

This verifies that the time of mortality will not be misinterpreted based upon differences between the beginning and end of the interval of observation. If large differences are noted, then some restructuring of the data may be necessary.