Why we need standardized processes for customer analytic data preparation. Data quality stats from Forrester 1
CADF stands for Customer Analytics Data Format. Here is how it works.
1.) Prepare your transaction data to account for purchases, returns and the correct transaction data inclusion. This may require working with source system teams like SAP or your accounting teams. It may require implementing or working with a customer data platform.
2.) Define segments of customers that you wish to process analytic datasets for.
3.) Run CADF for each segment. This packages is still in beta. Once finished it will automatically create an series of analytic datasets for each segment.
CADF works by utilizing a split-apply-combine approach to data mining.^ https://www.jstatsoft.org/article/download/v040i01/468. I do not follow Hadley’s approach exactly but more out of principle.
1.) Split data by customer id into thousands of R lists.
2.) Use R6 class to process data for each customer id. (This is called putting the data in CADF format.)
3.) Recombine the data into analytic datasets.
Prepare transactional data -> Split data by customer id -> Run CADF on each customer -> Reassemble customer level data into cohort level data for modeling.
Here are the steps to translate transactional data to CADF.
data(transactions)
transactions.filtered <- transactions[, c("ID", "PURCHASE_DATE")]
str(transactions.filtered)
#> 'data.frame': 69659 obs. of 2 variables:
#> $ ID : int 1 2 2 3 3 3 3 3 3 4 ...
#> $ PURCHASE_DATE: Date, format: "1997-01-01" "1997-01-12" ...
Run the CADF process. Split.transaction.file_to_CADF processes each customer. The second arguement can be used if your data contains training and test instances. This example assumes 0 testing data.
Here is the CADF profile (R6 class) for customer 400. The R6 class contains all the different data points that can be used for customer analytic modeling.
cadf["400"]
#> $`400`
#> <Customer>
#> Public:
#> Freq:
#> T: 4
#> T_custom: TRUE
#> T_ss: 15
#> clone: function (deep = FALSE)
#> data: data.frame
#> first_purchase_date: 1997-01-02
#> id: 400
#> initialize: function (df_customer = NA, today = NA)
#> last_purchase_date: 1998-03-07
#> logistic_modeling_matrix: 1 2 3 4 0 0 0 1
#> logistic_modeling_matrix_custom: 1 1
#> logistic_modeling_matrix_ss: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0 0 0 0 0 0 0 0 0 ...
#> output: list
#> payload: [{}]
#> purchase_count: 8
#> purchase_string: 111010100001101
#> purchase_string_as_matrix: 1 1 1 0 1 0 1 0 0 0 0 1 1 0 1
#> recency_string_as_matrix: 0 0 0 1 0 1 0 1 2 3 4 0 0 1 0
#> repeat_customer: Y
#> repeat_customer_by_day: Y
#> study_begin_date: NULL
#> study_name: Customer Study
#> survival_modeling_matrix: 1 2 3 4 5 0 0 0 0 1
#> survival_modeling_matrix_custom: 1 2 0 1
#> survival_modeling_matrix_ss: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 0 0 0 0 0 0 0 0 ...
#> timing: NULL
#> today: 1998-07-01
#> transaction_dates: 1997_01 1997_02 1997_03 1997_05 1997_07 1997_12 1997_12 ...
#> transaction_months: 1997-01 1997-02 1997-03 1997-05 1997-07 1997-12 1998-01 ...
#> transaction_range_complete: 15
Returns a numeric vector. Each entry is the time, in weeks, for each customers nth purchase. The function is ran for 3rd purchase.
A more comprehensive dataset may be returned. The following command returns the Customer ID, purchase_date (of xth purchase) and various calculations. Note that only one row per customer is returned. Each customer row returned represents the nth purchase.
nth.purchase.more <- CADF_to_nth_purchase_allrows(cadf,3)
head(nth.purchase.more)
#> ID PURCHASE_DATE Tdays month_yr Tmonths
#> 3 3 1997-05-25 84 1997-05 3
#> 4 4 1997-08-18 229 1997-08 8
#> 5 5 1997-02-14 44 1997-02 2
#> 7 7 1998-01-22 317 1998-01 11
#> 8 8 1997-06-13 132 1997-06 5
#> 9 9 1998-01-08 252 1998-01 9
#> yr_week
#> 3 1997_091997_091997_091997_091997_091998_
#> 4 1997_011997_011997_011997_
#> 5 1997_011997_011997_011997_011997_011997_011997_011997_011997_011997_011998_
#> 7 1997_401997_401998_
#> 8 1997_051997_051997_051997_051997_051997_051997_051998_
#> 9 1997_181997_181998_
#> Tweeks today diff.years target.buy Frequency.baseperiod
#> 3 12.000000 1998-07-01 1.1006160 days 0 6
#> 4 32.714286 1998-07-01 0.8678987 days 0 4
#> 5 6.285714 1998-07-01 1.3744011 days 0 11
#> 7 45.285714 1998-07-01 0.4380561 days 0 3
#> 8 18.857143 1998-07-01 1.0485969 days 0 8
#> 9 36.000000 1998-07-01 0.4763860 days 0 3
#> Frequency.holdout x t.x T.cal recency.studyperiod.years
#> 3 0 6 33.50000 days 69.42857 0.4216290 days
#> 4 0 4 26.41667 days 78.00000 0.5776865 days
#> 5 0 11 41.83333 days 78.00000 0.4900753 days
#> 7 0 3 13.33333 days 68.14286 0.4380561 days
#> 8 0 8 31.91667 days 73.57143 0.4188912 days
#> 9 0 3 14.50000 days 60.85714 0.4763860 days
#> purchase.num
#> 3 3
#> 4 3
#> 5 3
#> 7 3
#> 8 3
#> 9 3
View purchase string for one consumer as a string of 1’s and 0’s. 1 is purchase. 0 is no purchase.
View purchase string for one consumer as a matrix.
cadf$`400`$purchase_string_as_matrix
#> [,1]
#> [1,] 1
#> [2,] 1
#> [3,] 1
#> [4,] 0
#> [5,] 1
#> [6,] 0
#> [7,] 1
#> [8,] 0
#> [9,] 0
#> [10,] 0
#> [11,] 0
#> [12,] 1
#> [13,] 1
#> [14,] 0
#> [15,] 1
View purchase string for one consumer as R data frame. Time always starts at 1 and is relative to first purchase date.
t <- 1: length(cadf$`400`$purchase_string_as_matrix)
cadf$`400`$purchase_string_as_matrix
#> [,1]
#> [1,] 1
#> [2,] 1
#> [3,] 1
#> [4,] 0
#> [5,] 1
#> [6,] 0
#> [7,] 1
#> [8,] 0
#> [9,] 0
#> [10,] 0
#> [11,] 0
#> [12,] 1
#> [13,] 1
#> [14,] 0
#> [15,] 1
data.frame(t,cadf$`400`$purchase_string_as_matrix )
#> t cadf..400..purchase_string_as_matrix
#> 1 1 1
#> 2 2 1
#> 3 3 1
#> 4 4 0
#> 5 5 1
#> 6 6 0
#> 7 7 1
#> 8 8 0
#> 9 9 0
#> 10 10 0
#> 11 11 0
#> 12 12 1
#> 13 13 1
#> 14 14 0
#> 15 15 1
Create a R list of all purchase strings.
Each row contains weighted data for each combination of T and “cancel” T | cancel | count
head(srm_summaries)
#> bigT cancel count
#> 1 2 1 4
#> 2 10 1 13
#> 3 7 0 49
#> 4 3 1 16
#> 5 11 1 10
#> 6 8 0 63
Expanded datasets work best with CADF because that is how the data will most likely be returned from your reporting teams.
Sum of cancel flag = number of cancellations Row count = opportunities to cancel
Divide the two to get the simple retention rate
surv.obj <- Surv(srm_data$bigT, srm_data$cancel)
summary(survfit(surv.obj ~ 1))
#> Call: survfit(formula = surv.obj ~ 1)
#>
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 2 5157 4 0.999 0.000388 0.998 1.000
#> 3 4489 16 0.996 0.000969 0.994 0.998
#> 4 3825 20 0.990 0.001509 0.988 0.993
#> 5 3179 37 0.979 0.002403 0.974 0.984
#> 6 2554 28 0.968 0.003117 0.962 0.974
#> 7 1973 61 0.938 0.004833 0.929 0.948
#> 8 1453 24 0.923 0.005695 0.912 0.934
#> 9 1043 19 0.906 0.006773 0.893 0.919
#> 10 720 13 0.890 0.008027 0.874 0.905
#> 11 446 10 0.870 0.010024 0.850 0.890
#> 12 201 13 0.813 0.017763 0.779 0.849
CADF - Customer Data Preparation in R Copyright (C) 2023 Steve Ludwig
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.