Title: | Sequence and Latent Process Detector |
---|---|
Description: | Sequence detector in this package contains a specific automaton model that can be used to learn and detect data and process sequences. Automaton model in this package is capable of learning and tracing sequences. Automaton model can be found in Krleža, Vrdoljak, Brčić (2019) <doi:10.1109/ACCESS.2019.2955245>. This research has been partly supported under Competitiveness and Cohesion Operational Programme from the European Regional and Development Fund, as part of the Integrated Anti-Fraud System project no. KK.01.2.1.01.0041. This research has also been partly supported by the European Regional Development Fund under the grant KK.01.1.1.01.0009. |
Authors: | Dalibor Krleža |
Maintainer: | Dalibor Krleža <[email protected]> |
License: | LGPL-3 |
Version: | 1.0.7 |
Built: | 2024-11-20 06:30:34 UTC |
Source: | CRAN |
A single sales process flow from the BPI 2019 challenge event log was taken to perform the Sequence Detector testing. The results are available in [1].
bpi_challenge_2019_test1()
bpi_challenge_2019_test1()
None
[1] D. Krleža, B. Vrdoljak, and M. Brčić, Latent Process Discovery using Evolving Tokenized Transducer, IEEE Access, vol. 7, pp. 169657 - 169676, Dec. 2019
A method that formats an input list made of strings into a single output string. The output string is formatted as [e1,e2,...,en].
c_to_string(var)
c_to_string(var)
var |
(list) - A string list |
(character) - An output string made of the input list elements, formatted as [e1,e2,...,en].
An abstract method that needs to be implemented by classes that derive HSC_PC
.
It performs classification on the input event stream.
See the SeqDetect vignette for details on how to implement a HSC_PC
derived class.
classify(x, stream, ...)
classify(x, stream, ...)
x |
( |
stream |
(data.frame) - An input event stream |
... |
An additional list of parameters needed for the used pre-classifier. |
(data.frame) - An output, a consolidated stream. Each row in the output data.frame must have .clazz field, containing the row classification value.
Sequence Detector method for removing tokens.
## S4 method for signature 'HybridSequenceClassifier' cleanKeys(machine_id=NULL)
## S4 method for signature 'HybridSequenceClassifier' cleanKeys(machine_id=NULL)
machine_id |
(character) - An identifier of the machine (ETT) whose token needs to be removed. If NULL, all machines tokens are removed. |
st <- data.frame(product=c("P45","P134","P45","P134","P134","P45","P134"), sales=c(2,12,18,16,18,24,8), alert=c(NA,NA,NA,NA,NA,"Alert P45","Alert P134")) input_streams <- list(stream=st) pp <- HSC_PP(c("product","sales"),"sequence_id",auto_id=TRUE) pc <- HSC_PC_Binning(0,100,40,"sales") hsc <- HybridSequenceClassifier(c("product","sales","sequence_id"), "sequence_id","sequence_id","product",pc,pp) hsc$process(input_streams) hsc$cleanKeys()
st <- data.frame(product=c("P45","P134","P45","P134","P134","P45","P134"), sales=c(2,12,18,16,18,24,8), alert=c(NA,NA,NA,NA,NA,"Alert P45","Alert P134")) input_streams <- list(stream=st) pp <- HSC_PP(c("product","sales"),"sequence_id",auto_id=TRUE) pc <- HSC_PC_Binning(0,100,40,"sales") hsc <- HybridSequenceClassifier(c("product","sales","sequence_id"), "sequence_id","sequence_id","product",pc,pp) hsc$process(input_streams) hsc$cleanKeys()
Sequence Detector method for cloning. Clones the Sequence Detector object and all its ETTs.
## S4 method for signature 'HybridSequenceClassifier' clone()
## S4 method for signature 'HybridSequenceClassifier' clone()
st <- data.frame(product=c("P45","P134","P45","P134","P134","P45","P134"), sales=c(2,12,18,16,18,24,8), alert=c(NA,NA,NA,NA,NA,"Alert P45","Alert P134")) input_streams <- list(stream=st) pp <- HSC_PP(c("product","sales"),"sequence_id",auto_id=TRUE) pc <- HSC_PC_Binning(0,100,40,"sales") hsc <- HybridSequenceClassifier(c("product","sales","sequence_id"), "sequence_id","sequence_id","product",pc,pp) hsc$process(input_streams) tt <- data.frame(product=c("P672","P113","P983","P23872","P5","P672","P2982","P983","P672", "P991","P983","P113","P2982","P344"), sales=c(2,11,12,98,8,18,298,16,24,25,18,16,43,101),alert=NA) test_streams <- list(stream=tt) hsc2 <- hsc$clone() hsc2$process(test_streams,learn=FALSE)
st <- data.frame(product=c("P45","P134","P45","P134","P134","P45","P134"), sales=c(2,12,18,16,18,24,8), alert=c(NA,NA,NA,NA,NA,"Alert P45","Alert P134")) input_streams <- list(stream=st) pp <- HSC_PP(c("product","sales"),"sequence_id",auto_id=TRUE) pc <- HSC_PC_Binning(0,100,40,"sales") hsc <- HybridSequenceClassifier(c("product","sales","sequence_id"), "sequence_id","sequence_id","product",pc,pp) hsc$process(input_streams) tt <- data.frame(product=c("P672","P113","P983","P23872","P5","P672","P2982","P983","P672", "P991","P983","P113","P2982","P344"), sales=c(2,11,12,98,8,18,298,16,24,25,18,16,43,101),alert=NA) test_streams <- list(stream=tt) hsc2 <- hsc$clone() hsc2$process(test_streams,learn=FALSE)
Sequence Detector method for compressing machines by isolating common isomorphic sub-structured into child ETTs. See the SeqDetect vignette for details and examples.
## S4 method for signature 'HybridSequenceClassifier' compressMachines(ratio=0.5)
## S4 method for signature 'HybridSequenceClassifier' compressMachines(ratio=0.5)
ratio |
(numeric) - A minimal isomorphic overlap between ETTs to be eligible for compression. Using this parameter too low (e.g. <0.5) might lead to overfragmentation of ETTs. |
st <- data.frame(product=c("P45","P134","P45","P134","P134","P45","P134"), sales=c(2,12,18,16,18,24,8), alert=c(NA,NA,NA,NA,NA,"Alert P45","Alert P134")) input_streams <- list(stream=st) pp <- HSC_PP(c("product","sales"),"sequence_id",auto_id=TRUE) pc <- HSC_PC_Binning(0,100,40,"sales") hsc <- HybridSequenceClassifier(c("product","sales","sequence_id"), "sequence_id","sequence_id","product",pc,pp) hsc$process(input_streams) hsc$compressMachines()
st <- data.frame(product=c("P45","P134","P45","P134","P134","P45","P134"), sales=c(2,12,18,16,18,24,8), alert=c(NA,NA,NA,NA,NA,"Alert P45","Alert P134")) input_streams <- list(stream=st) pp <- HSC_PP(c("product","sales"),"sequence_id",auto_id=TRUE) pc <- HSC_PC_Binning(0,100,40,"sales") hsc <- HybridSequenceClassifier(c("product","sales","sequence_id"), "sequence_id","sequence_id","product",pc,pp) hsc$process(input_streams) hsc$compressMachines()
Sequence Detector method for deserializing from a list.
deserializeFromList(l)
deserializeFromList(l)
l |
(list) - A list containing a Sequence Detector details. |
(HybridSequenceClassifier) - Returns a deserialized Sequence Detector object.
HybridSequenceClassifier-class
,
serializeToList,HybridSequenceClassifier-method
st <- data.frame(product=c("P45","P134","P45","P134","P134","P45","P134"), sales=c(2,12,18,16,18,24,8), alert=c(NA,NA,NA,NA,NA,"Alert P45","Alert P134")) input_streams <- list(stream=st) pp <- HSC_PP(c("product","sales"),"sequence_id",auto_id=TRUE) pc <- HSC_PC_Binning(0,100,40,"sales") hsc <- HybridSequenceClassifier(c("product","sales","sequence_id"), "sequence_id","sequence_id","product",pc,pp) hsc$process(input_streams) hsc_list <- hsc$serializeToList() saveRDS(hsc_list,"test_list.RDS") new_hsc_list <- readRDS("test_list.RDS") file.remove("test_list.RDS") hsc2 <- deserializeFromList(new_hsc_list)
st <- data.frame(product=c("P45","P134","P45","P134","P134","P45","P134"), sales=c(2,12,18,16,18,24,8), alert=c(NA,NA,NA,NA,NA,"Alert P45","Alert P134")) input_streams <- list(stream=st) pp <- HSC_PP(c("product","sales"),"sequence_id",auto_id=TRUE) pc <- HSC_PC_Binning(0,100,40,"sales") hsc <- HybridSequenceClassifier(c("product","sales","sequence_id"), "sequence_id","sequence_id","product",pc,pp) hsc$process(input_streams) hsc_list <- hsc$serializeToList() saveRDS(hsc_list,"test_list.RDS") new_hsc_list <- readRDS("test_list.RDS") file.remove("test_list.RDS") hsc2 <- deserializeFromList(new_hsc_list)
Sequence Detector method for retrieving list of machine identifiers.
## S4 method for signature 'HybridSequenceClassifier' getMachineIdentifiers()
## S4 method for signature 'HybridSequenceClassifier' getMachineIdentifiers()
(list) A list of strings, represeting machine identifiers.
st <- data.frame(product=c("P45","P134","P45","P134","P134","P45","P134"), sales=c(2,12,18,16,18,24,8), alert=c(NA,NA,NA,NA,NA,"Alert P45","Alert P134")) input_streams <- list(stream=st) pp <- HSC_PP(c("product","sales"),"sequence_id",auto_id=TRUE) pc <- HSC_PC_Binning(0,100,40,"sales") hsc <- HybridSequenceClassifier(c("product","sales","sequence_id"), "sequence_id","sequence_id","product",pc,pp) res <- hsc$process(input_streams) message(hsc$getMachineIdentifiers())
st <- data.frame(product=c("P45","P134","P45","P134","P134","P45","P134"), sales=c(2,12,18,16,18,24,8), alert=c(NA,NA,NA,NA,NA,"Alert P45","Alert P134")) input_streams <- list(stream=st) pp <- HSC_PP(c("product","sales"),"sequence_id",auto_id=TRUE) pc <- HSC_PC_Binning(0,100,40,"sales") hsc <- HybridSequenceClassifier(c("product","sales","sequence_id"), "sequence_id","sequence_id","product",pc,pp) res <- hsc$process(input_streams) message(hsc$getMachineIdentifiers())
All pre-classifiers must inherit this class. A pre-classifier instance cannot be directly created by this abstract class.
HSC_PC_None,HSC_PC_Attribute,HSC_PC_Binning
Extends the HSC_PC abstract class.
HSC_PC_Attribute(field)
HSC_PC_Attribute(field)
field |
(character) - Field taken as the classification value from the input event stream. |
A pre-classifier takes classification from the predefined field in the input event stream and copies these values to the .clazz field. The rest of the input event stream remains unmodified.
event_stream <- data.frame(product=c("P45","P134","P45","P134","P134","P45","P134"), sales=c(2,12,18,16,18,24,8), alert=c(NA,NA,NA,NA,NA,"Alert P45","Alert P134")) pc <- HSC_PC_Attribute("sales") cons_stream <- classify(pc,event_stream)
event_stream <- data.frame(product=c("P45","P134","P45","P134","P134","P45","P134"), sales=c(2,12,18,16,18,24,8), alert=c(NA,NA,NA,NA,NA,"Alert P45","Alert P134")) pc <- HSC_PC_Attribute("sales") cons_stream <- classify(pc,event_stream)
Extends the HSC_PC abstract class.
HSC_PC_Binning(min_value, max_value, bins, value_field)
HSC_PC_Binning(min_value, max_value, bins, value_field)
min_value |
(numeric) - Minimal value. |
max_value |
(numeric) - Maximal value: |
bins |
(integer) - A number of bins that needs to be created. |
value_field |
(character) - The name of the value field in the input event stream. |
A pre-classifier takes performs binning on a value field of the input event stream.
event_stream <- data.frame(product=c("P45","P134","P45","P134","P134","P45","P134"), sales=c(2,12,18,16,18,24,8), alert=c(NA,NA,NA,NA,NA,"Alert P45","Alert P134")) pc <- HSC_PC_Binning(0,100,40,"sales") cons_stream <- classify(pc,event_stream) # Minimal value = 0, Maximal value = 100, 40 bins, values taken from the field named *sales*
event_stream <- data.frame(product=c("P45","P134","P45","P134","P134","P45","P134"), sales=c(2,12,18,16,18,24,8), alert=c(NA,NA,NA,NA,NA,"Alert P45","Alert P134")) pc <- HSC_PC_Binning(0,100,40,"sales") cons_stream <- classify(pc,event_stream) # Minimal value = 0, Maximal value = 100, 40 bins, values taken from the field named *sales*
Extends the HSC_PC abstract class.
HSC_PC_None()
HSC_PC_None()
A pre-classifier class that does not contain any classifier. It passes an input event stream straight through without any modifications. The only thing is to check whether the input event stream contains .clazz field, which should carry classification and input symbols for Sequence Detector ETTs.
event_stream <- data.frame(product=c("P45","P134","P45","P134","P134","P45","P134"), sales=c(2,12,18,16,18,24,8), alert=c(NA,NA,NA,NA,NA,"Alert P45","Alert P134"), .clazz=c(2,12,18,16,18,24,8)) pc <- HSC_PC_None() cons_stream <- classify(pc,event_stream)
event_stream <- data.frame(product=c("P45","P134","P45","P134","P134","P45","P134"), sales=c(2,12,18,16,18,24,8), alert=c(NA,NA,NA,NA,NA,"Alert P45","Alert P134"), .clazz=c(2,12,18,16,18,24,8)) pc <- HSC_PC_None() cons_stream <- classify(pc,event_stream)
Class that needs to be derived to create new pre-processors. A pre-processor can be directly instantiated from the HSC_PP class.
HSC_PP(fields, timestamp_field, create_unique_key = FALSE, auto_id = FALSE)
HSC_PP(fields, timestamp_field, create_unique_key = FALSE, auto_id = FALSE)
fields |
(vector) - The complete list of fields in the input data streams that needs to be present in the output event stream |
timestamp_field |
(character) - The name of the sequencing field. Could be autogenerated by the pre-processor, or already present in the input data streams. Used for ordering of the output event stream. |
create_unique_key |
(logical) - If TRUE, the pre-processor adds field named .key to the output event stream comprising a unique key (1) for all data items. |
auto_id |
(logical) - If TRUE, the pre-processor generates autoincremented values and assigns then to the timestamp_field. Can be used when input data streams do not comprise any timing information. |
Example 1pp <- HSC_PP(c("product","time","sales"),"time")
- Creates a new HSC_PP pre-processor that uses time field for ordering of the output event stream.
Example 2pp <- HSC_PP(c("product","sales"),"sequence_id",auto_id=TRUE)
- Creates a new HSC_PP pre-processor that has no time field. Instead, the pre-processor adds the sequence_id field and generates autoincremented values for it.
Example 3pp <- HSC_PP(c("sequence_val"),"sequence_id",create_unique_key=TRUE,auto_id=TRUE)
- Creates a new HSC_PP pre-processor that has no time and no key field. The pre-processor adds the sequence_id field and generates autoincremented values for it. Also, the .key=1 column is added to all output events.
The Sequence Detector class.
Instantiates a Sequence Detector object. Constructor takes a number of parameters that define pre-processing
and pre-classification stages, as well as the structure of the input consolidated data stream.
These stages can be redefined again later using setInputDefinitions,HybridSequenceClassifier-method
method.
See the SeqDetect vignette for examples.
(vector, character) - A vector of all relevant consolidated data stream fields.
(character) - A name of the field having starting time point values.
(character) - A name of the field having finishing time point values.
(character) - A name of the context identifier field (key field). If NULL, then .key field is used for retrieving context identifier values.
(HSC_PC) - A pre-classifier object. If NULL, the Sequence Detector creates new HSC_PC_None pre-classifier, which means that the input consolidated data stream must have .clazz field for retrieving classification values (input symbols in the underlying ETTs).
(HSC_PP) - A pre-processing object. If NULL, the Sequence Detector creates new HSC_PP pre-processor having the same fields as define in the fields parameter, and ordering timestamp field as defined in timestamp_start_field.
(list) - A list of decay descriptors. If NULL, token decay machanism is not used. Descriptor structure can be seen in vignettes.
(character) - A name of the field having output symbol values, i.e., relational ETT classification output.
(logical) - If TRUE, ETTs are instructed to create sequence statistics. This is used whe having input time-series data streams. If FALSE, the sequence statistics are not created.
(logical) - The parameter defined in [1]. ETTs are created so that each ETT have a state that represents each input symbol.
(logical) - Force parallel execution of ETTs in the Sequence Detector object. Useful when we expect higher number of ETTs in the same Sequence Detector.
cleanKeys(machine_id=NULL)
Sequence Detector method for removing tokens and keyscleanKeys,HybridSequenceClassifier-method
clone()
Sequence Detector method for cloningclone,HybridSequenceClassifier-method
compressMachines(ratio=0.5)
Sequence Detector method for compressing the underlying set of ETTscompressMachines,HybridSequenceClassifier-method
getMachineIdentifiers()
Sequence Detector method for retrieving identifiers for the underlying set of ETTsgetMachineIdentifiers,HybridSequenceClassifier-method
induceSubmachine(threshold, isolate=FALSE)
Sequence Detector method for performing statistical projections on the underlying set of ETTsinduceSubmachine,HybridSequenceClassifier-method
mergeMachines()
Sequence Detector method for merging the underlying set of ETTsmergeMachines,HybridSequenceClassifier-method
plotMachines(machine_id=NULL)
Sequence Detector method for plotting the underlying set of ETTsplotMachines,HybridSequenceClassifier-method
printMachines(machine_id=NULL, state=NULL, print_cache=TRUE, print_keys=TRUE)
Sequence Detector method for printing the underlying set of ETTs to the R consoleprintMachines,HybridSequenceClassifier-method
process(streams, learn=TRUE, give_explain=TRUE, threshold=NULL, debug=FALSE, out_filename=NULL, ...)
Sequence Detector method for processing an input streams sliceprocess,HybridSequenceClassifier-method
serialize()
Sequence Detector method for serializing the underlying set of ETTs definitionsserialize,HybridSequenceClassifier-method
serializeToList()
Sequence Detector method for serializing the underlying set of ETTs definitions to the listserializeToList,HybridSequenceClassifier-method
setOutputPattern(states=c(), transitions=c(), pattern, machine_id=NULL)
Sequence Detector method for setting the output alphabet to the underlying set of ETTssetOutputPattern,HybridSequenceClassifier-method
setPreprocessor(preprocessor)
Sequence Detector method for setting the pre-processorsetPreprocessor,HybridSequenceClassifier-method
setPreclassifier(preclassifier)
Sequence Detector method for setting the pre-classifiersetPreclassifier,HybridSequenceClassifier-method
setInputDefinitions(fields, timestamp_start_field, timestamp_finish_field, context_field=NULL, preclassifier=NULL, preprocessor=NULL, pattern_field=NULL)
Sequence Detector method for redefining the input definitionssetInputDefinitions,HybridSequenceClassifier-method
[1] D. Krleža, B. Vrdoljak, and M. Brčić, Latent Process Discovery using Evolving Tokenized Transducer, IEEE Access, vol. 7, pp. 169657 - 169676, Dec. 2019
Sequence Detector method for ETT projections. See the SeqDetect vignette for proper usage and cases. All projection changes are performed on the same Sequence Detector object.
## S4 method for signature 'HybridSequenceClassifier' induceSubmachine(threshold,isolate=FALSE)
## S4 method for signature 'HybridSequenceClassifier' induceSubmachine(threshold,isolate=FALSE)
threshold |
(integer) - A threshold for the ETT projection. All transitions that have invocation statistic above the threshold are moved to a submachine. |
isolate |
(logical) - After the regular sequences are moved the the submachine, the original parent can be removed, leaving only the most regular sequences. If TRUE, the parent ETT is removed and only the most regular sequences are left. |
Returns:
TRUE - projection was performed successfully
FALSE - no projection was performed.
st <- data.frame(product=c("P1","P2"),sales=c(5,76),alert=c(NA,NA)) for(i in 1:400) { st <- rbind(st,data.frame(product=c("P1","P2"),sales=c(10,58),alert=c(NA,NA))) st <- rbind(st,data.frame(product=c("P1","P2"),sales=c(20,31),alert=c(NA,NA))) } st <- rbind(st,data.frame(product=c("P1","P2"),sales=c(30,11), alert=c("Sequence 1","Sequence 2"))) input_streams <- list(stream=st) pp <- HSC_PP(c("product","sales","alert"),"sequence_id",auto_id=TRUE) pc <- HSC_PC_Attribute("sales") hsc <- HybridSequenceClassifier(c("sequence_id","product","sales","alert"),"sequence_id", "sequence_id",context_field="product",preclassifier=pc, preprocessor=pp,reuse_states=TRUE,pattern_field="alert") hsc$process(input_streams,learn=TRUE) hsc$cleanKeys() hsc$induceSubmachine(200,isolate=TRUE) hsc$printMachines()
st <- data.frame(product=c("P1","P2"),sales=c(5,76),alert=c(NA,NA)) for(i in 1:400) { st <- rbind(st,data.frame(product=c("P1","P2"),sales=c(10,58),alert=c(NA,NA))) st <- rbind(st,data.frame(product=c("P1","P2"),sales=c(20,31),alert=c(NA,NA))) } st <- rbind(st,data.frame(product=c("P1","P2"),sales=c(30,11), alert=c("Sequence 1","Sequence 2"))) input_streams <- list(stream=st) pp <- HSC_PP(c("product","sales","alert"),"sequence_id",auto_id=TRUE) pc <- HSC_PC_Attribute("sales") hsc <- HybridSequenceClassifier(c("sequence_id","product","sales","alert"),"sequence_id", "sequence_id",context_field="product",preclassifier=pc, preprocessor=pp,reuse_states=TRUE,pattern_field="alert") hsc$process(input_streams,learn=TRUE) hsc$cleanKeys() hsc$induceSubmachine(200,isolate=TRUE) hsc$printMachines()
Sequence Detector method for merging machines. See the SeqDetect vignette for details and examples.
## S4 method for signature 'HybridSequenceClassifier' mergeMachines()
## S4 method for signature 'HybridSequenceClassifier' mergeMachines()
ldf1 <- data.frame(product=c("P1","P1","P1","P1"),sequence_id=c(1,3,5,7), sales=c(5,76,123,1),alert=c(NA,NA,NA,"Alert P1")) ldf2 <- data.frame(product=c("P2","P2","P2","P2"),sequence_id=c(2,4,6,8), sales=c(21,76,123,42),alert=c(NA,NA,NA,"Alert P2")) input_streams <- list(stream1=ldf1,stream2=ldf2) pp <- HSC_PP(c("product","sales","alert","sequence_id"),"sequence_id") pc <- HSC_PC_Attribute("sales") hsc <- HybridSequenceClassifier(c("sequence_id","product","sales","alert"), "sequence_id","sequence_id",context_field="product", preclassifier=pc,preprocessor=pp,reuse_states=TRUE, pattern_field="alert") hsc$process(input_streams,learn=TRUE) hsc$cleanKeys() hsc$mergeMachines() hsc$printMachines()
ldf1 <- data.frame(product=c("P1","P1","P1","P1"),sequence_id=c(1,3,5,7), sales=c(5,76,123,1),alert=c(NA,NA,NA,"Alert P1")) ldf2 <- data.frame(product=c("P2","P2","P2","P2"),sequence_id=c(2,4,6,8), sales=c(21,76,123,42),alert=c(NA,NA,NA,"Alert P2")) input_streams <- list(stream1=ldf1,stream2=ldf2) pp <- HSC_PP(c("product","sales","alert","sequence_id"),"sequence_id") pc <- HSC_PC_Attribute("sales") hsc <- HybridSequenceClassifier(c("sequence_id","product","sales","alert"), "sequence_id","sequence_id",context_field="product", preclassifier=pc,preprocessor=pp,reuse_states=TRUE, pattern_field="alert") hsc$process(input_streams,learn=TRUE) hsc$cleanKeys() hsc$mergeMachines() hsc$printMachines()
Sequence Detector method for plotting of machines in the Sequence Detector object. Plotting is following the output symbols of the states and transitions. For machines that don't have a small output alphabet could not be plotted fully and correctly.
## S4 method for signature 'HybridSequenceClassifier' plotMachines(machine_id=NULL)
## S4 method for signature 'HybridSequenceClassifier' plotMachines(machine_id=NULL)
machine_id |
(character) - A machine identifier that needs to be plotted. If NULL, all machines are plotted. |
ldf1 <- data.frame(product=c("P1","P1","P1","P1"),sequence_id=c(1,3,5,7), sales=c(5,76,123,1),alert=c(NA,NA,NA,"Alert P1")) ldf2 <- data.frame(product=c("P2","P2","P2","P2"),sequence_id=c(2,4,6,8), sales=c(21,76,123,42),alert=c(NA,NA,NA,"Alert P2")) input_streams <- list(stream1=ldf1,stream2=ldf2) pp <- HSC_PP(c("product","sales","alert","sequence_id"),"sequence_id") pc <- HSC_PC_Attribute("sales") hsc <- HybridSequenceClassifier(c("sequence_id","product","sales","alert"), "sequence_id","sequence_id",context_field="product", preclassifier=pc,preprocessor=pp,reuse_states=TRUE, pattern_field="alert") hsc$process(input_streams,learn=TRUE) hsc$cleanKeys() hsc$mergeMachines() hsc$plotMachines()
ldf1 <- data.frame(product=c("P1","P1","P1","P1"),sequence_id=c(1,3,5,7), sales=c(5,76,123,1),alert=c(NA,NA,NA,"Alert P1")) ldf2 <- data.frame(product=c("P2","P2","P2","P2"),sequence_id=c(2,4,6,8), sales=c(21,76,123,42),alert=c(NA,NA,NA,"Alert P2")) input_streams <- list(stream1=ldf1,stream2=ldf2) pp <- HSC_PP(c("product","sales","alert","sequence_id"),"sequence_id") pc <- HSC_PC_Attribute("sales") hsc <- HybridSequenceClassifier(c("sequence_id","product","sales","alert"), "sequence_id","sequence_id",context_field="product", preclassifier=pc,preprocessor=pp,reuse_states=TRUE, pattern_field="alert") hsc$process(input_streams,learn=TRUE) hsc$cleanKeys() hsc$mergeMachines() hsc$plotMachines()
A method that all pre-processor classes need to implement. It is the code that aggregates and consolidates input data streams into one output event stream.
preprocess(x, streams, ...)
preprocess(x, streams, ...)
x |
The pre-processor object. |
streams |
A named list that comprises input data streams. Each input data stream is a data frame comprising fields declared while creating the HSC_PP object. |
... |
An additional list of parameters that can be used by the pre-proccessor. |
Input streams can be created as
streams -> list(stream1=x1,stream2=x2,....)
where x1 is a data frame and stream1 is the name of the stream. All examnples can be seen in the SeqDetect vignette.
Returns a list that comprises:
obj - A returning pre-processor object. Passed in the subsequent invocation as x.
res - An output event stream. A resulting data frame representing the output event stream that is ordered according to the timestamp / sequence field and comprises all declared fields.
Sequence Detector method for printing out the machines (ETTs) in the Sequence Detector object. See The SeqDetect vignette for proper usage and cases.
## S4 method for signature 'HybridSequenceClassifier' printMachines(machine_id=NULL,state=NULL,print_cache=TRUE,print_keys=TRUE)
## S4 method for signature 'HybridSequenceClassifier' printMachines(machine_id=NULL,state=NULL,print_cache=TRUE,print_keys=TRUE)
machine_id |
(character) - If defined, printout only machine that has the supplied identifier. If NULL, printout all machines. |
state |
(character) - If defined, printout only states that have the supplied identifier. If NULL, printout all states. |
print_cache |
(logical) - Switch for printout of the cache. If FALSE, the cache printout is omitted. The cache can be quite big for each machine and state, and could potentially blur the printout. |
print_keys |
(logical) - Switch for printout of the current token set. If FALSE, the token set printout is omitted. The number of tokens can be considerable, and could potentially blur the printout. |
ldf1 <- data.frame(product=c("P1","P1","P1","P1"),sequence_id=c(1,3,5,7), sales=c(5,76,123,1),alert=c(NA,NA,NA,"Alert P1")) ldf2 <- data.frame(product=c("P2","P2","P2","P2"),sequence_id=c(2,4,6,8), sales=c(21,76,123,42),alert=c(NA,NA,NA,"Alert P2")) input_streams <- list(stream1=ldf1,stream2=ldf2) pp <- HSC_PP(c("product","sales","alert","sequence_id"),"sequence_id") pc <- HSC_PC_Attribute("sales") hsc <- HybridSequenceClassifier(c("sequence_id","product","sales","alert"), "sequence_id","sequence_id",context_field="product", preclassifier=pc,preprocessor=pp,reuse_states=TRUE, pattern_field="alert") hsc$process(input_streams,learn=TRUE) hsc$cleanKeys() hsc$mergeMachines() hsc$printMachines()
ldf1 <- data.frame(product=c("P1","P1","P1","P1"),sequence_id=c(1,3,5,7), sales=c(5,76,123,1),alert=c(NA,NA,NA,"Alert P1")) ldf2 <- data.frame(product=c("P2","P2","P2","P2"),sequence_id=c(2,4,6,8), sales=c(21,76,123,42),alert=c(NA,NA,NA,"Alert P2")) input_streams <- list(stream1=ldf1,stream2=ldf2) pp <- HSC_PP(c("product","sales","alert","sequence_id"),"sequence_id") pc <- HSC_PC_Attribute("sales") hsc <- HybridSequenceClassifier(c("sequence_id","product","sales","alert"), "sequence_id","sequence_id",context_field="product", preclassifier=pc,preprocessor=pp,reuse_states=TRUE, pattern_field="alert") hsc$process(input_streams,learn=TRUE) hsc$cleanKeys() hsc$mergeMachines() hsc$printMachines()
Sequence Detector method for processing of input data streams. See the SeqDetect vignette for proper usage and cases.
## S4 method for signature 'HybridSequenceClassifier' process(streams,learn=TRUE,give_explain=TRUE,threshold=NULL,debug=FALSE, out_filename=NULL, ...)
## S4 method for signature 'HybridSequenceClassifier' process(streams,learn=TRUE,give_explain=TRUE,threshold=NULL,debug=FALSE, out_filename=NULL, ...)
streams |
(list, data.frame) - A named list that comprises input data streams. Each list element is a data frame that represents one input data stream. |
learn |
(logical) - Are ETTs in the Sequence Detector extendable? If TRUE, the Sequence Detector learns new sequences from the supplied input data streams. |
give_explain |
(logical) - Determines elements that will be returned by the method. If TRUE, output explanation and sequence statistical analysis will be returned as well. |
threshold |
(integer) - Needed threshold for the pushing mechanism. Pushing will work only for transitions that are above the supplied threshold. If NULL, all transitions are taken in consideration. |
debug |
(logical) - A switch for debug printout. |
out_filename |
(character) - A filename where the consolidated data stream should be written. The written file is in the CSV format. If NULL, file writing is skipped. |
... |
- An additional list of parameters passed into pre-processor and pre-classifier. |
A list that comprises the following elements:
stream - The consolidated stream.
If give_explain is TRUE then an additional element is:
explanation - Actual and potential output symbols for each data item of the consolidated data stream.
If give_explain is TRUE and time_series_sequence_stats is TRUE then an additional element is:
sequences - The complete sequence statistics for the input time-series data.
st <- data.frame(product=c("P45","P134","P45","P134","P134","P45","P134"), sales=c(2,12,18,16,18,24,8), alert=c(NA,NA,NA,NA,NA,"Alert P45","Alert P134")) input_streams <- list(stream=st) pp <- HSC_PP(c("product","sales"),"sequence_id",auto_id=TRUE) pc <- HSC_PC_Binning(0,100,40,"sales") hsc <- HybridSequenceClassifier(c("product","sales","sequence_id"), "sequence_id","sequence_id","product",pc,pp) res <- hsc$process(input_streams) message(res)
st <- data.frame(product=c("P45","P134","P45","P134","P134","P45","P134"), sales=c(2,12,18,16,18,24,8), alert=c(NA,NA,NA,NA,NA,"Alert P45","Alert P134")) input_streams <- list(stream=st) pp <- HSC_PP(c("product","sales"),"sequence_id",auto_id=TRUE) pc <- HSC_PC_Binning(0,100,40,"sales") hsc <- HybridSequenceClassifier(c("product","sales","sequence_id"), "sequence_id","sequence_id","product",pc,pp) res <- hsc$process(input_streams) message(res)
Sales dataset taken from [2], which comprises 811 product one year sales quantities. We applied this dataset to test the Sequence Detector. The results are available in [1]. The results of the test are various statistics on detected sequences. The testing set of products is re-tested by simultaneously rising the projection threshold, until no more sequences could be detected or max_th parameter is reached.
sales_dataset_test(learning_set = 1:20, testing_set = 21:40, th_increment = 1, max_th = NULL)
sales_dataset_test(learning_set = 1:20, testing_set = 21:40, th_increment = 1, max_th = NULL)
learning_set |
(vector) - A set of products to learn ETTs in the Sequence Detector. |
testing_set |
(vector) - A set of products to test previously learned sales numbers. |
th_increment |
(integer) - A threshold increment between two tests. |
max_th |
(integer) - Maximal thershold for testing. When reached, no further tests and no further threshold increment is done. If NULL, re-testing is done while there are some sequences detected. |
A list that comprises sequence statistics for all tests and thresholds.
[1] D. Krleža, B. Vrdoljak, and M. Brčić, Latent Process Discovery using Evolving Tokenized Transducer, IEEE Access, vol. 7, pp. 169657 - 169676, Dec. 2019
[2] S. C. Tan and J. P. San Lau, Time series clustering: A superior alternative for market basket analysis, in Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013), Singapore, 2014, pp. 241–248.
sepsis dataset is taken from the package eventdataR and used to test the Sequence Detector. The results are available in [1].
sepsis_dataset_test(induce_biomarker_decision_tree = TRUE, threshold = 75, debug = FALSE, hsc = NULL)
sepsis_dataset_test(induce_biomarker_decision_tree = TRUE, threshold = 75, debug = FALSE, hsc = NULL)
induce_biomarker_decision_tree |
(logical) - If FALSE, "Biomarker assessment" is one activity ignoring biomarker values. If TRUE, based on the biomarker values, several distinct "Biomarker assessment" activities are inferred. |
threshold |
(numeric) - Projection threshold. |
debug |
(logical) - Switch for debug printout. |
hsc |
(HybridSequenceClassifier) - An existing Sequence Detector that should be used instead of creating a new one. |
None
[1] D. Krleža, B. Vrdoljak, and M. Brčić, Latent Process Discovery using Evolving Tokenized Transducer, IEEE Access, vol. 7, pp. 169657 - 169676, Dec. 2019
Sequence Detector method for serializing. User needs to serialize the Sequence Detector object before saving. If not performed, Sequence Detector C++ part of the object is not saved properly, and cannot be restored later.
## S4 method for signature 'HybridSequenceClassifier' serialize()
## S4 method for signature 'HybridSequenceClassifier' serialize()
st <- data.frame(product=c("P45","P134","P45","P134","P134","P45","P134"), sales=c(2,12,18,16,18,24,8), alert=c(NA,NA,NA,NA,NA,"Alert P45","Alert P134")) input_streams <- list(stream=st) pp <- HSC_PP(c("product","sales"),"sequence_id",auto_id=TRUE) pc <- HSC_PC_Binning(0,100,40,"sales") hsc <- HybridSequenceClassifier(c("product","sales","sequence_id"), "sequence_id","sequence_id","product",pc,pp) res <- hsc$process(input_streams) hsc$serialize() #saveRDS(hsc,"test.RDS") # Previous line is commented due to the CRAN checking policies
st <- data.frame(product=c("P45","P134","P45","P134","P134","P45","P134"), sales=c(2,12,18,16,18,24,8), alert=c(NA,NA,NA,NA,NA,"Alert P45","Alert P134")) input_streams <- list(stream=st) pp <- HSC_PP(c("product","sales"),"sequence_id",auto_id=TRUE) pc <- HSC_PC_Binning(0,100,40,"sales") hsc <- HybridSequenceClassifier(c("product","sales","sequence_id"), "sequence_id","sequence_id","product",pc,pp) res <- hsc$process(input_streams) hsc$serialize() #saveRDS(hsc,"test.RDS") # Previous line is commented due to the CRAN checking policies
Sequence Detector method for serializing to a list. The list can be saved, loaded and deserialized into a Sequence Detector object again using deserializeFromList
function.
## S4 method for signature 'HybridSequenceClassifier' serializeToList()
## S4 method for signature 'HybridSequenceClassifier' serializeToList()
Returns a list that comprises all Sequence Detector details.
HybridSequenceClassifier
,
deserializeFromList
st <- data.frame(product=c("P45","P134","P45","P134","P134","P45","P134"), sales=c(2,12,18,16,18,24,8), alert=c(NA,NA,NA,NA,NA,"Alert P45","Alert P134")) input_streams <- list(stream=st) pp <- HSC_PP(c("product","sales"),"sequence_id",auto_id=TRUE) pc <- HSC_PC_Binning(0,100,40,"sales") hsc <- HybridSequenceClassifier(c("product","sales","sequence_id"), "sequence_id","sequence_id","product",pc,pp) res <- hsc$process(input_streams) hsc_list <- hsc$serializeToList() #saveRDS(hsc_list,"test_list.RDS") # Previous line is commented due to the CRAN checking policies
st <- data.frame(product=c("P45","P134","P45","P134","P134","P45","P134"), sales=c(2,12,18,16,18,24,8), alert=c(NA,NA,NA,NA,NA,"Alert P45","Alert P134")) input_streams <- list(stream=st) pp <- HSC_PP(c("product","sales"),"sequence_id",auto_id=TRUE) pc <- HSC_PC_Binning(0,100,40,"sales") hsc <- HybridSequenceClassifier(c("product","sales","sequence_id"), "sequence_id","sequence_id","product",pc,pp) res <- hsc$process(input_streams) hsc_list <- hsc$serializeToList() #saveRDS(hsc_list,"test_list.RDS") # Previous line is commented due to the CRAN checking policies
A method for redefining the Sequence Detector input parameters. This method is useful when we want to reuse an existing Sequence Detector for a different set of input data streams. Based on the ETT definition, after the pre-processing and pre-classification stages we need to have a consolidated data frame that comprises context identifier, sequence fields (timestamps or incremental value) and classification values (an input symbol). Not everything can be redefined and needs to be left as defined at the time of Sequence Detector instantiation, such as decay descriptors.
## S4 method for signature 'HybridSequenceClassifier' setInputDefinitions(fields, timestamp_start_field, timestamp_finish_field, context_field=NULL, preclassifier=NULL, preprocessor=NULL, pattern_field=NULL)
## S4 method for signature 'HybridSequenceClassifier' setInputDefinitions(fields, timestamp_start_field, timestamp_finish_field, context_field=NULL, preclassifier=NULL, preprocessor=NULL, pattern_field=NULL)
fields |
(vector, character) - A vector of all relevant consolidated data stream fields. |
timestamp_start_field |
(character) - A name of the field having starting time point values. |
timestamp_finish_field |
(character) - A name of the field having finishing time point values. |
context_field |
(character) - A name of the context identifier field (key field). If NULL, then .key field is used for retrieving context identifier values. |
preclassifier |
(HSC_PC) - A pre-classifier object. If NULL, the Sequence Detector creates new HSC_PC_None pre-classifier, which means that the input consolidated data stream must have .clazz field for retrieving classification values (input symbols in the underlying ETTs). |
preprocessor |
(HSC_PP) - A pre-processing object. If NULL, the Sequence Detector creates new HSC_PP pre-processor having the same fields as define in the fields parameter, and ordering timestamp field as defined in timestamp_start_field. |
pattern_field |
(character) - A name of the field having output symbol values, i.e., relational ETT classification output. |
Sequence Detector method for assigning output symbols to states and transitions. See the SeqDetect vignette for proper usage and cases.
## S4 method for signature 'HybridSequenceClassifier' setOutputPattern(states=c(),transitions=c(),pattern,machine_id=NULL)
## S4 method for signature 'HybridSequenceClassifier' setOutputPattern(states=c(),transitions=c(),pattern,machine_id=NULL)
states |
(vector,character) - A character vector that comprises state identifiers. The supplied symbol (output alphabet, pattern parameter) is assigned to these states. |
transitions |
(vector,character) - A character vector that comprises transition identifiers. The supplied symbol (output alphabet, pattern parameter) is assigned to these transitions. |
pattern |
(character) - An output symbol, an element of the output alphabet, that needs to be assigned to supplied states and transitions. |
machine_id |
(character) - If defined, the output symbol assignment applies only to the machine having this identifier. If NULL, the output symbol assignment applies to all machines (ETTs) in this Sequence Detector object. |
Sequence Detector method for re-setting the pre-classifier object. This might be desirable when we want to use already existing Sequence Detector for new input data streams, having different structure.
## S4 method for signature 'HybridSequenceClassifier' setPreclassifier(preclassifier)
## S4 method for signature 'HybridSequenceClassifier' setPreclassifier(preclassifier)
preclassifier |
( |
Sequence Detector method for re-setting the pre-processor object. This might be desirable when we want to use already existing Sequence Detector for new input data streams, having different structure.
## S4 method for signature 'HybridSequenceClassifier' setPreprocessor(preprocessor)
## S4 method for signature 'HybridSequenceClassifier' setPreprocessor(preprocessor)
preprocessor |
( |
A synthetic process that was introduced in the process mining agenda [1]. The original event log introduced in [1] did not comprise any timestamps, and a process discovery algorithm was intended to infer this based on the event position in the log. ETT and new process discovery algorithms require events to have at least some sort of timing, and this was added for this test. It is worth noticing that the given event log has some parallel activities, which should be detected by the process discovery algorithm. The final results of this test are described in [2].
synthetic_test_agenda(label_aspect=1)
synthetic_test_agenda(label_aspect=1)
label_aspect |
(numeric) - A vector of all relevant consolidated data stream fields. |
[1] W. M. P. van der Aalst and A. J. M. M. Weijters, Process mining: a research agenda, Computers in Industry, vol. 53, no. 2, pp. 231–244, Apr. 2004
[2] D. Krleža, B. Vrdoljak, and M. Brčić, Latent Process Discovery using Evolving Tokenized Transducer, IEEE Access, vol. 7, pp. 169657 - 169676, Dec. 2019