Title: | Detecting Anomalies in Data |
---|---|
Description: | Implements Collective And Point Anomaly (CAPA) Fisch, Eckley, and Fearnhead (2022) <doi:10.1002/sam.11586>, Multi-Variate Collective And Point Anomaly (MVCAPA) Fisch, Eckley, and Fearnhead (2021) <doi:10.1080/10618600.2021.1987257>, Proportion Adaptive Segment Selection (PASS) Jeng, Cai, and Li (2012) <doi:10.1093/biomet/ass059>, and Bayesian Abnormal Region Detector (BARD) Bardwell and Fearnhead (2015) <doi:10.1214/16-BA998>. These methods are for the detection of anomalies in time series data. Further information regarding the use of this package along with detailed examples can be found in Fisch, Grose, Eckley, Fearnhead, and Bardwell (2024) <doi:10.18637/jss.v110.i01>. |
Authors: | Alex Fisch [aut], Daniel Grose [aut, cre], Lawrence Bardwell [aut, ctb], Idris Eckley [aut, ths], Paul Fearnhead [aut, ths] |
Maintainer: | Daniel Grose <[email protected]> |
License: | GPL |
Version: | 4.3.3 |
Built: | 2024-10-31 21:26:57 UTC |
Source: | CRAN |
The anomaly package provides methods for detecting collective and point anomalies in both univariate and multivariate settings.
The anomaly package implements a number of recently proposed methods for anomaly detection. For univariate data there is the Collective And Point Anomaly (CAPA) method of Fisch, Eckley and Fearnhead (2022a), which can detect both collective and point anomalies. For multivariate data there are three methods, the multivariate extension of CAPA of Fisch, Eckley and Fearnhead (2022b), the Proportion Adaptive Segment Selection (PASS) method of Jeng, Cai and Li, and a Bayesian approach, Bayesian Abnormal Region Detector, of Bardwell and Fearnhead.
The multivariate CAPA method and PASS are similar in that, for a given segment they use a likelihood-based approach to measure the evidence that it is anomalous for each component of the multivariate data stream, and then merge this evidence across components. They differ in how they merge this evidence, with PASS using higher criticism (Donoho and Jin) and CAPA using a penalised likelihood approach. One disadvantage of the higher criticism approach for merging evidence is that it can lose power when only one or a very small number of components are anomalous. Furthermore, CAPA also allows for point anomalies in otherwise normal segments of data, and can be more robust to detecting collective anomalies when there are point anomalies in the data. CAPA can also allow for the anomalies segments to be slightly mis-aligned across different components.
The BARD method considers a similar model to that of CAPA or PASS, but is Bayesian and so its basic output are samples from the posterior distribution for where the collective anomalies are, and which components are anomalous. It does not allow for point anomalies. As with any Bayesian method, it requires the user to specify suitable priors, but the output is more flexible, and can more directly allow for quantifying uncertainty about the anomalies.
Fisch ATM, Grose D, Eckley IA, Fearnhead P, Bardwell L (2024). “anomaly: Detection of Anomalous Structure in Time Series Data.” Journal of Statistical Software, 110(1), 1–24. doi:10.18637/jss.v110.i01.
Fisch ATM, Eckley IA, Fearnhead P (2022a). “A linear time method for the detection of collective and point anomalies.” Statistical Analysis and Data Mining: The ASA Data Science Journal, 15(4), 494-508. doi:10.1002/sam.11586.
Fisch ATM, Eckley IA, Fearnhead P (2022b). “Subset Multivariate Collective and Point Anomaly Detection.” Journal of Computational and Graphical Statistics, 31(2), 574-585. doi:10.1080/10618600.2021.1987257.
Jeng XJ, Cai TT, Li H (2012). “Simultaneous discovery of rare and common segment variants.” Biometrika, 100(1), 157-172. ISSN 0006-3444, doi:10.1093/biomet/ass059, https://academic.oup.com/biomet/article/100/1/157/193108.
Bardwell L, Fearnhead P (2017). “Bayesian Detection of Abnormal Segments in Multiple Time Series.” Bayesian Anal., 12(1), 193–218.
Donoho D, Jin J (2004). “Higher Criticism for Detecting Sparse Heterogeneous Mixtures.” The Annals of Statistics, 32(3), 962–994. doi:10.1214/009053604000000265.
# Use univariate CAPA to analyse simulated data library("anomaly") set.seed(0) x <- rnorm(5000) x[401:500] <- rnorm(100, 4, 1) x[1601:1800] <- rnorm(200, 0, 0.01) x[3201:3500] <- rnorm(300, 0, 10) x[c(1000, 2000, 3000, 4000)] <- rnorm(4, 0, 100) x <- (x - median(x)) / mad(x) res <- capa(x) # view results summary(res) # visualise results plot(res) # Use multivariate CAPA to analyse simulated data library("anomaly") data("simulated") # set penalties beta <- 2 * log(ncol(sim.data):1) beta[1] <- beta[1] + 3 * log(nrow(sim.data)) res <- capa(sim.data, type= "mean", min_seg_len = 2,beta = beta) # view results summary(res) # visualise results plot(res, subset = 1:20) # Use PASS to analyse simulated mutivariate data library("anomaly") data("simulated") res <- pass(sim.data, max_seg_len = 20, alpha = 3) # view results collective_anomalies(res) # visualise results plot(res) # Use BARD to analyse simulated mutivariate data library("anomaly") data("simulated") bard.res <- bard(sim.data) # sample from the BARD result sampler.res <- sampler(bard.res, gamma = 1/3, num_draws = 1000) # view results show(sampler.res) # visualise results plot(sampler.res, marginals = TRUE)
# Use univariate CAPA to analyse simulated data library("anomaly") set.seed(0) x <- rnorm(5000) x[401:500] <- rnorm(100, 4, 1) x[1601:1800] <- rnorm(200, 0, 0.01) x[3201:3500] <- rnorm(300, 0, 10) x[c(1000, 2000, 3000, 4000)] <- rnorm(4, 0, 100) x <- (x - median(x)) / mad(x) res <- capa(x) # view results summary(res) # visualise results plot(res) # Use multivariate CAPA to analyse simulated data library("anomaly") data("simulated") # set penalties beta <- 2 * log(ncol(sim.data):1) beta[1] <- beta[1] + 3 * log(nrow(sim.data)) res <- capa(sim.data, type= "mean", min_seg_len = 2,beta = beta) # view results summary(res) # visualise results plot(res, subset = 1:20) # Use PASS to analyse simulated mutivariate data library("anomaly") data("simulated") res <- pass(sim.data, max_seg_len = 20, alpha = 3) # view results collective_anomalies(res) # visualise results plot(res) # Use BARD to analyse simulated mutivariate data library("anomaly") data("simulated") bard.res <- bard(sim.data) # sample from the BARD result sampler.res <- sampler(bard.res, gamma = 1/3, num_draws = 1000) # view results show(sampler.res) # visualise results plot(sampler.res, marginals = TRUE)
Implements the BARD (Bayesian Abnormal Region Detector) procedure of Bardwell and Fearnhead (2017). BARD is a fully Bayesian inference procedure which is able to give measures of uncertainty about the number and location of anomalous regions. It uses negative binomial prior distributions on the lengths of anomalous and non-anomalous regions as well as a uniform prior for the means of anomalous regions. Inference is conducted by solving a set of recursions. To reduce computational and storage costs a resampling step is included.
bard( x, p_N = 1/(nrow(x) + 1), p_A = 5/nrow(x), k_N = 1, k_A = (5 * p_A)/(1 - p_A), pi_N = 0.9, paffected = 0.05, lower = 2 * sqrt(log(nrow(x))/nrow(x)), upper = max(x), alpha = 1e-04, h = 0.25 )
bard( x, p_N = 1/(nrow(x) + 1), p_A = 5/nrow(x), k_N = 1, k_A = (5 * p_A)/(1 - p_A), pi_N = 0.9, paffected = 0.05, lower = 2 * sqrt(log(nrow(x))/nrow(x)), upper = max(x), alpha = 1e-04, h = 0.25 )
x |
A numeric matrix with n rows and p columns containing the data which is to be inspected. The time series data classes ts, xts, and zoo are also supported. |
p_N |
Hyper-parameter of the negative binomial distribution for the length of non-anomalous segments (probability of success). Defaults to |
p_A |
Hyper-parameter of the negative binomial distribution for the length of anomalous segments (probability of success). Defaults to |
k_N |
Hyper-parameter of the negative binomial distribution for the length of non-anomalous segments (size). Defaults to 1. |
k_A |
Hyper-parameter of the negative binomial distribution for the length of anomalous segments (size). Defaults to |
pi_N |
Probability that an anomalous segment is followed by a non-anomalous segment. Defaults to 0.9. |
paffected |
Proportion of the variates believed to be affected by any given anomalous segment. Defaults to 5%. This parameter is relatively robust to being mis-specified and is studied empirically in Section 5.1 of Bardwell and Fearnhead (2017). |
lower |
The lower limit of the the prior uniform distribution for the mean of an anomalous segment |
upper |
The upper limit of the prior uniform distribution for the mean of an anomalous segment |
alpha |
Threshold used to control the resampling in the approximation of the posterior distribution at each time step. A sensible default is 1e-4. Decreasing alpha increases the accuracy of the posterior distribution but also increases the computational complexity of the algorithm. |
h |
The step size in the numerical integration used to find the marginal likelihood.
The quadrature points are located from |
An instance of the S4 object of type .bard.class
containing the data x
, procedure parameter values, and the results.
This function gives certain default hyper-parameters for the two segment length distributions.
We chose these to be quite flexible for a range of problems. For non-anomalous segments a geometric distribution
was selected having an average segment length of with the standard deviation being of the same order.
For anomalous segments we chose parameters that gave an average length of 5 and a variance of
.
These may not be suitable for all problems and the user is encouraged to tune these parameters.
Bardwell L, Fearnhead P (2017). “Bayesian Detection of Abnormal Segments in Multiple Time Series.” Bayesian Anal., 12(1), 193–218.
Fisch ATM, Grose D, Eckley IA, Fearnhead P, Bardwell L (2024). “anomaly: Detection of Anomalous Structure in Time Series Data.” Journal of Statistical Software, 110(1), 1–24. doi:10.18637/jss.v110.i01.
library(anomaly) data(simulated) # run bard bard.res<-bard(sim.data, alpha = 1e-3, h = 0.5) sampler.res<-sampler(bard.res) collective_anomalies(sampler.res) plot(sampler.res,marginals=TRUE)
library(anomaly) data(simulated) # run bard bard.res<-bard(sim.data, alpha = 1e-3, h = 0.5) sampler.res<-sampler(bard.res) collective_anomalies(sampler.res) plot(sampler.res,marginals=TRUE)
A technique for detecting anomalous segments and points based on CAPA (Collective And Point Anomalies) by Fisch et al. (2022).
This is a generic method that can be used for both univariate and multivariate data. The specific method that is used for the analysis is deduced by capa
from the dimensions of the data.
The inputted data is either a vector (in the case of a univariate time-series) or a array with p columns (if the the time-series is p-dimensional). The CAPA procedure assumes that each component
of the time-series is standardised so that the non-anomalous segments of each component have mean 0 and variance 1. This may require pre-processing/standardising.
For example, using the median of each component as a robust estimate of its mean, and the mad (median absolute deviation from the median) estimator to get a robust estimate of the variance.
capa( x, beta, beta_tilde, type = "meanvar", min_seg_len = 10, max_seg_len = Inf, max_lag = 0 )
capa( x, beta, beta_tilde, type = "meanvar", min_seg_len = 10, max_seg_len = Inf, max_lag = 0 )
x |
A numeric matrix with n rows and p columns containing the data which is to be inspected. The time series data classes ts, xts, and zoo are also supported. |
beta |
A numeric vector of length p giving the marginal penalties. If beta is missing and p == 1 then beta = 3log(n) when the type is "mean" or "robustmean", and beta = 4log(n) otherwise. If beta is missing and p > 1, type ="meanvar" or type = "mean" and max_lag > 0 then it defaults to the penalty regime 2' described in Fisch, Eckley and Fearnhead (2022). If beta is missing and p > 1, type = "mean"/"meanvar" and max_lag = 0 it defaults to the pointwise minimum of the penalty regimes 1, 2, and 3 in Fisch, Eckley and Fearnhead (2022). |
beta_tilde |
A numeric constant indicating the penalty for adding an additional point anomaly. If beta_tilda is missing it defaults to 3log(np), where n and p are the data dimensions. |
type |
A string indicating which type of deviations from the baseline are considered. Can be "meanvar" for collective anomalies characterised by joint changes in mean and variance (the default), "mean" for collective anomalies characterised by changes in mean only, or "robustmean" (only allowed when p = 1) for collective anomalies characterised by changes in mean only which can be polluted by outliers. |
min_seg_len |
An integer indicating the minimum length of epidemic changes. It must be at least 2 and defaults to 10. |
max_seg_len |
An integer indicating the maximum length of epidemic changes. It must be at least min_seg_len and defaults to Inf. |
max_lag |
A non-negative integer indicating the maximum start or end lag. Only useful for multivariate data. Default value is 0. |
An instance of an S4 class of type capa.class.
Fisch ATM, Eckley IA, Fearnhead P (2022). “A linear time method for the detection of collective and point anomalies.” Statistical Analysis and Data Mining: The ASA Data Science Journal, 15(4), 494-508. doi:10.1002/sam.11586.
Fisch ATM, Grose D, Eckley IA, Fearnhead P, Bardwell L (2024). “anomaly: Detection of Anomalous Structure in Time Series Data.” Journal of Statistical Software, 110(1), 1–24. doi:10.18637/jss.v110.i01.
library(anomaly) # generate some multivariate data data(simulated) res<-capa(sim.data,type="mean",min_seg_len=2,max_lag=5) collective_anomalies(res) plot(res)
library(anomaly) # generate some multivariate data data(simulated) res<-capa(sim.data,type="mean",min_seg_len=2,max_lag=5) collective_anomalies(res) plot(res)
Creates a data frame containing collective anomaly locations, lags and changes in mean and variance as detected by capa
, pass
, and sampler
.
For an object created by capa
returns a data frame with columns containing the start and end position of the anomaly, the change in mean
due to the anomaly. For multivariate data a data frame with columns containing the start and end position of the anomaly, the variates
affected by the anomaly, as well as their the start and end lags. When type="mean"/"robustmean"
only the change in mean is reported. When type="meanvar"
both the change in mean and
change in variance are included. If merged=FALSE
(the default), then all the collective anomalies are processed individually even if they are common across multiple variates.
If merged=TRUE
, then the collective anomalies are grouped together across all variates that they appear in.
For an object produced by pass
or sampler
returns a data frame containing the start, end and strength of the collective anomalies.
collective_anomalies(object, ...) ## S4 method for signature 'bard.sampler.class' collective_anomalies(object) ## S4 method for signature 'capa.class' collective_anomalies(object, epoch = nrow(object@data), merged = FALSE) ## S4 method for signature 'pass.class' collective_anomalies(object)
collective_anomalies(object, ...) ## S4 method for signature 'bard.sampler.class' collective_anomalies(object) ## S4 method for signature 'capa.class' collective_anomalies(object, epoch = nrow(object@data), merged = FALSE) ## S4 method for signature 'pass.class' collective_anomalies(object)
object |
An instance of an S4 class produced by |
... |
TODO |
epoch |
Positive integer. CAPA methods are sequential and as such, can generate results up to, and including, any epoch within the data series. This can be controlled by the value
of |
merged |
Boolean value. If |
A data frame.
Temperature sensor data of an internal component of a large, industrial machine. The data contains three known anomalies. The first anomaly is a planned shutdown of the machine. The second anomaly is difficult to detect and directly led to the third anomaly, a catastrophic failure of the machine. The data consists of 22695 observations of machine temperature recorded at 5 minute intervals along with the date and time of the measurement. The data was obtained from the Numenta Anomaly Benchmark (Ahmad et al. 2017), which can be found at https://github.com/numenta/NAB.
data(machinetemp)
data(machinetemp)
A dataframe with 22695 rows and 2 columns. The first column contains the date and time of the temperature measurement. The second column contains the machine temperature.
Ahmad S, Lavin A, Purdy S, Agha Z (2017). “Unsupervised real-time anomaly detection for streaming data.” Neurocomputing, 262, 134 - 147. ISSN 0925-2312, doi:10.1016/j.neucom.2017.04.070, Online Real-Time Learning Strategies for Data Streams, https://www.sciencedirect.com/science/article/pii/S0925231217309864/.
Implements the PASS (Proportion Adaptive Segment Selection) procedure of Jeng et al. (2012). PASS uses a higher criticism statistic to pool the information about the presence or absence of a collective anomaly across the components. It uses Circular Binary Segmentation to detect multiple collective anomalies.
pass(x, alpha = 2, lambda = NULL, max_seg_len = 10, min_seg_len = 1)
pass(x, alpha = 2, lambda = NULL, max_seg_len = 10, min_seg_len = 1)
x |
A numeric matrix with n rows and p columns containing the data which is to be inspected. The time series data classes ts, xts, and zoo are also supported. |
alpha |
A positive integer > 0. This value is used to stabilise the higher criticism based test statistic used by PASS leading to a better finite sample familywise error rate. Anomalies affecting fewer than alpha components will however in all likelihood escape detection. The default is 2. |
lambda |
A positive real value setting the threshold value for the familywise Type 1 error. The default value
is |
max_seg_len |
A positive integer ( |
min_seg_len |
A positive integer ( |
An instance of an S4 object of type .pass.class
containing the data X
, procedure parameter values, and the results.
Jeng XJ, Cai TT, Li H (2012). “Simultaneous discovery of rare and common segment variants.” Biometrika, 100(1), 157-172. ISSN 0006-3444, doi:10.1093/biomet/ass059, https://academic.oup.com/biomet/article/100/1/157/193108.
Fisch ATM, Grose D, Eckley IA, Fearnhead P, Bardwell L (2024). “anomaly: Detection of Anomalous Structure in Time Series Data.” Journal of Statistical Software, 110(1), 1–24. doi:10.18637/jss.v110.i01.
library(anomaly) # generate some multivariate data data(simulated) res<-pass(sim.data) summary(res) plot(res,variate_names=TRUE)
library(anomaly) # generate some multivariate data data(simulated) res<-pass(sim.data) summary(res) plot(res,variate_names=TRUE)
Plot methods for S4 objects returned by capa
, pass
, and sampler
.
The plot can either be a line plot or a tile plot, the type produced depending on the options provided to the plot
function and/or the dimensions of the
data associated with the S4 object.
## S4 method for signature 'bard.sampler.class' plot(x, subset, variate_names, tile_plot, marginals = FALSE) ## S4 method for signature 'capa.class' plot(x, subset, variate_names = FALSE, tile_plot, epoch = nrow(x@data)) ## S4 method for signature 'pass.class' plot(x, subset, variate_names = FALSE, tile_plot)
## S4 method for signature 'bard.sampler.class' plot(x, subset, variate_names, tile_plot, marginals = FALSE) ## S4 method for signature 'capa.class' plot(x, subset, variate_names = FALSE, tile_plot, epoch = nrow(x@data)) ## S4 method for signature 'pass.class' plot(x, subset, variate_names = FALSE, tile_plot)
x |
An instance of an S4 class produced by |
subset |
A numeric vector specifying a subset of the variates to be displayed. Default value is all of the variates present in the data. |
variate_names |
Logical value indicating if variate names should be displayed on the plot. This is useful when a large number of variates are being displayed as it makes the visualisation easier to interpret. Default value is FALSE. |
tile_plot |
Logical value. If TRUE then a tile plot of the data is produced. The data displayed in the tile plot is normalised to values in [0,1] for each variate. This type of plot is useful when the data contains are large number of variates. The default value is TRUE if the number of variates is greater than 20. |
marginals |
Logical value. If |
epoch |
Positive integer. CAPA methods are sequential and as such, can generate results up to, and including, any epoch within the data series. This can be controlled by the value
of |
A ggplot object.
Creates a data frame containing point anomaly locations and strengths as detected by capa
.
Returns a data frame with columns containing the position, strength, and (for multivariate data) the variate number.
point_anomalies(object, ...) ## S4 method for signature 'capa.class' point_anomalies(object, epoch = nrow(object@data))
point_anomalies(object, ...) ## S4 method for signature 'capa.class' point_anomalies(object, epoch = nrow(object@data))
object |
An instance of an S4 class produced by |
... |
TODO |
epoch |
Positive integer. CAPA methods are sequential and as such, can generate results up to, and including, any epoch within the data series. This can be controlled by the value
of |
A data frame.
capa
.
Draw samples from the posterior distribution to give the locations of anomalous segments.
sampler(bard_result, gamma = 1/3, num_draws = 1000)
sampler(bard_result, gamma = 1/3, num_draws = 1000)
bard_result |
An instance of the S4 class |
gamma |
Parameter of loss function giving the cost of a false negative i.e. incorrectly allocating an anomalous point as being non-anomalous. For more details see Section 3.5 of Bardwell and Fearnhead (2017). |
num_draws |
Number of samples to draw from the posterior distribution. |
Returns an S4 class of type bard.sampler.class
.
Bardwell L, Fearnhead P (2017). “Bayesian Detection of Abnormal Segments in Multiple Time Series.” Bayesian Anal., 12(1), 193–218.
Fisch ATM, Grose D, Eckley IA, Fearnhead P, Bardwell L (2024). “anomaly: Detection of Anomalous Structure in Time Series Data.” Journal of Statistical Software, 110(1), 1–24. doi:10.18637/jss.v110.i01.
library(anomaly) data(simulated) # run bard res<-bard(sim.data, alpha = 1e-3, h = 0.5) # sample sampler(res)
library(anomaly) data(simulated) # run bard res<-bard(sim.data, alpha = 1e-3, h = 0.5) # sample sampler(res)
Displays S4 object produced by capa
, pass
, bard
, and sampler
.
The output displayed depends on the type of S4 object passed to the method. For all types, the output indicates whether the data is univariate or
multivariate, the number of observations in the data, and the type of change being detected.
## S4 method for signature 'bard.class' show(object) ## S4 method for signature 'bard.sampler.class' show(object) ## S4 method for signature 'capa.class' show(object) ## S4 method for signature 'pass.class' show(object)
## S4 method for signature 'bard.class' show(object) ## S4 method for signature 'bard.sampler.class' show(object) ## S4 method for signature 'capa.class' show(object) ## S4 method for signature 'pass.class' show(object)
object |
An instance of an S4 class produced by |
A simulated data set for use in the examples and vignettes. The data consists of 500 observations on 20 variates drawn from the standard normal distribution. Within the data there are three multivariate anomalies of length 15 located at t=100, t=200, and t=300 for which the mean changes from 0 to 2. The anomalies affect variates 1 to 8, 1 to 12 and 1 to 16 respectively.
data(simulated)
data(simulated)
A matrix with 500 rows and 40 columns.
Summary methods for S4 objects returned by capa
,
pass
, and sampler
. The output displayed
depends on the type of object passed to summary. For all types, the output indicates whether the data is univariate or
multivariate, the number of observations in the data, and the type of change being detected.
## S4 method for signature 'bard.class' summary(object, ...) ## S4 method for signature 'bard.sampler.class' summary(object, ...) ## S4 method for signature 'capa.class' summary(object, epoch = nrow(object@data)) ## S4 method for signature 'pass.class' summary(object, ...)
## S4 method for signature 'bard.class' summary(object, ...) ## S4 method for signature 'bard.sampler.class' summary(object, ...) ## S4 method for signature 'capa.class' summary(object, epoch = nrow(object@data)) ## S4 method for signature 'pass.class' summary(object, ...)
object |
|
... |
Ignored. |
epoch |
Positive integer. CAPA methods are sequential and as such, can generate results up to, and including, any epoch within the data series. This can be controlled by the value
of |