This is an accompanying vignette for the package isotree presenting a short introduction to the Isolation Forest family of algorithms as implemented in said package. For more information about it, see the GitHub page.
Isolation Forest is an unsupervised decision-tree-based algorithm originally developed for outlier detection in tabular data, which consists in splitting sub-samples of the data according to some attribute/feature/column at random. The idea is that, the rarer the observation, the more likely it is that a random split on some feature would put outliers alone in one branch, and the fewer splits it will take to isolate (create a partition in which only one point is present) an outlier observation like this.
The intuition behind it is very simple: if there is an outlier in the data and we pick a column at random in which the value for the outlier point is different from the rest of the observations, and then we select an arbitrary threshold uniformly at random within the range of that column and divide all the points into two groups according to whether they are higher or lower than the randomly-chosen threshold for that column, then there is a higher chance that the outlier point would end up in the smaller partition than in the larger partition.
Of course, outliers are typically not defined by just having one extreme value in one column, and a good outlier detection method needs to look at the relationships between different variables and their combinations. One potential way to do this is by building a so-called “isolation tree”, which consists of repeating the randomized splitting process described above recursively (that is, we divide the points into two groups, then repeat the process in each of the two groups that are obtained, and continue repeating it on the new groups until no further split is possible or until meeting some other criteria).
Under this scheme, one can deduce that, the more common a point is, the more splits it will take to leave the point alone or in a smaller group compared to uncommon points - as such, one can think of the “isolation depth” (number of partitions that it takes to isolate a point, hence the name of the algorithm) in the isolation trees as a metric by which to measure the inlierness or outlierness of a point.
A single isolation tree has a lot of expected variability in the isolation depths that it will give to each observation, thus an ensemble of many such trees - an “isolation forest” - may be used instead for better results, with the final score obtained by averaging the results (the isolation depths) from many such trees.
There are many potential ways of improving upon the logic behind the procedure (for example, extrapolating an isolation depth after reaching a certain limit) and the resulting score can be standardized for easier usage, among many others - see the references for more details about the methodology.
Compared to other outlier/anomaly detection methods such as “local outlier factor” or “one-class support vector machines”, isolation forests have advantages in that they are:
Additionally, since they produce a standardized outlier metric for every point, such models can be used for example to generate additional features for regression or classification models or as a proxy for distribution density, for which not all outlier detection methods are equally suitable (see the rest of the vignette for other potential uses of isolation forests).
As a simple proof-of-concept test, one can think of producing random numbers from a normal distribution and seeing what kinds of isolation depths would isolation trees assigning to them:
set.seed(123)
random_numbers <- matrix(rnorm(1000))
par(oma = c(0,0,0,0), mar = c(4,4,3,2))
hist(random_numbers, breaks=50, col="navy",
main="Randomly-generated numbers\nfrom normal distribution",
xlab="value")
library(isotree)
model <- isolation.forest(random_numbers, ndim=1, ntrees=10, nthreads=1)
scores <- predict(model, random_numbers, type="avg_depth")
par(mar = c(4,5,3,2))
plot(random_numbers, scores, type="p", col="darkred",
main="Average isolation depth\nfor normally-distributed numbers",
xlab="value", ylab="Average isolation depth")
As expected, the isolation depth in this sample of randomly-generated numbers following a normal distribution looks very similar to the probability distribution function, achieving the goal of determining which kind of points are more common or less common compared to the rest.
For simple 1D data, a kernel density estimate or similar would do a better job, but once we start adding more dimensions to the data, something like a kernel density estimate starts sounding less and less suitable, and more susceptible to the presence of outliers (which is what we want to detect in the first place).
The next example will generate standardized outlier scores (see references for details about this) for randomly-generated two-dimensional data.
In order to illustrate the advantages of isolation forests, this next example will generate two clusters of normally-distributed numbers, and will add an outlier along the way which could not be determined to be so if looking at the variables individually.
For many other outlier/anomaly detection methods, having multi-modal distributions like this is a big problem, and their performance suffers from contamination of the training data with outliers. For isolation forests, these are not problematic:
### Randomly-generated data from different distributions
set.seed(1)
cluster1 <- data.frame(
x = rnorm(1000, -1, .4),
y = rnorm(1000, -1, .2)
)
cluster2 <- data.frame(
x = rnorm(1000, +1, .2),
y = rnorm(1000, +1, .4)
)
outlier <- data.frame(
x = -1,
y = 1
)
### Putting them together
X <- rbind(cluster1, cluster2, outlier)
### Function to produce a heatmap of the scores
pts = seq(-3, 3, .1)
space_d <- expand.grid(x = pts, y = pts)
plot.space <- function(Z, ttl, cex.main = 1.4) {
image(pts, pts, matrix(Z, nrow = length(pts)),
col = rev(heat.colors(50)),
main = ttl, cex.main = cex.main,
xlim = c(-3, 3), ylim = c(-3, 3),
xlab = "", ylab = "")
par(new = TRUE)
plot(X, type = "p", xlim = c(-3, 3), ylim = c(-3, 3),
col = "#0000801A",
axes = FALSE, main = "",
xlab = "", ylab = "")
}
model <- isolation.forest(X, ndim=1, ntrees=100, nthreads=1)
scores <- predict(model, space_d)
par(mar = c(2.5,2.2,2,2.5))
plot.space(scores, "Outlier Scores\n(clustered data with an outlier on top)", 1.0)
The isolation forest algorithm, as originally introduced in the paper “Isolation forest”, tends to provide reasonably good performance out-of-the-box across a variety of datasets and problem domains, but it is far from universal or perfect: it suffers from many biases which affect its performance, and there have been many sub-sequent papers trying to address some of its issues by introducing changes in the splitting logic or in other aspects.
For example, in the plot above, it can be seen that it generated “ghost” regions of high inlierness in areas parallel to the clusters along each axis, despite there not being any data in those regions.
Enhanced variations of the algorithm have been proposed, for example by introducing changes in the logic as follows:
Here are some example variations on the same data as before:
par(mfrow = c(3, 2), mar = c(2.5,2.2,2,2.5))
iforest <- isolation.forest(
X, ndim=1, ntrees=100,
missing_action="fail"
)
plot.space(
predict(iforest, space_d),
"Isolation Forest"
)
ext_iforest <- isolation.forest(
X, ndim=2, ntrees=100,
missing_action="fail"
)
plot.space(
predict(ext_iforest, space_d),
"Extended Isolation Forest"
)
sciforest <- isolation.forest(
X, ndim=2, ntrees=100,
missing_action="fail",
coefs="normal",
prob_pick_avg_gain=1
)
plot.space(
predict(sciforest, space_d),
"SCiForest"
)
fcf <- isolation.forest(
X, ndim=2, ntrees=100,
missing_action="fail",
prob_pick_pooled_gain=1
)
plot.space(
predict(fcf, space_d),
"Fair-Cut Forest"
)
dens_iforest <- isolation.forest(
X, ndim=2, ntrees=100,
missing_action="fail",
scoring_metric="density"
)
plot.space(
predict(dens_iforest, space_d),
"Density Isolation Forest"
)
bdens_iforest <- isolation.forest(
X, ndim=1, ntrees=100,
missing_action="fail",
scoring_metric="boxed_ratio"
)
plot.space(
predict(bdens_iforest, space_d),
"Boxed Isolation Forest"
)
Although the plots above can be illustrative for understanding what isolation forest and its variants are doing, outlier detection in 1D and 2D data is an area in which many other more successful methods have been tried and proved before. Where isolation forests does shine though, is when the data has a larger number of dimensions, clustered outliers, or multi-modal distributions.
One of the most common datasets for benchmarking outlier detection algorithms in more than 2 dimensions is the “Satellite” dataset, which was originally designed for multi-class classification but is typically adapted for anomaly detection by merging the least common classes together and labelling them as “anomalies” or “outliers”.
From ODDS:
The original Statlog (Landsat Satellite) dataset from UCI machine learning repository is a multi-class classification dataset. Here, the training and test data are combined. The smallest three classes, i.e. 2, 4, 5 are combined to form the outliers class, while all the other classes are combined to form an inlier class.
This makes it one of the hardest datasets for outlier detection, but the challenges it introduces (clustered outliers, multi-modal distributions) are not an obstacle for isolation forests.
The next example will load the dataset from the mlbench
package and apply the conversion as outlined in ODDS and as done in
typical benchmarks for outlier detection:
library(mlbench)
data("Satellite")
is_outlier <- Satellite$classes %in% c("damp grey soil", "cotton crop", "vegetation stubble")
sat_without_class <- Satellite[, names(Satellite)[names(Satellite) != "classes"]]
dim(sat_without_class)
#> [1] 6435 36
Now the next chunk will fit different variants of isolation forests and assess how well they do at discriminating outliers by calculating the AUROC (area under the ROC curve) that they produce. Note that the models do not see any information about which observations are outliers and which are not.
The original paper “Isolation Forest” suggested that models would converge with 100 trees, but here it is possible to get slightly better performance by simply increasing the number of trees and by bigger changes such as choosing a different scoring metric.
library(MLmetrics)
library(kableExtra)
model_orig <- isolation.forest(
sat_without_class,
ndim=1, sample_size=256,
ntrees=100,
missing_action="fail"
)
pred_orig <- predict(model_orig, sat_without_class)
model_dens <- isolation.forest(
sat_without_class,
ndim=1, sample_size=256,
ntrees=100,
missing_action="fail",
scoring_metric="density"
)
pred_dens <- predict(model_dens, sat_without_class)
model_fcf <- isolation.forest(
sat_without_class,
ndim=1, sample_size=32,
prob_pick_pooled_gain=1,
ntrees=100,
missing_action="fail"
)
pred_fcf <- predict(model_fcf, sat_without_class)
results_df <- data.frame(
Model = c(
"Isolation Forest",
"Density Isolation Forest",
"Fair-Cut Forest"
),
AUROC = c(
AUC(pred_orig, is_outlier),
AUC(pred_dens, is_outlier),
AUC(pred_fcf, is_outlier)
)
)
results_df %>%
kable() %>%
kable_styling()
Model | AUROC |
---|---|
Isolation Forest | 0.6910222 |
Density Isolation Forest | 0.8265929 |
Fair-Cut Forest | 0.8969312 |
For comparison purposes, what follows next (one-class SVM) is one of the most widely-used methods for outlier detection other than isolation forests - as will be seen, “one-class support vector machines” are not very suitable to this kind of data, and even if it were to be tuned, its performance would not match that of isolation forest variants:
library(kernlab)
model_svm <- ksvm(
as.matrix(sat_without_class),
type="one-svc",
nu = 0.5
)
pred_svm <- predict(model_svm, as.matrix(sat_without_class), type="decision")
results_svm <- data.frame(
Model = "One-Class SVM",
AUROC = AUC(-pred_svm, is_outlier)
)
results_svm %>%
kable() %>%
kable_styling()
Model | AUROC |
---|---|
One-Class SVM | 0.7014547 |
(Note that kernlab’s objects output scores in the opposite
direction: for them, higher scores means more inlinerness, while for
isotree
(and for the one-class SVM in e1071
),
higher scores means more outlierness)
The scores obtained from isolation forest models can also be used as a proxy for density or as an additional feature in regression and classification tasks, which can be particularly helpful for domains like fraud detection, and can be used as a hint for determining whether there has been any covariate drift in a dataset (as part of other methods). When used for purposes other than anomaly detection, it is recommended to fit trees beyond balanced-tree height limit (see documentation for details).
Apart from the outlier scores, the package isotree
can
also do the following: