Title: | Find the Outlier by Quantile Random Forests |
---|---|
Description: | Provides a method to find the outlier in custom data by quantile random forests method. Introduced by Meinshausen Nicolai (2006) <https://dl.acm.org/doi/10.5555/1248547.1248582>. It directly calls the ranger() function of the 'ranger' package to perform data fitting and prediction. We also implement the evaluation of outlier prediction results. Compared with random forest detection of outliers, this method has higher accuracy and stability on large datasets. |
Authors: | Tengfei Xu [aut, cre] |
Maintainer: | Tengfei Xu <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.0 |
Built: | 2025-01-09 07:03:31 UTC |
Source: | CRAN |
This function evaluates the performance of the outlier detection algorithm.
evaluateOutliers(original_data, anomaly_data, anomaly_result)
evaluateOutliers(original_data, anomaly_data, anomaly_result)
original_data |
A data frame containing the original data. |
anomaly_data |
A data frame containing the anomaly data. |
anomaly_result |
A data frame containing the predicted anomalies. |
A data frame containing the evaluation metrics.
anomaly_data <- generateOutliers(iris, p = 0.05, sd_factor = 5, seed = 123) qrf<- outqrf(anomaly_data) evaluateOutliers(iris,anomaly_data,qrf$outliers)
anomaly_data <- generateOutliers(iris, p = 0.05, sd_factor = 5, seed = 123) qrf<- outqrf(anomaly_data) evaluateOutliers(iris,anomaly_data,qrf$outliers)
This function finds the closest index to a given value in a vector.
find_index(x, y)
find_index(x, y)
x |
a vector |
y |
a value |
the index of the closest value in the vector
find_index(c(1, 2, 3, 4, 5), 3.5)
find_index(c(1, 2, 3, 4, 5), 3.5)
Adds Outliers
generateOutliers(data, p = 0.05, sd_factor = 5, seed = NULL)
generateOutliers(data, p = 0.05, sd_factor = 5, seed = NULL)
data |
data.frame. |
p |
Proportion of outliers to add to data. |
sd_factor |
Each outlier is generated by shifting the original value by a
realization of a normal random variable with |
seed |
An integer seed. |
data with some outliers.
generateOutliers(iris, p = 0.05, sd_factor = 5)
generateOutliers(iris, p = 0.05, sd_factor = 5)
This function extracts the numeric value from a string.
get_quantily_value(name)
get_quantily_value(name)
name |
a string |
a numeric value
get_quantily_value("quantiles = 0.001")
get_quantily_value("quantiles = 0.001")
This function finds the right rank of a response value in a quantile random forest.
get_right_rank(response, outMatrix, median_outMatrix, rmse_)
get_right_rank(response, outMatrix, median_outMatrix, rmse_)
response |
a vector of response values |
outMatrix |
a matrix of out values |
median_outMatrix |
a vector of median out values |
rmse_ |
a vector of rmse values |
a vector of ranks
This function finds outliers in a dataset using quantile random forests.
outqrf( data, quantiles_type = 1000, threshold = 0.025, impute = TRUE, verbose = 1, weight = FALSE, ... )
outqrf( data, quantiles_type = 1000, threshold = 0.025, impute = TRUE, verbose = 1, weight = FALSE, ... )
data |
a data frame |
quantiles_type |
'1000':seq(from = 0.001, to = 0.999, by = 0.001), '400':seq(0.0025,0.9975,0.0025) |
threshold |
a threshold for outlier detection |
impute |
a boolean value indicating whether to impute missing values |
verbose |
a boolean value indicating whether to print verbose output |
weight |
a boolean value indicating whether to use weight. if TRUE, The actual threshold will be threshold*r2. |
... |
additional arguments passed to the ranger function |
An object of class "outqrf" and a list with the following elements.
Data
: Original data set in unchanged row order
outliers
: Compact representation of outliers. Each row corresponds to an outlier and contains the following columns:
row
: Row number of the outlier
col
: Variable name of the outlier
observed
: value of the outlier
predicted
: predicted value of the outlier
rank
: Rank of the outlier
outMatrix
: Predicted value at different quantiles for each observation
r.squared
: R-squared value of the quantile random forest model
outMatrix
: Predicted value at different quantiles for each observation
r.squared
: R-squared value of the quantile random forest model
oob.error
: Out-of-bag error of the quantile random forest model
rmse
: RMSE of the quantile random forest model
threshold
: Threshold for outlier detection
iris_with_outliers <- generateOutliers(iris, p=0.05) qrf = outqrf(iris_with_outliers) qrf$outliers evaluateOutliers(iris,iris_with_outliers,qrf$outliers)
iris_with_outliers <- generateOutliers(iris, p=0.05) qrf = outqrf(iris_with_outliers) qrf$outliers evaluateOutliers(iris,iris_with_outliers,qrf$outliers)
This function can plot paired boxplot of an "outqrf" object. It helps us to better observe the relationship between the original and predicted values
## S3 method for class 'outqrf' plot(x, ...)
## S3 method for class 'outqrf' plot(x, ...)
x |
An object of class "outqrf". |
... |
other param maybe uesd. |
A ggplot2 object
irisWithOutliers <- generateOutliers(iris, seed = 2024) qrf <- outqrf(irisWithOutliers) plot(qrf)
irisWithOutliers <- generateOutliers(iris, seed = 2024) qrf <- outqrf(irisWithOutliers) plot(qrf)