Title: | Methods for Identification of Outliers in Environmental Data |
---|---|
Description: | Three semi-parametric methods for detection of outliers in environmental data based on kernel regression and subsequent analysis of smoothing residuals. The first method (Campulova, Michalek, Mikuska and Bokal (2018) <DOI: 10.1002/cem.2997>) analyzes the residuals using changepoint analysis, the second method is based on control charts (Campulova, Veselik and Michalek (2017) <DOI: 10.1016/j.apr.2017.01.004>) and the third method (Holesovsky, Campulova and Michalek (2018) <DOI: 10.1016/j.apr.2017.06.005>) analyzes the residuals using extreme value theory (Holesovsky, Campulova and Michalek (2018) <DOI: 10.1016/j.apr.2017.06.005>). |
Authors: | Martina Campulova [cre], Martina Campulova [aut], Roman Campula [ctb] |
Maintainer: | Martina Campulova <[email protected]> |
License: | GPL-2 |
Version: | 1.1.0 |
Built: | 2024-12-04 07:22:39 UTC |
Source: | CRAN |
Performs Box-Cox power transformation of the data. The optimal value of power parameter is selected based on profile log-likelihoods.
The function is called by KRDetect.outliers.changepoint
and is not intended for use by regular users of the package.
boxcoxTransform(x)
boxcoxTransform(x)
x |
a numeric vector of data values. |
This function computes the Box-Cox power transformation of the data.
The function is exported for developer use only. It does not perform any checks on inputs since it is only a convenience function for a transformation of data to normality.
The optimal value of a power parameter is estimated based on profile log-likelihoods calculated using boxcox
function implemented in MASS package.
A list is returned with elements:
lambda |
a numeric value giving power parameter |
x |
a numeric vector of data values |
x.transformed |
a numeric vector of transformed data |
Box G, Cox D (1964). An analysis of transformations. Journal of the Royal Statistical Society: Series B, 26, 211–234.
Venables WN, Ripley BD (2002). Modern Applied Statistics with S. New York, fourth edition. ISBN 0-387-95457-0, URL http://www.stats.ox.ac.uk/pub/MASS4.
Performs changepoint analysis using PELT algorithm or A Nonparametric Approach for Multiple Changepoints.
The function is called by KRDetect.outliers.changepoint
and is not intended for use by regular users of the package.
changepoint(x, cp.analysis.type, pen.value, alpha.edivisive)
changepoint(x, cp.analysis.type, pen.value, alpha.edivisive)
x |
a numeric vector of data values. |
cp.analysis.type |
a character string specifying the type of changepoint analysis Possible options are
|
pen.value |
A character string giving the formula for manual penalty used in PELT algorithm.
Only required for |
alpha.edivisive |
a numeric value giving the moment index used for determining the distance between and within segments in the nonparametric changepoint model. |
This function performs changepoint analysis using parametric or nonparametric approach. The function is exported for developer use only. It does not perform any checks on inputs since it is only convenience function for partitioning smoothing residuals into homogeneous segments.
A list is returned with elements:
x |
a numeric vector of data values |
cp.segmet |
an estimated integer membership vector for individual segments |
Killick R, Fearnhead P, Eckley IA (2012). Optimal detection of changepoints with a linear computational cost. Journal of the American Statistical Association, 107(500), 1590–1598.
Matteson D, James N (2014). A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data. Journal of the American Statistical Association, 109(505), 334–345.
Nicholas A. James, David S. Matteson (2014). ecp: An R Package for Nonparametric Multiple Change Point Analysis of Multivariate Data. Journal of Statistical Software, 62(7), 1-25, URL "http://www.jstatsoft.org/v62/i07/".
Killick R, Haynes K, Eckley IA (2016). changepoint: An R package for changepoint analysis. R package version 2.2.2, <URL: https://CRAN.R-project.org/package=changepoint>.
Plot of results obtained using function KRDetect.outliers.changepoint
for identification of outliers using changepoint analysis.
The function is called by plot.KRDetect
and is not intended for use by regular users of the package.
changepoint.plot(x, show.segments, ...)
changepoint.plot(x, show.segments, ...)
x |
a list obtained as an output of function |
show.segments |
a logical variable specifying if vertical lines representing individual segments are plotted. |
... |
further arguments to be passed to the |
This function plots the results obtained using function KRDetect.outliers.changepoint
identificating outliers using changepoint analysis based method.
The function is exported for developer use only. It does not perform any checks on inputs since it is only convenience function for plotting results obtained using functions implemented in package envoutliers.
Identification of outlier data values on individual homogeneous segments using Chebyshev inequality.
The function is called by KRDetect.outliers.changepoint
and is not intended for use by regular users of the package.
chebyshev.inequality.detect(x, cp.segment, L.default)
chebyshev.inequality.detect(x, cp.segment, L.default)
x |
a numeric vector of data. |
cp.segment |
an integer membership vector for individual segments. |
L.default |
a numeric value of L parameter determining the criterion for outlier detection:
the limits for outlier observations on individual segments are set as |
This function detects outlier observations on individual segments using Chebyshev inequality. The function is exported for developer use only. It does not perform any checks on inputs since it is only convenience function for identification of outlier residuals.
A list is returned with elements:
L |
a numeric vector of L parameters used for outlier identification on individual segments |
outlier |
a logical vector specifing the identified outliers, |
Campulova M, Michalek J, Mikuska P, Bokal D (2018). Nonparametric algorithm for identification of outliers in environmental data. Journal of Chemometrics, 32, 453-463.
Estimation of limits of control chart R. The function is called by KRDetect.outliers.controlchart
and is not intended for use by regular users of the package.
control.limits.R(x, group.size, L)
control.limits.R(x, group.size, L)
x |
a numeric vector of data values. |
group.size |
a positive integer giving the number of observations in individual segments used for computation of control chart limits. If the data can not be equidistantly divided, the first extra values will be excluded from the analysis. |
L |
a positive numeric value giving parameter |
This function computes parameters based on which control chart R can be constructed. The function is exported for developer use only. It does not perform any checks on inputs since it is only a convenience function for identification limits based on control chart R.
A list is returned with elements:
x |
a numeric vector of data |
range.est |
a numeric value giving an estimate of range parameter |
groups.count |
a numeric value giving a number of segments used for estimating parameters of control chart |
groups.range |
a numeric vector giving sample ranges in individual segments used for estimating parameters of control chart |
LCL |
a numeric value giving lower control limit of control chart R |
UCL |
a numeric value giving upper control limit of control chart R |
Shewhart W (1931). Quality control chart. Bell System Technical Journal, 5, 593–603.
SAS/QC User's Guide, Version 8, 1999. SAS Institute, Cary, N.C.
Wild C, Seber G (2000). Chance encounters: A first course in data analysis and inference. New York: John Wiley.
Estimation of limits of control chart s. The function is called by KRDetect.outliers.controlchart
and is not intended for use by regular users of the package.
control.limits.s(x, group.size, L)
control.limits.s(x, group.size, L)
x |
a numeric vector of data values. |
group.size |
a positive integer giving the number of observations in individual segments used for computation of control chart limits. If the data can not be equidistantly divided, the first extra values will be excluded from the analysis. |
L |
a positive numeric value giving parameter |
This function computes parameters based on which control chart s can be constructed. The function is exported for developer use only. It does not perform any checks on inputs since it is only a convenience function for identification limits based on control chart s.
A list is returned with elements:
x |
a numeric vector of data |
sd.est |
a numeric value giving an estimate of standard deviation parameter |
groups.count |
a numeric value giving a number of segments used for estimating parameters of control chart |
groups.sd |
a numeric vector giving sample standard deviations in individual segments used for estimating parameters of control chart |
LCL |
a numeric value giving lower control limit of control chart s |
UCL |
a numeric value giving upper control limit of control chart s |
Shewhart W (1931). Quality control chart. Bell System Technical Journal, 5, 593–603.
SAS/QC User's Guide, Version 8, 1999. SAS Institute, Cary, N.C.
Wild C, Seber G (2000). Chance encounters: A first course in data analysis and inference. New York: John Wiley.
Estimation of limits of control chart x. The function is called by KRDetect.outliers.controlchart
and is not intended for use by regular users of the package.
control.limits.x(x, method = "range", group.size, L)
control.limits.x(x, method = "range", group.size, L)
x |
a numeric vector of data values. |
method |
a character string specifying the preferred estimate of standard deviation parameter. Possible options are
|
group.size |
a positive integer giving the number of observations in individual segments used for computation of control chart limits. If the data can not be equidistantly divided, the first extra values will be excluded from the analysis. |
L |
a positive numeric value giving parameter |
This function computes parameters based on which control chart x can be constructed. The function is exported for developer use only. It does not perform any checks on inputs since it is only a convenience function for identification limits based on control chart x.
A list is returned with elements:
x |
a numeric vector of data |
mean |
a numeric value giving sample mean of vector |
groups.count |
a numeric value giving a number of segments used for estimating parameters of control chart |
groups.mean |
a numeric vector giving sample means in individual segments used for estimating parameters of control chart |
LCL |
numeric value giving lower control limit of control chart x |
UCL |
numeric value giving upper control limit of control chart s |
Shewhart W (1931). Quality control chart. Bell System Technical Journal, 5, 593–603.
SAS/QC User's Guide, Version 8, 1999. SAS Institute, Cary, N.C.
Wild C, Seber G (2000). Chance encounters: A first course in data analysis and inference. New York: John Wiley.
Plot of results obtained using function KRDetect.outliers.controlchart
for identification of outliers using control charts.
The function is called by plot.KRDetect
and is not intended for use by regular users of the package.
controlchart.plot(x, plot.type = "all", ...)
controlchart.plot(x, plot.type = "all", ...)
x |
a list obtained as an output of function |
plot.type |
a type of plot with outliers displayed. Possible options are
|
... |
further arguments to be passed to the |
This function plots the results obtained using function KRDetect.outliers.controlchart
identificating outliers using control charts.
The function is exported for developer use only. It does not perform any checks on inputs since it is only convenience function for plotting results obtained using functions implemented in package envoutliers.
Plot of results obtained using function KRDetect.outliers.EV
for identification of outliers using extreme value theory.
The function is called by plot.KRDetect
and is not intended for use by regular users of the package.
EV.plot(x, plot.type = "all", ...)
EV.plot(x, plot.type = "all", ...)
x |
a list obtained as an output of function |
plot.type |
a type of plot with outliers displayed. Possible options are
|
... |
further arguments to be passed to the |
This function plots the results obtained using function KRDetect.outliers.EV
identificating outliers using changepoint analysis based method.
The function is exported for developer use only. It does not perform any checks on inputs since it is only convenience function for plotting results obtained using functions implemented in package envoutliers.
Estimation of an extremal index using the censored estimator suggested in (Holesovsky and Fusek, 2020).
The function is called by KRDetect.outliers.EV
and is not intended for use by regular users of the package.
extremal.index.censored(x, u, D)
extremal.index.censored(x, u, D)
x |
a numeric vector of observations. |
u |
a numeric value giving threshold. |
D |
a nonnegative integer giving the value of D parameter (Holesovsky and Fusek, 2020) |
This function computes the censored estimate of extremal index suggested in (Holesovsky and Fusek, 2020).
The function is exported for developer use only. It does not perform any checks on inputs since it is only convenience function used within KRDetect.outliers.EV
.
a numeric value of an extremal index estimate
Holesovsky, J, Fusek, M (2020). Estimation of the Extremal Index Using Censored Distributions. Extremes, DOI: 10.1007/s10687-020-00374-3.
Estimation of an extremal index using the block maxima approach suggested by (Gomes, 1993).
The function is called by KRDetect.outliers.EV
and is not intended for use by regular users of the package.
extremal.index.gomes(x, block.length)
extremal.index.gomes(x, block.length)
x |
a numeric vector of observations. |
block.length |
a numeric value giving the length of blocks. |
This function computes the estimate of extremal index suggested by (Gomes, 1993).
The function is exported for developer use only. It does not perform any checks on inputs since it is only convenience function used within KRDetect.outliers.EV
.
A numeric value of an extremal index estimate
Gomes M (1993). On the estimation of parameter of rare events in environmental time series. In Statistics for the Environment, volume 2 of Water Related Issues, pp. 225-241. Wiley.
Heffernan JE, Stephenson AG (2016). ismev: An Introduction to Statistical Modeling of Extreme Values. R package version 1.41, URL http://CRAN.R-project.org/package=ismev.
Estimation of an extremal index using the Intervals estimator suggested in (Ferro and Segers, 2003).
The function is called by KRDetect.outliers.EV
and is not intended for use by regular users of the package.
extremal.index.intervals(x, u)
extremal.index.intervals(x, u)
x |
a numeric vector of observations. |
u |
a numeric value giving threshold |
This function computes the estimate of extremal index suggested in (Ferro and Segers, 2003).
The function is exported for developer use only. It does not perform any checks on inputs since it is only convenience function used within KRDetect.outliers.EV
.
a numeric value of an extremal index estimate
Ferro, CAT, Segers, J (2003). Inference for Cluster of Extreme Values. Journal of Royal Statistical Society, Series B, 65(2), 545-556.
Estimation of an extremal index using the K-gaps estimator suggested in (Suveges and Davison, 2010).
The function is called by KRDetect.outliers.EV
and is not intended for use by regular users of the package.
extremal.index.Kgaps(x, u, K)
extremal.index.Kgaps(x, u, K)
x |
a numeric vector of observations. |
u |
a numeric value giving threshold. |
K |
a nonnegative integer giving the value of K parameter (Suveges and Davison, 2010). |
This function computes the K-gaps estimate of extremal index suggested in (Suveges and Davison, 2010).
The function is exported for developer use only. It does not perform any checks on inputs since it is only convenience function used within KRDetect.outliers.EV
.
a numeric value of an extremal index estimate
Suveges, M, Davison, AC (2010). Model Misspecification in Peaks Over Threshold Analysis. The Annals of Applied Statistics, 4(1), 203-221.
Estimation of an extremal index using the runs estimator suggested in (Smith and Weissman, 1994).
The function is called by KRDetect.outliers.EV
and is not intended for use by regular users of the package.
extremal.index.runs(x, u, r)
extremal.index.runs(x, u, r)
x |
a numeric vector of observations. |
u |
a numeric value giving threshold. |
r |
a positive integer giving the value of runs parameter (Smith and Weissman, 1994). |
This function computes the runs estimate of extremal index suggested in (Smith and Weissman, 1994).
The function is exported for developer use only. It does not perform any checks on inputs since it is only convenience function used within KRDetect.outliers.EV
.
a numeric value of an extremal index estimate
Smith, RL, Weissman, I (1994). Estimating the Extremal Index. Journal of the Royal Statistical Society, Series B, 56, 515-529.
Estimation of an extremal index using the sliding blocks estimator suggested in (Northrop, 2015).
The function is called by KRDetect.outliers.EV
and is not intended for use by regular users of the package.
extremal.index.sliding.blocks(x, b = round(sqrt(length(x))))
extremal.index.sliding.blocks(x, b = round(sqrt(length(x))))
x |
a numeric vector of observations. |
b |
a numeric value giving the length of blocks. Default is |
This function computes the sliding blocks estimate of extremal index suggested in (Northrop, 2015).
The function is exported for developer use only. It does not perform any checks on inputs since it is only convenience function used within KRDetect.outliers.EV
.
a numeric value of an extremal index estimate
Northrop, PJ (2015). An Efficient Semiparametric Maxima Estimator of the Extremal Index. Extremes, 18, 585-603.
Finds the value of parameter alpha defining the criterion for outlier identification using quantiles of normal distribution. The parameter is found using Modified algorithm A1 (Campulova et al., 2018.)
The function is called by KRDetect.outliers.changepoint
and is not intended for use by regular users of the package.
find.alpha(x, alpha.start = 0.05, eps = NULL)
find.alpha(x, alpha.start = 0.05, eps = NULL)
x |
a numeric vector of data values. |
alpha.start |
a numeric value giving the largest reasonable value of parameter alpha. |
eps |
a numeric value of the epsilon parameter. If |
This function finds the value of parameter alpha defining the criterion for outlier identification using quantiles of normal distribution. The algorithm is based on Modified Algorithm A1 described in (Campulova et al., 2018). Nonoutliers are characterised as a homogeneous set of data randomly distributed around zero value. The differences between the data correspond to random fluctuations in the measurements. The algorithm finds the value of the parameter by scanning possible values of alpha and investigating differences of the corresponding nonoutliers. The idea is to choose alpha corresponding to the maximum change of the maximum difference found among the ordered nonoutlier data. The function is exported for developer use only. It does not perform any checks on inputs since it is only convenience function for finding outlier residuals based on quantiles of normal distribution.
A numeric value giving the parameter alpha
Campulova M, Michalek J, Mikuska P, Bokal D (2018). Nonparametric algorithm for identification of outliers in environmental data. Journal of Chemometrics, 32, 453-463.
Finds the value of parameter L defining the criterion for outlier identification using Chebyshev inequality. The parameter is found using Algorithm A1 (Campulova et al., 2018).
The function is called by KRDetect.outliers.changepoint
and is not intended for use by regular users of the package.
find.L(x, L.start = 2.5, eps = NULL)
find.L(x, L.start = 2.5, eps = NULL)
x |
a numeric vector of data values. |
L.start |
a numeric value giving the smallest reasonable value of parameter |
eps |
A numeric value of the epsilon parameter. If |
This function finds the value of parameter L defining the criterion for outlier identification using Chebyshev inequality. The algorithm is based on Algorithm A1 described in (Campulova et al., 2018). Nonoutliers are characterised as a homogeneous set of data randomly distributed around zero value. The differences between the data correspond to random fluctuations in the measurements. The algorithm finds the value of the parameter by scanning possible values of L and investigating differences of the corresponding nonoutliers. The idea is to choose L corresponding to the maximum change of the maximum difference found among the ordered nonoutlier data. The function is exported for developer use only. It does not perform any checks on inputs since it is only a convenience function for finding outlier residuals based on Chebyshev inequality.
A numeric value giving the parameter L
Campulova M, Michalek J, Mikuska P, Bokal D (2018). Nonparametric algorithm for identification of outliers in environmental data. Journal of Chemometrics, 32, 453-463.
Creation of Table of Control Chart Constants.
The function is called by control.limits.x
, control.limits.R
and control.limits.s
. This function is not intended for use by regular users of the package.
get.norm()
get.norm()
This function creates a table with columns giving constants for computation limits of control charts. The function is exported for developer use only. It does not have any input parameters and does not perform any checks on inputs since it is only a convenience function for computing control chart limits.
data.frame whose columns are numeric vectors giving constants for control charts limits computation
JOGLEKAR, Anand M. Statistical methods for six sigma: in R&D and manufacturing. Hoboken, NJ: Wiley-Interscience. ISBN sbn0-471-20342-4.
Identification of outlier data values on individual homogeneous segments using Grubbs test.
The function is called by KRDetect.outliers.changepoint
and is not intended for use by regular users of the package.
grubbs.detect(x, cp.segment)
grubbs.detect(x, cp.segment)
x |
a numeric vector of data. |
cp.segment |
an integer membership vector for individual segments. |
This function detects outlier observations on individual segments using Grubbs test. The function is exported for developer use only. It does not perform any checks on inputs since it is only convenience function for identification of outlier residuals.
A logical vector specifing the identified outliers, TRUE
means that corresponding data value from vector x
is detected as outlier.
Grubbs F (1950). Sample criteria for testing outlying observations. The Annals of Mathematical Statistics, 21(1), 27-58.
Campulova M, Michalek J, Mikuska P, Bokal D (2018). Nonparametric algorithm for identification of outliers in environmental data. Journal of Chemometrics, 32, 453-463.
Sequential identification of outliers using Grubbs' test.
The algorithm first considers the data value with the highest absolute value. If the null hypothesis that such a value is not an outlier is rejected,
the considered value is detected as an outlier and excluded from further analysis. Subsequently, a value with the second-highest absolute value is considered, and its quality is again evaluated using the Grubbs test. This procedure is repeated until no outlier is detected.
The function is called by KRDetect.outliers.changepoint
and grubbs.detect
. The function is not intended for use by regular users of the package.
grubbs.test(x, alpha = 0.05)
grubbs.test(x, alpha = 0.05)
x |
a numeric vector of data values. |
alpha |
a numeric value giving test significance level. |
This function sequentially identifies outlier data using Grubbs test. The function is exported for developer use only. It does not perform any checks on inputs since it is only convenience function for identification of outlier residuals using Grubbs test.
A list is returned with elements:
result |
A table containing information about identified outliers. The number of rows of the table corresponds to the number of identified outliers. The table has following columns: |
outliers.exists |
A logical value. |
Grubbs F (1950). Sample criteria for testing outlying observations. The Annals of Mathematical Statistics, 21(1), 27-58.
Identification of outliers in environmental data using method based on kernel smoothing, changepoint analysis of smoothing residuals and subsequent analysis of residuals on homogeneous segments (Campulova et al., 2018).
KRDetect.outliers.changepoint(x, perform.smoothing = TRUE, perform.cp.analysis = TRUE, bandwidth.type = "local", bandwidth.value = NULL, kernel.order = 2, cp.analysis.type = "parametric", pen.value = "5*log(n)", alpha.edivisive = 0.3, min.segment.length = 30, segment.length.for.merge = 15, method = "auto", prefer.grubbs = TRUE, alpha.default = NULL, L.default = NULL)
KRDetect.outliers.changepoint(x, perform.smoothing = TRUE, perform.cp.analysis = TRUE, bandwidth.type = "local", bandwidth.value = NULL, kernel.order = 2, cp.analysis.type = "parametric", pen.value = "5*log(n)", alpha.edivisive = 0.3, min.segment.length = 30, segment.length.for.merge = 15, method = "auto", prefer.grubbs = TRUE, alpha.default = NULL, L.default = NULL)
x |
data values. Supported data types
|
perform.smoothing |
a logical value specifying if data smoothing is performed. If |
perform.cp.analysis |
a logical value specifying if changepoint analysis is performed. If |
bandwidth.type |
a character string specifying the type of bandwidth. Possible options are
|
bandwidth.value |
a local bandwidth array (for |
kernel.order |
a nonnegative integer giving the order of the optimal kernel (Gasser et al., 1985) used for smoothing. Possible options are
|
cp.analysis.type |
a character string specifying the type of changepoint analysis. Possible options are
|
pen.value |
a character string giving the formula for manual penalty used in PELT algorithm.
Only required for |
alpha.edivisive |
a numeric value giving the moment index used for determining the distance between and within segments in nonparametric changepoint model. Default is |
min.segment.length |
a numeric value giving minimal required number of observations on segments from changepoint analysis.
If a segment contains less than |
segment.length.for.merge |
a numeric value giving minimal required number of observations on segments for performing the homogeneity test within changepoint split control.
A segment with less data than |
method |
a character string specifying the method for identification of outlier residuals. Possible options are
|
prefer.grubbs |
a logical variable specyfing if Grubbs test for identification of outlier residuals is preferred to quantiles of normal distribution.
|
alpha.default |
a numeric value from interval (0,1) of alpha parameter determining the criterion for (residual) outlier detection:
the limits for outlier residuals on individual segments are set as |
L.default |
a numeric value of L parameter determining the criterion for outlier (residual) detection:
the limits for outlier residuals on individual segments are set as |
This function identifies outliers in time series using procedure based on kernel smoothing, changepoint analysis of smoothing residuals and subsequent analysis of residuals on homogeneous segments (Campulova et al., 2018). Three different approaches (Grubbs test, quantiles of normal distribution, Chebyshev inequality), that can be selected automatically based on data structure or specified by the user, can be used to detect outlier residuals. Crucial for the method is the choice of parameters alpha and L for quantiles of normal distribution and Chebyshev inequality approach, that define the criterion for outlier detection. These values can be specified by the user or estimated automatically using data driven algorithms (Campulova et al., 2018).
A "KRDetect"
object which contains a list with elements:
method.type |
a character string giving the type of method used for outlier idetification |
x |
a numeric vector of observations |
index |
a numeric vector of index design points assigned to individual observations |
smoothed |
a numeric vector of estimates of the kernel regression function (smoothed data) |
changepoints |
an integer membership vector for individual segments |
normality.results |
a data.frame of normality results of residuals on individual segments |
detection.method |
a character string giving the type of method used for identification of outlier residuals |
alpha |
a numeric vector of alpha parameters used for outlier identification on individual segments |
L |
a numeric vector of L parameters used for outlier identification on individual segments |
outlier |
a logical vector specyfing the identified outliers, |
Campulova M, Michalek J, Mikuska P, Bokal D (2018). Nonparametric algorithm for identification of outliers in environmental data. Journal of Chemometrics, 32, 453-463.
Gasser T, Kneip A, Kohler W (1991). A flexible and fast method for automatic smoothing. Journal of the American Statistical Association, 86, 643–652.
Herrmann E (1997). Local bandwidth choice in kernel regression estimation. Journal of Computational and Graphical Statistics, 6(1), 35–54.
Eva Herrmann; Packaged for R and enhanced by Martin Maechler (2016). lokern: Kernel Regression Smoothing with Local or Global Plug-in Bandwidth. R package version 1.1-8. https://CRAN.R-project.org/package=lokern.
Killick R, Fearnhead P, Eckley IA (2012). Optimal detection of changepoints with a linear computational cost. Journal of the American Statistical Association, 107(500), 1590–1598.
Killick R, Haynes K, Eckley IA (2016). changepoint: An R package for changepoint analysis. R package version 2.2.2, <URL: https://CRAN.R-project.org/package=changepoint>.
Matteson D, James N (2014). A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data. Journal of the American Statistical Association, 109(505), 334–345.
Nicholas A. James, David S. Matteson (2014). ecp: An R Package for Nonparametric Multiple Change Point Analysis of Multivariate Data. Journal of Statistical Software, 62(7), 1-25, URL "http://www.jstatsoft.org/v62/i07/".
Brys G, Hubert M, Struyf A (2008). Goodness-of-fit tests based on a robust measure of skewness. Computational Statistics, 23(3), 429–442.
Todorov V, Filzmoser P (2009). An Object-Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1-47. URL http://www.jstatsoft.org/v32/i03/.
Box G, Cox D (1964). An analysis of transformations. Journal of the Royal Statistical Society: Series B, 26, 211–234.
Venables WN, Ripley BD (2002). Modern Applied Statistics with S. New York, fourth edition. ISBN 0-387-95457-0, URL http://www.stats.ox.ac.uk/pub/MASS4.
Grubbs F (1950). Sample criteria for testing outlying observations. The Annals of Mathematical Statistics, 21(1), 27-58.
Fox J (2016). Applied regression analysis and generalized linear models. 3 edition. Los Angeles: SAGE. ISBN 9781452205663.
data("mydata", package = "openair") x = mydata$o3[format(mydata$date, "%m %Y") == "12 2002"] result = KRDetect.outliers.changepoint(x) summary(result) plot(result) plot(result, show.segments = FALSE)
data("mydata", package = "openair") x = mydata$o3[format(mydata$date, "%m %Y") == "12 2002"] result = KRDetect.outliers.changepoint(x) summary(result) plot(result) plot(result, show.segments = FALSE)
Identification of outliers in environmental data using two-step method based on kernel smoothing and control charts (Campulova et al., 2017). The outliers are identified as observations corresponding to segments of smoothing residuals exceeding control charts limits.
KRDetect.outliers.controlchart(x, perform.smoothing = TRUE, bandwidth.type = "local", bandwidth.value = NULL, kernel.order = 2, method = "range", group.size.x = 3, group.size.R = 3, group.size.s = 3, L.x = 3, L.R = 3, L.s = 3)
KRDetect.outliers.controlchart(x, perform.smoothing = TRUE, bandwidth.type = "local", bandwidth.value = NULL, kernel.order = 2, method = "range", group.size.x = 3, group.size.R = 3, group.size.s = 3, L.x = 3, L.R = 3, L.s = 3)
x |
data values. Supported data types
|
perform.smoothing |
a logical value specifying if data smoothing is performed. If |
bandwidth.type |
a character string specifying the type of bandwidth. Possible options are
|
bandwidth.value |
a local bandwidth array (for |
kernel.order |
a nonnegative integer giving the order of the optimal kernel (Gasser et al., 1985) used for smoothing. Possible options are
|
method |
a character string specifying the preferred estimate of standard deviation parameter. Possible options are
|
group.size.x |
a positive integer giving the number of observations in individual segments used for computation of x chart control limits.
If the data can not be equidistantly divided, the first extra values will be excluded from the analysis. Default is |
group.size.R |
a positive integer giving the number of observations in individual segments used for computation of R chart control limits.
If the data can not be equidistantly divided, the first extra values will be excluded from the analysis. Default is |
group.size.s |
a positive integer giving the number of observations in individual segments used for computation of s chart control limits.
If the data can not be equidistantly divided, the first extra values will be excluded from the analysis. Default is |
L.x |
a positive numeric value giving parameter |
L.R |
a positive numeric value giving parameter |
L.s |
a positive numeric value giving parameter |
This function identifies outliers in environmental data using two-step procedure (Campulova et al., 2017).
The procedure consists of kernel smoothing and subsequent identification of observations corresponding to segments of smoothing residuals exceeding control charts limits.
This way the method does not identify individual outliers but segments of observations, where the outliers occur.
The output of the method are three logical vectors specyfing the outliers identified based on each of the three control charts.
Beside that logical vector specyfing the outliers identified based on at least one type of control limits is returned.
Crucial for the method is the choice of paramaters L.x
, L.R
and L.s
specifying the width of control limits.
Different values of the parameters determine different criteria for outlier detection. For more information see (Campulova et al., 2017).
A "KRDetect"
object which contains a list with elements:
method.type |
a character string giving the type of method used for outlier idetification |
x |
a numeric vector of observations |
index |
a numeric vector of index design points assigned to individual observations |
smoothed |
a numeric vector of estimates of the kernel regression function (smoothed data) |
outlier.x |
a logical vector specyfing the identified outliers based on limits of control chart x, |
outlier.R |
a logical vector specyfing the identified outliers based on limits of control chart R, |
outlier.s |
a logical vector specyfing the identified outliers based on limits of control chart s, |
outlier |
a logical vector specyfing the identified outliers based on at least one type of control limits. |
LCL.x |
a numeric value giving lower control limit of control chart x |
UCL.x |
a numeric value giving upper control limit of control chart x |
LCL.s |
a numeric value giving lower control limit of control chart s |
UCL.s |
a numeric value giving upper control limit of control chart s |
LCL.R |
a numeric value giving lower control limit of control chart R |
UCL.R |
a numeric value giving upper control limit of control chart R |
Campulova M, Veselik P, Michalek J (2017). Control chart and Six sigma based algorithms for identification of outliers in experimental data, with an application to particulate matter PM10. Atmospheric Pollution Research. Doi=10.1016/j.apr.2017.01.004.
Shewhart W (1931). Quality control chart. Bell System Technical Journal, 5, 593–603.
SAS/QC User's Guide, Version 8, 1999. SAS Institute, Cary, N.C.
Wild C, Seber G (2000). Chance encounters: A first course in data analysis and inference. New York: John Wiley.
Joglekar, Anand M. Statistical methods for six sigma: in R&D and manufacturing. Hoboken, NJ: Wiley-Interscience. ISBN sbn0-471-20342-4.
Gasser T, Kneip A, Kohler W (1991). A flexible and fast method for automatic smoothing. Journal of the American Statistical Association, 86, 643–652.
Herrmann E (1997). Local bandwidth choice in kernel regression estimation. Journal of Computational and Graphical Statistics, 6(1), 35–54.
Eva Herrmann; Packaged for R and enhanced by Martin Maechler (2016). lokern: Kernel Regression Smoothing with Local or Global Plug-in Bandwidth. R package version 1.1-8. https://CRAN.R-project.org/package=lokern
data("mydata", package = "openair") x = mydata$o3[format(mydata$date, "%m %Y") == "12 2002"] result = KRDetect.outliers.controlchart(x) summary(result) plot(result) plot(result, plot.type = "x") plot(result, plot.type = "R") plot(result, plot.type = "s")
data("mydata", package = "openair") x = mydata$o3[format(mydata$date, "%m %Y") == "12 2002"] result = KRDetect.outliers.controlchart(x) summary(result) plot(result) plot(result, plot.type = "x") plot(result, plot.type = "R") plot(result, plot.type = "s")
Identification of outliers in environmental data using semiparametric method based on kernel smoothing and extreme value theory (Holesovsky et al., 2018). The outliers are identified as observations whose values are exceeded on average once a given period that is specified by the user.
KRDetect.outliers.EV(x, perform.smoothing = TRUE, bandwidth.type = "local", bandwidth.value = NULL, kernel.order = 2, gpd.fit.method = "mle", threshold.min = NULL, threshold.max = NULL, k.min = round(length(na.omit(x)) * 0.1), k.max = round(length(na.omit(x)) * 0.1), extremal.index.min = NULL, extremal.index.max = NULL, extremal.index.type = "block.maxima", block.length.min = round(sqrt(length(na.omit(x)))), block.length.max = round(sqrt(length(na.omit(x)))), D.min = NULL, D.max = NULL, K.min = NULL, K.max = NULL, r.min = NULL, r.max = NULL, return.period = 120)
KRDetect.outliers.EV(x, perform.smoothing = TRUE, bandwidth.type = "local", bandwidth.value = NULL, kernel.order = 2, gpd.fit.method = "mle", threshold.min = NULL, threshold.max = NULL, k.min = round(length(na.omit(x)) * 0.1), k.max = round(length(na.omit(x)) * 0.1), extremal.index.min = NULL, extremal.index.max = NULL, extremal.index.type = "block.maxima", block.length.min = round(sqrt(length(na.omit(x)))), block.length.max = round(sqrt(length(na.omit(x)))), D.min = NULL, D.max = NULL, K.min = NULL, K.max = NULL, r.min = NULL, r.max = NULL, return.period = 120)
x |
data values. Supported data types
|
perform.smoothing |
a logical value specifying if data smoothing is performed. If |
bandwidth.type |
a character string specifying the type of bandwidth. Possible options are
|
bandwidth.value |
a local bandwidth array (for |
kernel.order |
a nonnegative integer giving the order of the optimal kernel (Gasser et al., 1985) used for smoothing. Possible options are
|
gpd.fit.method |
a character string specifying the method used for the estimate of the scale and shape parameters of GP distribution. Possible options are
|
threshold.min |
a threshold value for residuals with low values, that is used to find the maximum likelihood estimates of shape and scale parameters of GP distribution and selected types of extremal index estimates (specifically: Intervals estimator (Ferro and Segers, 2003), censored estimator, (Holesovsky and Fusek, 2020), K-gaps estimator (Suveges and Davison, 2010), runs estimator (Smith and Weissman, 1994)). If |
threshold.max |
a threshold value for residuals with high values, that is used to find the maximum likelihood estimates of shape and scale parameters of GP distribution and selected types of extremal index estimates (specifically: Intervals estimator (Ferro and Segers, 2003), censored estimator, (Holesovsky and Fusek, 2020), K-gaps estimator (Suveges and Davison, 2010), runs estimator (Smith and Weissman, 1994)). If |
k.min |
a positive integer for residuals with low values giving the number of largest order statistics used to find the moment estimates (de Haan and Ferreira, 2006) of shape and scale parameters of GP distribution. Default is |
k.max |
a positive integer for residuals with high values giving the number of largest order statistics used to find the moment estimates (de Haan and Ferreira, 2006) of shape and scale parameters of GP distribution. Default is |
extremal.index.min |
a numeric value giving the extremal index for identification of outliers with extremely low value. If |
extremal.index.max |
a numeric value giving the extremal index for identification of outliers with extremely high value. If |
extremal.index.type |
a character string specifying the type of extremal index estimate. Possible options are
|
block.length.min |
a numeric value for residuals with low values giving the length of blocks for estimation of extremal index. Only required for |
block.length.max |
a numeric value for residuals with high values giving the length of blocks for estimation of extremal index. Only required for |
D.min |
a nonnegative integer for residuals with low values giving the value of D parameter used for censored extremal index estimate (Holesovsky and Fusek, 2020). Only required for |
D.max |
a nonnegative integer for residuals with high values giving the value of D parameter used for censored extremal index estimate (Holesovsky and Fusek, 2020). Only required for |
K.min |
a nonnegative integer for residuals with low values giving the value of K parameter used for K-gaps extremal index estimate (Suveges and Davison, 2010). Only required for |
K.max |
a nonnegative integer for residuals with high values giving the value of K parameter used for K-gaps extremal index estimate (Suveges and Davison, 2010). Only required for |
r.min |
a positive integer for residuals with low values giving the value of runs parameter of runs extremal index estimate (Smith and Weissman, 1994). Only required for |
r.max |
a positive integer for residuals with high values giving the value of runs parameter of runs extremal index estimate (Smith and Weissman, 1994). Only required for |
return.period |
a positive numeric value giving return period. Default is |
This function identifies outliers in time series using two-step procedure (Holesovsky et al., 2018). The procedure consists of kernel smoothing and extreme value estimation of high threshold exceedances for smoothing residuals. Outliers with both extremely high and extremely low values are identified. Crucial for the method is the choice of return period - parameter defining the criterion for outliers detection. The outliers with extremely high values are detected as observations whose values are exceeded on average once a given return.period of observations. Analogous, the outliers with extremely low values are identified.
A "KRDetect"
object which contains a list with elements:
method.type |
a character string giving the type of method used for outlier idetification |
x |
a numeric vector of observations |
index |
a numeric vector of index design points assigned to individual observations |
smoothed |
a numeric vector of estimates of the kernel regression function (smoothed data) |
GPD.fit.method |
the method used for the estimate of the scale and shape parameters of GP distribution |
extremal.index.type |
the type of extremal index estimate used for the identification of outliers |
sigma.min |
a numeric value giving scale parameter of Generalised Pareto distribution used for identification of outliers with extremely low value |
sigma.max |
a numeric value giving scale parameter of Generalised Pareto distribution used for identification of outliers with extremely high value |
xi.min |
a numeric value giving shape parameter of Generalised Pareto distribution used for identification of outliers with extremely low value |
xi.max |
a numeric value giving shape parameter of Generalised Pareto distribution used for identification of outliers with extremely high value |
lambda_u.min |
a numeric value giving relative frequency of the number of threshold value exceedances and identification of outliers with extremely low value. The value of the parameter is returned only for |
lambda_u.max |
a numeric value giving relative frequency of the number of threshold value exceedances and identification of outliers with extremely high value. The value of the parameter is returned only for |
extremal.index.min |
a numeric value giving extremal index used for identification of outliers with extremely low value |
extremal.index.max |
a numeric value giving extremal index used for identification of outliers with extremely high value |
threshold.min |
a numeric value giving threshold value used for identification of outliers with extremely low value. |
threshold.max |
a numeric value giving threshold value used for identification of outliers with extremely high value. |
return.level.min |
a numeric value giving return level used for identification of outliers with extremely low value |
return.level.max |
a numeric value giving return level used for identification of outliers with extremely high value |
outlier.min |
a logical vector specyfing the identified outliers with extremely low value. |
outlier.max |
a logical vector specyfing the identified outliers with extremely high value. |
outlier |
a logical vector specyfing the identified outliers with both extremely low and extremely high value. |
Holesovsky J, Campulova M, Michalek J (2018). Semiparametric Outlier Detection in Nonstationary Times Series: Case Study for Atmospheric Pollution in Brno, Czech Republic. Atmospheric Pollution Research, 9(1).
Theo Gasser, Alois Kneip & Walter Koehler (1991) A flexible and fast method for automatic smoothing. Journal of the American Statistical Association 86, 643-652. https://doi.org/10.2307/2290393
E. Herrmann (1997) Local bandwidth choice in kernel regression estimation. Journal of Graphical and Computational Statistics 6, 35-54.
Herrmann E, Maechler M (2013). lokern: Kernel Regression Smoothing with Local or Global Plug-in Bandwidth. R package version 1.1-5, URL http://CRAN.R-project.org/package=lokern.
Gasser, T, Muller, H-G, Mammitzsch, V (1985). Kernels for nonparametric curve estimation. Journal of the Royal Statistical Society, B Met., 47(2), 238-252.
Gomes M (1993). On the estimation of parameter of rare events in environmental time series. In Statistics for the Environment, volume 2 of Water Related Issues, pp. 225-241. Wiley.
Ferro, CAT, Segers, J (2003). Inference for Cluster of Extreme Values. Journal of Royal Statistical Society, Series B, 65(2), 545-556.
Holesovsky, J, Fusek, M (2020). Estimation of the Extremal Index Using Censored Distributions. Extremes, In Press.
Suveges, M, Davison, AC (2010). Model Misspecification in Peaks Over Threshold Analysis. The Annals of Applied Statistics, 4(1), 203-221.
Northrop, PJ (2015). An Efficient Semiparametric Maxima Estimator of the Extremal Index. Extremes, 18, 585-603.
Smith, RL, Weissman, I (1994). Estimating the Extremal Index. Journal of the Royal Statistical Society, Series B, 56, 515-529.
Heffernan JE, Stephenson AG (2016). ismev: An Introduction to Statistical Modeling of Extreme Values. R package version 1.41, URL http://CRAN.R-project.org/package=ismev.
Coles S (2001). An Introduction to Statistical Modeling of Extreme Values. 3 edition. London: Springer. ISBN 1-85233-459-2.
de Haan, L, Ferreira, A (2006). Extreme Value Theory: An Introduction. Springer.
Pickands J (1975). Statistical inference using extreme order statistics. The Annals of Statistics, 3(1), 119-131.
data("mydata", package = "openair") x = mydata$o3[format(mydata$date, "%m %Y") == "12 2002"] result = KRDetect.outliers.EV(x) summary(result) plot(result) plot(result, plot.type = "min") plot(result, plot.type = "max")
data("mydata", package = "openair") x = mydata$o3[format(mydata$date, "%m %Y") == "12 2002"] result = KRDetect.outliers.EV(x) summary(result) plot(result) plot(result, plot.type = "min") plot(result, plot.type = "max")
This function is deprecated. Use plot.KRDetect
instead.
KRDetect.outliers.plot(x, all, segments)
KRDetect.outliers.plot(x, all, segments)
x |
a list obtained as the output of function KRDetect.outliers.changepoint, KRDetect.outliers.controlchart or KRDetect.outliers.EV for identification of outliers. |
all |
a logical variable, in case of results obtained using function KRDetect.outliers.controlchart specifying if individual graphs for outliers detected using control chart x, R and s are plotted together with graph visualising outliers detected based on at least 1 control chart. If all = FALSE, only one graph visualising outliers detected based on at least 1 control chart is plotted. in case of results obtained using function KRDetect.outliers.EV specifying if individual graphs for outliers with extremely low and extremely high value are plotted together with graph visualising outliers with both extremely low and extremely high value. If all = FALSE, only one graph visualising outliers with both extremely low and extremely high value is plotted. Only required for results obtained using functions KRDetect.outliers.controlchart and KRDetect.outliers.EV. Default is all = TRUE. |
segments |
a logical variable specifying if vertical lines representing individual segments are plotted. Only required for results obtained using KRDetect.outliers.changepoint function. Default is segments = TRUE. |
Plot of results obrained using functions KRDetect.outliers.changepoint, KRDetect.outliers.controlchart and KRDetect.outliers.EV for identification of outliers. The function graphically visualizes results obtained using functions for outlier detection implemented in package envoutliers.
This function plots the results obtained using function KRDetect.outliers.changepoint, KRDetect.outliers.controlchart or KRDetect.outliers.EV implemented in package envoutliers and identificating outliers.
Calculates left medcouple (MLC).
The function is called by KRDetect.outliers.changepoint
and is not intended for use by regular users of the package.
mc.left(x)
mc.left(x)
x |
a numeric vector of data values. |
This function computes left medcouple (LMC). The function is exported for developer use only. It does not perform any checks on inputs since it is only a convenience function.
A numeric value giving left medcouple
Brys G, Hubert M, Struyf A (2008). Goodness-of-fit tests based on a robust measure of skewness. Computational Statistics, 23(3), 429–442.
Todorov V, Filzmoser P (2009). An Object-Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1-47. URL http://www.jstatsoft.org/v32/i03/.
Calculates right medcouple (RMC).
The function is called by KRDetect.outliers.changepoint
and is not intended for use by regular users of the package.
mc.right(x)
mc.right(x)
x |
a numeric vector of data values. |
This function computes right medcouple (RMC). The function is exported for developer use only. It does not perform any checks on inputs since it is only a convenience function.
A numeric value giving right medcouple
Brys G, Hubert M, Struyf A (2008). Goodness-of-fit tests based on a robust measure of skewness. Computational Statistics, 23(3), 429–442.
Todorov V, Filzmoser P (2009). An Object-Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1-47. URL http://www.jstatsoft.org/v32/i03/.
Performs robust medcouple test to evaluate the fit of the data to normal distribution.
The function is called by KRDetect.outliers.changepoint
and is not intended for use by regular users of the package.
mc.test(x, alpha = 0.05)
mc.test(x, alpha = 0.05)
x |
a numeric vector of data values. |
alpha |
numeric value giving test significance level. |
This function performs robust medcouple test based on the left and right medcouple (LMC and LRC). The function is exported for developer use only. It does not perform any checks on inputs since it is only a convenience function for robust testing of the normality.
A list is returned with elements:
test.stat |
a numeric value giving the value of test statistics |
crit numeric |
a vector of critical values defining rejection region of the test |
Brys G, Hubert M, Struyf A (2008). Goodness-of-fit tests based on a robust measure of skewness. Computational Statistics, 23(3), 429–442.
Todorov V, Filzmoser P (2009). An Object-Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1-47. URL http://www.jstatsoft.org/v32/i03/.
Moment estimates of shape and scale parameters of GP distribution using the approach presented in (de Haan and Ferreira, 2006).
The function is called by KRDetect.outliers.EV
and is not intended for use by regular users of the package.
Moment.gpd.fit(x, k = round(length(x) * 0.1))
Moment.gpd.fit(x, k = round(length(x) * 0.1))
x |
a numeric vector of observations. |
k |
a positive integer giving the number of top rank statistics (de Haan and Ferreira, 2006). Default is |
This function computes the moment estimates of shape and scale parameters of GP distribution (de Haan and Ferreira, 2006).
The function is exported for developer use only. It does not perform any checks on inputs since it is only convenience function used within KRDetect.outliers.EV
.
a numeric vector giving the moment estimates for the scale and shape parameters, resp. a numeric vector giving the standard deviations for the scale and shape parameter estimates, resp.
de Haan, L, Ferreira, A (2006). Extreme Value Theory: An Introduction. Springer.
An empirical mean residual life plot (Coles, 2001), including confidence intervals, is produced based on maximum likelihood or moment estimates.
MRL.plot(x, umin = quantile(na.omit(x), probs = 0.8), umax = quantile(na.omit(x), probs = 0.95), kmin = round(length(na.omit(x)) * 0.05), kmax = round(length(na.omit(x)) * 0.2), nint = 100, conf = 0.95, est.method = "mle", u0 = NULL, k0 = NULL)
MRL.plot(x, umin = quantile(na.omit(x), probs = 0.8), umax = quantile(na.omit(x), probs = 0.95), kmin = round(length(na.omit(x)) * 0.05), kmax = round(length(na.omit(x)) * 0.2), nint = 100, conf = 0.95, est.method = "mle", u0 = NULL, k0 = NULL)
x |
data values. Supported data types
|
umin |
the minimum threshold at which the mean residual life function is calculated based on maximum likelihood estimates. Default is |
umax |
the maximum threshold at which the mean residual life function is calculated based on maximum likelihood estimates. Default is |
kmin |
the minimum number of largest order statistics for which the mean residual life function is calculated based on moment estimates. Default is |
kmax |
the maximum number of largest order statistics for which the mean residual life function is calculated based on moment estimates. Default is |
nint |
the number of points at which the mean residual life function is calculated. Default is |
conf |
the confidence coefficient for the confidence intervals depicted in the plot. Default is |
est.method |
a character string specifying the type of estimates for the scale and shape parameters of GP distribution. Possible options are
|
u0 |
a numeric value giving the threshold meant for a GP approximation of the threshold exceedances. Default is |
k0 |
a numeric value giving the number |
The function constructs MRL plot (Coles, 2001) based on maximum likelihood or moment estimates for parameters of GP distribution.
The MRL, i.e. the estimates of the mean excess, are expected to change linearly with threshold levels at which the GP model is appropriate.
If u0
(or k0
, respectively) is given, a GP mean-threshold dependency line is plotted in addition to the MRL plot (Coles, 2001; Eq. 4.9).
Each of the lines provide the user an option to assess the suitability of u0
or k0
as a lower bound for the threshold exceedances (for u0
) or the number of upper order statistics (for k0
) to fit the GP distribution.
In case est.method = "mle"
and u0
takes a value, the theoretical GP mean is estimated by the MLE estimates of the GP parameters. For the case est.method = "moment"
and k0
is given, the theoretical GP mean is estimated using the moment estimates.
In case est.method = "moment"
the value x(n-k)
on the x-axis of MRL plot denotes the (k + 1)
-th largest observation of the total number of n
observations.
Theo Gasser, Alois Kneip & Walter Koehler (1991) A flexible and fast method for automatic smoothing. Journal of the American Statistical Association 86, 643-652. https://doi.org/10.2307/2290393
E. Herrmann (1997) Local bandwidth choice in kernel regression estimation. Journal of Graphical and Computational Statistics 6, 35-54.
Herrmann E, Maechler M (2013). lokern: Kernel Regression Smoothing with Local or Global Plug-in Bandwidth. R package version 1.1-5, URL http://CRAN.R-project.org/package=lokern.
Gasser, T, Muller, H-G, Mammitzsch, V (1985). Kernels for nonparametric curve estimation. Journal of the Royal Statistical Society, B Met., 47(2), 238-252.
Coles, S (2001). An Introduction to Statistical Modeling of Extreme Values. Springer-Verlag, London, U.K., 208pp.
de Haan, L, Ferreira, A (2006). Extreme Value Theory: An Introduction. Springer.
data("mydata", package = "openair") x = mydata$o3[format(mydata$date, "%m %Y") == "12 2002"] res = smoothing(y = x)$residuals MRL.plot(res)
data("mydata", package = "openair") x = mydata$o3[format(mydata$date, "%m %Y") == "12 2002"] res = smoothing(y = x)$residuals MRL.plot(res)
Identification of outlier data values on individual homogeneous segments using quantiles of normal distribution.
The function is called by KRDetect.outliers.changepoint
and is not intended for use by regular users of the package.
normal.distr.quantiles.detect(x, cp.segment, alpha.default)
normal.distr.quantiles.detect(x, cp.segment, alpha.default)
x |
a numeric vector of data. |
cp.segment |
an integer membership vector for individual segments. |
alpha.default |
a numeric value from interval (0,1) of alpha parameter determining the criterion for outlier detection:
the limits for outlier observations on individual segments are set as +/- (alpha/2-quantile of normal distribution with parameters corresponding to data on studied segment) * (sample standard deviation of data on corresponding segment)
If |
This function detects outlier observations on individual segments using quantiles of normal distribution. The function is exported for developer use only. It does not perform any checks on inputs since it is only convenience function for identification of outlier residuals.
A list is returned with elements:
alpha |
a numeric vector of alpha parameters used for outlier identification on individual segments |
outlier |
a logical vector specyfing the identified outliers, |
Campulova M, Michalek J, Mikuska P, Bokal D (2018). Nonparametric algorithm for identification of outliers in environmental data. Journal of Chemometrics, 32, 453-463.
Plot of results obtained using functions KRDetect.outliers.changepoint
, KRDetect.outliers.controlchart
and KRDetect.outliers.EV
for identification of outliers.
The function graphically visualizes results obtained using functions for outlier detection implemented in package envoutliers.
## S3 method for class 'KRDetect' plot(x, show.segments = TRUE, plot.type = "all", xlab = "index", ylab = "data values", ...)
## S3 method for class 'KRDetect' plot(x, show.segments = TRUE, plot.type = "all", xlab = "index", ylab = "data values", ...)
x |
a KRDetect object obtained as an output of function |
show.segments |
a logical variable specifying if vertical lines representing individual segments are plotted. Only required for results obtained using |
plot.type |
a type of plot with outliers displayed. Possible options for
Possible options for
|
xlab |
a title for the x axis |
ylab |
a title for the y axis |
... |
further arguments to be passed to the |
This function plots the results obtained using function KRDetect.outliers.changepoint
, KRDetect.outliers.controlchart
or KRDetect.outliers.EV
implemented in package envoutliers and identificating outliers.
data("mydata", package = "openair") x = mydata$o3[format(mydata$date, "%m %Y") == "12 2002"] result = KRDetect.outliers.EV(x) plot(result)
data("mydata", package = "openair") x = mydata$o3[format(mydata$date, "%m %Y") == "12 2002"] result = KRDetect.outliers.EV(x) plot(result)
Estimation of return level for a given threshold value using Peaks Over Threshold model.
The function is called by KRDetect.outliers.EV
and is not intended for use by regular users of the package.
return.level.est(r, u, sigma_u, xi, lambda_u, theta)
return.level.est(r, u, sigma_u, xi, lambda_u, theta)
r |
a return period. |
u |
a threshold value. |
sigma_u |
a scale parameter of Generalized Pareto distribution. |
xi |
a shape parameter of Generalized Pareto distribution. |
lambda_u |
a relative frequency of the number of threshold value exceedances. |
theta |
an extremal index. |
This function computes the estimate of return level for a given threshold value using Peaks Over Threshold model.
The function is exported for developer use only. It does not perform any checks on inputs since it is only convenience function used within KRDetect.outliers.EV
.
A numeric value of return lever corresponding to return period r
Coles S (2001). An Introduction to Statistical Modeling of Extreme Values. 3 edition. London: Springer. ISBN 1-85233-459-2.
Pickands J (1975). Statistical inference using extreme order statistics. The Annals of Statistics, 3(1), 119-131.
Control of a number of data values on individual segments. In case a number of data values on a segment is too small, the segment is (under the presumption of meeting certain conditions) merged with the previous one. The first segment can be merged with the previous one.
segment.length.control(index, x, cp.segment, min.segment.length, segment.length.for.merge)
segment.length.control(index, x, cp.segment, min.segment.length, segment.length.for.merge)
index |
a numeric vector of design points. |
x |
a numeric vector of data. |
cp.segment |
an integer membership vector for individual segments. |
min.segment.length |
a numeric value giving minimal required number of observations on segments from changepoint analysis.
If a segment contains less than |
segment.length.for.merge |
a numeric value giving giving minimal required number of observations on segments for performing the homogeneity test within changepoint split control.
A segment with fewer data than |
Control of data splitting into segments.
If a segment contains less than a given number of observations specified by the user and the variances of data on the segment and the previous one are equally based on the robust version of Levene's test, the segment is merged with previous one.
Analogous, the first segment can be merged with the second one.
The user can also specify a minimum length of a segment for performing the homogeneity test. A segment with fewer data than this minimal length is merged with the previous one without testing the homogeneity of variances.
The function is called by KRDetect.outliers.changepoint
and is not intended for use by regular users of the package.
An integer membership vector for individual segments
Fox J (2016). Applied regression analysis and generalized linear models. 3 edition. Los Angeles: SAGE. ISBN 9781452205663.
Nonparametric estimation of regression function using kernel regression with local or global data-adaptive plug-in bandwidth and optimal kernels.
smoothing(x = c(1:length(y)), y, bandwidth.type = "local", bandwidth.value = NULL, kernel.order = 2)
smoothing(x = c(1:length(y)), y, bandwidth.type = "local", bandwidth.value = NULL, kernel.order = 2)
x |
data values. Supported data types
|
y |
a numeric vector of data values. |
bandwidth.type |
a character string specifying the type of bandwidth. Possible options are
|
bandwidth.value |
a local bandwidth array (for |
kernel.order |
a nonnegative integer giving the order of the optimal kernel (Gasser et al., 1985) used for smoothing. Possible options are
|
This function computes the estimate of kernel regression function using a local or global data-adaptive plug-in algorithm and optimal kernels (Gasser et al., 1985).
A list is returned with elements:
data.smoothed |
a numeric vector of estimates of the kernel regression function (smoothed data). |
residuals |
a numeric vector of smoothing residuals |
Gasser T, Kneip A, Kohler W (1991). A flexible and fast method for automatic smoothing. Journal of the American Statistical Association, 86, 643-652.
Herrmann E (1997). Local bandwidth choice in kernel regression estimation. Journal of Computational and Graphical Statistics, 6(1), 35-54.
Gasser, T, Müller, H-G, Mammitzsch, V (1985). Kernels for nonparametric curve estimation. Journal of the Royal Statistical Society, B Met., 47(2), 238-252.
Eva Herrmann; Packaged for R and enhanced by Martin Maechler (2016). lokern: Kernel Regression Smoothing with Local or Global Plug-in Bandwidth. R package version 1.1-8. https://CRAN.R-project.org/package=lokern
data("mydata", package = "openair") x = mydata$o3[format(mydata$date, "%m %Y") == "12 2002"] smoothed = smoothing(y = x) smoothed$data.smoothed smoothed$residuals
data("mydata", package = "openair") x = mydata$o3[format(mydata$date, "%m %Y") == "12 2002"] smoothed = smoothing(y = x) smoothed$data.smoothed smoothed$residuals
A stability plot for maximum likelihood or moment estimates of the GP parameters (Coles, 2001), including confidence intervals, at a range of thresholds or number of the largest observations.
stability.plot(x, umin = quantile(na.omit(x), probs = 0.8), umax = quantile(na.omit(x), probs = 0.95), kmin = round(length(na.omit(x)) * 0.05), kmax = round(length(na.omit(x)) * 0.2), nint = 100, conf = 0.95, est.method = "mle", u0 = NULL, k0 = NULL)
stability.plot(x, umin = quantile(na.omit(x), probs = 0.8), umax = quantile(na.omit(x), probs = 0.95), kmin = round(length(na.omit(x)) * 0.05), kmax = round(length(na.omit(x)) * 0.2), nint = 100, conf = 0.95, est.method = "mle", u0 = NULL, k0 = NULL)
x |
data values. Supported data types
|
umin |
the minimum threshold at which the mean residual life function is calculated. Default is |
umax |
the maximum threshold at which the mean residual life function is calculated. Default is |
kmin |
the minimum number of largest order statistics for which the mean residual life function is calculated based on moment estimates. Default is |
kmax |
the maximum number of largest order statistics for which the mean residual life function is calculated based on moment estimates. Default is |
nint |
the number of points at which the mean residual life function is calculated. Default is |
conf |
the confidence coefficient for the confidence intervals depicted in the plot. Default is |
est.method |
a character string specifying the type of estimates for the scale and shape parameters of GP distribution. Possible options are
|
u0 |
a numeric value giving the threshold meant for a GP approximation of the threshold exceedances. Default is |
k0 |
a numeric value giving the number |
The function estimates the GP parameters at a range of thresholds (in case est.method = "mle"
) or a range of upper order statistics (in case of est.method = "moment"
), and shows the sample paths of the estimates.
The estimates of the shape or the scale parameter are expected to be constant or to change linearly, respectively, with threshold levels at which the GP model is appropriate.
If u0
(or k0
, respectively) is given, a threshold-dependency lines for the particular parameters are plotted in addition. The lines provide the user an option to assess the suitability of u0
or k0
as a lower bound for the threshold exceedances (for u0
) or the number of upper order statistics (for k0
) to fit the GP distribution.
In case est.method = "mle"
and u0
takes a value, the theoretical dependency lines for the parameters are evaluated on the basis of MLE estimates. For the case est.method = "moment"
and k0
is given, the dependency lines are estimated using the moment estimators.
In case est.method = "moment"
the value x(n-k)
on the x-axis of MRL plot denotes the (k + 1)
-th largest observation of the total number of n
observations.
Theo Gasser, Alois Kneip & Walter Koehler (1991) A flexible and fast method for automatic smoothing. Journal of the American Statistical Association 86, 643-652. https://doi.org/10.2307/2290393
E. Herrmann (1997) Local bandwidth choice in kernel regression estimation. Journal of Graphical and Computational Statistics 6, 35-54.
Herrmann E, Maechler M (2013). lokern: Kernel Regression Smoothing with Local or Global Plug-in Bandwidth. R package version 1.1-5, URL http://CRAN.R-project.org/package=lokern.
Gasser, T, Muller, H-G, Mammitzsch, V (1985). Kernels for nonparametric curve estimation. Journal of the Royal Statistical Society, B Met., 47(2), 238-252.
Coles, S (2001). An Introduction to Statistical Modeling of Extreme Values. Springer-Verlag, London, U.K., 208pp.
de Haan, L, Ferreira, A (2006). Extreme Value Theory: An Introduction. Springer.
data("mydata", package = "openair") x = mydata$o3[format(mydata$date, "%m %Y") == "12 2002"] res = smoothing(y = x)$residuals stability.plot(res)
data("mydata", package = "openair") x = mydata$o3[format(mydata$date, "%m %Y") == "12 2002"] res = smoothing(y = x)$residuals stability.plot(res)
Summary of results obtained using functions KRDetect.outliers.changepoint
, KRDetect.outliers.controlchart
and KRDetect.outliers.EV
for identification of outliers.
## S3 method for class 'KRDetect' summary(object, ...)
## S3 method for class 'KRDetect' summary(object, ...)
object |
a KRDetect object obtained as an output of function |
... |
further arguments to be passed to the |
The function summarizes the results obtained using functions KRDetect.outliers.changepoint
, KRDetect.outliers.controlchart
and KRDetect.outliers.EV
for identification of outliers.
data("mydata", package = "openair") x = mydata$o3[format(mydata$date, "%m %Y") == "12 2002"] result = KRDetect.outliers.changepoint(x) summary(result) result = KRDetect.outliers.controlchart(x) summary(result) result = KRDetect.outliers.EV(x) summary(result)
data("mydata", package = "openair") x = mydata$o3[format(mydata$date, "%m %Y") == "12 2002"] result = KRDetect.outliers.changepoint(x) summary(result) result = KRDetect.outliers.controlchart(x) summary(result) result = KRDetect.outliers.EV(x) summary(result)