Title: | Efficient and Flexible Data Preprocessing Tools |
---|---|
Description: | Efficiently and flexibly preprocess data using a set of data filtering, deletion, and interpolation tools. These data preprocessing methods are developed based on the principles of completeness, accuracy, threshold method, and linear interpolation and through the setting of constraint conditions, time completion & recovery, and fast & efficient calculation and grouping. Key preprocessing steps include deletions of variables and observations, outlier removal, and missing values (NA) interpolation, which are dependent on the incomplete and dispersed degrees of raw data. They clean data more accurately, keep more samples, and add no outliers after interpolation, compared with ordinary methods. Auto-identification of consecutive NA via run-length based grouping is used in observation deletion, outlier removal, and NA interpolation; thus, new outliers are not generated in interpolation. Conditional extremum is proposed to realize point-by-point weighed outlier removal that saves non-outliers from being removed. Plus, time series interpolation with values to refer to within short periods further ensures reliable interpolation. These methods are based on and improved from the reference: Liang, C.-S., Wu, H., Li, H.-Y., Zhang, Q., Li, Z. & He, K.-B. (2020) <doi:10.1016/j.scitotenv.2020.140923>. |
Authors: | Chun-Sheng Liang <[email protected]>, Hao Wu, Hai-Yan Li, Qiang Zhang, Zhanqing Li, Ke-Bin He, Lanzhou University, Tsinghua University |
Maintainer: | Chun-Sheng Liang <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.1.5 |
Built: | 2025-02-15 06:48:57 UTC |
Source: | CRAN |
Care is needed when dealing with outliers that are common real-life phenomena besides missing values in data. Unfortunately, many non-outliers may be removed by a one-for-all threshold method, which will be largely avoided if a one-by-one considered way is developed and applied. The condextr
proposed here considers every value (point) that will potentially be removed, combining constraint conditions and extremum (maximum and minimum). Therefore, it is a function of point-by-point weighed outlier removal by conditional extremum. Observation deletion is combined in the process of outlier removal since large gaps consisted of excessive missing values may be formed in time series after removing certain outliers.
condextr(data, start = NULL, end = NULL, group = NULL, top = 0.995, top.error = 0.1, top.magnitude = 0.2, bottom = 0.0025, bottom.error = 0.2, bottom.magnitude = 0.4, interval = 10, by = "min", half = 30, times = 10, cores = NULL)
condextr(data, start = NULL, end = NULL, group = NULL, top = 0.995, top.error = 0.1, top.magnitude = 0.2, bottom = 0.0025, bottom.error = 0.2, bottom.magnitude = 0.4, interval = 10, by = "min", half = 30, times = 10, cores = NULL)
data |
A data frame containing outliers (and missing values). Its columns from |
start |
The column number of the first selected variable. |
end |
The column number of the last selected variable. |
group |
The column number of the grouping variable. It can be selected according to whether the data needs to be processed in groups. If grouping is not required, leave it default (NULL); if grouping is required, set |
top |
The top percentile is 0.995 by default. |
top.error |
The top allowable error coefficient is 0.1 by default. |
top.magnitude |
The order of magnitude coefficient of the top error is 0.2 by default. |
bottom |
The bottom percentile is 0.0025 by default. |
bottom.error |
The bottom allowable error coefficient is 0.2 by default. |
bottom.magnitude |
The order of magnitude coefficient of the bottom error is 0.4 by default. |
interval |
The interval of observation deletion, i.e. the number of outlier deletions before each observation deletion, is 10 by default. |
by |
The time extension unit by is a minute ("min") by default. The user can specify other time units. For example, "5 min" means that the time extension unit is 5 minutes. |
half |
Half window size of hourly moving average. It is 30 (minutes) by default, which is determined by the time expansion unit minute ("min"). |
times |
The number of observation deletions in outlier removal is 10 by default. |
cores |
The number of CPU cores. |
A point-by-point constraint (consideration) outlier removal method based on conditional extremum is proposed, which is more advantageous than the traditional "one size fits all" percentile deletion method in deleting outliers. Moreover, it emphasizes that the outlier removal should be grouped if there are groups such as month because of the value difference among different groups.
A data frame after removing outliers.
Chun-Sheng Liang <[email protected]>
1. Example data is from https://smear.avaa.csc.fi/download. It includes particle number concentrations in SMEAR I Varrio forest.
2. Wickham, H., Francois, R., Henry, L. & Muller, K. 2017. dplyr: A Grammar of Data Manipulation. 0.7.4 ed. http://dplyr.tidyverse.org, https://github.com/tidyverse/dplyr.
3. Wickham, H., Francois, R., Henry, L. & Muller, K. 2019. dplyr: A Grammar of Data Manipulation. R package version 0.8.3. https://CRAN.R-project.org/package=dplyr.
4. Dowle, M., Srinivasan, A., Gorecki, J., Short, T., Lianoglou, S., Antonyan, E., 2017. data.table: Extension of 'data.frame', 1.10.4-3 ed, http://r-datatable.com.
5. Dowle, M., Srinivasan, A., 2021. data.table: Extension of 'data.frame'. R package version 1.14.0. https://CRAN.R-project.org/package=data.table.
6. Wallig, M., Microsoft & Weston, S. 2020. foreach: Provides Foreach Looping Construct. R package version 1.5.0. https://CRAN.R-project.org/package=foreach.
7. Ooi, H., Corporation, M. & Weston, S. 2019. doParallel: Foreach Parallel Adaptor for the 'parallel' Package. R package version 1.0.15. https://CRAN.R-project.org/package=doParallel.
# Remove outliers by condextr after deleting observations by obsedele # 337 observations will be deleted in obsedele(data[,c(1:4,27:61)],5,39,4). # Further, 362 observations will be deleted in condextr by obsedele # Here, for executing time reason, a smaller example is used to show. # Besides, only 2 cores are used for submission test. condextr(obsedele(data[1:500,c(1,4,17:19)],3,5,2,cores=2),3,5,2,cores=2)
# Remove outliers by condextr after deleting observations by obsedele # 337 observations will be deleted in obsedele(data[,c(1:4,27:61)],5,39,4). # Further, 362 observations will be deleted in condextr by obsedele # Here, for executing time reason, a smaller example is used to show. # Besides, only 2 cores are used for submission test. condextr(obsedele(data[1:500,c(1,4,17:19)],3,5,2,cores=2),3,5,2,cores=2)
The raw data is downloaded from https://smear.avaa.csc.fi/download.
data
data
A data frame with 7640 observations on the following 65 variables.
date
a POSIXct
tconc
a numeric vector
TPNC
a numeric vector
monthyear
a character vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
https://smear.avaa.csc.fi/download
1. Example data is from https://smear.avaa.csc.fi/download. It includes particle number concentrations in SMEAR I Varrio forest.
data ## maybe str(data)
data ## maybe str(data)
Calculated from the raw data that is downloaded from https://smear.avaa.csc.fi/download.
data1
data1
A data frame with 7640 observations on the following 7 variables.
date
a POSIXct
monthyear
a character vector
Nucleation
a numeric vector
Aitken
a numeric vector
Accumulation
a numeric vector
tconc
a numeric vector
TPNC
a numeric vector
https://smear.avaa.csc.fi/download
1. Example data is from https://smear.avaa.csc.fi/download. It includes particle number concentrations in SMEAR I Varrio forest.
data1 ## maybe str(data1)
data1 ## maybe str(data1)
The four steps, i.e., variable deletion by varidele
, observation deletion by obsedele
, outlier removal by condextr
, and missing value interpolation by shorvalu
can be finished in dataprep
.
dataprep(data, start = NULL, end = NULL, group = NULL, optimal = FALSE, interval = 10, times = 10, fraction = 0.25, top = 0.995, top.error = 0.1, top.magnitude = 0.2, bottom = 0.0025, bottom.error = 0.2, bottom.magnitude = 0.4, by = "min", half = 30, intervals = 30, cores = NULL)
dataprep(data, start = NULL, end = NULL, group = NULL, optimal = FALSE, interval = 10, times = 10, fraction = 0.25, top = 0.995, top.error = 0.1, top.magnitude = 0.2, bottom = 0.0025, bottom.error = 0.2, bottom.magnitude = 0.4, by = "min", half = 30, intervals = 30, cores = NULL)
data |
A data frame containing outliers (and missing values). Its columns from |
start |
The column number of the first selected variable. |
end |
The column number of the last selected variable. |
group |
The column number of the grouping variable. It can be selected according to whether the data needs to be processed in groups. If grouping is not required, leave it default (NULL); if grouping is required, set |
optimal |
A Boolean to decide whether the |
interval |
The interval of observation deletion, i.e. the number of outlier deletions before each observation deletion, is 10 by default. |
times |
The number of observation deletions in outlier removal is 10 by default. |
fraction |
The proportion of missing values of variables. Default is 0.25. |
top |
The top percentile is 0.995 by default. |
top.error |
The top allowable error coefficient is 0.1 by default. |
top.magnitude |
The order of magnitude coefficient of the top error is 0.2 by default. |
bottom |
The bottom percentile is 0.0025 by default. |
bottom.error |
The bottom allowable error coefficient is 0.2 by default. |
bottom.magnitude |
The order of magnitude coefficient of the bottom error is 0.4 by default. |
by |
The time extension unit by is a minute ("min") by default. The user can specify other time units. For example, "5 min" means that the time extension unit is 5 minutes. |
half |
Half window size of hourly moving average. It is 30 (minutes) by default, which is determined by the time expansion unit minute ("min"). |
intervals |
The time gap of dividing periods as groups, is 30 (minutes) by default. This confines the interpolation inside short periods so that each interpolation has observed value(s) to refer to within every half an hour. |
cores |
The number of CPU cores. |
If optimal = T, relatively much more time will be needed to finish the data preprocessing, but it gets better final result.
A preprocessed data frame after variable deletion, observation deletion, outlier removal, and missing value interpolation.
Chun-Sheng Liang <[email protected]>
1. Example data is from https://smear.avaa.csc.fi/download. It includes particle number concentrations in SMEAR I Varrio forest.
2. Wickham, H., Francois, R., Henry, L. & Muller, K. 2017. dplyr: A Grammar of Data Manipulation. 0.7.4 ed. http://dplyr.tidyverse.org, https://github.com/tidyverse/dplyr.
3. Wickham, H., Francois, R., Henry, L. & Muller, K. 2019. dplyr: A Grammar of Data Manipulation. R package version 0.8.3. https://CRAN.R-project.org/package=dplyr.
4. Dowle, M., Srinivasan, A., Gorecki, J., Short, T., Lianoglou, S., Antonyan, E., 2017. data.table: Extension of 'data.frame', 1.10.4-3 ed, http://r-datatable.com.
5. Dowle, M., Srinivasan, A., 2021. data.table: Extension of 'data.frame'. R package version 1.14.0. https://CRAN.R-project.org/package=data.table.
6. Wallig, M., Microsoft & Weston, S. 2020. foreach: Provides Foreach Looping Construct. R package version 1.5.0. https://CRAN.R-project.org/package=foreach.
7. Ooi, H., Corporation, M. & Weston, S. 2019. doParallel: Foreach Parallel Adaptor for the 'parallel' Package. R package version 1.0.15. https://CRAN.R-project.org/package=doParallel.
8. Zeileis, A. & Grothendieck, G. 2005. zoo: S3 infrastructure for regular and irregular time series. Journal of Statistical Software, 14(6):1-27.
9. Zeileis, A., Grothendieck, G. & Ryan, J.A. 2019. zoo: S3 Infrastructure for Regular and Irregular Time Series (Z's Ordered Observations). R package version 1.8-6. https://cran.r-project.org/web/packages/zoo/.
dataprep::varidele
, dataprep::obsedele
, dataprep::condextr
, dataprep::shorvalu
and dataprep::optisolu
# Combine 4 steps in one function # In dataprep(data,5,65,4), 26 variables, 1097 outliers and 699 observations will be deleted # Besides, 6012 missing values will be replaced # Setting optimal=T will get optimized result # Here, for executing time reason, a smaller example is used to show dataprep(data[1:60,c(1,4,18:19)],3,4,2, interval=2,times=1,cores=2) # Check if results are the same identical(shorvalu(condextr(obsedele(varidele( data[1:60,c(1,4,18:19)],3,4),3,4,2,cores=2),3,4,2, interval=2,times=1,cores=2),3,4), dataprep(data[1:60,c(1,4,18:19)],3,4,2, interval=2,times=1,cores=2))
# Combine 4 steps in one function # In dataprep(data,5,65,4), 26 variables, 1097 outliers and 699 observations will be deleted # Besides, 6012 missing values will be replaced # Setting optimal=T will get optimized result # Here, for executing time reason, a smaller example is used to show dataprep(data[1:60,c(1,4,18:19)],3,4,2, interval=2,times=1,cores=2) # Check if results are the same identical(shorvalu(condextr(obsedele(varidele( data[1:60,c(1,4,18:19)],3,4),3,4,2,cores=2),3,4,2, interval=2,times=1,cores=2),3,4), dataprep(data[1:60,c(1,4,18:19)],3,4,2, interval=2,times=1,cores=2))
It describes data using R basic functions, without calling other packages to avoid redundant calculations, which is faster.
descdata(data, start = NULL, end = NULL, stats= 1:9, first = "variables")
descdata(data, start = NULL, end = NULL, stats= 1:9, first = "variables")
data |
A data frame to describe, from the column |
start |
The column number of the first variable to describe. |
end |
The column number of the last variable to describe. |
stats |
Selecting or rearranging the items from the 9 statistics, i.e., n, na, mean, sd, median, trimmed, min, max, and IQR. It can be a vector or a single value, in 'character' or 'numeric' class. |
first |
The name of the first column of the output. It is the general name of the items (variables). |
This function can be used for different types of data, as long as the variables are numeric. Because it describes the data frame from the column start
to the column end
, the variables need to be linked together instead of being scattered.
A data frame of descriptive statistics:
size |
default general name of items (variables). Users can define it via the parameter first. |
n |
number of valid cases |
na |
number of invalid cases |
mean |
mean of each item |
sd |
standard deviation |
median |
median of each item |
trimmed |
trimmed mean (with trim defaulting to .1) |
min |
minimum of each item |
max |
maximum of each item |
IQR |
interquartile range of each item |
Chun-Sheng Liang <[email protected]>
1. Example data is from https://smear.avaa.csc.fi/download. It includes particle number concentrations in SMEAR I Varrio forest.
dataprep::descplot
# Variable names are essentially numeric descdata(data,5,65) # Use numbers to select statistics descdata(data,5,65,c(2,7:9)) # Use characters to select statistics descdata(data,5,65,c('na','min','max','IQR')) # When type of variable names is character descdata(data1,3,7) # Use numbers to select statistics descdata(data1,3,7,c(2,7:9)) # Use characters to select statistics descdata(data1,3,7,c('na','min','max','IQR'))
# Variable names are essentially numeric descdata(data,5,65) # Use numbers to select statistics descdata(data,5,65,c(2,7:9)) # Use characters to select statistics descdata(data,5,65,c('na','min','max','IQR')) # When type of variable names is character descdata(data1,3,7) # Use numbers to select statistics descdata(data1,3,7,c(2,7:9)) # Use characters to select statistics descdata(data1,3,7,c('na','min','max','IQR'))
It applies to an original (a raw) data and produces a plot to describe the data with 9 statistics including n, na, mean, sd, median, trimmed, min, max, and IQR.
descplot(data, start = NULL, end = NULL, stats= 1:9, first = "variables")
descplot(data, start = NULL, end = NULL, stats= 1:9, first = "variables")
data |
A data frame to describe, from the column |
start |
The column number of the first variable to describe. |
end |
The column number of the last variable to describe. |
stats |
Selecting or rearranging the items from the 9 statistics, i.e., n, na, mean, sd, median, trimmed, min, max, and IQR. It can be a vector or a single value, in 'character' or 'numeric' class. |
first |
The name of the first column of the output. It is the general name of the items (variables). |
This function will describe the data first using descdata. Then, A plot to show the result will be produced using the package ggplot2
(coupled with self-defined melt
or reshape2::melt
to melt the intermediate data). The variables from start
to end
need to be linked together instead of being scattered.
A plot to show the descriptive result of the data, including:
size |
default general name of items (variables). Users can define it via the parameter |
n |
number of valid cases |
na |
number of invalid cases |
mean |
mean of each item |
sd |
standard deviation |
median |
median of each item |
trimmed |
trimmed mean (with trim defaulting to .1) |
min |
minimum of each item |
max |
maximum of each item |
IQR |
interquartile range of each item |
Chun-Sheng Liang <[email protected]>
1. Example data is from https://smear.avaa.csc.fi/download. It includes particle number concentrations in SMEAR I Varrio forest.
2. Wickham, H. 2007. Reshaping data with the reshape package. Journal of Statistical Software, 21(12):1-20.
3. Wickham, H. 2009. ggplot2: Elegant Graphics for Data Analysis. http://ggplot2.org: Springer-Verlag New York.
4. Wickham, H. 2016. ggplot2: elegant graphics for data analysis. Springer-Verlag New York.
dataprep::descdata
and dataprep::melt
# Line plots for variable names that are essentially numeric descplot(data,5,65) # Use numbers to select statistics descplot(data,5,65,c(2,7:9)) # Use characters to select statistics descplot(data,5,65,c('na','min','max','IQR')) # Bar charts for type of variable names that is character descplot(data1,3,7) # Use numbers to select statistics descplot(data1,3,7,7:9) # Use characters to select statistics descplot(data1,3,7,c('min','max','IQR'))
# Line plots for variable names that are essentially numeric descplot(data,5,65) # Use numbers to select statistics descplot(data,5,65,c(2,7:9)) # Use characters to select statistics descplot(data,5,65,c('na','min','max','IQR')) # Bar charts for type of variable names that is character descplot(data1,3,7) # Use numbers to select statistics descplot(data1,3,7,7:9) # Use characters to select statistics descplot(data1,3,7,c('min','max','IQR'))
Turn the names and values of all pending variables into two columns. These variables are inversely selected inside function (cols
) and waiting to be melted. After melting, the data format changes from wide to long.
melt(data, cols = NULL)
melt(data, cols = NULL)
data |
A data frame to melt, from the column |
cols |
Inversely selected columns inside function, except which are columns waiting to be melted. |
This function (dataprep::melt
) will be used when reshape2
is not installed.
A long-format data frame from its original wide format.
Chun-Sheng Liang <[email protected]>
1. Example data is from https://smear.avaa.csc.fi/download. It includes particle number concentrations in SMEAR I Varrio forest.
2. Wickham, H. 2007. Reshaping data with the reshape package. Journal of Statistical Software, 21(12):1-20.
# The first pending variable contains only NA melt(data,1:4) # Number concentrations of modes and total particles are not NA melt(data1,1:2)
# The first pending variable contains only NA melt(data,1:4) # Number concentrations of modes and total particles are not NA melt(data1,1:2)
The description of varidele mentions that missing values are common in real data but excessive missing values would led to inaccurate information and conclusions about the data. To control and improve the quality of original data, besides deleting the variables with too many missing values, the observations with one or more variables containing excessive consecutive missing values in time series should further be deleted.
obsedele(data, start = NULL, end = NULL, group = NULL, by = "min", half = 30, cores = NULL)
obsedele(data, start = NULL, end = NULL, group = NULL, by = "min", half = 30, cores = NULL)
data |
A data frame containing variables with too many consecutive missing values (NA) in time series. Its columns from |
start |
The column number of the first selected variable. |
end |
The column number of the last selected variable. |
group |
The column number of the grouping variable. It can be selected according to whether the data needs to be processed in groups. If grouping is not required, leave it default (NULL); if grouping is required, set |
by |
The time extension unit |
half |
Half window size of hourly moving average. It is 30 (minutes) by default, which is determined by the time expansion unit minute ("min"). Users can set its value as required. |
cores |
The number of CPU cores. |
How to delete observations based on consecutive missing value? The idea here is to remove the observations with incomplete half-hour averages, that is, the observations with at least one variable missing more than half an hour. Besides the design of flexible constraints, fast and efficient algorithm is also used, which saves much more time. Using basic functions such merge
and seq
(in loop) temporarily to extend the full time period and using the fast and efficient data.table::rleid
, data.table::rowid
, and data.table::setorder
to realize run-length based grouping (even faster than calculating moving average by C++ optimized algorithm) are very important to quickly detect consecutive missing values. For the loop, parallel computing can be conducted using packages parallel
, doParallel
, and foreach
. Further, this method will also be used to delete outliers. In this way, it ensures that the observations with excessive consecutive missing values are deleted completely and the interpolation in time series is reasonable.
A data frame after deleting observations with too many consecutive missing values in time series.
Chun-Sheng Liang <[email protected]>
1. Example data is from https://smear.avaa.csc.fi/download. It includes particle number concentrations in SMEAR I Varrio forest.
2. Dowle, M., Srinivasan, A., Gorecki, J., Short, T., Lianoglou, S., Antonyan, E., 2017. data.table: Extension of 'data.frame', 1.10.4-3 ed, http://r-datatable.com.
3. Dowle, M., Srinivasan, A., 2021. data.table: Extension of 'data.frame'. R package version 1.14.0. https://CRAN.R-project.org/package=data.table.
4. Wallig, M., Microsoft & Weston, S. 2020. foreach: Provides Foreach Looping Construct. R package version 1.5.0. https://CRAN.R-project.org/package=foreach.
5. Ooi, H., Corporation, M. & Weston, S. 2019. doParallel: Foreach Parallel Adaptor for the 'parallel' Package. R package version 1.0.15. https://CRAN.R-project.org/package=doParallel.
# Select start as 27 and end as 61 according to varidele # This selection ignores the first 22 and the last 4 variables # Not show the first 22 variables dropped by varidele # A total of 39 variables left (65 - 22 - 4) # Here, a smaller example is used for saving time. # Besides, only 2 cores are used for submission test. obsedele(data[c(1:200,3255:3454),c(1:4,27:61)],5,39,4,cores=2)
# Select start as 27 and end as 61 according to varidele # This selection ignores the first 22 and the last 4 variables # Not show the first 22 variables dropped by varidele # A total of 39 variables left (65 - 22 - 4) # Here, a smaller example is used for saving time. # Besides, only 2 cores are used for submission test. obsedele(data[c(1:200,3255:3454),c(1:4,27:61)],5,39,4,cores=2)
interval
and times
for condextr
Optimal values of interval
and times
in the proposed conditional extremum based outlier removal method, i.e., condextr
can be searched out after comparing with the traditional "one size fits all" percentile deletion method in deleting outliers. Three parameters are used for this comparison, including sample deletion ratio (SDR), outlier removal ratio (ORR), and signal-to-noise ratio (SNR).
optisolu(data, start = NULL, end = NULL, group = NULL, interval = 35, times = 10, top = 0.995, top.error = 0.1, top.magnitude = 0.2, bottom = 0.0025, bottom.error = 0.2, bottom.magnitude = 0.4, by = "min", half = 30, cores = NULL)
optisolu(data, start = NULL, end = NULL, group = NULL, interval = 35, times = 10, top = 0.995, top.error = 0.1, top.magnitude = 0.2, bottom = 0.0025, bottom.error = 0.2, bottom.magnitude = 0.4, by = "min", half = 30, cores = NULL)
data |
A data frame containing outliers (and missing values). Its columns from |
start |
The column number of the first selected variable. |
end |
The column number of the last selected variable. |
group |
The column number of the grouping variable. It can be selected according to whether the data needs to be processed in groups. If grouping is not required, leave it default (NULL); if grouping is required, set |
interval |
The interval of observation deletion, i.e., the number of outlier deletions before each observation deletion, is 35 by default. Its values from 1 to |
times |
The number of observation deletions in outlier removal is 10 by default. The values from 1 to |
top |
The top percentile is 0.995 by default. |
top.error |
The top allowable error coefficient is 0.1 by default. |
top.magnitude |
The order of magnitude coefficient of the top error is 0.2 by default. |
bottom |
The bottom percentile is 0.0025 by default. |
bottom.error |
The bottom allowable error coefficient is 0.2 by default. |
bottom.magnitude |
The order of magnitude coefficient of the bottom error is 0.4 by default. |
by |
The time extension unit by is a minute ("min") by default. The user can specify other time units. For example, "5 min" means that the time extension unit is 5 minutes. |
half |
Half window size of hourly moving average. It is 30 (minutes) by default, which is determined by the time expansion unit minute ("min"). |
cores |
The number of CPU cores. |
The three ratios offer indices to show the quality of outlier removal methods. Besides, other parameters such as new outlier production (NOP) are also important. Since the preprocessing roadmap is based on the ideas of grouping, both flexible and strict constraints for outliers, and interpolation within short period and with effective observed values, the new outlier production is greatly restricted.
A data frame indicating the quality of outlier remove by condextr
with different values of interval
and times
. A total of 9 columns are listed in it.
case |
Order of combination of |
interval |
The interval of observation deletion, i.e., the number of outlier deletions before each observation deletion |
times |
The number of observation deletions in outlier removal |
sdr |
Sample deletion ratio (SDR) |
orr |
Outlier removal ratio (ORR) |
snr |
Signal-to-noise ratio (SNR) |
index |
Quality level of outlier removal based on the three parameters |
relaindex |
A relative form of the index |
optimal |
A Boolean variable to show if the result of conditional extremum is better, in terms of all the three parameters, than the traditional "one size fits all" percentile deletion method in deleting outliers. |
Chun-Sheng Liang <[email protected]>
1. Example data is from https://smear.avaa.csc.fi/download. It includes particle number concentrations in SMEAR I Varrio forest.
2. Wickham, H., Francois, R., Henry, L. & Muller, K. 2017. dplyr: A Grammar of Data Manipulation. 0.7.4 ed. http://dplyr.tidyverse.org, https://github.com/tidyverse/dplyr.
3. Wickham, H., Francois, R., Henry, L. & Muller, K. 2019. dplyr: A Grammar of Data Manipulation. R package version 0.8.3. https://CRAN.R-project.org/package=dplyr.
4. Dowle, M., Srinivasan, A., Gorecki, J., Short, T., Lianoglou, S., Antonyan, E., 2017. data.table: Extension of 'data.frame', 1.10.4-3 ed, http://r-datatable.com.
5. Dowle, M., Srinivasan, A., 2021. data.table: Extension of 'data.frame'. R package version 1.14.0. https://CRAN.R-project.org/package=data.table.
6. Wallig, M., Microsoft & Weston, S. 2020. foreach: Provides Foreach Looping Construct. R package version 1.5.0. https://CRAN.R-project.org/package=foreach.
7. Ooi, H., Corporation, M. & Weston, S. 2019. doParallel: Foreach Parallel Adaptor for the 'parallel' Package. R package version 1.0.15. https://CRAN.R-project.org/package=doParallel.
# Setting interval as 35 and times as 10 can find optimal solutions # optisolu(obsedele(data[c(1:4,27:61)],5,39,4),5,39,4,35,10) # Here, for executing time reason, a smaller example is used to show # But too small interval and times will not get optimal solutions optisolu(data[1:50,c(1,4,18:19)],3,4,2,2,1,cores=2)
# Setting interval as 35 and times as 10 can find optimal solutions # optisolu(obsedele(data[c(1:4,27:61)],5,39,4),5,39,4,35,10) # Here, for executing time reason, a smaller example is used to show # But too small interval and times will not get optimal solutions optisolu(data[1:50,c(1,4,18:19)],3,4,2,2,1,cores=2)
Outliers can be preliminarily checked by the calculated top and bottom percentiles. Basic R functions in packages from system library are used to get these percentiles of selected variables in data frames, instead of calling other packages. It saves time.
percdata(data, start = NULL, end = NULL, group = NULL, diff = 0.1, part = 'both')
percdata(data, start = NULL, end = NULL, group = NULL, diff = 0.1, part = 'both')
data |
A data frame to calculate percentiles, from the column |
start |
The column number of the first variable to calculate percentiles for. |
end |
The column number of the last variable to calculate percentiles for. |
group |
The column number of the grouping variable. It can be selected according to whether the data needs to be processed in groups. If grouping is not required, leave it default (NULL); if grouping is required, set |
diff |
The common difference between |
part |
The option of calculating bottom and/or top percentiles (parts). Default is 'both', or 2 for both bottom and top parts. Setting it as 'bottom' or 0 for bottom part and 'top' or 1 for top part. |
The data to be processed ranges from the column start
to the last column end
. The column numbers of these two columns are needed for the arguments. This requires that the variables of the data to be processed are arranged continuously in the database or table. Or else, it is necessary to move the columns in advance to make a continuous arrangement.
Top (highest or greatest) and bottom (lowest or smallest) percentiles are calculated. According to the default diff
(=0.1), the calculated values are as follows.
0th |
Quantile with |
0.1th |
Quantile with |
0.2th |
Quantile with |
0.3th |
Quantile with |
0.4th |
Quantile with |
0.5th |
Quantile with |
99.5th |
Quantile with |
99.6th |
Quantile with |
99.7th |
Quantile with |
99.8th |
Quantile with |
99.9th |
Quantile with |
100th |
Quantile with |
Chun-Sheng Liang <[email protected]>
1. Example data is from https://smear.avaa.csc.fi/download. It includes particle number concentrations in SMEAR I Varrio forest.
dataprep::percplot
# Select the grouping variable and remaining variables after deletion by varidele. # Column 4 ('monthyear') is the group and the fraction for varidele is 0.25. # After extracting according to the result by varidele, the group is in the first column. percdata(data[,c(4,27:61)],2,36,1)
# Select the grouping variable and remaining variables after deletion by varidele. # Column 4 ('monthyear') is the group and the fraction for varidele is 0.25. # After extracting according to the result by varidele, the group is in the first column. percdata(data[,c(4,27:61)],2,36,1)
The percentile-based outlier removal methods usually take a quantile as a threshold and values above it will be deleted. Here, two quantiles are used for both the top and the bottom. For the bottom, accordingly, values below the quantile threshold will be removed.
percoutl(data, start = NULL, end = NULL, group = NULL, top = 0.995, bottom = 0.0025, by = "min", half = 30, cores = NULL)
percoutl(data, start = NULL, end = NULL, group = NULL, top = 0.995, bottom = 0.0025, by = "min", half = 30, cores = NULL)
data |
A data frame containing outliers (and missing values). Its columns from |
start |
The column number of the first selected variable. |
end |
The column number of the last selected variable. |
group |
The column number of the grouping variable. It can be selected according to whether the data needs to be processed in groups. If grouping is not required, leave it default (NULL); if grouping is required, set |
top |
The top percentile is 0.995 by default. |
bottom |
The bottom percentile is 0.0025 by default. |
by |
The time extension unit by is a minute ("min") by default. The user can specify other time units. For example, "5 min" means that the time extension unit is 5 minutes. |
half |
Half window size of hourly moving average. It is 30 (minutes) by default, which is determined by the time expansion unit minute ("min"). |
cores |
The number of CPU cores. |
Unlike condextr
, a point-by-point considered outlier removal method, the traditional percentile-based percoutl
is a "one size fits all" outlier deletion method. It may delete too many or too few values that are non-outliers or outliers respectively.
A data frame after deleting outliers.
Chun-Sheng Liang <[email protected]>
1. Example data is from https://smear.avaa.csc.fi/download. It includes particle number concentrations in SMEAR I Varrio forest.
2. Wickham, H., Francois, R., Henry, L. & Muller, K. 2017. dplyr: A Grammar of Data Manipulation. 0.7.4 ed. http://dplyr.tidyverse.org, https://github.com/tidyverse/dplyr.
3. Wickham, H., Francois, R., Henry, L. & Muller, K. 2019. dplyr: A Grammar of Data Manipulation. R package version 0.8.3. https://CRAN.R-project.org/package=dplyr.
4. Dowle, M., Srinivasan, A., Gorecki, J., Short, T., Lianoglou, S., Antonyan, E., 2017. data.table: Extension of 'data.frame', 1.10.4-3 ed, http://r-datatable.com.
5. Dowle, M., Srinivasan, A., 2021. data.table: Extension of 'data.frame'. R package version 1.14.0. https://CRAN.R-project.org/package=data.table.
6. Wallig, M., Microsoft & Weston, S. 2020. foreach: Provides Foreach Looping Construct. R package version 1.5.0. https://CRAN.R-project.org/package=foreach.
7. Ooi, H., Corporation, M. & Weston, S. 2019. doParallel: Foreach Parallel Adaptor for the 'parallel' Package. R package version 1.0.15. https://CRAN.R-project.org/package=doParallel.
dataprep::condextr
percoutl(obsedele(data[c(1:200,3255:3454),c(1:4,27:61)],5,39,4,cores=2),5,39,4,cores=2) # Result
percoutl(obsedele(data[c(1:200,3255:3454),c(1:4,27:61)],5,39,4,cores=2),5,39,4,cores=2) # Result
The top and bottom percentiles of selected variables calculated by percdata
can be plotted by percplot
that offers a vivid check of possible outliers. It uses reshape2::melt
or dataprep::melt
to melt the data and uses ggplot2
and scales
to plot the data.
percplot(data, start = NULL, end = NULL, group = NULL, ncol = NULL, diff = 0.1, part = 'both')
percplot(data, start = NULL, end = NULL, group = NULL, ncol = NULL, diff = 0.1, part = 'both')
data |
A data frame to calculate percentiles, from the column |
start |
The column number of the first variable to calculate percentiles for. |
end |
The column number of the last variable to calculate percentiles for. |
group |
The column number of the grouping variable. It can be selected according to whether the data needs to be processed in groups. If grouping is not required, leave it default (NULL); if grouping is required, set |
ncol |
The total columns of the plot. |
diff |
The common difference between |
part |
The option of plotting bottom and/or top percentiles (parts). Default is 'both', or 2 for both bottom and top parts. Setting it as 'bottom' or 0 for bottom part and 'top' or 1 for top part. |
Four scenes are considered according to the scales of x and y axes, namely the ranges of x and y values. For example, the code, sd(diff(log(as.numeric(as.character(names(data[, start:end])))))) / mean(diff(log(as.numeric(as.character(names(data[, start:end])))))) < 0.1 & max(data[, start:end], na.rm = T) / min(data[, start:end], na.rm = T) > = 10^3
, means that the coefficient of variation of the lagged differences of log(x)
is below 0.1 and meanwhile the maximum y is 1000 times greater than or equal to the minimum y.
Top (highest or greatest) and bottom (lowest or smallest) percentiles are plotted.
0th |
Quantile with |
0.1th |
Quantile with |
0.2th |
Quantile with |
0.3th |
Quantile with |
0.4th |
Quantile with |
0.5th |
Quantile with |
99.5th |
Quantile with |
99.6th |
Quantile with |
99.7th |
Quantile with |
99.8th |
Quantile with |
99.9th |
Quantile with |
100th |
Quantile with |
Chun-Sheng Liang <[email protected]>
1. Example data is from https://smear.avaa.csc.fi/download. It includes particle number concentrations in SMEAR I Varrio forest.
2. Wickham, H. 2007. Reshaping data with the reshape package. Journal of Statistical Software, 21(12):1-20.
3. Wickham, H. 2009. ggplot2: Elegant Graphics for Data Analysis. http://ggplot2.org: Springer-Verlag New York.
4. Wickham, H. 2016. ggplot2: elegant graphics for data analysis. Springer-Verlag New York.
5. Wickham, H. 2017. scales: Scale Functions for Visualization. 0.5.0 ed. https://github.com/hadley/scales.
6. Wickham, H. & Seidel, D. 2019. scales: Scale Functions for Visualization. R package version 1.1.0. https://CRAN.R-project.org/package=scales.
dataprep::percdata
and dataprep::melt
# Plot percplot(data,5,65,4) # Plot percplot(data1,3,7,2)
# Plot percplot(data,5,65,4) # Plot percplot(data1,3,7,2)
Time gaps and available values are considered in NA interpolation by shorvalu
. Thus, more reliable interpolation is realized with these constraints and the successive using of obsedele
in the preceding outlier removal.
shorvalu(data, start, end, intervals = 30, units = 'mins')
shorvalu(data, start, end, intervals = 30, units = 'mins')
data |
A data frame containing outliers. Its columns from |
start |
The column number of the first selected variable. |
end |
The column number of the last selected variable. |
intervals |
The time gap of dividing periods as groups, is 30 (minutes) by default. This confines the interpolation inside short periods so that each interpolation has observed value(s) to refer to within every half an hour. |
units |
Units in time intervals/differences. It can be one of "secs", "mins", "hours", "days", or "weeks". The default is 'mins'. |
It offers a robust interpolation method based on considering time gaps and available values.
A data frame with missing values being replaced linearly within short periods and with values to refer to.
Chun-Sheng Liang <[email protected]>
1. Example data is from https://smear.avaa.csc.fi/download. It includes particle number concentrations in SMEAR I Varrio forest.
2. Wickham, H., Francois, R., Henry, L. & Muller, K. 2017. dplyr: A Grammar of Data Manipulation. 0.7.4 ed. http://dplyr.tidyverse.org, https://github.com/tidyverse/dplyr.
3. Wickham, H., Francois, R., Henry, L. & Muller, K. 2019. dplyr: A Grammar of Data Manipulation. R package version 0.8.3. https://CRAN.R-project.org/package=dplyr.
4. Zeileis, A. & Grothendieck, G. 2005. zoo: S3 infrastructure for regular and irregular time series. Journal of Statistical Software, 14(6):1-27.
5. Zeileis, A., Grothendieck, G. & Ryan, J.A. 2019. zoo: S3 Infrastructure for Regular and Irregular Time Series (Z's Ordered Observations). R package version 1.8-6. https://cran.r-project.org/web/packages/zoo/.
shorvalu(condextr(obsedele(data[1:250,c(1,4,17:19)],3,5,2,cores=2), 3,5,2,cores=2),3,5)
shorvalu(condextr(obsedele(data[1:250,c(1,4,17:19)],3,5,2,cores=2), 3,5,2,cores=2),3,5)
Missing values often exist in real data. However, excessive missing values would lead to information distortion and inappropriate handling of them may end up coming to inaccurate conclusions about the data. Therefore, to control and improve the quality of original data that have already been produced by instruments in the first beginning, the data preprocessing method in this package introduces the deletion of variables to filter out and delete the variables with too many missing values.
varidele(data, start = NULL, end = NULL, fraction = 0.25)
varidele(data, start = NULL, end = NULL, fraction = 0.25)
data |
A data frame containing variables with excessive missing values. Its columns from |
start |
The column number of the first selected variable. |
end |
The column number of the last selected variable. |
fraction |
The proportion of missing values of variables. Default is 0.25. |
It operates only at the beginning and the end variables, so as to ensure that the remaining variables after deleting are continuous without breaks. The deletion of variables with excessive missing values is mainly based on the proportion of missing values of variables, excluding blank observations. The default proportion is 0.25, which can be adjusted according to practical needs.
A data frame after deleting variables with too many missing values.
Chun-Sheng Liang <[email protected]>
1. Example data is from https://smear.avaa.csc.fi/download. It includes particle number concentrations in SMEAR I Varrio forest.
# Show the first 5 and last 5 rows and columns besides the date column. varidele(data,5,65)[c(1:5,(nrow(data)-4):nrow(data)),c(1,5:9,35:39)] # The first 22 variables and last 4 variables are deleted with the NA proportion of 0.25. # Increasing the NA proportion can keep more variables. # Proportions 0.4 and 0.5 have the same result for this example data. # Namely 2 more variables in the beginning and 1 more variable in the end are kept. varidele(data,5,65,.5)[c(1:5,(nrow(data)-4):nrow(data)),c(1,5:9,38:42)] # Setting proportion as 0.6, then 3 more variables in the beginning are kept # than that of proportions 0.4 and 0.5. varidele(data,5,65,.6)[c(1:5,(nrow(data)-4):nrow(data)),c(1,5:9,41:45)]
# Show the first 5 and last 5 rows and columns besides the date column. varidele(data,5,65)[c(1:5,(nrow(data)-4):nrow(data)),c(1,5:9,35:39)] # The first 22 variables and last 4 variables are deleted with the NA proportion of 0.25. # Increasing the NA proportion can keep more variables. # Proportions 0.4 and 0.5 have the same result for this example data. # Namely 2 more variables in the beginning and 1 more variable in the end are kept. varidele(data,5,65,.5)[c(1:5,(nrow(data)-4):nrow(data)),c(1,5:9,38:42)] # Setting proportion as 0.6, then 3 more variables in the beginning are kept # than that of proportions 0.4 and 0.5. varidele(data,5,65,.6)[c(1:5,(nrow(data)-4):nrow(data)),c(1,5:9,41:45)]
Zeros are suitable in logarithmic scale and should be removed for plots.
zerona(x)
zerona(x)
x |
A dataframe, matrix, or vector containing zeros. |
A dataframe, matrix, or vector with zeros being turned into missing values.
Chun-Sheng Liang <[email protected]>
zerona(0:5) zerona(cbind(a=0:5,b=c(6:10,0)))
zerona(0:5) zerona(cbind(a=0:5,b=c(6:10,0)))