The noisemodel package contains the first extensive
implementation of noise models for classification datasets. It provides
72 noise models found in the specialized literature that allow errors to
be introduced in different ways in class labels, attributes or both in
combination. Each of them is properly documented and referenced,
unifying their results through a specific S3 class, which benefits from
customized print
, summary
and
plot
methods.
The noisemodel package can be installed in R from CRAN servers using the command:
This command installs all the dependencies of the package that are necessary for the operation of the noise models. In order to access all the functions of the package, it is necessary to use the R command:
All the information corresponding to each noise model can be consulted
from the CRAN website. Additionally, the
help()
command can be used. For example, in order to check
the documentation of the Symmetric uniform label noise model, we
can use the command:
For introducing noise in a dataset, each noise model in the noisemodel package provides two standard ways of use:
An example on how to use these two methods for introducing noise in the
iris2D
dataset with the sym_uni_ln
model is
shown below:
# load the dataset
data(iris2D)
# usage of the default method
set.seed(9)
outdef <- sym_uni_ln(x = iris2D[,-ncol(iris2D)], y = iris2D[,ncol(iris2D)], level = 0.1)
# show results
summary(outdef, showid = TRUE)
#>
#> ########################################################
#> Noise introduction process: Summary
#> ########################################################
#>
#> ## Original call:
#> sym_uni_ln(x = iris2D[, -ncol(iris2D)], y = iris2D[, ncol(iris2D)], level = 0.1)
#>
#> ## Noise model:
#> Symmetric uniform label noise
#>
#> ## Parameters:
#> - level = 0.1
#> - sortid = TRUE
#>
#> ## Number of noisy and clean samples:
#> - Noisy samples: 10/101 (9.90%)
#> - Clean samples: 91/101 (90.10%)
#>
#> ## Number of noisy samples per class label:
#> - Class setosa: 3/22 (13.64%)
#> - Class versicolor: 4/35 (11.43%)
#> - Class virginica: 3/44 (6.82%)
#>
#> ## Number of clean samples per class label:
#> - Class setosa: 19/22 (86.36%)
#> - Class versicolor: 31/35 (88.57%)
#> - Class virginica: 41/44 (93.18%)
#>
#> ## Indices of noisy samples:
#> - Output class: 3, 6, 12, 24, 30, 48, 53, 59, 83, 101
plot(outdef)
# usage of the method for class formula
set.seed(9)
outfrm <- sym_uni_ln(formula = Species ~ ., data = iris2D, level = 0.1)
# check the match of noisy indices
identical(outdef$idnoise, outfrm$idnoise)
#> [1] TRUE
Note that, the $
operator is used to access the elements
returned by the noise model in the objects outdef
and
outfrm
.
All noise models return an object of class ndmodel
. It is
designed to unify the output value of the methods included in the
noisemodel package. The class ndmodel
is a
list of elements with the most relevant information of the noise
introduction process:
xnoise
a data frame with the noisy input attributes.
ynoise
a factor vector with the noisy output class.
numnoise
an integer vector with the amount of noisy samples
per class.
idnoise
an integer vector list with the indices of noisy
samples.
numclean
an integer vector with the amount of clean samples
per class.
idclean
an integer vector list with the indices of clean
samples.
distr
an integer vector with the samples per class in the
original data.
model
the full name of the noise introduction model used.
param
a list of the argument values.
call
the function call.
As an example, the structure of the ndmodel
object
returned using the sym_uni_ln
model is shown below:
str(outdef)
#> List of 10
#> $ xnoise :'data.frame': 101 obs. of 2 variables:
#> ..$ Petal.Length: num [1:101] 1.4 1.3 1.5 1.7 1.4 1.5 1.6 1.4 1.1 1.2 ...
#> ..$ Petal.Width : num [1:101] 0.2 0.2 0.2 0.4 0.3 0.1 0.2 0.1 0.1 0.2 ...
#> $ ynoise : Factor w/ 3 levels "setosa","versicolor",..: 1 1 2 1 1 2 1 1 1 1 ...
#> $ numnoise: Named int [1:3] 3 4 3
#> ..- attr(*, "names")= chr [1:3] "setosa" "versicolor" "virginica"
#> $ idnoise :List of 1
#> ..$ : int [1:10] 3 6 12 24 30 48 53 59 83 101
#> $ numclean: Named int [1:3] 19 31 41
#> ..- attr(*, "names")= chr [1:3] "setosa" "versicolor" "virginica"
#> $ idclean :List of 1
#> ..$ : int [1:91] 1 2 4 5 7 8 9 10 11 13 ...
#> $ distr : Named int [1:3] 22 35 44
#> ..- attr(*, "names")= chr [1:3] "setosa" "versicolor" "virginica"
#> $ model : chr "Symmetric uniform label noise"
#> $ param :List of 2
#> ..$ level : num 0.1
#> ..$ sortid: logi TRUE
#> $ call : language sym_uni_ln(x = iris2D[, -ncol(iris2D)], y = iris2D[, ncol(iris2D)], level = 0.1)
#> - attr(*, "class")= chr "ndmodel"
In order to display the results of the class ndmodel
in a
friendly way in the R console, specific print
,
summary
and plot
functions are implemented.
The print
function presents the basic information about the
noise introduction process contained in an object of class
ndmodel
:
print(outdef)
#>
#> ## Noise model:
#> Symmetric uniform label noise
#>
#> ## Parameters:
#> - level = 0.1
#> - sortid = TRUE
#>
#> ## Number of noisy and clean samples:
#> - Noisy samples: 10/101 (9.90%)
#> - Clean samples: 91/101 (90.10%)
The information offered by print
is as follows:
On the other hand, the summary
function displays a summary
containing information about the noise introduction process contained in
an object of class ndmodel
, with other additional details.
This function can be called by typing the following R command:
summary(outdef, showid = TRUE)
#>
#> ########################################################
#> Noise introduction process: Summary
#> ########################################################
#>
#> ## Original call:
#> sym_uni_ln(x = iris2D[, -ncol(iris2D)], y = iris2D[, ncol(iris2D)], level = 0.1)
#>
#> ## Noise model:
#> Symmetric uniform label noise
#>
#> ## Parameters:
#> - level = 0.1
#> - sortid = TRUE
#>
#> ## Number of noisy and clean samples:
#> - Noisy samples: 10/101 (9.90%)
#> - Clean samples: 91/101 (90.10%)
#>
#> ## Number of noisy samples per class label:
#> - Class setosa: 3/22 (13.64%)
#> - Class versicolor: 4/35 (11.43%)
#> - Class virginica: 3/44 (6.82%)
#>
#> ## Number of clean samples per class label:
#> - Class setosa: 19/22 (86.36%)
#> - Class versicolor: 31/35 (88.57%)
#> - Class virginica: 41/44 (93.18%)
#>
#> ## Indices of noisy samples:
#> - Output class: 3, 6, 12, 24, 30, 48, 53, 59, 83, 101
The information offered by this function is as follows:
showid = TRUE
).
Finally, the plot
function displays a representation of the
dataset contained in an object of class ndmodel
after the
application of a noise introduction model.
This function performs a two-dimensional representation using the
ggplot2 package of the dataset contained in the object
x of class ndmodel
. Each of the classes in the
dataset (available in x$ynoise
) is represented by a
different color.