Title: | Data Preprocessing, Discretization for Classification |
---|---|
Description: | A collection of supervised discretization algorithms. It can also be grouped in terms of top-down or bottom-up, implementing the discretization algorithms. |
Authors: | HyunJi Kim |
Maintainer: | HyunJi Kim <[email protected]> |
License: | GPL |
Version: | 1.0-1.1 |
Built: | 2024-11-16 06:26:12 UTC |
Source: | CRAN |
This package is a collection of supervised discretization algorithms. It can also be grouped in terms of top-down or bottom-up, implementing the discretization algorithms.
Package: | discretization |
Type: | Package |
Version: | 1.0-1 |
Date: | 2010-12-02 |
License: GPL LazyLoad: | yes |
Maintainer: HyunJi Kim <[email protected]>
Choi, B. S., Kim, H. J., Cha, W. O. (2011). A Comparative Study on Discretization Algorithms for Data Mining, Communications of the Korean Statistical Society, to be published.
Chmielewski, M. R. and Grzymala-Busse, J. W. (1996). Global Discretization of Continuous Attributes as Preprocessing for Machine Learning, International journal of approximate reasoning, Vol. 15, No. 4, 319–331.
Fayyad, U. M. and Irani, K. B.(1993). Multi-interval discretization of continuous-valued attributes for classification learning, Artificial intelligence, 13, 1022–1027.
Gonzalez-Abril, L., Cuberos, F. J., Velasco, F. and Ortega, J. A. (2009), Ameva: An autonomous discretization algorithm,Expert Systems with Applications, 36, 5327–5332.
Kerber, R. (1992). ChiMerge : Discretization of numeric attributes, In Proceedings of the Tenth National Conference on Artificial Intelligence, 123–128.
Kurgan, L. A. and Cios, K. J. (2004). CAIM Discretization Algorithm, IEEE Transactions on knowledge and data engineering, 16, 145-153.
Liu, H. and Setiono, R. (1995). Chi2: Feature selection and discretization of numeric attributes, Tools with Artificial Intelligence, 388–391.
Liu, H. and Setiono, R. (1997). Feature selection and discretization, IEEE transactions on knowledge and data engineering, 9, 642–645.
Pawlak, Z. (1982). Rough Sets, International Journal of Computer and Information Sciences, vol.11, No.5, 341–356.
Su, C. T. and Hsu, J. H. (2005). An Extended Chi2 Algorithm for Discretization of Real Value Attributes, IEEE transactions on knowledge and data engineering, 17, 437–441.
Tay, F. E. H. and Shen, L. (2002). Modified Chi2 Algorithm for Discretization, IEEE Transactions on knowledge and data engineering, 14, 666–670.
Tsai, C. J., Lee, C. I. and Yang, W. P. (2008). A discretization algorithm based on Class-Attribute Contingency Coefficient, Information Sciences, 178, 714–731.
Ziarko, W. (1993). Variable Precision Rough Set Model, Journal of computer and system sciences, Vol. 46, No. 1, 39–59.
This function is required to compute the ameva value for Ameva algorithm.
ameva(tb)
ameva(tb)
tb |
a vector of observed frequencies, |
This function implements the Ameva criterion proposed in Gonzalez-Abril, Cuberos, Velasco and Ortega (2009) for Discretization. An autonomous discretization algorithm(Ameva) implements in disc.Topdown(data,method=1)
It uses a measure based on as the criterion for the optimal discretization which has the minimum number of discrete intervals and minimum loss of class variable interdependence. The algorithm finds local maximum values of Ameva criterion and a stopping criterion.
Ameva coefficient is defined as follows:
for , k is a number of intervals, l is a number of classes.
This value calculates in contingency table between class variable and discrete interval, row matrix representing the class variable and each column of discrete interval.
val |
numeric value of Ameva coefficient |
HyunJi Kim [email protected]
Gonzalez-Abril, L., Cuberos, F. J., Velasco, F. and Ortega, J. A. (2009) Ameva: An autonomous discretization algorithm, Expert Systems with Applications, 36, 5327–5332.
disc.Topdown
,
topdown
,
insert
,
findBest
and
chiSq
.
#--Ameva criterion value a=c(2,5,1,1,3,3) m=matrix(a,ncol=3,byrow=TRUE) ameva(m)
#--Ameva criterion value a=c(2,5,1,1,3,3) m=matrix(a,ncol=3,byrow=TRUE) ameva(m)
This function is requied to compute the cacc value for CACC discretization algorithm.
cacc(tb)
cacc(tb)
tb |
a vector of observed frequencies |
The Class-Attribute Contingency Coefficient(CACC) discretization algorithm implements in disc.Topdown(data,method=2)
.
The cacc value is defined as
for
is the total number of samples,
is a number of discretized intervals. This value calculates in contingency table between class variable and discrete interval, row matrix representing the class variable and each column of discrete interval.
val |
numeric of cacc value |
HyunJi Kim [email protected]
Tsai, C. J., Lee, C. I. and Yang, W. P. (2008). A discretization algorithm based on Class-Attribute Contingency Coefficient, Information Sciences, 178, 714–731.
disc.Topdown
,
topdown
,
insert
,
findBest
and
chiSq
.
#----Calculating cacc value (Tsai, Lee, and Yang (2008)) a=c(3,0,3,0,6,0,0,3,0) m=matrix(a,ncol=3,byrow=TRUE) cacc(m)
#----Calculating cacc value (Tsai, Lee, and Yang (2008)) a=c(3,0,3,0,6,0,0,3,0) m=matrix(a,ncol=3,byrow=TRUE) cacc(m)
This function is required to compute the CAIM value for CAIM iscretization algorithm.
caim(tb)
caim(tb)
tb |
a vector of observed frequencies |
The Class-Attrivute Interdependence Maximization(CAIM) discretization algorithm implements in disc.Topdwon(data,method=1)
. The CAIM criterion measures the dependency between the class variable and the discretization variable for attribute, and is defined as :
for ,
is the maximum value within the
th column of the quanta matrix.
is the total number of continuous values of attribute that are within the interval(Kurgan and Cios (2004)).
HyunJi Kim [email protected]
Kurgan, L. A. and Cios, K. J. (2004). CAIM Discretization Algorithm, IEEE Transactions on knowledge and data engineering, 16, 145–153.
disc.Topdown
,
topdown
,
insert
,
findBest
.
#----Calculating caim value a=c(3,0,3,0,6,0,0,3,0) m=matrix(a,ncol=3,byrow=TRUE) caim(m)
#----Calculating caim value a=c(3,0,3,0,6,0,0,3,0) m=matrix(a,ncol=3,byrow=TRUE) caim(m)
This function performs Chi2 discretization algorithm. Chi2 algorithm automatically determines a proper Chi-sqaure() threshold that keeps the fidelity of the original numeric dataset.
chi2(data, alp = 0.5, del = 0.05)
chi2(data, alp = 0.5, del = 0.05)
data |
the dataset to be discretize |
alp |
significance level; |
del |
|
The Chi2 algorithm is based on the statistic, and consists of two phases.
In the first phase, it begins with a high significance level(sigLevel), for all numeric attributes for discretization. Each attribute is sorted according to its values. Then the following is performed:
phase 1. calculate the
value for every pair of adjacent intervals (at the beginning, each pattern is put into its own interval that contains only one value of an attribute);
pahse 2. merge the pair of adjacent intervals with the lowest
value. Merging continues until all pairs of intervals have
values exceeding the parameter determined by sigLevel. The above process is repeated with a decreased sigLevel until an inconsistency rate(
),
incon()
, is exceeded in the discretized data(Liu and Setiono (1995)).
cutp |
list of cut-points for each variable |
Disc.data |
discretized data matrix |
HyunJi Kim [email protected]
Liu, H. and Setiono, R. (1995). Chi2: Feature selection and discretization of numeric attributes, Tools with Artificial Intelligence, 388–391.
Liu, H. and Setiono, R. (1997). Feature selection and discretization, IEEE transactions on knowledge and data engineering, Vol.9, no.4, 642–645.
data(iris) #---cut-points chi2(iris,0.5,0.05)$cutp #--discretized dataset using Chi2 algorithm chi2(iris,0.5,0.05)$Disc.data
data(iris) #---cut-points chi2(iris,0.5,0.05)$cutp #--discretized dataset using Chi2 algorithm chi2(iris,0.5,0.05)$Disc.data
This function implements ChiMerge discretization algorithm.
chiM(data, alpha = 0.05)
chiM(data, alpha = 0.05)
data |
numeric data matrix to discretized dataset |
alpha |
significance level; |
The ChiMerge algorithm follows the axis of bottom-up. It uses the statistic to determine if the relative class frequencies of adjacent intervlas are distinctly different or if they are similar enough to justify merging them into a single interval(Kerber, R. (1992)).
cutp |
list of cut-points for each variable |
Disc.data |
discretized data matrix |
HyunJi Kim [email protected]
Kerber, R. (1992). ChiMerge : Discretization of numeric attributes, In Proceedings of the Tenth National Conference on Artificial Intelligence, 123–128.
#--Discretization using the ChiMerge method data(iris) disc=chiM(iris,alpha=0.05) #--cut-points disc$cutp #--discretized data matrix disc$Disc.data
#--Discretization using the ChiMerge method data(iris) disc=chiM(iris,alpha=0.05) #--cut-points disc$cutp #--discretized data matrix disc$Disc.data
This function is required to perform the discretization based on Chi-square statistic( CACC, Ameva, ChiMerge, Chi2, Modified Chi2, Extended Chi2).
chiSq(tb)
chiSq(tb)
tb |
a vector of observed frequencies |
The formula for computing the value is
number of (no.) classes,
no. patterns in the
th interval,
th class,
no. patterns in the
th class =
,
no. patterns in the
the class =
,
total no. patterns =
,
expected frequency of
.
If either
or
is 0,
is set to 0.1. The degree of freedom of the
statistic is on less the number of classes.
val |
|
HyunJi Kim [email protected]
Kerber, R. (1992). ChiMerge : Discretization of numeric attributes, In Proceedings of the Tenth National Conference on Artificial Intelligence, 123–128.
cacc
,
ameva
,
chiM
,
chi2
,
modChi2
and
extendChi2
.
#----Calulate Chi-Square b=c(2,4,1,2,5,3) m=matrix(b,ncol=3) chiSq(m) chisq.test(m)$statistic
#----Calulate Chi-Square b=c(2,4,1,2,5,3) m=matrix(b,ncol=3) chiSq(m) chisq.test(m)$statistic
This function is required to perform the Minimum Description Length Principle.mdlp
cutIndex(x, y)
cutIndex(x, y)
x |
a vector of numeric value |
y |
class variable vector |
This function computes the best cut index using entropy
HyunJi Kim [email protected]
cutPoints
,
ent
,
mergeCols
,
mdlStop
,
mylog
,
mdlp
.
This function is required to perform the Minimum Description Length Principle.mdlp
cutPoints(x, y)
cutPoints(x, y)
x |
a vector of numeric value |
y |
class variable vector |
HyunJi Kim [email protected]
cutIndex
,
ent
,
mergeCols
,
mdlStop
,
mylog
,
mdlp
.
This function implements three top-down discretization algorithms(CAIM, CACC, Ameva).
disc.Topdown(data, method = 1)
disc.Topdown(data, method = 1)
data |
numeric data matrix to discretized dataset |
method |
|
cutp |
list of cut-points for each variable(minimun value, cut-points and maximum value) |
Disc.data |
discretized data matrix |
HyunJi Kim [email protected]
Gonzalez-Abril, L., Cuberos, F. J., Velasco, F. and Ortega, J. A. (2009) Ameva: An autonomous discretization algorithm, Expert Systems with Applications, 36, 5327–5332.
Kurgan, L. A. and Cios, K. J. (2004). CAIM Discretization Algorithm, IEEE Transactions on knowledge and data engineering, 16, 145–153.
Tsai, C. J., Lee, C. I. and Yang, W. P. (2008). A discretization algorithm based on Class-Attribute Contingency Coefficient, Information Sciences, 178, 714–731.
topdown
,
insert
,
findBest
,
findInterval
,
caim
,
cacc
,
ameva
##---- CAIM discretization ---- ##----cut-potins cm=disc.Topdown(iris, method=1) cm$cutp ##----discretized data matrix cm$Disc.data ##---- CACC discretization---- disc.Topdown(iris, method=2) ##---- Ameva discretization ---- disc.Topdown(iris, method=3)
##---- CAIM discretization ---- ##----cut-potins cm=disc.Topdown(iris, method=1) cm$cutp ##----discretized data matrix cm$Disc.data ##---- CACC discretization---- disc.Topdown(iris, method=2) ##---- Ameva discretization ---- disc.Topdown(iris, method=3)
This function is required to perform the Minimum Description Length Principle.mdlp
ent(y)
ent(y)
y |
class variable vector |
HyunJi Kim [email protected]
cutPoints
,
ent
,
mergeCols
,
mdlStop
,
mylog
,
mdlp
.
This function implements Extended Chi2 discretization algorithm.
extendChi2(data, alp = 0.5)
extendChi2(data, alp = 0.5)
data |
data matrix to discretized dataset |
alp |
significance level; |
In the extended Chi2 algorithm, inconsistency checking() of the Chi2 algorithm is replaced by the lease upper bound
(
Xi()
) after each step of discretization ().
It uses as the stopping criterion.
cutp |
list of cut-points for each variable |
Disc.data |
discretized data matrix |
HyunJi Kim [email protected]
Su, C. T. and Hsu, J. H. (2005). An Extended Chi2 Algorithm for Discretization of Real Value Attributes, IEEE transactions on knowledge and data engineering, 17, 437–441.
data(iris) ext=extendChi2(iris,0.5) ext$cutp ext$Disc.data
data(iris) ext=extendChi2(iris,0.5) ext$cutp ext$Disc.data
This function is required to perform the disc.Topdown()
.
findBest(x, y, bd, di, method)
findBest(x, y, bd, di, method)
x |
a vector of numeric value |
y |
class variable vector |
bd |
current cut points |
di |
candidate cut-points |
method |
each |
HyunJi Kim [email protected]
topdown
, insert
and
disc.Topdown
.
This function computes the inconsistency rate of dataset.
incon(data)
incon(data)
data |
dataset matrix |
The inconsistency rate of dataset is calculated as follows: (1) two instances are considered inconsistent if they match except for their class labels; (2) for all the matching instances (without considering their class labels), the inconsistency count is the number of the instances minus the largest number of instnces of class labels; (3) the inconsistency rate is the sum of all the inconsistency counts divided by the total number of instances.
inConRate |
the inconsistency rate of the dataset |
HyunJi Kim [email protected]
Liu, H. and Setiono, R. (1995), Chi2: Feature selection and discretization of numeric attributes , Tools with Artificial Intelligence, 388–391.
Liu, H. and Setiono, R. (1997), Feature selection and discretization, IEEE transactions on knowledge and data engineering, Vol.9, no.4, 642–645.
##---- Calculating Inconsistency ---- data(iris) disiris=chiM(iris,alpha=0.05)$Disc.data incon(disiris)
##---- Calculating Inconsistency ---- data(iris) disiris=chiM(iris,alpha=0.05)$Disc.data incon(disiris)
This function is required to perform the disc.Topdown()
.
insert(x, a)
insert(x, a)
x |
cut-point |
a |
a vector of minimum, maximum value |
HyunJi Kim [email protected]
topdown
, findBest
and
disc.Topdown
.
This function computes the level of consistency, is required to perform the Modified Chi2 discretization algorithm.
LevCon(data)
LevCon(data)
data |
discretized data matrix |
LevelConsis |
Level of Consistency value |
HyunJi Kim [email protected]
Tay, F. E. H. and Shen, L. (2002). Modified Chi2 Algorithm for Discretization, IEEE Transactions on knowledge and data engineering, Vol. 14, No. 3, 666–670.
Pawlak, Z. (1982). Rough Sets, International Journal of Computer and Information Sciences, vol.11, No.5, 341–356.
Chmielewski, M. R. and Grzymala-Busse, J. W. (1996). Global Discretization of Continuous Attributes as Preprocessing for Machine Learning, International journal of approximate reasoning, Vol. 15, No. 4, 319–331.
This function discretizes the continuous attributes of data matrix using entropy criterion with the Minimum Description Length as stopping rule.
mdlp(data)
mdlp(data)
data |
data matrix to be discretized dataset |
Minimum Discription Length Principle
cutp |
list of cut-points for each variable |
Disc.data |
discretized data matrix |
HyunJi Kim [email protected]
Fayyad, U. M. and Irani, K. B.(1993). Multi-interval discretization of continuous-valued attributes for classification learning, Artificial intelligence, 13, 1022–1027.
cutIndex
,
cutPoints
,
ent
,
mergeCols
,
mdlStop
,
mylog
.
data(iris) mdlp(iris)$Disc.data
data(iris) mdlp(iris)$Disc.data
This function determines cut criterion based on Fayyad and Irani Criterion, is required to perform the minimum description length principle.
mdlStop(ci, y, entropy)
mdlStop(ci, y, entropy)
ci |
cut index |
y |
class variable |
entropy |
this value is calculated by |
Minimum description Length Principle Criterion
gain |
numeric value |
HyunJi Kim [email protected]
Fayyad, U. M. and Irani, K. B.(1993). Multi-interval discretization of continuous-valued attributes for classification learning, Artificial intelligence, 13, 1022–1027.
cutPoints
,
ent
,
mergeCols
,
cutIndex
,
mylog
,
mdlp
.
This function merges the columns having observation numbers equal to 0, required to perform the minimum discription length principle.
mergeCols(n, minimum = 2)
mergeCols(n, minimum = 2)
n |
table, column: intervals, row: variables |
minimum |
min # observations in col or row to merge |
HyunJi Kim [email protected]
cutPoints
,
ent
,
cutIndex
,
mdlStop
,
mylog
,
mdlp
.
This function implements the Modified Chi2 discretization algorithm.
modChi2(data, alp = 0.5)
modChi2(data, alp = 0.5)
data |
numeric data matrix to discretized dataset |
alp |
significance level, |
In the modified Chi2 algorithm, inconsistency checking() of the Chi2 algorithm is replaced by maintaining the level of consistency
after each step of discretization
. this inconsistency rate as the stopping criterion.
cutp |
list of cut-points for each variable |
Disc.data |
discretized data matrix |
HyunJi Kim [email protected]
Tay, F. E. H. and Shen, L. (2002). Modified Chi2 Algorithm for Discretization, IEEE Transactions on knowledge and data engineering, 14, 666–670.
data(iris) modChi2(iris, alp=0.5)$Disc.data
data(iris) modChi2(iris, alp=0.5)$Disc.data
This function is required to perform the minimum discription length principle, mdlp()
.
mylog(x)
mylog(x)
x |
a vector of numeric value |
HyunJi Kim [email protected]
Fayyad, U. M. and Irani, K. B.(1993). Multi-interval discretization of continuous-valued attributes for classification learning, Artificial intelligence, Vol. 13, 1022–1027.
mergeCols
,
ent
,
cutIndex
,
cutPoints
,
mdlStop
and
mdlp
.
This function is required to perform the disc.Topdown()
.
topdown(data, method = 1)
topdown(data, method = 1)
data |
numeric data matrix to discretized dataset |
method |
|
HyunJi Kim [email protected]
Gonzalez-Abril, L., Cuberos, F. J., Velasco, F. and Ortega, J. A. (2009) Ameva: An autonomous discretization algorithm, Expert Systems with Applications, 36, 5327–5332.
Kurgan, L. A. and Cios, K. J. (2004). CAIM Discretization Algorithm, IEEE Transactions on knowledge and data engineering, 16, 145–153.
Tsai, C. J., Lee, C. I. and Yang, W. P. (2008). A discretization algorithm based on Class-Attribute Contingency Coefficient, Information Sciences, 178, 714–731.
insert
, findBest
and
disc.Topdown
.
This function is called by ChiMerge diacretization fucntion, chiM()
.
value(i, data, alpha)
value(i, data, alpha)
i |
|
data |
numeric data matrix |
alpha |
significance level; |
cuts |
list of cut-points for any variable |
disc |
discretized |
HyunJi Kim [email protected]
Kerber, R. (1992). ChiMerge : Discretization of numeric attributes, In Proceedings of the Tenth National Conference on Artificial Intelligence, 123–128.
chiM
.
data(iris) value(1,iris,0.05)
data(iris) value(1,iris,0.05)
This function is the , required to perform the Extended Chi2 discretization algorithm.
Xi(data)
Xi(data)
data |
data matrix |
The following equality is used for calculating the least upper bound() of the data set(Chao and Jyh-Hwa (2005)).
where
is the equivalence relation set,
is the decision set, and
is the equivalence classes.
and
,
and
.
denotes set cardinality.
Xi |
numeric value, |
HyunJi Kim [email protected]
Chao-Ton, S. and Jyh-Hwa, H. (2005). An Extended Chi2 Algorithm for Discretization of Real Value Attributes, IEEE transactions on knowledge and data engineering, Vol. 17, No. 3, 437–441.
Ziarko, W. (1993). Variable Precision Rough Set Model, Journal of computer and system sciences, Vol. 46, No. 1, 39–59.