Title: | Determination of K Using Peak Counts of Features for Clustering |
---|---|
Description: | The number of clusters (k) is needed to start all the partitioning clustering algorithms. An optimal value of this input argument is widely determined by using some internal validity indices. Since most of the existing internal indices suggest a k value which is computed from the clustering results after several runs of a clustering algorithm they are computationally expensive. On the contrary, the package 'kpeaks' enables to estimate k before running any clustering algorithm. It is based on a simple novel technique using the descriptive statistics of peak counts of the features in a data set. |
Authors: | Zeynel Cebeci [aut, cre], Cagatay Cebeci [aut] |
Maintainer: | Zeynel Cebeci <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.1.0 |
Built: | 2024-12-06 06:38:46 UTC |
Source: | CRAN |
The input argument k, represents the number of clusters is needed to start all the partitioning clustering algorithms. In unsupervised learning applications, an optimal value of this argument is widely determined by using the internal validity indexes. Since these indexes suggest a k value which is computed on the clustering results obtained with several runs of a clustering algorithm, they are computationally expensive. On the contrary, the package 'kpeaks' enables to estimate k before running any clustering algorithm. It is based on a simple novel technique using the descriptive statistics of peak counts of the features in a dataset.
The package 'kpeaks' contains five functions and one synthetically created dataset for testing purposes. In order to suggest an estimate of k, the function findk
internally calls the functions genpolygon
and findpolypeaks
, respectively. The frequency polygons can be visually inspected by using the function plotpolygon
. Using the function rmshoulders
is recommended to flatten or remove the the shoulder peaks around the main peaks of a frequency polygon, if any.
Zeynel Cebeci, Cagatay Cebeci
Cebeci, Z. & Cebeci, C. (2018). "A novel technique for fast determination of K in partitioning cluster analysis", Journal of Agricultural Informatics, 9(2), 1-11. doi: 10.17700/jai.2018.9.2.442.
Cebeci, Z. & Cebeci, C. (2018). "kpeaks: An R Package for Quick Selection of K for Cluster Analysis", In 2018 International Conference on Artificial Intelligence and Data Processing (IDAP), IEEE. doi: 10.1109/IDAP.2018.8620896.
findk
,
findpolypeaks
,
genpolygon
,
plotpolygon
,
rmshoulders
Based on some of descriptive statistics of the peak counts in the frequency polygon of a feature, this function proposes a list of estimates of the number of clusters in a data set.
findk(x, binrule, nbins, tcmethod, tc, trmethod, tv, rms=FALSE, rcs=FALSE, tpc=1)
findk(x, binrule, nbins, tcmethod, tc, trmethod, tv, rms=FALSE, rcs=FALSE, tpc=1)
x |
a numeric data frame or matrix. |
binrule |
a string specifying the binning rule to compute the number of classes of a frequency polygon. |
nbins |
an integer specifying the number of classes (bins). It is internally computed according to the selected binning rule except usr. See all available options in |
tcmethod |
a string representing a threshold method to compute a threshold distance value to discard the small or empty bins of a frequency polygon. See all available options in |
tc |
an integer for threshold frequency value assigned by |
trmethod |
a string used to specify a removal method to discard the shoulders around the main peaks in a frequency polygon. See all available options in |
tv |
a numeric threshold distance value assigned by |
rms |
a logical value whether the shoulders removal is applied or not. Default value is FALSE. |
rcs |
a logical value whether the estimates of k computed on the reduced counts set instead of the full set. Default value is FALSE, and set to |
tpc |
an integer threshold value for creating the reduced set of the peak counts. Default value is 1. |
The function findk
returns a list of k values which are proposed as the estimates of the number of clusters in a given data set. The estimation is based on various descriptive statistics of the peak counts in the frequency polygon of the features. Firstly, the classes of frequency polygons of the features are generated by using the function genpolygon
. Then, the main peaks in frequency polygons are determined by using the function findpolypeaks
. If desired, with the function rmshoulders
the shoulder peaks are removed from the peaks matrix returned by the function findpolypeaks
. In the returned peaks matrix, the peaks are counted for each feature, and a list of estimates of k is produced by using various descriptive statistics of the peak counts.
a list of the estimates of k consists of the following items which are computed from the peak counts of the features in a given data set:
am |
arithmetic mean of peak counts. |
med |
median of peak counts. |
mod |
mode of peak counts. |
cr |
center of the range of peak counts. |
ciqr |
center of the interquartile range (IQR) of peak counts. |
mppc |
overall mean of the pairwise means of peak counts. |
mq3m |
mean of the third quartile (Q3) and maximum of peak counts. |
mtl |
mean of two largest value of peak counts. |
avgk |
proposed k as the mean of all the estimates. |
modk |
proposed k as the mode of all the estimates. |
mtlk |
proposed k as the mean of two largest estimates. |
dst |
a string representing the type of counts set which is used in computations. |
pcounts |
an integer vector containing the peak counts of the features. |
The input arguments of the function findk
usually are the outputs from the functions findpolypeaks
and rmshoulders
.
Zeynel Cebeci, Cagatay Cebeci
Cebeci, Z. & Cebeci, C. (2018). "A novel technique for fast determination of K in partitioning cluster analysis", Journal of Agricultural Informatics, 9(2), 1-11. doi: 10.17700/jai.2018.9.2.442.
Cebeci, Z. & Cebeci, C. (2018). "kpeaks: An R Package for Quick Selection of K for Cluster Analysis", In 2018 International Conference on Artificial Intelligence and Data Processing (IDAP), IEEE. doi: 10.1109/IDAP.2018.8620896.
# Estimate the number of clusters in x5p4c data set data(x5p4c) estk <- findk(x5p4c, binrule="sturges") print(estk) summary(estk$pcounts) cat("Estimated the number of clusters as the mean of Q3 and max peak count:", estk$mq3m, fill=TRUE) cat("Proposed number of clusters based on the mean of two largest estimates:", estk$mtlk, fill=TRUE) # Estimate the number of clusters in x5p4c data set by using threshold frequency method 'avg' # and shoulders removal method 'q1' estk <- findk(x5p4c, binrule="usr", nbins=15, tcmethod="usr", tc=1, trmethod="avg", rms=TRUE) print(estk) summary(estk$pcounts) cat("Proposed number of clusters based on the mean of two largest estimates:", estk$mtlk, fill=TRUE) # Estimate the number of clusters in iris data set data(iris) estk <- findk(iris[,1:4], binrule="bc", rcs=FALSE) print(estk) summary(estk$pcounts) cat("Proposed number of clusters based on the mean of estimates:", estk$avgk, fill=TRUE) cat("Proposed number of clusters based on the mode of estimates:", estk$modk, fill=TRUE) cat("Proposed number of clusters based on the mean of two largest estimates:", estk$mtlk, fill=TRUE)
# Estimate the number of clusters in x5p4c data set data(x5p4c) estk <- findk(x5p4c, binrule="sturges") print(estk) summary(estk$pcounts) cat("Estimated the number of clusters as the mean of Q3 and max peak count:", estk$mq3m, fill=TRUE) cat("Proposed number of clusters based on the mean of two largest estimates:", estk$mtlk, fill=TRUE) # Estimate the number of clusters in x5p4c data set by using threshold frequency method 'avg' # and shoulders removal method 'q1' estk <- findk(x5p4c, binrule="usr", nbins=15, tcmethod="usr", tc=1, trmethod="avg", rms=TRUE) print(estk) summary(estk$pcounts) cat("Proposed number of clusters based on the mean of two largest estimates:", estk$mtlk, fill=TRUE) # Estimate the number of clusters in iris data set data(iris) estk <- findk(iris[,1:4], binrule="bc", rcs=FALSE) print(estk) summary(estk$pcounts) cat("Proposed number of clusters based on the mean of estimates:", estk$avgk, fill=TRUE) cat("Proposed number of clusters based on the mode of estimates:", estk$modk, fill=TRUE) cat("Proposed number of clusters based on the mean of two largest estimates:", estk$mtlk, fill=TRUE)
Frequency polygons are graphics to reveal the shapes of data distributions as histograms do. The peaks of frequency polygons are required in several data mining applications. findpolypeaks
finds the peaks in a frequency polygon by using the frequencies and middles values of the classes of it.
findpolypeaks(xm, xc, tcmethod, tc)
findpolypeaks(xm, xc, tcmethod, tc)
xm |
a numeric vector contains the middle values of the classes of the frequency polygon (or the bins of a histogram). |
xc |
an integer vector contains the frequencies of the classes of the frequency polygon. |
tcmethod |
a string represents the threshold method to discard the empty and the small bins whose frequencies are smaller than a threshold frequency value. Default method is usr. Alternatively, the methods given below can be used to compute a threshold frequency value using the descriptive statistics of the frequencies in
|
tc |
an integer which is used as the threshold frequency value for discarding the empty and small height classes in the frequency polygon. Default value is 1 if the threshold option usr is chosen. Depending on the selected methods, the value of
|
The peaks are determined after removing the empty and small height classes whose frequencies are below the chosen threshold frequency. Default threshold value is 1 that means that all the classes which have frequencies of 0 and 1 are removed in the input vectors xm
and xc
.
pm |
a data frame with two columns which are pvalues and pfreqs containing the middle values and frequencies of the peaks which determined in the frequency polygon, respectively. |
np |
an integer representing the number of peaks in the frequency polygon. |
Zeynel Cebeci, Cagatay Cebeci
Cebeci, Z. & Cebeci, C. (2018). "A novel technique for fast determination of K in partitioning cluster analysis", Journal of Agricultural Informatics, 9(2), 1-11. doi: 10.17700/jai.2018.9.2.442.
Cebeci, Z. & Cebeci, C. (2018). "kpeaks: An R Package for Quick Selection of K for Cluster Analysis", In 2018 International Conference on Artificial Intelligence and Data Processing (IDAP), IEEE. doi: 10.1109/IDAP.2018.8620896.
findk
,
genpolygon
,
rmshoulders
data(x5p4c) # Using a user-specified number of bins, build the frequency polygon of p2 in the data set x5p4c hvals <- genpolygon(x5p4c$p2, binrule="usr", nbins=20) plotpolygon(x5p4c$p2, nbins=hvals$nbins, ptype="ph") # Find the peaks in the frequency polygon by using the threshold method min resfpp1 <- findpolypeaks(hvals$mids, hvals$freqs, tcmethod="min") print(resfpp1) # Find the peaks in the frequency polygon by using the threshold equals to 5 resfpp2 <- findpolypeaks(hvals$mids, hvals$freqs, tcmethod="usr", tc=5) print(resfpp2) data(iris) # By using Doane rule, build the frequency polygon of the 4th feature in the data set iris hvals <- genpolygon(iris[,4], binrule="doane") plotpolygon(iris[,4], nbins=hvals$nbins, ptype="p") #Find the peaks in the frequency polygon by using the threshold method avg resfpp3 <- findpolypeaks(hvals$mids, hvals$freqs, tcmethod="avg") print(resfpp3)
data(x5p4c) # Using a user-specified number of bins, build the frequency polygon of p2 in the data set x5p4c hvals <- genpolygon(x5p4c$p2, binrule="usr", nbins=20) plotpolygon(x5p4c$p2, nbins=hvals$nbins, ptype="ph") # Find the peaks in the frequency polygon by using the threshold method min resfpp1 <- findpolypeaks(hvals$mids, hvals$freqs, tcmethod="min") print(resfpp1) # Find the peaks in the frequency polygon by using the threshold equals to 5 resfpp2 <- findpolypeaks(hvals$mids, hvals$freqs, tcmethod="usr", tc=5) print(resfpp2) data(iris) # By using Doane rule, build the frequency polygon of the 4th feature in the data set iris hvals <- genpolygon(iris[,4], binrule="doane") plotpolygon(iris[,4], nbins=hvals$nbins, ptype="p") #Find the peaks in the frequency polygon by using the threshold method avg resfpp3 <- findpolypeaks(hvals$mids, hvals$freqs, tcmethod="avg") print(resfpp3)
Constructs the histogram of a feature by using a selected binning rule, returns the middle values and frequencies of classes for further works on the frequency polygon.
genpolygon(x, binrule, nbins, disp = FALSE)
genpolygon(x, binrule, nbins, disp = FALSE)
x |
a numeric vector containing the observations for a feature. |
binrule |
name of the rule in order to compute the number of bins to build the histogram. |
nbins |
an integer representing the number of bins which is computed by using the selected binning rule. Default rule is sturges. Depending on the selected rule
|
disp |
a logical value should be set to |
According to Hyndman (1995), Sturges's rule was the first rule to calculate k, the number of classes to build a histogram. Most of the statistical packages use this simple rule for determining the number of classes in constructing histograms. Brooks & Carruthers (1953) proposed a rule using instead of
giving always larger k when compared to Sturges's rule. The rule by Huntsberger (1962) yields nearly equal result to those of Sturges's rule. These two rules work well if n is less than 200. Scott (1992) argued that Sturges's rule leads to generate oversmoothed histograms in case of large number of n. In his rule, Cencov (1962) used the cube root of n simply. This rule was followed by its extensions, i.e., Rice rule and Terrell & Scott (1985) rule. When compared to the others, the square root rule produces larger k (Davies & Goldsmith, 1980).
Most of the rules simply include only n as the input argument. On the other hand, the rules using variation and shape of data distributions can provide more optimal k values. For instance, Doane (1976) extended the Sturges's rule by adding the standardized skewness in order to overcome the problem with non-normal distributions need more classes. In order to estimate optimal k values, Scott (1979) added the standard deviation to his formula. Freedman and Diaconis (1981) proposed to use the interquartile range (IQR) statistic which is less sensitive to outliers than the standard deviation. In a study on unsupervised discretization methods, Cebeci & Yildiz (2017) tested a binning rule formula based on the ten-base logarithm of n divided by 2*pi
. They also argued that the rules Freedman-Diaconis and Doane were slightly performed better than the other rules based on the training model accuracies on a chicken egg quality traits dataset. Therefore, using the above mentioned rules may be more effective in determining the peaks of a frequency polygon.
xm |
a numeric vector containing the middle values of bins. |
xc |
an integer vector containing the frequencies of the bins. |
nbins |
an integer containing the number of bins to build the histogram. |
Zeynel Cebeci, Cagatay Cebeci
Brooks C E P & Carruthers N (1953). Handbook of statistical methods in meteorology. H M Stationary Office, London.
Cebeci Z & Yildiz F (2017). Unsupervised discretization of continuous variables in a chicken egg quality traits dataset. Turk. J Agriculture-Food Sci. & Tech. 5(4): 315-320. doi: 10.24925/turjaf.v5i4.315-320.1056.
Cebeci Z & Cebeci C (2018). "A novel technique for fast determination of K in partitioning cluster analysis", Journal of Agricultural Informatics, 9(2), 1-11. doi: 10.17700/jai.2018.9.2.442.
Cebeci Z & Cebeci C (2018). "kpeaks: An R package for quick selection of k for cluster analysis", In 2018 Int. Conf. on Artificial Intelligence and Data Processing (IDAP), IEEE. doi: 10.1109/IDAP.2018.8620896.
Cencov N N (1962). Evaluation of an unknown distribution density from observations. Soviet Mathematics 3: 1559-1562.
Davies O L & Goldsmith P L (1980). Statistical methods in research and production. 4th edn, Longman: London.
Doane D P (1976). Aesthetic frequency classification. American Statistician 30(4):181-183.
Freedman D & Diaconis P (1981). On the histogram as a density estimator: L2 Theory. Zeit. Wahr. ver. Geb. 57(4):453-476.
Hyndman R J (1995). The problem with Sturges rule for constructing histograms. url:http://robjhyndman.com/papers/sturges.pdf.
Huntsberger D V (1962). Elements of statistical inference. London: Prentice-Hall.
Scott D W (1992). Multivariate density estimation: Theory, Practice and Visualization. John Wiley & Sons: New York.
Sturges H (1926). The choice of a class-interval. J Amer. Statist. Assoc. 21(153):65-66.
Terrell G R & Scott D W (1985). Oversmoothed nonparametric density estimates. J Amer. Statist. Assoc. 80(389):209-214.
findk
,
findpolypeaks
,
plotpolygon
x <- rnorm(n=100, mean=5, sd=0.5) # Construct the histogram of x according to the Sturges rule with no display hvals <- genpolygon(x, binrule = "sturges") print(hvals) # Plot the histogram of x by using the user-specified number of classes hvals <- genpolygon(x, binrule = "usr", nbins = 20, disp = TRUE) print(hvals) # Plot the histogram of the second feature in iris dataset # by using the Freedman-Diaconis (fd) rule data(iris) hvals <- genpolygon(iris[,2], binrule = "fd", disp = TRUE) print(hvals)
x <- rnorm(n=100, mean=5, sd=0.5) # Construct the histogram of x according to the Sturges rule with no display hvals <- genpolygon(x, binrule = "sturges") print(hvals) # Plot the histogram of x by using the user-specified number of classes hvals <- genpolygon(x, binrule = "usr", nbins = 20, disp = TRUE) print(hvals) # Plot the histogram of the second feature in iris dataset # by using the Freedman-Diaconis (fd) rule data(iris) hvals <- genpolygon(iris[,2], binrule = "fd", disp = TRUE) print(hvals)
Plots the frequency polygon and histogram of a feature with some options.
plotpolygon(x, nbins, ptype, bcol = "gray", pcol = "blue")
plotpolygon(x, nbins, ptype, bcol = "gray", pcol = "blue")
x |
a numeric vector containing the observations of a feature, or a numeric matrix when |
nbins |
an integer for the number of classes in the frequency polygon. |
bcol |
a string for the color of bins. Default is gray. |
pcol |
a string for the color of polygon lines. Default is blue. |
ptype |
a string specifying the type of plot. Use p for plotting the polygon only or ph for plotting the polygon with the histogram. Default is sp for the scatterplots between the pairs of features and the polygons on the diagonal panel. |
Zeynel Cebeci, Cagatay Cebeci
Cebeci, Z. & Cebeci, C. (2018). "A novel technique for fast determination of K in partitioning cluster analysis", Journal of Agricultural Informatics, 9(2), 1-11. doi: 10.17700/jai.2018.9.2.442.
Cebeci, Z. & Cebeci, C. (2018). "kpeaks: An R Package for Quick Selection of K for Cluster Analysis", In 2018 International Conference on Artificial Intelligence and Data Processing (IDAP), IEEE. doi: 10.1109/IDAP.2018.8620896.
# plot the frequency polygon of the 2nd feature in x5p4c data set data(x5p4c) hvals <- genpolygon(x5p4c[,2], binrule="usr", nbins=20) # plot the frequency polygon of the 2nd feature in x5p4c data set plotpolygon(x5p4c[,2], nbins=hvals$nbins, ptype="p") # plot the histogram and frequency polygon of the 2nd feature in x5p4c data set plotpolygon(x5p4c[,2], nbins=hvals$nbins, ptype="ph", bcol="orange", pcol="blue") # plot the pairwise scatter plots of the features in x5p4c data set pairs(x5p4c, diag.panel=plotpolygon, upper.panel=NULL, cex.labels=1.5) # plot the histogram and frequency polygon of Petal.Width in iris data set data(iris) hvals <- genpolygon(iris$Petal.Width, binrule="doane") plotpolygon(iris$Petal.Width, nbins=hvals$nbins, ptype="ph")
# plot the frequency polygon of the 2nd feature in x5p4c data set data(x5p4c) hvals <- genpolygon(x5p4c[,2], binrule="usr", nbins=20) # plot the frequency polygon of the 2nd feature in x5p4c data set plotpolygon(x5p4c[,2], nbins=hvals$nbins, ptype="p") # plot the histogram and frequency polygon of the 2nd feature in x5p4c data set plotpolygon(x5p4c[,2], nbins=hvals$nbins, ptype="ph", bcol="orange", pcol="blue") # plot the pairwise scatter plots of the features in x5p4c data set pairs(x5p4c, diag.panel=plotpolygon, upper.panel=NULL, cex.labels=1.5) # plot the histogram and frequency polygon of Petal.Width in iris data set data(iris) hvals <- genpolygon(iris$Petal.Width, binrule="doane") plotpolygon(iris$Petal.Width, nbins=hvals$nbins, ptype="ph")
Removes the shoulders around the main peaks in a frequency polygon.
rmshoulders(xm, xc, trmethod, tv)
rmshoulders(xm, xc, trmethod, tv)
xm |
a numeric vector containing the middle values of peaks of a frequency polygon. |
xc |
an integer vector containing the frequencies of peaks of a frequency polygon. |
trmethod |
a string representing the type of shoulders removal option for computing a threshold value. Default method is usr. The alternatives are sd, q1, iqr, avg and med. These methods compute the threshold distance value using some statistics of the distances between the middle values of two successive peaks in the vector
|
tv |
a numeric value to be used as the threshold distance for deciding the shoulders. Default threshold is 1 if the removal method usr is chosen. Depending on the selected removal method
|
Literally speaking, a shoulder peak or shortly shoulder is a secondary peak in a close location before or after the main peak of a mountain. In a frequency polygon, a shoulder is a smaller peak that is quite close to a higher peak resulting a non-obvious valley between them. Shoulders may occur randomly due to some reasons such as random noises or selecting higher number of classes in histogram building etc. Usually, it is desired to remove them from the peaks vector of a frequency polygon. In 'kpeaks', a peak considered as a shoulder when its height is smaller than the height of its neighbor peak and its distance to its neighbor is also lower than a threshold distance value. In order to compute a threshold distance value, here, we propose to use seven options as listed in the section ‘arguments’. The options q1
and iqr
can be applied to remove the minor shoulders that are very near to the main peaks while q3
is recommended to eliminate the substantial shoulders in the processed frequency polygon. The remaining options may be more efficient for removing the moderate shoulders.
pm |
a data frame with two columns whose names are pvalues and pfreqs for the middle values and the frequencies of the peaks after removal process, respectively. |
np |
an integer representing the number of peaks after removal of the shoulders. |
The function rmshoulders
normally should be called with the input values that are returned by the function findpolypeaks
.
Zeynel Cebeci, Cagatay Cebeci
Cebeci, Z. & Cebeci, C. (2018). "A novel technique for fast determination of K in partitioning cluster analysis", Journal of Agricultural Informatics, 9(2), 1-11. doi: 10.17700/jai.2018.9.2.442.
Cebeci, Z. & Cebeci, C. (2018). "kpeaks: An R Package for Quick Selection of K for Cluster Analysis", In 2018 International Conference on Artificial Intelligence and Data Processing (IDAP), IEEE. doi: 10.1109/IDAP.2018.8620896.
findpolypeaks
,
plotpolygon
,
genpolygon
# Build a data vector with three peaks x1 <-rnorm(100, mean=20, sd=5) x2 <-rnorm(50, mean=50, sd=5) x3 <-rnorm(150, mean=90, sd=10) x <- c(x1,x3,x2) # generate the frequency polygon and histogram of x by using Doane rule hvals <- genpolygon(x, binrule="doane") plotpolygon(x, nbins=hvals$nbins, ptype="p") # find the peaks in frequency polygon of x by using the default threshold frequency resfpp <- findpolypeaks(xm=hvals$mids, xc=hvals$freqs) print(resfpp) # remove the shoulders with the threshold distance option 'avg' resrs <- rmshoulders(resfpp$pm[,1], resfpp$pm[,2], trmethod = "avg") print(resrs) # remove the shoulders with the threshold distance option 'iqr' resrs <- rmshoulders(resfpp$pm[,1], resfpp$pm[,2], trmethod = "iqr") print(resrs) data(x5p4c) # plot the frequnecy polygon and histogram of p2 in x5p4c data set hvals <- genpolygon(x5p4c$p2, binrule="usr", nbins=30) plotpolygon(x5p4c$p2, nbins=hvals$nbins, ptype="ph") # find the peaks in frequency polygon of p2 resfpp <- findpolypeaks(xm=hvals$mids, xc=hvals$freqs, tcmethod = "min") print(resfpp) # remove the shoulders with threshold distance option 'q1' resrs <- rmshoulders(resfpp$pm[,1], resfpp$pm[,2], trmethod = "q1") print(resrs) ## Not run: data(iris) # plot the frequency polygon and histogram of Petal.Length in iris data set # by using a user-defined class number hvals <- genpolygon(iris$Petal.Length, binrule="usr", nbins=30) plotpolygon(iris$Petal.Length, nbins=hvals$nbins, ptype="p") # find the peaks in frequency polygon of Petal.Length with default # threshold frequency value resfpp <- findpolypeaks(xm=hvals$mids, xc=hvals$freqs) print(resfpp) # remove the shoulders with threshold option 'med' resrs <- rmshoulders(resfpp$pm[,1], resfpp$pm[,2], trmethod = "med") print(resrs) ## End(Not run)
# Build a data vector with three peaks x1 <-rnorm(100, mean=20, sd=5) x2 <-rnorm(50, mean=50, sd=5) x3 <-rnorm(150, mean=90, sd=10) x <- c(x1,x3,x2) # generate the frequency polygon and histogram of x by using Doane rule hvals <- genpolygon(x, binrule="doane") plotpolygon(x, nbins=hvals$nbins, ptype="p") # find the peaks in frequency polygon of x by using the default threshold frequency resfpp <- findpolypeaks(xm=hvals$mids, xc=hvals$freqs) print(resfpp) # remove the shoulders with the threshold distance option 'avg' resrs <- rmshoulders(resfpp$pm[,1], resfpp$pm[,2], trmethod = "avg") print(resrs) # remove the shoulders with the threshold distance option 'iqr' resrs <- rmshoulders(resfpp$pm[,1], resfpp$pm[,2], trmethod = "iqr") print(resrs) data(x5p4c) # plot the frequnecy polygon and histogram of p2 in x5p4c data set hvals <- genpolygon(x5p4c$p2, binrule="usr", nbins=30) plotpolygon(x5p4c$p2, nbins=hvals$nbins, ptype="ph") # find the peaks in frequency polygon of p2 resfpp <- findpolypeaks(xm=hvals$mids, xc=hvals$freqs, tcmethod = "min") print(resfpp) # remove the shoulders with threshold distance option 'q1' resrs <- rmshoulders(resfpp$pm[,1], resfpp$pm[,2], trmethod = "q1") print(resrs) ## Not run: data(iris) # plot the frequency polygon and histogram of Petal.Length in iris data set # by using a user-defined class number hvals <- genpolygon(iris$Petal.Length, binrule="usr", nbins=30) plotpolygon(iris$Petal.Length, nbins=hvals$nbins, ptype="p") # find the peaks in frequency polygon of Petal.Length with default # threshold frequency value resfpp <- findpolypeaks(xm=hvals$mids, xc=hvals$freqs) print(resfpp) # remove the shoulders with threshold option 'med' resrs <- rmshoulders(resfpp$pm[,1], resfpp$pm[,2], trmethod = "med") print(resrs) ## End(Not run)
A synthetically created data frame consists of five continous variables forming four clusters.
data(x5p4c)
data(x5p4c)
A data frame with 400 rows and 5 numeric variables:
a continous variable with one mode
a continous variable with four modes
a continous variable with two modes
a continous variable with three modes
a continous variable with two modes
The data set x5p4c
is recommended to use in comparing the performances of the internal validity indexes in cluster analysis.
data(x5p4c) # descriptive statistics of the variables summary(x5p4c) # plot the histogram of the variable p2 hist(x5p4c$p2, breaks=15) # scatter plots of the variable pairs pairs(x5p4c)
data(x5p4c) # descriptive statistics of the variables summary(x5p4c) # plot the histogram of the variable p2 hist(x5p4c$p2, breaks=15) # scatter plots of the variable pairs pairs(x5p4c)