Package 'Kira' reference manual

Title:	Machine Learning
Description:	Machine learning, containing several algorithms for supervised and unsupervised classification, in addition to a function that plots the Receiver Operating Characteristic (ROC) and Precision-Recall (PRC) curve graphs, and also a function that returns several metrics used for model evaluation, the latter can be used in ranking results from other packs.
Authors:	Paulo Cesar Ossani [aut, cre]
Maintainer:	Paulo Cesar Ossani <[email protected]>
License:	GPL-3
Version:	1.0.6
Built:	2024-12-05 06:43:54 UTC
Source:	CRAN

Machine learning and data mining.

Description

Machine learning, containing several algorithms, in addition to functions that plot the graphs of the Receiver Operating Characteristic (ROC) and Precision-Recall (PRC) curve, and also a function that returns several metrics used to evaluate the models, the latter can be used in the classification results of other packages.

Details

Package:	Kira
Type:	Package
Version:	1.0.6
Date:	2024-09-04
License:	GPL(>= 3)
LazyLoad:	yes

This package contains:

Algorithms for supervised classification: knn, linear (lda) and quadratic (qda) discriminant analysis, linear regression, etc.
Algorithms for unsupervised classification: hierarchical, kmeans, etc.
A function that plots the ROC and PRC curve.
A function that returns a series of metrics from models.
Functions that determine the ideal number of clusters: elbow and silhouette.

Author(s)

Paulo Cesar Ossani <[email protected]>

References

Aha, D. W.; Kibler, D. and Albert, M. K. Instance-based learning algorithms. Machine learning. v.6, n.1, p.37-66. 1991.

Anitha, S.; Metilda, M. A. R. Y. An extensive investigation of outlier detection by cluster validation indices. Ciencia e Tecnica Vitivinicola - A Science and Technology Journal, v. 34, n. 2, p. 22-32, 2019. doi: 10.13140/RG.2.2.26801.63848

Charnet, R. at al. Analise de modelos de regressao lienar, 2a ed. Campinas: Editora da Unicamp, 2008. 357 p.

Chicco, D.; Warrens, M. J. and Jurman, G. The matthews correlation coefficient (mcc) is more informative than cohen's kappa and brier score in binary classification assessment. IEEE Access, IEEE, v. 9, p. 78368-78381, 2021.

Erich, S. Stop using the Elbow criterion for k-means and how to choose the number of clusters instead. ACM SIGKDD Explorations Newsletter. 25 (1): 36-42. arXiv:2212.12189. 2023. doi: 10.1145/3606274.3606278

Ferreira, D. F. Estatistica Multivariada. 2a ed. revisada e ampliada. Lavras: Editora UFLA, 2011. 676 p.

Kaufman, L. and Rousseeuw, P. J. Finding Groups in Data: An Introduction to Cluster Analysis, New York: John Wiley & Sons. 1990.

Martinez, W. L.; Martinez, A. R.; Solka, J. Exploratory data analysis with MATLAB. 2nd ed. New York: Chapman & Hall/CRC, 2010. 499 p.

Mingoti, S. A. analysis de dados atraves de metodos de estatistica multivariada: uma abordagem aplicada. Belo Horizonte: UFMG, 2005. 297 p.

Nicoletti, M. do C. O modelo de aprendizado de maquina baseado em exemplares: principais caracteristicas e algoritmos. Sao Carlos: EdUFSCar, 2005. 61 p.

Onumanyi, A. J.; Molokomme, D. N.; Isaac, S. J. and Abu-Mahfouz, A. M. Autoelbow: An automatic elbow detection method for estimating the number of clusters in a dataset. Applied Sciences 12, 15. 2022. doi: 10.3390/app12157515

Rencher, A. C. Methods of multivariate analysis. 2th. ed. New York: J.Wiley, 2002. 708 p.

Rencher, A. C. and Schaalje, G. B. Linear models in statisctic. 2th. ed. New Jersey: John & Sons, 2008. 672 p.

Rousseeuw P. J. Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Journal of Computational and Applied Mathematics, 20:53-65. 1987. doi: 10.1016/0377-0427(87)90125-7

Sugar, C. A. and James, G. M. Finding the number of clusters in a dataset: An information-theoretic approach. Journal of the American Statistical Association, 98, 463, 750-763. 2003. doi: 10.1198/016214503000000666

Venabless, W. N. and Ripley, B. D. Modern Applied Statistics with S. Fourth edition. Springer, 2002.

Zhang, Y.; Mandziuk, J.; Quek, H. C. and Goh, W. Curvature-based method for determining the number of clusters. Inf. Sci. 415, 414-428, 2017. doi: 10.1016/j.ins.2017.05.024

Elbow method to determine the optimal number of clusters.

Description

Generates the Elbow graph and returns the ideal number of clusters.

Usage

elbow(data, k.max = 10, method = "AutoElbow", plot = TRUE, 
      cut = TRUE, title = NA, xlabel = NA, ylabel = NA, size = 1.1,  
      grid = TRUE, color = TRUE, savptc = FALSE, width = 3236, 
      height = 2000, res = 300, casc = TRUE)elbow(data, k.max = 10, method = "AutoElbow", plot = TRUE, 
      cut = TRUE, title = NA, xlabel = NA, ylabel = NA, size = 1.1,  
      grid = TRUE, color = TRUE, savptc = FALSE, width = 3236, 
      height = 2000, res = 300, casc = TRUE)

Arguments

`data`	Data with x and y coordinates.
`k.max`	Maximum number of clusters for comparison (default = 10).
`method`	Method used to find the ideal number k of clusters: "jump", "curvature", "Exp", "AutoElbow" (default).
`plot`	Indicates whether to plot the elbow graph (default = TRUE).
`cut`	Indicates whether to plot the best cluster indicative line (default = TRUE).
`title`	Title of the graphic, if not set, assumes the default text.
`xlabel`	Names the X axis, if not set, assumes the default text.
`ylabel`	Names the Y axis, if not set, assumes the default text.
`size`	Size of points on the graph and line thickness (default = 1.1).
`grid`	Put grid on graph (default = TRUE).
`color`	Colored graphic (default = TRUE).
`savptc`	Saves the graph image to a file (default = FALSE).
`width`	Graphic image width when savptc = TRUE (defaul = 3236).
`height`	Graphic image height when savptc = TRUE (default = 2000).
`res`	Nominal resolution in ppi of the graphic image when savptc = TRUE (default = 300).
`casc`	Cascade effect in the presentation of the graphic (default = TRUE).

Value

k.ideal

Ideal number of clusters.

Author(s)

Paulo Cesar Ossani

References

Zhang, Y.; Mandziuk, J.; Quek, H. C. and Goh, W. Curvature-based method for determining the number of clusters. Inf. Sci. 415, 414-428, 2017. doi: 10.1016/j.ins.2017.05.024

Examples

data(iris) # data set

res <- elbow(data = iris[,1:4], k.max = 20, method = "AutoElbow", cut = TRUE, 
             plot = TRUE, title = NA, xlabel = NA, ylabel = NA, size = 1.1, 
             grid = TRUE, savptc = FALSE, width = 3236, color = TRUE, 
             height = 2000, res = 300, casc = FALSE)
             
res$k.ideal # number of clusters

data(iris) # data set

res <- elbow(data = iris[,1:4], k.max = 20, method = "AutoElbow", cut = TRUE, 
             plot = TRUE, title = NA, xlabel = NA, ylabel = NA, size = 1.1, 
             grid = TRUE, savptc = FALSE, width = 3236, color = TRUE, 
             height = 2000, res = 300, casc = FALSE)
             
res$k.ideal # number of clusters

Hierarchical unsupervised classification.

Description

Performs hierarchical unsupervised classification analysis in a data set.

Usage

hierarchical(data, titles = NA, analysis = "Obs", cor.abs = FALSE,
         normalize = FALSE, distance = "euclidean", method = "complete", 
         horizontal = FALSE, num.groups = 0, lambda = 2, savptc = FALSE, 
         width = 3236, height = 2000, res = 300, casc = TRUE)hierarchical(data, titles = NA, analysis = "Obs", cor.abs = FALSE,
         normalize = FALSE, distance = "euclidean", method = "complete", 
         horizontal = FALSE, num.groups = 0, lambda = 2, savptc = FALSE, 
         width = 3236, height = 2000, res = 300, casc = TRUE)

Arguments

`data`	Data to be analyzed.
`titles`	Titles of the graphics, if not set, assumes the default text.
`analysis`	"Obs" for analysis on observations (default), "Var" for analysis on variables.
`cor.abs`	Matrix of absolute correlation case 'analysis' = "Var" (default = FALSE).
`normalize`	Normalize the data only for case 'analysis' = "Obs" (default = FALSE).
`distance`	Metric of the distances in case of hierarchical groupings: "euclidean" (default), "maximum", "manhattan", "canberra", "binary" or "minkowski". Case Analysis = "Var" the metric will be the correlation matrix, according to cor.abs.
`method`	Method for analyzing hierarchical groupings: "complete" (default), "ward.D", "ward.D2", "single", "average", "mcquitty", "median" or "centroid".
`horizontal`	Horizontal dendrogram (default = FALSE).
`num.groups`	Number of groups to be formed.
`lambda`	Value used in the minkowski distance.
`savptc`	Saves graphics images to files (default = FALSE).
`width`	Graphics images width when savptc = TRUE (defaul = 3236).
`height`	Graphics images height when savptc = TRUE (default = 2000).
`res`	Nominal resolution in ppi of the graphics images when savptc = TRUE (default = 300).
`casc`	Cascade effect in the presentation of the graphics (default = TRUE).

Value

Several graphics.

`tab.res`	Table with similarities and distances of the groups formed.
`groups`	Original data with groups formed.
`res.groups`	Results of the groups formed.
`R.sqt`	Result of the R squared.
`sum.sqt`	Total sum of squares.
`mtx.dist`	Matrix of the distances.

Author(s)

Paulo Cesar Ossani

References

Rencher, A. C. Methods of multivariate analysis. 2th. ed. New York: J.Wiley, 2002. 708 p.

Mingoti, S. A. analysis de dados atraves de metodos de estatistica multivariada: uma abordagem aplicada. Belo Horizonte: UFMG, 2005. 297 p.

Ferreira, D. F. Estatistica Multivariada. 2a ed. revisada e ampliada. Lavras: Editora UFLA, 2011. 676 p.

Examples

data(iris) # data set

data <- iris

res <- hierarchical(data[,1:4], titles = NA, analysis = "Obs", cor.abs = FALSE, 
            normalize = FALSE, distance = "euclidean", method = "ward.D", 
            horizontal = FALSE, num.groups = 3, savptc = FALSE, width = 3236, 
            height = 2000, res = 300, casc = FALSE)
      
message("R squared: ", res$R.sqt)     
# message("Total sum of squares: ", res$sum.sqt)
message("Groups formed: "); res$groups
# message("Table with similarities and distances:"); res$tab.res
# message("Table with the results of the groups:"); res$res.groups
# message("Distance Matrix:"); res$mtx.dist

#write.table(file=file.path(tempdir(),"GroupData.csv"), res$groups, sep=";",
#            dec=",",row.names = TRUE)
data(iris) # data set

data <- iris

res <- hierarchical(data[,1:4], titles = NA, analysis = "Obs", cor.abs = FALSE, 
            normalize = FALSE, distance = "euclidean", method = "ward.D", 
            horizontal = FALSE, num.groups = 3, savptc = FALSE, width = 3236, 
            height = 2000, res = 300, casc = FALSE)
      
message("R squared: ", res$R.sqt)     
# message("Total sum of squares: ", res$sum.sqt)
message("Groups formed: "); res$groups
# message("Table with similarities and distances:"); res$tab.res
# message("Table with the results of the groups:"); res$res.groups
# message("Distance Matrix:"); res$mtx.dist

#write.table(file=file.path(tempdir(),"GroupData.csv"), res$groups, sep=";",
#            dec=",",row.names = TRUE)

kmeans unsupervised classification.

Description

Performs kmeans unsupervised classification analysis in a data set.

Usage

kmeans(data, normalize = FALSE, num.groups = 2)kmeans(data, normalize = FALSE, num.groups = 2)

Arguments

`data`	Data to be analyzed.
`normalize`	Normalize the data (default = FALSE).
`num.groups`	Number of groups to be formed (default = 2).

Value

`groups`	Original data with groups formed.
`res.groups`	Results of the groups formed.
`R.sqt`	Result of the R squared.
`sum.sqt`	Total sum of squares.

Author(s)

Paulo Cesar Ossani

References

Rencher, A. C. Methods of multivariate analysis. 2th. ed. New York: J.Wiley, 2002. 708 p.

Mingoti, S. A. analysis de dados atraves de metodos de estatistica multivariada: uma abordagem aplicada. Belo Horizonte: UFMG, 2005. 297 p.

Ferreira, D. F. Estatistica Multivariada. 2a ed. revisada e ampliada. Lavras: Editora UFLA, 2011. 676 p.

Examples

data(iris) # data set

data <- iris

res <- kmeans(data[,1:4], normalize = FALSE, num.groups = 3)
 
message("R squared: ", res$R.sqt)             
# message("Total sum of squares: ", res$sum.sqt)
message("Groups formed:"); res$groups
# message("Table with the results of the groups:"); res$res.groups

#write.table(file=file.path(tempdir(),"GroupData.csv"), res$groups, sep=";",
#            dec=",",row.names = TRUE)
data(iris) # data set

data <- iris

res <- kmeans(data[,1:4], normalize = FALSE, num.groups = 3)
 
message("R squared: ", res$R.sqt)             
# message("Total sum of squares: ", res$sum.sqt)
message("Groups formed:"); res$groups
# message("Table with the results of the groups:"); res$res.groups

#write.table(file=file.path(tempdir(),"GroupData.csv"), res$groups, sep=";",
#            dec=",",row.names = TRUE)

k-nearest neighbor (kNN) supervised classification method

Description

Performs the k-nearest neighbor (kNN) supervised classification method.

Usage

knn(train, test, class, k = 1, dist = "euclidean", lambda = 3)knn(train, test, class, k = 1, dist = "euclidean", lambda = 3)

Arguments

`train`	Data set of training, without classes.
`test`	Test data set.
`class`	Vector with data classes names.
`k`	Number of nearest neighbors (default = 1).
`dist`	Distances used in the method: "euclidean" (default), "manhattan", "minkowski", "canberra", "maximum" or "chebyshev".
`lambda`	Value used in the minkowski distance (default = 3).

Value

predict

The classified factors of the test set.

Author(s)

Paulo Cesar Ossani

References

Aha, D. W.; Kibler, D. and Albert, M. K. Instance-based learning algorithms. Machine learning. v.6, n.1, p.37-66. 1991.

Nicoletti, M. do C. O modelo de aprendizado de maquina baseado em exemplares: principais caracteristicas e algoritmos. Sao Carlos: EdUFSCar, 2005. 61 p.

Examples

data(iris) # data set

data  <- iris
names <- colnames(data)
colnames(data) <- c(names[1:4],"class")

#### Start - hold out validation method ####
dat.sample = sample(2, nrow(data), replace = TRUE, prob = c(0.7,0.3))
data.train = data[dat.sample == 1,] # training data set
data.test  = data[dat.sample == 2,] # test data set
class.train = as.factor(data.train$class) # class names of the training data set
class.test  = as.factor(data.test$class)  # class names of the test data set
#### End - hold out validation method ####


dist = "euclidean" 
# dist = "manhattan"
# dist = "minkowski"
# dist = "canberra"
# dist = "maximum"
# dist = "chebyshev"

k = 1
lambda = 5

r <- (ncol(data) - 1)
res <- knn(train = data.train[,1:r], test = data.test[,1:r], class = class.train, 
           k = 1, dist = dist, lambda = lambda)

resp <- results(orig.class = class.test, predict = res$predict)

message("Mean squared error:"); resp$mse
message("Mean absolute error:"); resp$mae
message("Relative absolute error:"); resp$rae
message("Confusion matrix:"); resp$conf.mtx  
message("Hit rate: ", resp$rate.hits)
message("Error rate: ", resp$rate.error)
message("Number of correct instances: ", resp$num.hits)
message("Number of wrong instances: ", resp$num.error)
message("Kappa coefficient: ", resp$kappa)
message("General results of the classes:"); resp$res.class

data(iris) # data set

data  <- iris
names <- colnames(data)
colnames(data) <- c(names[1:4],"class")

#### Start - hold out validation method ####
dat.sample = sample(2, nrow(data), replace = TRUE, prob = c(0.7,0.3))
data.train = data[dat.sample == 1,] # training data set
data.test  = data[dat.sample == 2,] # test data set
class.train = as.factor(data.train$class) # class names of the training data set
class.test  = as.factor(data.test$class)  # class names of the test data set
#### End - hold out validation method ####


dist = "euclidean" 
# dist = "manhattan"
# dist = "minkowski"
# dist = "canberra"
# dist = "maximum"
# dist = "chebyshev"

k = 1
lambda = 5

r <- (ncol(data) - 1)
res <- knn(train = data.train[,1:r], test = data.test[,1:r], class = class.train, 
           k = 1, dist = dist, lambda = lambda)

resp <- results(orig.class = class.test, predict = res$predict)

message("Mean squared error:"); resp$mse
message("Mean absolute error:"); resp$mae
message("Relative absolute error:"); resp$rae
message("Confusion matrix:"); resp$conf.mtx  
message("Hit rate: ", resp$rate.hits)
message("Error rate: ", resp$rate.error)
message("Number of correct instances: ", resp$num.hits)
message("Number of wrong instances: ", resp$num.error)
message("Kappa coefficient: ", resp$kappa)
message("General results of the classes:"); resp$res.class

Linear discriminant analysis (LDA).

Description

Perform linear discriminant analysis.

Usage

lda(data, test = NA, class = NA, type = "train", 
   method = "moment", prior = NA)lda(data, test = NA, class = NA, type = "train", 
   method = "moment", prior = NA)

Arguments

`data`	Data to be classified.
`test`	Vector with indices that will be used in 'data' as test. For type = "train", one has test = NA.
`class`	Vector with data classes names.
`type`	Type of type: "train" - data training (default), or "test" - classifies the data of the vector "test".
`method`	Classification method: "mle" to MLEs, "mve" to use cov.mv, "moment" (default) for standard mean and variance estimators, or "t" for robust estimates based on the t distribution.
`prior`	Probabilities of occurrence of classes. If not specified, it will take the proportions of the classes. If specified, probabilities must follow the order of factor levels.

Value

predict

The classified factors of the set.

Author(s)

Paulo Cesar Ossani

References

Rencher, A. C. Methods of multivariate analysis. 2th. ed. New York: J.Wiley, 2002. 708 p.

Venabless, W. N. and Ripley, B. D. Modern Applied Statistics with S. Fourth edition. Springer, 2002.

Mingoti, S. A. Analise de dados atraves de metodos de estatistica multivariada: uma abordagem aplicada. Belo Horizonte: UFMG, 2005. 297 p.

Ferreira, D. F. Estatistica Multivariada. 2a ed. revisada e ampliada. Lavras: Editora UFLA, 2011. 676 p.

Examples

data(iris) # data set

data  <- iris
names <- colnames(data)
colnames(data) <- c(names[1:4],"class")

#### Start - hold out validation method ####
dat.sample = sample(2, nrow(data), replace = TRUE, prob = c(0.7,0.3))
data.train = data[dat.sample == 1,] # training data set
data.test  = data[dat.sample == 2,] # test data set
class.train = as.factor(data.train$class) # class names of the training data set
class.test  = as.factor(data.test$class)  # class names of the test data set
#### End - hold out validation method ####

r <- (ncol(data) - 1)
class <- data[,c(r+1)] # classes names

## Data training example
res <- lda(data = data[,1:r], test = NA, class = class, 
           type = "train", method = "moment", prior = NA)

resp <- results(orig.class = class, predict = res$predict)

message("Mean squared error:"); resp$mse
message("Mean absolute error:"); resp$mae
message("Relative absolute error:"); resp$rae
message("Confusion matrix:"); resp$conf.mtx  
message("Hit rate: ", resp$rate.hits)
message("Error rate: ", resp$rate.error)
message("Number of correct instances: ", resp$num.hits)
message("Number of wrong instances: ", resp$num.error)
message("Kappa coefficient: ", resp$kappa)
message("General results of the classes:"); resp$res.class 


## Data test example
class.table <- table(class) # table with the number of elements per class
prior <- as.double(class.table/sum(class.table))
test = as.integer(rownames(data.test)) # test data index

res <- lda(data = data[,1:r], test = test, class = class, 
           type = "test", method = "mle", prior = prior)

resp <- results(orig.class = class.test, predict = res$predict)

message("Mean squared error:"); resp$mse
message("Mean absolute error:"); resp$mae
message("Relative absolute error:"); resp$rae
message("Confusion matrix: "); resp$conf.mtx  
message("Hit rate: ", resp$rate.hits)
message("Error rate: ", resp$rate.error)
message("Number of correct instances: ", resp$num.hits)
message("Number of wrong instances: ", resp$num.error)
message("Kappa coefficient: ", resp$kappa)
message("General results of the classes:"); resp$res.class  

data(iris) # data set

data  <- iris
names <- colnames(data)
colnames(data) <- c(names[1:4],"class")

#### Start - hold out validation method ####
dat.sample = sample(2, nrow(data), replace = TRUE, prob = c(0.7,0.3))
data.train = data[dat.sample == 1,] # training data set
data.test  = data[dat.sample == 2,] # test data set
class.train = as.factor(data.train$class) # class names of the training data set
class.test  = as.factor(data.test$class)  # class names of the test data set
#### End - hold out validation method ####

r <- (ncol(data) - 1)
class <- data[,c(r+1)] # classes names

## Data training example
res <- lda(data = data[,1:r], test = NA, class = class, 
           type = "train", method = "moment", prior = NA)

resp <- results(orig.class = class, predict = res$predict)

message("Mean squared error:"); resp$mse
message("Mean absolute error:"); resp$mae
message("Relative absolute error:"); resp$rae
message("Confusion matrix:"); resp$conf.mtx  
message("Hit rate: ", resp$rate.hits)
message("Error rate: ", resp$rate.error)
message("Number of correct instances: ", resp$num.hits)
message("Number of wrong instances: ", resp$num.error)
message("Kappa coefficient: ", resp$kappa)
message("General results of the classes:"); resp$res.class 


## Data test example
class.table <- table(class) # table with the number of elements per class
prior <- as.double(class.table/sum(class.table))
test = as.integer(rownames(data.test)) # test data index

res <- lda(data = data[,1:r], test = test, class = class, 
           type = "test", method = "mle", prior = prior)

resp <- results(orig.class = class.test, predict = res$predict)

message("Mean squared error:"); resp$mse
message("Mean absolute error:"); resp$mae
message("Relative absolute error:"); resp$rae
message("Confusion matrix: "); resp$conf.mtx  
message("Hit rate: ", resp$rate.hits)
message("Error rate: ", resp$rate.error)
message("Number of correct instances: ", resp$num.hits)
message("Number of wrong instances: ", resp$num.error)
message("Kappa coefficient: ", resp$kappa)
message("General results of the classes:"); resp$res.class

Graphics of the results of the classification process

Description

Return graphics of the results of the classification process.

Usage

plot_curve(data, type = "ROC", title = NA, xlabel = NA, ylabel = NA,  
           posleg = 3, boxleg = FALSE, axis = TRUE, size = 1.1, grid = TRUE, 
           color = TRUE, classcolor = NA, savptc = FALSE, width = 3236, 
           height = 2000, res = 300, casc = TRUE)plot_curve(data, type = "ROC", title = NA, xlabel = NA, ylabel = NA,  
           posleg = 3, boxleg = FALSE, axis = TRUE, size = 1.1, grid = TRUE, 
           color = TRUE, classcolor = NA, savptc = FALSE, width = 3236, 
           height = 2000, res = 300, casc = TRUE)

Arguments

`data`	Data with x and y coordinates.
`type`	ROC (default) or PRC graphics type.
`title`	Title of the graphic, if not set, assumes the default text.
`xlabel`	Names the X axis, if not set, assumes the default text.
`ylabel`	Names the Y axis, if not set, assumes the default text.
`posleg`	0 with no caption, 1 for caption in the left upper corner, 2 for caption in the right upper corner, 3 for caption in the right lower corner (default), 4 for caption in the left lower corner.
`boxleg`	Puts the frame in the caption (default = TRUE).
`axis`	Put the diagonal axis on the graph (default = TRUE).
`size`	Size of the points in the graphs (default = 1.1).
`grid`	Put grid on graphs (default = TRUE).
`color`	Colored graphics (default = TRUE).
`classcolor`	Vector with the colors of the classes.
`savptc`	Saves graphics images to files (default = FALSE).
`width`	Graphics images width when savptc = TRUE (defaul = 3236).
`height`	Graphics images height when savptc = TRUE (default = 2000).
`res`	Nominal resolution in ppi of the graphics images when savptc = TRUE (default = 300).
`casc`	Cascade effect in the presentation of the graphic (default = TRUE).

Value

ROC or PRC curve.

Author(s)

Paulo Cesar Ossani

Examples

data(iris) # data set

data  <- iris
names <- colnames(data)
colnames(data) <- c(names[1:4],"class")

#### Start - hold out validation method ####
dat.sample = sample(2, nrow(data), replace = TRUE, prob = c(0.7,0.3))
data.train = data[dat.sample == 1,] # training data set
data.test  = data[dat.sample == 2,] # test data set
class.train = as.factor(data.train$class) # class names of the training data set
class.test  = as.factor(data.test$class)  # class names of the test data set
#### End - hold out validation method ####


dist = "euclidean" 
# dist = "manhattan"
# dist = "minkowski"
# dist = "canberra"
# dist = "maximum"
# dist = "chebyshev"

k = 1
lambda = 5

r <- (ncol(data) - 1)
res <- knn(train = data.train[,1:r], test = data.test[,1:r], class = class.train, 
           k = 1, dist = dist, lambda = lambda)

resp <- results(orig.class = class.test, predict = res$predict)

message("Mean squared error:"); resp$mse
message("Mean absolute error:"); resp$mae
message("Relative absolute error:"); resp$rae
message("Confusion matrix:"); resp$conf.mtx  
message("Hit rate: ", resp$rate.hits)
message("Error rate: ", resp$rate.error)
message("Number of correct instances: ", resp$num.hits)
message("Number of wrong instances: ", resp$num.error)
message("Kappa coefficient: ", resp$kappa)
# message("Data for the ROC curve in classes:"); resp$roc.curve 
# message("Data for the PRC curve in classes:"); resp$prc.curve
message("General results of the classes:"); resp$res.class

dat <- resp$roc.curve; tp = "roc"; ps = 3
# dat <- resp$prc.curve; tp = "prc"; ps = 4

plot_curve(data = dat, type = tp, title = NA, xlabel = NA, ylabel = NA,  
           posleg = ps, boxleg = FALSE, axis = TRUE, size = 1.1, grid = TRUE, 
           color = TRUE, classcolor = NA, savptc = FALSE,
           width = 3236, height = 2000, res = 300, casc = FALSE)

data(iris) # data set

data  <- iris
names <- colnames(data)
colnames(data) <- c(names[1:4],"class")

#### Start - hold out validation method ####
dat.sample = sample(2, nrow(data), replace = TRUE, prob = c(0.7,0.3))
data.train = data[dat.sample == 1,] # training data set
data.test  = data[dat.sample == 2,] # test data set
class.train = as.factor(data.train$class) # class names of the training data set
class.test  = as.factor(data.test$class)  # class names of the test data set
#### End - hold out validation method ####


dist = "euclidean" 
# dist = "manhattan"
# dist = "minkowski"
# dist = "canberra"
# dist = "maximum"
# dist = "chebyshev"

k = 1
lambda = 5

r <- (ncol(data) - 1)
res <- knn(train = data.train[,1:r], test = data.test[,1:r], class = class.train, 
           k = 1, dist = dist, lambda = lambda)

resp <- results(orig.class = class.test, predict = res$predict)

message("Mean squared error:"); resp$mse
message("Mean absolute error:"); resp$mae
message("Relative absolute error:"); resp$rae
message("Confusion matrix:"); resp$conf.mtx  
message("Hit rate: ", resp$rate.hits)
message("Error rate: ", resp$rate.error)
message("Number of correct instances: ", resp$num.hits)
message("Number of wrong instances: ", resp$num.error)
message("Kappa coefficient: ", resp$kappa)
# message("Data for the ROC curve in classes:"); resp$roc.curve 
# message("Data for the PRC curve in classes:"); resp$prc.curve
message("General results of the classes:"); resp$res.class

dat <- resp$roc.curve; tp = "roc"; ps = 3
# dat <- resp$prc.curve; tp = "prc"; ps = 4

plot_curve(data = dat, type = tp, title = NA, xlabel = NA, ylabel = NA,  
           posleg = ps, boxleg = FALSE, axis = TRUE, size = 1.1, grid = TRUE, 
           color = TRUE, classcolor = NA, savptc = FALSE,
           width = 3236, height = 2000, res = 300, casc = FALSE)

Quadratic discriminant analysis (QDA).

Description

Perform quadratic discriminant analysis.

Usage

qda(data, test = NA, class = NA, type = "train",
   method = "moment", prior = NA)qda(data, test = NA, class = NA, type = "train",
   method = "moment", prior = NA)

Arguments

`data`	Data to be classified.
`test`	Vector with indices that will be used in 'data' as test. For type = "train", one has test = NA.
`class`	Vector with data classes names.
`type`	Type of type: "train" - data training (default), or "test" - classifies the data of the vector "test".
`method`	Classification method: "mle" to MLEs, "mve" to use cov.mv, "moment" (default) for standard mean and variance estimators, or "t" for robust estimates based on the t distribution.
`prior`	Probabilities of occurrence of classes. If not specified, it will take the proportions of the classes. If specified, probabilities must follow the order of factor levels.

Value

predict

The classified factors of the set.

Author(s)

Paulo Cesar Ossani

References

Rencher, A. C. Methods of multivariate analysis. 2th. ed. New York: J.Wiley, 2002. 708 p.

Venabless, W. N. and Ripley, B. D. Modern Applied Statistics with S. Fourth edition. Springer, 2002.

Mingoti, S. A. Analise de dados atraves de metodos de estatistica multivariada: uma abordagem aplicada. Belo Horizonte: UFMG, 2005. 297 p.

Ferreira, D. F. Estatistica Multivariada. 2a ed. revisada e ampliada. Lavras: Editora UFLA, 2011. 676 p.

Examples

data(iris) # data set

data  <- iris
names <- colnames(data)
colnames(data) <- c(names[1:4],"class")

#### Start - hold out validation method ####
dat.sample = sample(2, nrow(data), replace = TRUE, prob = c(0.7,0.3))
data.train = data[dat.sample == 1,] # training data set
data.test  = data[dat.sample == 2,] # test data set
class.train = as.factor(data.train$class) # class names of the training data set
class.test  = as.factor(data.test$class)  # class names of the test data set
#### End - hold out validation method ####

r <- (ncol(data) - 1)
class <- data[,c(r+1)] # classes names

## Data training example
res <- qda(data = data[,1:r], test = NA, class = class, 
           type = "train", method = "moment", prior = NA)

resp <- results(orig.class = class, predict = res$predict)

message("Mean Squared Error:"); resp$mse
message("Mean absolute error:"); resp$mae
message("Relative absolute error:"); resp$rae
message("Confusion matrix: "); resp$conf.mtx  
message("Hit rate: ", resp$rate.hits)
message("Error rate: ", resp$rate.error)
message("Number of correct instances: ", resp$num.hits)
message("Number of wrong instances: ", resp$num.error)
message("Kappa coefficient: ", resp$kappa)
message("General results of the classes:"); resp$res.class  


## Data test example
class.table <- table(class) # table with the number of elements per class
prior <- as.double(class.table/sum(class.table))
test = as.integer(rownames(data.test)) # test data index

res <- qda(data = data[,1:r], test = test, class = class, 
           type = "test", method = "mle", prior = prior)

resp <- results(orig.class = class.test, predic = res$predict)

message("Mean squared error:"); resp$mse
message("Mean absolute error:"); resp$mae
message("Relative absolute error:"); resp$rae
message("Confusion matrix: "); resp$conf.mtx  
message("Hit rate: ", resp$rate.hits)
message("Error rate: ", resp$rate.error)
message("Number of correct instances: ", resp$num.hits)
message("Number of wrong instances: ", resp$num.error)
message("Kappa coefficient: ", resp$kappa)
message("General results of the classes:"); resp$res.class  

data(iris) # data set

data  <- iris
names <- colnames(data)
colnames(data) <- c(names[1:4],"class")

#### Start - hold out validation method ####
dat.sample = sample(2, nrow(data), replace = TRUE, prob = c(0.7,0.3))
data.train = data[dat.sample == 1,] # training data set
data.test  = data[dat.sample == 2,] # test data set
class.train = as.factor(data.train$class) # class names of the training data set
class.test  = as.factor(data.test$class)  # class names of the test data set
#### End - hold out validation method ####

r <- (ncol(data) - 1)
class <- data[,c(r+1)] # classes names

## Data training example
res <- qda(data = data[,1:r], test = NA, class = class, 
           type = "train", method = "moment", prior = NA)

resp <- results(orig.class = class, predict = res$predict)

message("Mean Squared Error:"); resp$mse
message("Mean absolute error:"); resp$mae
message("Relative absolute error:"); resp$rae
message("Confusion matrix: "); resp$conf.mtx  
message("Hit rate: ", resp$rate.hits)
message("Error rate: ", resp$rate.error)
message("Number of correct instances: ", resp$num.hits)
message("Number of wrong instances: ", resp$num.error)
message("Kappa coefficient: ", resp$kappa)
message("General results of the classes:"); resp$res.class  


## Data test example
class.table <- table(class) # table with the number of elements per class
prior <- as.double(class.table/sum(class.table))
test = as.integer(rownames(data.test)) # test data index

res <- qda(data = data[,1:r], test = test, class = class, 
           type = "test", method = "mle", prior = prior)

resp <- results(orig.class = class.test, predic = res$predict)

message("Mean squared error:"); resp$mse
message("Mean absolute error:"); resp$mae
message("Relative absolute error:"); resp$rae
message("Confusion matrix: "); resp$conf.mtx  
message("Hit rate: ", resp$rate.hits)
message("Error rate: ", resp$rate.error)
message("Number of correct instances: ", resp$num.hits)
message("Number of wrong instances: ", resp$num.error)
message("Kappa coefficient: ", resp$kappa)
message("General results of the classes:"); resp$res.class

Linear regression supervised classification method

Description

Performs supervised classification using the linear regression method.

Usage

regression(train, test, class, intercept = TRUE)regression(train, test, class, intercept = TRUE)

Arguments

`train`	Data set of training, without classes.
`test`	Test data set.
`class`	Vector with data classes names.
`intercept`	Consider the intercept in the regression (default = TRUE).

Value

predict

The classified factors of the test set.

Author(s)

Paulo Cesar Ossani

References

Charnet, R. at al. Analise de modelos de regressao lienar, 2a ed. Campinas: Editora da Unicamp, 2008. 357 p.

Rencher, A. C. and Schaalje, G. B. Linear models in statisctic. 2th. ed. New Jersey: John & Sons, 2008. 672 p.

Rencher, A. C. Methods of multivariate analysis. 2th. ed. New York: J.Wiley, 2002. 708 p.

Examples

data(iris) # data set

data  <- iris
names <- colnames(data)
colnames(data) <- c(names[1:4],"class")

#### Start - hold out validation method ####
dat.sample = sample(2, nrow(data), replace = TRUE, prob = c(0.7,0.3))
data.train = data[dat.sample == 1,] # training data set
data.test  = data[dat.sample == 2,] # test data set
class.train = as.factor(data.train$class) # class names of the training data set
class.test  = as.factor(data.test$class)  # class names of the test data set
#### End - hold out validation method ####

r <- (ncol(data) - 1)
res <- regression(train = data.train[,1:r], test = data.test[,1:r], 
                  class = class.train, intercept = TRUE)

resp <- results(orig.class = class.test, predict = res$predict)

message("Mean squared error:"); resp$mse
message("Mean absolute error:"); resp$mae
message("Relative absolute error:"); resp$rae
message("Confusion matrix:"); resp$conf.mtx  
message("Hit rate: ", resp$rate.hits)
message("Error rate: ", resp$rate.error)
message("Number of correct instances: ", resp$num.hits)
message("Number of wrong instances: ", resp$num.error)
message("Kappa coefficient: ", resp$kappa)
message("General results of the classes:"); resp$res.class

data(iris) # data set

data  <- iris
names <- colnames(data)
colnames(data) <- c(names[1:4],"class")

#### Start - hold out validation method ####
dat.sample = sample(2, nrow(data), replace = TRUE, prob = c(0.7,0.3))
data.train = data[dat.sample == 1,] # training data set
data.test  = data[dat.sample == 2,] # test data set
class.train = as.factor(data.train$class) # class names of the training data set
class.test  = as.factor(data.test$class)  # class names of the test data set
#### End - hold out validation method ####

r <- (ncol(data) - 1)
res <- regression(train = data.train[,1:r], test = data.test[,1:r], 
                  class = class.train, intercept = TRUE)

resp <- results(orig.class = class.test, predict = res$predict)

message("Mean squared error:"); resp$mse
message("Mean absolute error:"); resp$mae
message("Relative absolute error:"); resp$rae
message("Confusion matrix:"); resp$conf.mtx  
message("Hit rate: ", resp$rate.hits)
message("Error rate: ", resp$rate.error)
message("Number of correct instances: ", resp$num.hits)
message("Number of wrong instances: ", resp$num.error)
message("Kappa coefficient: ", resp$kappa)
message("General results of the classes:"); resp$res.class

Results of the classification process

Description

Returns the results of the classification process.

Usage

results(orig.class, predict)results(orig.class, predict)

Arguments

`orig.class`	Data with the original classes.
`predict`	Data with classes of results of classifiers.

Value

`mse`	Mean squared error.
`mae`	Mean absolute error.
`rae`	Relative absolute error.
`conf.mtx`	Confusion matrix.
`rate.hits`	Hit rate.
`rate.error`	Error rate.
`num.hits`	Number of correct instances.
`num.error`	Number of wrong instances.
`kappa`	Kappa coefficient.
`roc.curve`	Data for the ROC curve in classes.
`prc.curve`	Data for the PRC curve in classes.
`res.class`	General results of the classes: Sensitivity, Specificity, Precision, TP Rate, FP Rate, NP Rate, F-Score, MCC, ROC Area, PRC Area.

Author(s)

Paulo Cesar Ossani

References

Examples

data(iris) # data set

data  <- iris
names <- colnames(data)
colnames(data) <- c(names[1:4],"class")

#### Start - hold out validation method ####
dat.sample = sample(2, nrow(data), replace = TRUE, prob = c(0.7,0.3))
data.train = data[dat.sample == 1,] # training data set
data.test  = data[dat.sample == 2,] # test data set
class.train = as.factor(data.train$class) # class names of the training data set
class.test  = as.factor(data.test$class)  # class names of the test data set
#### End - hold out validation method ####


dist = "euclidean" 
# dist = "manhattan"
# dist = "minkowski"
# dist = "canberra"
# dist = "maximum"
# dist = "chebyshev"

k = 1
lambda = 5

r <- (ncol(data) - 1)
res <- knn(train = data.train[,1:r], test = data.test[,1:r], class = class.train, 
           k = 1, dist = dist, lambda = lambda)

resp <- results(orig.class = class.test, predict = res$predict)

message("Mean squared error:"); resp$mse
message("Mean absolute error:"); resp$mae
message("Relative absolute error:"); resp$rae
message("Confusion matrix:"); resp$conf.mtx  
message("Hit rate: ", resp$rate.hits)
message("Error rate: ", resp$rate.error)
message("Number of correct instances: ", resp$num.hits)
message("Number of wrong instances: ", resp$num.error)
message("Kappa coefficient: ", resp$kappa)
# message("Data for the ROC curve in classes:"); resp$roc.curve 
# message("Data for the PRC curve in classes:"); resp$prc.curve
message("General results of the classes:"); resp$res.class

dat <- resp$roc.curve; tp = "roc"; ps = 3
# dat <- resp$prc.curve; tp = "prc"; ps = 4

plot_curve(data = dat, type = tp, title = NA, xlabel = NA, ylabel = NA,  
           posleg = ps, boxleg = FALSE, axis = TRUE, size = 1.1, grid = TRUE, 
           color = TRUE, classcolor = NA, savptc = FALSE, width = 3236, 
           height = 2000, res = 300, casc = FALSE)

data(iris) # data set

data  <- iris
names <- colnames(data)
colnames(data) <- c(names[1:4],"class")

#### Start - hold out validation method ####
dat.sample = sample(2, nrow(data), replace = TRUE, prob = c(0.7,0.3))
data.train = data[dat.sample == 1,] # training data set
data.test  = data[dat.sample == 2,] # test data set
class.train = as.factor(data.train$class) # class names of the training data set
class.test  = as.factor(data.test$class)  # class names of the test data set
#### End - hold out validation method ####


dist = "euclidean" 
# dist = "manhattan"
# dist = "minkowski"
# dist = "canberra"
# dist = "maximum"
# dist = "chebyshev"

k = 1
lambda = 5

r <- (ncol(data) - 1)
res <- knn(train = data.train[,1:r], test = data.test[,1:r], class = class.train, 
           k = 1, dist = dist, lambda = lambda)

resp <- results(orig.class = class.test, predict = res$predict)

message("Mean squared error:"); resp$mse
message("Mean absolute error:"); resp$mae
message("Relative absolute error:"); resp$rae
message("Confusion matrix:"); resp$conf.mtx  
message("Hit rate: ", resp$rate.hits)
message("Error rate: ", resp$rate.error)
message("Number of correct instances: ", resp$num.hits)
message("Number of wrong instances: ", resp$num.error)
message("Kappa coefficient: ", resp$kappa)
# message("Data for the ROC curve in classes:"); resp$roc.curve 
# message("Data for the PRC curve in classes:"); resp$prc.curve
message("General results of the classes:"); resp$res.class

dat <- resp$roc.curve; tp = "roc"; ps = 3
# dat <- resp$prc.curve; tp = "prc"; ps = 4

plot_curve(data = dat, type = tp, title = NA, xlabel = NA, ylabel = NA,  
           posleg = ps, boxleg = FALSE, axis = TRUE, size = 1.1, grid = TRUE, 
           color = TRUE, classcolor = NA, savptc = FALSE, width = 3236, 
           height = 2000, res = 300, casc = FALSE)

Silhouette method to determine the optimal number of clusters.

Description

Generates the silhouette graph and returns the ideal number of clusters in the k-means method.

Usage

silhouette(data, k.cluster = 2:10, plot = TRUE, cut = TRUE,
           title = NA, xlabel = NA, ylabel = NA, size = 1.1, grid = TRUE, 
           color = TRUE, savptc = FALSE, width = 3236, height = 2000,
           res = 300, casc = TRUE)silhouette(data, k.cluster = 2:10, plot = TRUE, cut = TRUE,
           title = NA, xlabel = NA, ylabel = NA, size = 1.1, grid = TRUE, 
           color = TRUE, savptc = FALSE, width = 3236, height = 2000,
           res = 300, casc = TRUE)

Arguments

`data`	Data with x and y coordinates.
`k.cluster`	Cluster numbers for comparison in the k-means method (default = 2:10).
`plot`	Indicates whether to plot the silhouette graph (default = TRUE).
`cut`	Indicates whether to plot the best cluster indicative line (default = TRUE).
`title`	Title of the graphic, if not set, assumes the default text.
`xlabel`	Names the X axis, if not set, assumes the default text.
`ylabel`	Names the Y axis, if not set, assumes the default text.
`size`	Size of points on the graph and line thickness (default = 1.1).
`grid`	Put grid on graph (default = TRUE).
`color`	Colored graphic (default = TRUE).
`savptc`	Saves the graph image to a file (default = FALSE).
`width`	Graphic image width when savptc = TRUE (defaul = 3236).
`height`	Graphic image height when savptc = TRUE (default = 2000).
`res`	Nominal resolution in ppi of the graphic image when savptc = TRUE (default = 300).
`casc`	Cascade effect in the presentation of the graphic (default = TRUE).

Value

`k.ideal`	Ideal number of clusters.
`eve.si`	Vector with averages of silhouette indices of cluster groups (si).

Author(s)

Paulo Cesar Ossani

References

Kaufman, L. and Rousseeuw, P. J. Finding Groups in Data: An Introduction to Cluster Analysis, New York: John Wiley & Sons. 1990.

Martinez, W. L.; Martinez, A. R.; Solka, J. Exploratory data analysis with MATLAB. 2nd ed. New York: Chapman & Hall/CRC, 2010. 499 p.

Examples

data(iris) # data set

res <- silhouette(data = iris[,1:4], k.cluster = 2:10, cut = TRUE, 
                  plot = TRUE, title = NA, xlabel = NA, ylabel = NA, 
                  size = 1.1, grid = TRUE, savptc = FALSE, width = 3236, 
                  color = TRUE, height = 2000, res = 300, casc = TRUE)
             
res$k.ideal # number of clusters
res$eve.si  # vector with averages of si indices


res <- silhouette(data = iris[,1:4], k.cluster = 3, cut = TRUE, 
                  plot = TRUE, title = NA, xlabel = NA, ylabel = NA, 
                  size = 1.1, grid = TRUE, savptc = FALSE, width = 3236, 
                  color = TRUE, height = 2000, res = 300, casc = TRUE)
             
res$k.ideal # number of clusters
res$eve.si  # vector with averages of si indices
data(iris) # data set

res <- silhouette(data = iris[,1:4], k.cluster = 2:10, cut = TRUE, 
                  plot = TRUE, title = NA, xlabel = NA, ylabel = NA, 
                  size = 1.1, grid = TRUE, savptc = FALSE, width = 3236, 
                  color = TRUE, height = 2000, res = 300, casc = TRUE)
             
res$k.ideal # number of clusters
res$eve.si  # vector with averages of si indices


res <- silhouette(data = iris[,1:4], k.cluster = 3, cut = TRUE, 
                  plot = TRUE, title = NA, xlabel = NA, ylabel = NA, 
                  size = 1.1, grid = TRUE, savptc = FALSE, width = 3236, 
                  color = TRUE, height = 2000, res = 300, casc = TRUE)
             
res$k.ideal # number of clusters
res$eve.si  # vector with averages of si indices

Package 'Kira'

Help Index

Machine learning and data mining.

Description

Details

Author(s)

References

Elbow method to determine the optimal number of clusters.

Description

Usage

Arguments

Value

Author(s)

References

Examples

Hierarchical unsupervised classification.

Description

Usage

Arguments

Value

Author(s)

References

Examples

kmeans unsupervised classification.

Description

Usage

Arguments

Value

Author(s)

References

Examples

k-nearest neighbor (kNN) supervised classification method

Description

Usage

Arguments

Value

Author(s)

References

See Also

Examples

Linear discriminant analysis (LDA).

Description

Usage

Arguments

Value

Author(s)

References

See Also

Examples

Graphics of the results of the classification process

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Quadratic discriminant analysis (QDA).

Description

Usage

Arguments

Value

Author(s)

References

See Also

Examples

Linear regression supervised classification method

Description

Usage

Arguments

Value

Author(s)

References

See Also

Examples

Results of the classification process

Description

Usage

Arguments

Value