Title: | R/Weka Interface |
---|---|
Description: | An R interface to Weka (Version 3.9.3). Weka is a collection of machine learning algorithms for data mining tasks written in Java, containing tools for data pre-processing, classification, regression, clustering, association rules, and visualization. Package 'RWeka' contains the interface code, the Weka jar is in a separate package 'RWekajars'. For more information on Weka see <https://www.cs.waikato.ac.nz/ml/weka/>. |
Authors: | Kurt Hornik [aut, cre] , Christian Buchta [ctb], Torsten Hothorn [ctb], Alexandros Karatzoglou [ctb], David Meyer [ctb], Achim Zeileis [ctb] |
Maintainer: | Kurt Hornik <[email protected]> |
License: | GPL-2 |
Version: | 0.4-46 |
Built: | 2024-12-04 07:07:03 UTC |
Source: | CRAN |
Write a DOT language representation of an object for processing via Graphviz.
write_to_dot(x, con = stdout(), ...) ## S3 method for class 'Weka_classifier' write_to_dot(x, con = stdout(), ...)
write_to_dot(x, con = stdout(), ...) ## S3 method for class 'Weka_classifier' write_to_dot(x, con = stdout(), ...)
x |
an R object. |
con |
a connection for writing the representation to. |
... |
additional arguments to be passed from or to methods. |
Graphviz (https://www.graphviz.org) is open source graph
visualization software providing several main graph layout programs,
of which dot
makes “hierarchical” or layered drawings of
directed graphs, and hence is typically most suitable for visualizing
classification trees.
Using dot
, the representation in file ‘foo.dot’ can be
transformed to PostScript or other displayable graphical formats using
(a variant of) dot -Tps foo.dot >foo.ps
.
Some Weka classifiers (e.g., tree learners such as J48 and M5P)
implement a “Drawable” interface providing DOT representations
of the fitted models. For such classifiers, the write_to_dot
method writes the representation to the specified connection.
Compute model performance statistics for a fitted Weka classifier.
evaluate_Weka_classifier(object, newdata = NULL, cost = NULL, numFolds = 0, complexity = FALSE, class = FALSE, seed = NULL, ...)
evaluate_Weka_classifier(object, newdata = NULL, cost = NULL, numFolds = 0, complexity = FALSE, class = FALSE, seed = NULL, ...)
object |
a |
newdata |
an optional data frame in which to look for variables
with which to evaluate. If omitted or |
cost |
a square matrix of (mis)classification costs. |
numFolds |
the number of folds to use in cross-validation. |
complexity |
option to include entropy-based statistics. |
class |
option to include class statistics. |
seed |
optional seed for cross-validation. |
... |
further arguments passed to other methods (see details). |
The function computes and extracts a non-redundant set of performance statistics that is suitable for model interpretation. By default the statistics are computed on the training data.
Currently argument ...
only supports the logical variable
normalize
which tells Weka to normalize the cost matrix so that
the cost of a correct classification is zero.
Note that if the class variable is numeric only a subset of the statistics
are available. Arguments complexity
and class
are then
not applicable and therefore ignored.
An object of class Weka_classifier_evaluation
, a list of the
following components:
string |
character, concatenation of the string representations of the performance statistics. |
details |
vector, base statistics, e.g., the percentage of instances correctly classified, etc. |
detailsComplexity |
vector, entropy-based statistics (if selected). |
detailsClass |
matrix, class statistics, e.g., the true positive rate, etc., for each level of the response variable (if selected). |
confusionMatrix |
table, cross-classification of true and predicted classes. |
I. H. Witten and E. Frank (2005). Data Mining: Practical Machine Learning Tools and Techniques. 2nd Edition, Morgan Kaufmann, San Francisco.
## Use some example data. w <- read.arff(system.file("arff","weather.nominal.arff", package = "RWeka")) ## Identify a decision tree. m <- J48(play~., data = w) m ## Use 10 fold cross-validation. e <- evaluate_Weka_classifier(m, cost = matrix(c(0,2,1,0), ncol = 2), numFolds = 10, complexity = TRUE, seed = 123, class = TRUE) e summary(e) e$details
## Use some example data. w <- read.arff(system.file("arff","weather.nominal.arff", package = "RWeka")) ## Identify a decision tree. m <- J48(play~., data = w) m ## Use 10 fold cross-validation. e <- evaluate_Weka_classifier(m, cost = matrix(c(0,2,1,0), ncol = 2), numFolds = 10, complexity = TRUE, seed = 123, class = TRUE) e summary(e) e$details
Predicted values based on fitted Weka classifier models.
## S3 method for class 'Weka_classifier' predict(object, newdata = NULL, type = c("class", "probability"), ...)
## S3 method for class 'Weka_classifier' predict(object, newdata = NULL, type = c("class", "probability"), ...)
object |
an object of class inheriting from
|
newdata |
an optional data frame in which to look for variables
with which to predict. If omitted or |
type |
character string determining whether classes should be predicted (numeric for regression, factor for classification) or class probabilities (only available for classification). May be abbreviated. |
... |
further arguments passed to or from other methods. |
Either a vector with classes or a matrix with the posterior class probabilities, with rows corresponding to instances and columns to classes.
Predict class ids or memberships based on fitted Weka clusterers.
## S3 method for class 'Weka_clusterer' predict(object, newdata = NULL, type = c("class_ids", "memberships"), ...)
## S3 method for class 'Weka_clusterer' predict(object, newdata = NULL, type = c("class_ids", "memberships"), ...)
object |
an object of class inheriting from
|
newdata |
an optional data set for predictions are sought. This
must be given for predicting class memberships. If omitted or
|
type |
a character string indicating whether class ids or memberships should be returned. May be abbreviated. |
... |
further arguments passed to or from other methods. |
It is only possible to predict class memberships if the Weka clusterer
provides a distributionForInstance
method.
Reads data from Weka Attribute-Relation File Format (ARFF) files.
read.arff(file)
read.arff(file)
file |
a character string with the name of the ARFF
file to read from, or a |
A data frame containing the data from the ARFF file.
Attribute-Relation File Format https://waikato.github.io/weka-wiki/formats_and_processing/arff/
read.arff(system.file("arff", "contact-lenses.arff", package = "RWeka"))
read.arff(system.file("arff", "contact-lenses.arff", package = "RWeka"))
R interfaces to Weka association rule learning algorithms.
Apriori(x, control = NULL) Tertius(x, control = NULL)
Apriori(x, control = NULL) Tertius(x, control = NULL)
x |
an R object with the data to be associated. |
control |
an object of class |
Apriori
implements an Apriori-type algorithm, which iteratively
reduces the minimum support until it finds the required number of
rules with the given minimum confidence.
Tertius
implements a Tertius-type algorithm.
See the references for more information on these algorithms.
A list inheriting from class Weka_associators
with components
including
associator |
a reference (of class
|
Tertius
requires Weka package tertius to be installed.
R. Agrawal and R. Srikant (1994). Fast algorithms for mining association rules in large databases. Proceedings of the International Conference on Very Large Databases, 478–499. Santiago, Chile: Morgan Kaufmann, Los Altos, CA.
P. A. Flach and N. Lachiche (1999). Confirmation-guided discovery of first-order rules with Tertius. Machine Learning, 42, 61–95. doi:10.1023/A:1007656703224.
I. H. Witten and E. Frank (2005). Data Mining: Practical Machine Learning Tools and Techniques. 2nd Edition, Morgan Kaufmann, San Francisco.
x <- read.arff(system.file("arff", "contact-lenses.arff", package = "RWeka")) ## Apriori with defaults. Apriori(x) ## Some options: set required number of rules to 20. Apriori(x, Weka_control(N = 20)) ## Not run: ## Requires Weka package 'tertius' to be installed. ## Tertius with defaults. Tertius(x) ## Some options: only classification rules (single item in the RHS). Tertius(x, Weka_control(S = TRUE)) ## End(Not run)
x <- read.arff(system.file("arff", "contact-lenses.arff", package = "RWeka")) ## Apriori with defaults. Apriori(x) ## Some options: set required number of rules to 20. Apriori(x, Weka_control(N = 20)) ## Not run: ## Requires Weka package 'tertius' to be installed. ## Tertius with defaults. Tertius(x) ## Some options: only classification rules (single item in the RHS). Tertius(x, Weka_control(S = TRUE)) ## End(Not run)
R interfaces to Weka attribute evaluators.
GainRatioAttributeEval(formula, data, subset, na.action, control = NULL) InfoGainAttributeEval(formula, data, subset, na.action, control = NULL)
GainRatioAttributeEval(formula, data, subset, na.action, control = NULL) InfoGainAttributeEval(formula, data, subset, na.action, control = NULL)
formula |
a symbolic description of a model. Note that for unsupervised filters the response can be omitted. |
data |
an optional data frame containing the variables in the model. |
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
na.action |
a function which indicates what should happen when
the data contain |
control |
an object of class |
GainRatioAttributeEval
evaluates the worth of an attribute by
measuring the gain ratio with respect to the class.
InfoGainAttributeEval
evaluates the worth of an attribute by
measuring the information gain with respect to the class.
Currently, only interfaces to classes which evaluate single attributes (as opposed to subsets, technically, which implement the Weka AttributeEvaluator interface) are possible.
A numeric vector with the figures of merit for the attributes
specified by the right hand side of formula
.
InfoGainAttributeEval(Species ~ . , data = iris)
InfoGainAttributeEval(Species ~ . , data = iris)
R interfaces to Weka regression and classification function learners.
LinearRegression(formula, data, subset, na.action, control = Weka_control(), options = NULL) Logistic(formula, data, subset, na.action, control = Weka_control(), options = NULL) SMO(formula, data, subset, na.action, control = Weka_control(), options = NULL)
LinearRegression(formula, data, subset, na.action, control = Weka_control(), options = NULL) Logistic(formula, data, subset, na.action, control = Weka_control(), options = NULL) SMO(formula, data, subset, na.action, control = Weka_control(), options = NULL)
formula |
a symbolic description of the model to be fit. |
data |
an optional data frame containing the variables in the model. |
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
na.action |
a function which indicates what should happen when
the data contain |
control |
an object of class |
options |
a named list of further options, or |
There are a predict
method for
predicting from the fitted models, and a summary
method based
on evaluate_Weka_classifier
.
LinearRegression
builds suitable linear regression models,
using the Akaike criterion for model selection.
Logistic
builds multinomial logistic regression models based on
ridge estimation (le Cessie and van Houwelingen, 1992).
SMO
implements John C. Platt's sequential minimal optimization
algorithm for training a support vector classifier using polynomial or
RBF kernels. Multi-class problems are solved using pairwise
classification.
The model formulae should only use the ‘+’ and ‘-’ operators to indicate the variables to be included or not used, respectively.
Argument options
allows further customization. Currently,
options model
and instances
(or partial matches for
these) are used: if set to TRUE
, the model frame or the
corresponding Weka instances, respectively, are included in the fitted
model object, possibly speeding up subsequent computations on the
object. By default, neither is included.
A list inheriting from classes Weka_functions
and
Weka_classifiers
with components including
classifier |
a reference (of class
|
predictions |
a numeric vector or factor with the model
predictions for the training instances (the results of calling the
Weka |
call |
the matched call. |
J. C. Platt (1998). Fast training of Support Vector Machines using Sequential Minimal Optimization. In B. Schoelkopf, C. Burges, and A. Smola (eds.), Advances in Kernel Methods — Support Vector Learning. MIT Press.
I. H. Witten and E. Frank (2005). Data Mining: Practical Machine Learning Tools and Techniques. 2nd Edition, Morgan Kaufmann, San Francisco.
## Linear regression: ## Using standard data set 'mtcars'. LinearRegression(mpg ~ ., data = mtcars) ## Compare to R: step(lm(mpg ~ ., data = mtcars), trace = 0) ## Using standard data set 'chickwts'. LinearRegression(weight ~ feed, data = chickwts) ## (Note the interactions!) ## Logistic regression: ## Using standard data set 'infert'. STATUS <- factor(infert$case, labels = c("control", "case")) Logistic(STATUS ~ spontaneous + induced, data = infert) ## Compare to R: glm(STATUS ~ spontaneous + induced, data = infert, family = binomial()) ## Sequential minimal optimization algorithm for training a support ## vector classifier, using am RBF kernel with a non-default gamma ## parameter (argument '-G') instead of the default polynomial kernel ## (from a question on r-help): SMO(Species ~ ., data = iris, control = Weka_control(K = list("weka.classifiers.functions.supportVector.RBFKernel", G = 2))) ## In fact, by some hidden magic it also "works" to give the "base" name ## of the Weka kernel class: SMO(Species ~ ., data = iris, control = Weka_control(K = list("RBFKernel", G = 2)))
## Linear regression: ## Using standard data set 'mtcars'. LinearRegression(mpg ~ ., data = mtcars) ## Compare to R: step(lm(mpg ~ ., data = mtcars), trace = 0) ## Using standard data set 'chickwts'. LinearRegression(weight ~ feed, data = chickwts) ## (Note the interactions!) ## Logistic regression: ## Using standard data set 'infert'. STATUS <- factor(infert$case, labels = c("control", "case")) Logistic(STATUS ~ spontaneous + induced, data = infert) ## Compare to R: glm(STATUS ~ spontaneous + induced, data = infert, family = binomial()) ## Sequential minimal optimization algorithm for training a support ## vector classifier, using am RBF kernel with a non-default gamma ## parameter (argument '-G') instead of the default polynomial kernel ## (from a question on r-help): SMO(Species ~ ., data = iris, control = Weka_control(K = list("weka.classifiers.functions.supportVector.RBFKernel", G = 2))) ## In fact, by some hidden magic it also "works" to give the "base" name ## of the Weka kernel class: SMO(Species ~ ., data = iris, control = Weka_control(K = list("RBFKernel", G = 2)))
R interfaces to Weka lazy learners.
IBk(formula, data, subset, na.action, control = Weka_control(), options = NULL) LBR(formula, data, subset, na.action, control = Weka_control(), options = NULL)
IBk(formula, data, subset, na.action, control = Weka_control(), options = NULL) LBR(formula, data, subset, na.action, control = Weka_control(), options = NULL)
formula |
a symbolic description of the model to be fit. |
data |
an optional data frame containing the variables in the model. |
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
na.action |
a function which indicates what should happen when
the data contain |
control |
an object of class |
options |
a named list of further options, or |
There are a predict
method for
predicting from the fitted models, and a summary
method based
on evaluate_Weka_classifier
.
IBk
provides a -nearest neighbors classifier, see Aha &
Kibler (1991).
LBR
(“Lazy Bayesian Rules”) implements a lazy learning
approach to lessening the attribute-independence assumption of naive
Bayes as suggested by Zheng & Webb (2000).
The model formulae should only use the ‘+’ and ‘-’ operators to indicate the variables to be included or not used, respectively.
Argument options
allows further customization. Currently,
options model
and instances
(or partial matches for
these) are used: if set to TRUE
, the model frame or the
corresponding Weka instances, respectively, are included in the fitted
model object, possibly speeding up subsequent computations on the
object. By default, neither is included.
A list inheriting from classes Weka_lazy
and
Weka_classifiers
with components including
classifier |
a reference (of class
|
predictions |
a numeric vector or factor with the model
predictions for the training instances (the results of calling the
Weka |
call |
the matched call. |
LBR
requires Weka package lazyBayesianRules to be
installed.
D. Aha and D. Kibler (1991). Instance-based learning algorithms. Machine Learning, 6, 37–66. doi:10.1007/BF00153759.
Z. Zheng and G. Webb (2000). Lazy learning of Bayesian rules. Machine Learning, 41/1, 53–84. doi:10.1023/A:1007613203719.
R interfaces to Weka meta learners.
AdaBoostM1(formula, data, subset, na.action, control = Weka_control(), options = NULL) Bagging(formula, data, subset, na.action, control = Weka_control(), options = NULL) LogitBoost(formula, data, subset, na.action, control = Weka_control(), options = NULL) MultiBoostAB(formula, data, subset, na.action, control = Weka_control(), options = NULL) Stacking(formula, data, subset, na.action, control = Weka_control(), options = NULL) CostSensitiveClassifier(formula, data, subset, na.action, control = Weka_control(), options = NULL)
AdaBoostM1(formula, data, subset, na.action, control = Weka_control(), options = NULL) Bagging(formula, data, subset, na.action, control = Weka_control(), options = NULL) LogitBoost(formula, data, subset, na.action, control = Weka_control(), options = NULL) MultiBoostAB(formula, data, subset, na.action, control = Weka_control(), options = NULL) Stacking(formula, data, subset, na.action, control = Weka_control(), options = NULL) CostSensitiveClassifier(formula, data, subset, na.action, control = Weka_control(), options = NULL)
formula |
a symbolic description of the model to be fit. |
data |
an optional data frame containing the variables in the model. |
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
na.action |
a function which indicates what should happen when
the data contain |
control |
an object of class |
options |
a named list of further options, or |
There are a predict
method for
predicting from the fitted models, and a summary
method based
on evaluate_Weka_classifier
.
AdaBoostM1
implements the AdaBoost M1 method of Freund and
Schapire (1996).
Bagging
provides bagging (Breiman, 1996).
LogitBoost
performs boosting via additive logistic regression
(Friedman, Hastie and Tibshirani, 2000).
MultiBoostAB
implements MultiBoosting (Webb, 2000), an
extension to the AdaBoost technique for forming decision
committees which can be viewed as a combination of AdaBoost and
“wagging”.
Stacking
provides stacking (Wolpert, 1992).
CostSensitiveClassifier
makes its base classifier
cost-sensitive.
The model formulae should only use the ‘+’ and ‘-’ operators to indicate the variables to be included or not used, respectively.
Argument options
allows further customization. Currently,
options model
and instances
(or partial matches for
these) are used: if set to TRUE
, the model frame or the
corresponding Weka instances, respectively, are included in the fitted
model object, possibly speeding up subsequent computations on the
object. By default, neither is included.
A list inheriting from classes Weka_meta
and
Weka_classifiers
with components including
classifier |
a reference (of class
|
predictions |
a numeric vector or factor with the model
predictions for the training instances (the results of calling the
Weka |
call |
the matched call. |
multiBoostAB
requires Weka package multiBoostAB to be
installed.
L. Breiman (1996). Bagging predictors. Machine Learning, 24/2, 123–140. doi:10.1023/A:1018054314350.
Y. Freund and R. E. Schapire (1996). Experiments with a new boosting algorithm. In Proceedings of the International Conference on Machine Learning, pages 148–156. Morgan Kaufmann: San Francisco.
J. H. Friedman, T. Hastie, and R. Tibshirani (2000). Additive logistic regression: A statistical view of boosting. Annals of Statistics, 28/2, 337–374. doi:10.1214/aos/1016218223.
G. I. Webb (2000). MultiBoosting: A technique for combining boosting and wagging. Machine Learning, 40/2, 159–196. doi:10.1023/A:1007659514849.
I. H. Witten and E. Frank (2005). Data Mining: Practical Machine Learning Tools and Techniques. 2nd Edition, Morgan Kaufmann, San Francisco.
D. H. Wolpert (1992). Stacked generalization. Neural Networks, 5, 241–259. doi:10.1016/S0893-6080(05)80023-1.
## Use AdaBoostM1 with decision stumps. m1 <- AdaBoostM1(Species ~ ., data = iris, control = Weka_control(W = "DecisionStump")) table(predict(m1), iris$Species) summary(m1) # uses evaluate_Weka_classifier() ## Control options for the base classifiers employed by the meta ## learners (apart from Stacking) can be given as follows: m2 <- AdaBoostM1(Species ~ ., data = iris, control = Weka_control(W = list(J48, M = 30)))
## Use AdaBoostM1 with decision stumps. m1 <- AdaBoostM1(Species ~ ., data = iris, control = Weka_control(W = "DecisionStump")) table(predict(m1), iris$Species) summary(m1) # uses evaluate_Weka_classifier() ## Control options for the base classifiers employed by the meta ## learners (apart from Stacking) can be given as follows: m2 <- AdaBoostM1(Species ~ ., data = iris, control = Weka_control(W = list(J48, M = 30)))
R interfaces to Weka rule learners.
JRip(formula, data, subset, na.action, control = Weka_control(), options = NULL) M5Rules(formula, data, subset, na.action, control = Weka_control(), options = NULL) OneR(formula, data, subset, na.action, control = Weka_control(), options = NULL) PART(formula, data, subset, na.action, control = Weka_control(), options = NULL)
JRip(formula, data, subset, na.action, control = Weka_control(), options = NULL) M5Rules(formula, data, subset, na.action, control = Weka_control(), options = NULL) OneR(formula, data, subset, na.action, control = Weka_control(), options = NULL) PART(formula, data, subset, na.action, control = Weka_control(), options = NULL)
formula |
a symbolic description of the model to be fit. |
data |
an optional data frame containing the variables in the model. |
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
na.action |
a function which indicates what should happen when
the data contain |
control |
an object of class |
options |
a named list of further options, or |
There are a predict
method for
predicting from the fitted models, and a summary
method based
on evaluate_Weka_classifier
.
JRip
implements a propositional rule learner, “Repeated
Incremental Pruning to Produce Error Reduction” (RIPPER), as proposed
by Cohen (1995).
M5Rules
generates a decision list for regression problems using
separate-and-conquer. In each iteration it builds an model tree using
M5 and makes the “best” leaf into a rule. See Hall, Holmes and
Frank (1999) for more information.
OneR
builds a simple 1-R classifier, see Holte (1993).
PART
generates PART decision lists using the approach of Frank
and Witten (1998).
The model formulae should only use the ‘+’ and ‘-’ operators to indicate the variables to be included or not used, respectively.
Argument options
allows further customization. Currently,
options model
and instances
(or partial matches for
these) are used: if set to TRUE
, the model frame or the
corresponding Weka instances, respectively, are included in the fitted
model object, possibly speeding up subsequent computations on the
object. By default, neither is included.
A list inheriting from classes Weka_rules
and
Weka_classifiers
with components including
classifier |
a reference (of class
|
predictions |
a numeric vector or factor with the model
predictions for the training instances (the results of calling the
Weka |
call |
the matched call. |
W. W. Cohen (1995). Fast effective rule induction. In A. Prieditis and S. Russell (eds.), Proceedings of the 12th International Conference on Machine Learning, pages 115–123. Morgan Kaufmann. ISBN 1-55860-377-8. doi:10.1016/B978-1-55860-377-6.50023-2.
E. Frank and I. H. Witten (1998). Generating accurate rule sets without global optimization. In J. Shavlik (ed.), Machine Learning: Proceedings of the Fifteenth International Conference. Morgan Kaufmann Publishers: San Francisco, CA. https://www.cs.waikato.ac.nz/~eibe/pubs/ML98-57.ps.gz
M. Hall, G. Holmes, and E. Frank (1999). Generating rule sets from model trees. Proceedings of the Twelfth Australian Joint Conference on Artificial Intelligence, Sydney, Australia, pages 1–12. Springer-Verlag. https://www.cs.waikato.ac.nz/~eibe/pubs/ajc.pdf
R. C. Holte (1993). Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11, 63–91. doi:10.1023/A:1022631118932.
I. H. Witten and E. Frank (2005). Data Mining: Practical Machine Learning Tools and Techniques. 2nd Edition, Morgan Kaufmann, San Francisco.
M5Rules(mpg ~ ., data = mtcars) m <- PART(Species ~ ., data = iris) m summary(m)
M5Rules(mpg ~ ., data = mtcars) m <- PART(Species ~ ., data = iris) m summary(m)
R interfaces to Weka regression and classification tree learners.
J48(formula, data, subset, na.action, control = Weka_control(), options = NULL) LMT(formula, data, subset, na.action, control = Weka_control(), options = NULL) M5P(formula, data, subset, na.action, control = Weka_control(), options = NULL) DecisionStump(formula, data, subset, na.action, control = Weka_control(), options = NULL)
J48(formula, data, subset, na.action, control = Weka_control(), options = NULL) LMT(formula, data, subset, na.action, control = Weka_control(), options = NULL) M5P(formula, data, subset, na.action, control = Weka_control(), options = NULL) DecisionStump(formula, data, subset, na.action, control = Weka_control(), options = NULL)
formula |
a symbolic description of the model to be fit. |
data |
an optional data frame containing the variables in the model. |
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
na.action |
a function which indicates what should happen when
the data contain |
control |
an object of class |
options |
a named list of further options, or |
There are a predict
method for
predicting from the fitted models, and a summary
method based
on evaluate_Weka_classifier
.
There is also a plot
method for fitted binary Weka_tree
s
via the facilities provided by package partykit. This converts
the Weka_tree
to a party
object and then simply calls
the plot method of this class (see plot.party
).
Provided the Weka classification tree learner implements the
“Drawable” interface (i.e., provides a graph
method),
write_to_dot
can be used to create a DOT representation
of the tree for visualization via Graphviz or the Rgraphviz
package.
J48
generates unpruned or pruned C4.5 decision trees (Quinlan,
1993).
LMT
implements “Logistic Model Trees” (Landwehr, 2003;
Landwehr et al., 2005).
M5P
(where the ‘P’ stands for ‘prime’) generates M5
model trees using the M5' algorithm, which was introduced in Wang &
Witten (1997) and enhances the original M5 algorithm by Quinlan
(1992).
DecisionStump
implements decision stumps (trees with a single
split only), which are frequently used as base learners for meta
learners such as Boosting.
The model formulae should only use the ‘+’ and ‘-’ operators to indicate the variables to be included or not used, respectively.
Argument options
allows further customization. Currently,
options model
and instances
(or partial matches for
these) are used: if set to TRUE
, the model frame or the
corresponding Weka instances, respectively, are included in the fitted
model object, possibly speeding up subsequent computations on the
object. By default, neither is included.
parse_Weka_digraph
can parse the graph associated with a Weka
tree classifier (and obtained by invoking its graph()
method in
Weka), returning a simple list with nodes and edges.
A list inheriting from classes Weka_tree
and
Weka_classifiers
with components including
classifier |
a reference (of class
|
predictions |
a numeric vector or factor with the model
predictions for the training instances (the results of calling the
Weka |
call |
the matched call. |
N. Landwehr (2003). Logistic Model Trees. Master's thesis, Institute for Computer Science, University of Freiburg, Germany. https://www.cs.uni-potsdam.de/ml/landwehr/diploma_thesis.pdf
N. Landwehr, M. Hall, and E. Frank (2005). Logistic Model Trees. Machine Learning, 59, 161–205. doi:10.1007/s10994-005-0466-3.
R. Quinlan (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA.
R. Quinlan (1992). Learning with continuous classes. Proceedings of the Australian Joint Conference on Artificial Intelligence, 343–348. World Scientific, Singapore.
Y. Wang and I. H. Witten (1997). Induction of model trees for predicting continuous classes. Proceedings of the European Conference on Machine Learning. University of Economics, Faculty of Informatics and Statistics, Prague.
I. H. Witten and E. Frank (2005). Data Mining: Practical Machine Learning Tools and Techniques. 2nd Edition, Morgan Kaufmann, San Francisco.
m1 <- J48(Species ~ ., data = iris) ## print and summary m1 summary(m1) # calls evaluate_Weka_classifier() table(iris$Species, predict(m1)) # by hand ## visualization ## use partykit package if(require("partykit", quietly = TRUE)) plot(m1) ## or Graphviz write_to_dot(m1) ## or Rgraphviz ## Not run: library("Rgraphviz") ff <- tempfile() write_to_dot(m1, ff) plot(agread(ff)) ## End(Not run) ## Using some Weka data sets ... ## J48 DF2 <- read.arff(system.file("arff", "contact-lenses.arff", package = "RWeka")) m2 <- J48(`contact-lenses` ~ ., data = DF2) m2 table(DF2$`contact-lenses`, predict(m2)) if(require("partykit", quietly = TRUE)) plot(m2) ## M5P DF3 <- read.arff(system.file("arff", "cpu.arff", package = "RWeka")) m3 <- M5P(class ~ ., data = DF3) m3 if(require("partykit", quietly = TRUE)) plot(m3) ## Logistic Model Tree. DF4 <- read.arff(system.file("arff", "weather.arff", package = "RWeka")) m4 <- LMT(play ~ ., data = DF4) m4 table(DF4$play, predict(m4)) ## Larger scale example. if(require("mlbench", quietly = TRUE) && require("partykit", quietly = TRUE)) { ## Predict diabetes status for Pima Indian women data("PimaIndiansDiabetes", package = "mlbench") ## Fit J48 tree with reduced error pruning m5 <- J48(diabetes ~ ., data = PimaIndiansDiabetes, control = Weka_control(R = TRUE)) plot(m5) ## (Make sure that the plotting device is big enough for the tree.) }
m1 <- J48(Species ~ ., data = iris) ## print and summary m1 summary(m1) # calls evaluate_Weka_classifier() table(iris$Species, predict(m1)) # by hand ## visualization ## use partykit package if(require("partykit", quietly = TRUE)) plot(m1) ## or Graphviz write_to_dot(m1) ## or Rgraphviz ## Not run: library("Rgraphviz") ff <- tempfile() write_to_dot(m1, ff) plot(agread(ff)) ## End(Not run) ## Using some Weka data sets ... ## J48 DF2 <- read.arff(system.file("arff", "contact-lenses.arff", package = "RWeka")) m2 <- J48(`contact-lenses` ~ ., data = DF2) m2 table(DF2$`contact-lenses`, predict(m2)) if(require("partykit", quietly = TRUE)) plot(m2) ## M5P DF3 <- read.arff(system.file("arff", "cpu.arff", package = "RWeka")) m3 <- M5P(class ~ ., data = DF3) m3 if(require("partykit", quietly = TRUE)) plot(m3) ## Logistic Model Tree. DF4 <- read.arff(system.file("arff", "weather.arff", package = "RWeka")) m4 <- LMT(play ~ ., data = DF4) m4 table(DF4$play, predict(m4)) ## Larger scale example. if(require("mlbench", quietly = TRUE) && require("partykit", quietly = TRUE)) { ## Predict diabetes status for Pima Indian women data("PimaIndiansDiabetes", package = "mlbench") ## Fit J48 tree with reduced error pruning m5 <- J48(diabetes ~ ., data = PimaIndiansDiabetes, control = Weka_control(R = TRUE)) plot(m5) ## (Make sure that the plotting device is big enough for the tree.) }
R interfaces to Weka classifiers.
Supervised learners, i.e., algorithms for classification and regression, are termed “classifiers” by Weka. (Numeric prediction, i.e., regression, is interpreted as prediction of a continuous class.)
R interface functions to Weka classifiers are created by
make_Weka_classifier
, and have formals formula
,
data
, subset
, na.action
, and control
(default: none), where the first four have the “usual” meanings
for statistical modeling functions in R, and the last again specifies
the control options to be employed by the Weka learner.
By default, the model formulae should only use the ‘+’ and ‘-’ operators to indicate the variables to be included or not used, respectively.
See model.frame
for details on how na.action
is
used.
Objects created by these interfaces always inherit from class
Weka_classifier
, and have at least suitable print
,
summary
(via evaluate_Weka_classifier
), and
predict
methods.
Available “standard” interface functions are documented in Weka_classifier_functions (regression and classification function learners), Weka_classifier_lazy (lazy learners), Weka_classifier_meta (meta learners), Weka_classifier_rules (rule learners), and Weka_classifier_trees (regression and classification tree learners).
R interfaces to Weka clustering algorithms.
Cobweb(x, control = NULL) FarthestFirst(x, control = NULL) SimpleKMeans(x, control = NULL) XMeans(x, control = NULL) DBScan(x, control = NULL)
Cobweb(x, control = NULL) FarthestFirst(x, control = NULL) SimpleKMeans(x, control = NULL) XMeans(x, control = NULL) DBScan(x, control = NULL)
x |
an R object with the data to be clustered. |
control |
an object of class |
There is a predict
method for
predicting class ids or memberships from the fitted clusterers.
Cobweb
implements the Cobweb (Fisher, 1987) and Classit
(Gennari et al., 1989) clustering algorithms.
FarthestFirst
provides the “farthest first traversal
algorithm” by Hochbaum and Shmoys, which works as a fast simple
approximate clusterer modeled after simple -means.
SimpleKMeans
provides clustering with the -means
algorithm.
XMeans
provides -means extended by an
“Improve-Structure part” and automatically determines the
number of clusters.
DBScan
provides the “density-based clustering algorithm”
by Ester, Kriegel, Sander, and Xu. Note that noise points are assigned
to NA
.
A list inheriting from class Weka_clusterers
with components
including
clusterer |
a reference (of class
|
class_ids |
a vector of integers indicating the class to which
each training instance is allocated (the results of calling the Weka
|
XMeans
requires Weka package XMeans to be installed.
DBScan
requires Weka package optics_dbScan to be
installed.
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96), Portland, OR, 226–231. AAAI Press.
D. H. Fisher (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2/2, 139–172. doi:10.1023/A:1022852608280.
J. Gennari, P. Langley, and D. H. Fisher (1989). Models of incremental concept formation. Artificial Intelligence, 40, 11–62.
D. S. Hochbaum and D. B. Shmoys (1985).
A best possible heuristic for the -center problem,
Mathematics of Operations Research, 10(2), 180–184.
doi:10.1287/moor.10.2.180.
D. Pelleg and A. W. Moore (2006). X-means: Extending K-means with Efficient Estimation of the Number of Clusters. In: Seventeenth International Conference on Machine Learning, 727–734. Morgan Kaufmann.
I. H. Witten and E. Frank (2005). Data Mining: Practical Machine Learning Tools and Techniques. 2nd Edition, Morgan Kaufmann, San Francisco.
cl1 <- SimpleKMeans(iris[, -5], Weka_control(N = 3)) cl1 table(predict(cl1), iris$Species) ## Not run: ## Requires Weka package 'XMeans' to be installed. ## Use XMeans with a KDTree. cl2 <- XMeans(iris[, -5], c("-L", 3, "-H", 7, "-use-kdtree", "-K", "weka.core.neighboursearch.KDTree -P")) cl2 table(predict(cl2), iris$Species) ## End(Not run)
cl1 <- SimpleKMeans(iris[, -5], Weka_control(N = 3)) cl1 table(predict(cl1), iris$Species) ## Not run: ## Requires Weka package 'XMeans' to be installed. ## Use XMeans with a KDTree. cl2 <- XMeans(iris[, -5], c("-L", 3, "-H", 7, "-use-kdtree", "-K", "weka.core.neighboursearch.KDTree -P")) cl2 table(predict(cl2), iris$Species) ## End(Not run)
Set control options for Weka learners.
Weka_control(...)
Weka_control(...)
... |
named arguments of control options, see the details and examples. |
The available options for a Weka learner, foo()
say, can be
queried by WOW(foo)
and then conveniently set by
Weka_control()
. See below for an example.
One can use lists for options taking multiple arguments, see the
documentation for SMO
for an example.
A list of class Weka_control
which can be coerced to
character
for passing it to Weka.
## Query J4.8 options: WOW("J48") ## Learn J4.8 tree on iris data with default settings: J48(Species ~ ., data = iris) ## Learn J4.8 tree with reduced error pruning (-R) and ## minimum number of instances set to 5 (-M 5): J48(Species ~ ., data = iris, control = Weka_control(R = TRUE, M = 5))
## Query J4.8 options: WOW("J48") ## Learn J4.8 tree on iris data with default settings: J48(Species ~ ., data = iris) ## Learn J4.8 tree with reduced error pruning (-R) and ## minimum number of instances set to 5 (-M 5): J48(Species ~ ., data = iris, control = Weka_control(R = TRUE, M = 5))
R interfaces to Weka file loaders and savers.
C45Loader(file) XRFFLoader(file) C45Saver(x, file, control = NULL) XRFFSaver(x, file, control = NULL)
C45Loader(file) XRFFLoader(file) C45Saver(x, file, control = NULL) XRFFSaver(x, file, control = NULL)
file |
a non-empty character string naming a file to read from or write to. |
x |
the data to be written, preferably a matrix or data frame. If not, coercion to a data frame is attempted. |
control |
an object of class |
C45Loader
and C45Saver
use the format employed by the
C4.5 algorithm/software, where data is stored in two separate
‘.names’ and ‘.data’ files.
XRFFLoader
and XRFFSaver
handle XRFF
(eXtensible attribute-Relation File Format, an XML-based
extension of Weka's native Attribute-Relation File Format) files.
Invisibly NULL
for the savers.
A data frame containing the data from the given file for the loaders.
R interfaces to Weka filters.
Normalize(formula, data, subset, na.action, control = NULL) Discretize(formula, data, subset, na.action, control = NULL)
Normalize(formula, data, subset, na.action, control = NULL) Discretize(formula, data, subset, na.action, control = NULL)
formula |
a symbolic description of a model. Note that for unsupervised filters the response can be omitted. |
data |
an optional data frame containing the variables in the model. |
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
na.action |
a function which indicates what should happen when
the data contain |
control |
an object of class |
Normalize
implements an unsupervised filter that normalizes all
instances of a dataset to have a given norm. Only numeric values are
considered, and the class attribute is ignored.
Discretize
implements a supervised instance filter that
discretizes a range of numeric attributes in the dataset into nominal
attributes. Discretization is by Fayyad & Irani's MDL
method (the default).
Note that these methods ignore nominal attributes, i.e., variables of
class factor
.
A data frame.
U. M. Fayyad and K. B. Irani (1993). Multi-interval discretization of continuous-valued attributes for classification learning. Thirteenth International Joint Conference on Artificial Intelligence, 1022–1027. Morgan Kaufmann.
I. H. Witten and E. Frank (2005). Data Mining: Practical Machine Learning Tools and Techniques. 2nd Edition, Morgan Kaufmann, San Francisco.
## Using a Weka data set ... w <- read.arff(system.file("arff","weather.arff", package = "RWeka")) ## Normalize (response irrelevant) m1 <- Normalize(~., data = w) m1 ## Discretize m2 <- Discretize(play ~., data = w) m2
## Using a Weka data set ... w <- read.arff(system.file("arff","weather.arff", package = "RWeka")) ## Normalize (response irrelevant) m1 <- Normalize(~., data = w) m1 ## Discretize m2 <- Discretize(play ~., data = w) m2
Create an R interface to an existing Weka learner, attribute evaluator or filter, or show the available interfaces.
make_Weka_associator(name, class = NULL, init = NULL, package = NULL) make_Weka_attribute_evaluator(name, class = NULL, init = NULL, package = NULL) make_Weka_classifier(name, class = NULL, handlers = list(), init = NULL, package = NULL) make_Weka_clusterer(name, class = NULL, init = NULL, package = NULL) make_Weka_filter(name, class = NULL, init = NULL, package = NULL) list_Weka_interfaces() make_Weka_package_loader(p)
make_Weka_associator(name, class = NULL, init = NULL, package = NULL) make_Weka_attribute_evaluator(name, class = NULL, init = NULL, package = NULL) make_Weka_classifier(name, class = NULL, handlers = list(), init = NULL, package = NULL) make_Weka_clusterer(name, class = NULL, init = NULL, package = NULL) make_Weka_filter(name, class = NULL, init = NULL, package = NULL) list_Weka_interfaces() make_Weka_package_loader(p)
name |
a character string giving the fully qualified name of a Weka learner/filter class in JNI notation. |
class |
|
handlers |
a named list of special handler functions, see Details. |
init |
|
package |
|
p |
a character string naming a Weka package to be loaded via
|
make_Weka_associator
and make_Weka_clusterer
create an R
function providing an interface to a Weka association learner or a
Weka clusterer, respectively. This interface function has formals
x
and control = NULL
, representing the training
instances and control options to be employed. Objects created by
these interface functions always inherit from classes
Weka_associator
and Weka_clusterer
, respectively,
and have at least suitable print
methods. Fitted clusterers
also have a predict
method.
make_Weka_classifier
creates an interface function for a Weka
classifier, with formals formula
, data
, subset
,
na.action
, and control
(default: none), where the first
four have the “usual” meanings for statistical modeling
functions in R, and the last again specifies the control options to be
employed by the Weka learner. Objects created by these interfaces
always inherit from class Weka_classifier
, and have at least
suitable print
and
predict
methods.
make_Weka_filter
creates an interface function for a Weka
filter, with formals formula
, data
, subset
,
na.action
, and control = NULL
, where the first four have
the “usual” meanings for statistical modeling functions in R,
and the last again specifies the control options to be employed by the
Weka filter. Note that the response variable can be omitted from
formula
if the filter is “unsupervised”. Objects
created by these interface functions are (currently) always of class
data.frame
.
make_Weka_attribute_evaluator
creates an interface function for
a Weka attribute evaluation class which implements the
AttributeEvaluator
interface, with formals as for the
classifier interface functions.
Certain aspects of the interface function can be customized by
providing handlers. Currently, only control handlers
(functions given as the control
component of the list of
handlers) are used for processing the given control arguments before
passing them to the Weka classifier. This is used, e.g., by the meta
learners to allow the specification of registered base learners by
their “base names” (rather their full Weka/Java class names).
In addition to creating interface functions, the interfaces are
registered (under the name of the Weka class interfaced), which in
particular allows the Weka Option Wizard (WOW
) to
conveniently give on-line information about available control options
for the interfaces.
list_Weka_interfaces
lists the available interfaces.
Finally, make_Weka_package_loader
generates init hooks for
loading required and already installed Weka packages.
It is straightforward to register new interfaces in addition to the ones package RWeka provides by default.
K. Hornik, C. Buchta, and A. Zeileis (2009). Open-source machine learning: R meets Weka. Computational Statistics, 24/2, 225–232. doi:10.1007/s00180-008-0119-7.
## Create an interface to Weka's Naive Bayes classifier. NB <- make_Weka_classifier("weka/classifiers/bayes/NaiveBayes") ## Note that this has a very useful print method: NB ## And we can use the Weka Option Wizard for finding out more: WOW(NB) ## And actually use the interface ... if(require("e1071", quietly = TRUE) && require("mlbench", quietly = TRUE)) { data("HouseVotes84", package = "mlbench") model <- NB(Class ~ ., data = HouseVotes84) predict(model, HouseVotes84[1:10, -1]) predict(model, HouseVotes84[1:10, -1], type = "prob") } ## (Compare this to David Meyer's naiveBayes() in package 'e1071'.)
## Create an interface to Weka's Naive Bayes classifier. NB <- make_Weka_classifier("weka/classifiers/bayes/NaiveBayes") ## Note that this has a very useful print method: NB ## And we can use the Weka Option Wizard for finding out more: WOW(NB) ## And actually use the interface ... if(require("e1071", quietly = TRUE) && require("mlbench", quietly = TRUE)) { data("HouseVotes84", package = "mlbench") model <- NB(Class ~ ., data = HouseVotes84) predict(model, HouseVotes84[1:10, -1]) predict(model, HouseVotes84[1:10, -1], type = "prob") } ## (Compare this to David Meyer's naiveBayes() in package 'e1071'.)
R interfaces to Weka stemmers.
IteratedLovinsStemmer(x, control = NULL) LovinsStemmer(x, control = NULL)
IteratedLovinsStemmer(x, control = NULL) LovinsStemmer(x, control = NULL)
x |
a character vector with words to be stemmed. |
control |
an object of class |
A character vector with the stemmed words.
J. B. Lovins (1968), Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11, 22–31.
R interfaces to Weka tokenizers.
AlphabeticTokenizer(x, control = NULL) NGramTokenizer(x, control = NULL) WordTokenizer(x, control = NULL)
AlphabeticTokenizer(x, control = NULL) NGramTokenizer(x, control = NULL) WordTokenizer(x, control = NULL)
x |
a character vector with strings to be tokenized. |
control |
an object of class |
AlphabeticTokenizer
is an alphabetic string tokenizer, where
tokens are to be formed only from contiguous alphabetic sequences.
NGramTokenizer
splits strings into -grams with given
minimal and maximal numbers of grams.
WordTokenizer
is a simple word tokenizer.
A character vector with the tokenized strings.
Give on-line information about available control options for Weka learners or filters and their R interfaces.
WOW(x)
WOW(x)
x |
a character string giving either the fully qualified name of a Weka learner or filter class in JNI notation, or the name of an available R interface, or an object obtained from applying these interfaces to build an associator, classifier, clusterer, or filter. |
See list_Weka_interfaces
for the available interface
functions.
K. Hornik, C. Buchta, and A. Zeileis (2009). Open-source machine learning: R meets Weka. Computational Statistics, 24/2, 225–232. doi:10.1007/s00180-008-0119-7.
## The name of an "existing" (registered) interface. WOW("J48") ## The name of some Weka class (not necessarily in the interface ## registry): WOW("weka/classifiers/bayes/NaiveBayes")
## The name of an "existing" (registered) interface. WOW("J48") ## The name of some Weka class (not necessarily in the interface ## registry): WOW("weka/classifiers/bayes/NaiveBayes")
Manage Weka packages.
WPM(cmd, ...)
WPM(cmd, ...)
cmd |
a character string specifying the action to be performed.
Must be one of |
... |
character strings giving further arguments required for the action to be performed. See Details. |
Available actions and respective additional arguments are as follows.
"refresh-cache"
Refresh the cached copy of the package meta data from the central package repository.
"list-packages"
print information (version numbers and
short descriptions) about packages as specified by an additional
keyword which must be one of "all"
(all packages the system
knows about), "installed"
(all packages installed locally),
or ("available"
(all known packages not installed locally),
or a unique abbreviation thereof.
"package-info"
print information (metadata) about a
package. Requires two additional character string arguments: a
keyword and the package name. The keyword must be one of
"repository"
(print info from the repository) or
"installed"
(print info on the installed version), or a
unique abbreviation thereof.
"install-package"
install a package as specified by an additional character string giving its name. (In principle, one could also provide a file path or URL to a zip file.)
"remove-package"
remove a given (installed) package.
"toggle-load-status"
toggle the load status of the given (installed) packages.
"load-packages"
load all installed packages with active load status.
Weka stores packages and their information in the Weka home directory,
as given by the value of the environment variable WEKA_HOME; if
this is not set, the ‘wekafiles’ subdirectory of the user's home
directory is used. If this Weka home directory was not created yet,
WPM()
will instead use a temporary directory in the R session
directory: to achieve persistence, users need to create the Weka home
directory before using WPM()
.
## Not run: ## Start by building/refreshing the cache. WPM("refresh-cache") ## Show the packages installed locally. WPM("list-packages", "installed") ## Show the packages available from the central Weka package ## repository and not installed locally. WPM("list-packages", "available") ## Show repository information about package XMeans. WPM("package-info", "repository", "XMeans") ## End(Not run)
## Not run: ## Start by building/refreshing the cache. WPM("refresh-cache") ## Show the packages installed locally. WPM("list-packages", "installed") ## Show the packages available from the central Weka package ## repository and not installed locally. WPM("list-packages", "available") ## Show repository information about package XMeans. WPM("package-info", "repository", "XMeans") ## End(Not run)
Writes data into Weka Attribute-Relation File Format (ARFF) files.
write.arff(x, file, eol = "\n")
write.arff(x, file, eol = "\n")
x |
the data to be written, preferably a matrix or data frame. If not, coercion to a data frame is attempted. |
file |
either a character string naming a file, or a connection.
|
eol |
the character(s) to print at the end of each line (row). |
Attribute-Relation File Format https://waikato.github.io/weka-wiki/formats_and_processing/arff/
write.arff(iris, file = "")
write.arff(iris, file = "")