| Title: | Harvest the Classification Tree |
|---|---|
| Description: | Aimed at applying the Harvest classification tree algorithm, modified algorithm of classic classification tree.The harvested tree has advantage of deleting redundant rules in trees, leading to a simplify and more efficient tree model.It was firstly used in drug discovery field, but it also performs well in other kinds of data, especially when the region of a class is disconnected. This package also improves the basic harvest classification tree algorithm by extending the field of data of algorithm to both continuous and categorical variables. To learn more about the harvest classification tree algorithm, you can go to http://www.stat.ubc.ca/Research/TechReports/techreports/220.pdf for more information. |
| Authors: | Bingyuan Liu/Yan Yuan/Qian Shi |
| Maintainer: | Bingyuan Liu <[email protected]> |
| License: | GPL-2 |
| Version: | 1.1 |
| Built: | 2026-05-26 10:38:42 UTC |
| Source: | https://github.com/cran/Harvest.Tree |
This function takes in a ruleset and output the lower and upper bounds of each rule.
extrule(myrules, varname)extrule(myrules, varname)
myrules |
A 3 column matrix output of function "hughs.path.rpart" |
varname |
the names of x variables |
A p*2 matrix, p is the length of varname. The first column is the lower bound, the second column is the upper bound. The default lower bound is "-Inf",the default upper bound is "Inf". row corresponse to x variables ordered in the data matrix given to rpart.
Basic function to apply the harvest algorithm to the training data set, computing whether we can harvest any nodes based on the classic classification tree algorithm.
harfunc(rpart.object, data, varname, sig = 0.95)harfunc(rpart.object, data, varname, sig = 0.95)
rpart.object |
classification result of training data from traditional classification tree(rpart function). |
data |
original training data where 'y' stores classmembership |
varname |
the name of each explaanatory variables |
sig |
significance level (default 0.95) |
the list of orginial result of classification, likelihood improvment and harvested classification result.
The main function of the package, aiming at develop the harvest classification tree. Training data input and
harvest(training, num.var, numeric.info, sig = 0.95)harvest(training, num.var, numeric.info, sig = 0.95)
training |
original data where 'y' stores classmembership 0 and 1,in the first column, with explanatory variable stores in the second to the last column. |
num.var |
number of explanatory variables |
numeric.info |
the vector stores the number of which variable is continuous |
sig |
significance level (default 0.95) |
The function will return the harvested tree model. Missing values are allowed, and they will be treated accordingly. To use the trained tree model to predict, you can use predict function in this package.
An object of class "harvest", which is the result of algorithm with the following elements for each nodes(nodes are ordered in sequence of harvesting):
rule constriants of the node
total total number of data points in the node
'1' the number of data points belonging to class 1 in the node
'logchange' the improvement of log likelihood of deleting the redundent rules by the algorithm for the node
data(training) harvest(training,4,3)data(training) harvest(training,4,3)
Aimed at applying the Harvest classification tree algorithm, modified algorithm of classic classification tree.The harvested tree has advantage of deleting redundant rules in trees, leading to a simplify and more efficient tree model.It was firstly used in drug discovery field, but it also performs well in other kinds of data, especially when the region of a class is disconnected. This package also improves the basic harvest classification tree algorithm by extending the field of data of algorithm to both continuous and categorical variables.
To learn more about the harvest classification tree algorithm, you can go to http://www.stat.ubc.ca/Research/TechReports/techreports/220.pdf for more information.
| Package: | Harvest.Tree |
| Type: | Package |
| Version: | 1.1 |
| Date: | 2015-07-30 |
| License: | GPL-2 |
The main function of package called 'harvest', it can be used to analyze the data which is stored in a data frame, where first column stores the class of response data, and the second to last column stores explantory variables accordingly.The 'predict' funciton offers function to predict the unclassified data based on training model. The 'harfunc' function is the fundemental part of 'harvest', which can be used to analyze the data which has already been classified by rpart function(traditional classification tree). Please check the help file of these three functions for more information.
Bingyuan Liu \ Yan Yuan \ Qian Shi
Maintainer: Bingyuan Liu<[email protected]>
The function predict computes the prediction of membership from a new data set classified by harvested classification model of training data.
predict(harfunc.object, data, num.var)predict(harfunc.object, data, num.var)
harfunc.object |
the output of harfunc function. |
data |
test data |
num.var |
number of explaining varibles |
To run the predict function, a trained harvested classification tree formed by harvest function is required.
pred.mat is a data frame stored the information of result of prediction with the following columns:
belong the node that data point belongs to
possibility the probability of point being in class 1
predict the simple perdict based on whether probability is larger than 0.5.
Rank harvested node by lower p value
rank.nodes(harfunc.object)rank.nodes(harfunc.object)
harfunc.object |
an object of class "harfunc" |
the ranked harvest nodes
Return a logical matrix of the rule sets which define a terminal node
rulesets(noden, newsim, varn, nodenumb)rulesets(noden, newsim, varn, nodenumb)
noden |
a terminal node defined by a set of rules, from function "treemat" |
newsim |
data to be harvested |
varn |
x variable names |
nodenumb |
all the labels of terminal nodes |
A nxnn logical matrix, n=number of data points to be harvested, nn=number of rules defining a terminal node. Each column of the matrix corresponding to a node that is defined by one variable/rule, its name corresponds to that variable. Note the original terminal node is just the intersection of these nodes.
A simulated data set of symptoms of breast cancer patients
data(training)data(training)
A data frame with 300 observations on the following 5 variables.
ya numeric vector
x1a factor with levels 21 22 23 24 25 26 27 28 29
x2a factor with levels 39- 40-49 50-69 70-74 75+
x3a numeric vector
x4a factor with levels 2004 2005 2006 2007 2008 2009 2010
simulated data for breast cancer diagnosis