Title: | Optimal Trees Ensembles for Regression, Classification and Class Membership Probability Estimation |
---|---|
Description: | Functions for creating ensembles of optimal trees for regression, classification (Khan, Z., Gul, A., Perperoglou, A., Miftahuddin, M., Mahmoud, O., Adler, W., & Lausen, B. (2019). (2019) <doi:10.1007/s11634-019-00364-9>) and class membership probability estimation (Khan, Z, Gul, A, Mahmoud, O, Miftahuddin, M, Perperoglou, A, Adler, W & Lausen, B (2016) <doi:10.1007/978-3-319-25226-1_34>) are given. A few trees are selected from an initial set of trees grown by random forest for the ensemble on the basis of their individual and collective performance. Three different methods of tree selection for the case of classification are given. The prediction functions return estimates of the test responses and their class membership probabilities. Unexplained variations, error rates, confusion matrix, Brier scores, etc. are also returned for the test data. |
Authors: | Zardad Khan, Asma Gul, Aris Perperoglou, Osama Mahmoud, Werner Adler, Miftahuddin and Berthold Lausen |
Maintainer: | Zardad Khan <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.0.1 |
Built: | 2024-12-10 06:54:43 UTC |
Source: | CRAN |
Functions for creating ensembles of optimal trees for regression, classification and class membership probability estimation are given. A few trees are selected from an initial set of trees grown by random forest for the ensemble on the basis of their individual and collective performance. The prediction functions return estimates of the test responses/class labels and their class membership probabilities. Unexplained variations, error rates, confusion matrix, Brier scores, etc. for the test data are also returned. Three different methods for tree selection are given for the case of classification.
Package: | OTE |
Type: | Package |
Version: | 1.0.1 |
Date: | 2020-04-18 |
License: | GPL-3 |
Zardad Khan, Asma Gul, Aris Perperoglou, Osama Mahmoud, Werner Adler, Miftahuddin and Berthold Lausen Maintainer: Zardad Khan <[email protected]>
Khan, Z., Gul, A., Perperoglou, A., Miftahuddin, M., Mahmoud, O., Adler, W., & Lausen, B. (2019). Ensemble of optimal trees, random forest and random projection ensemble classification. Advances in Data Analysis and Classification, 1-20.
The Body data set consists of 507 observations on 24 predictor variables including age, weight, hight and 21 body dimensions. All the 507 observations are on individuals, 247 men and 260 women, in the age of twenties and thirties with a small number of old people. The class variable is gender having two categories male and female.
data(Body)
data(Body)
A data frame with 507 observations recorded on the following 25 variables.
Biacrom
The diameter of Biacrom taken in centimeter.
Biiliac
"Pelvic breadth" measured in centimeter.
Bitro
Bitrochanteric whole diameter measured in centimeter.
ChestDp
The depth of Chest of a person in centimeter between sternum and spine at nipple level.
ChestD
The diameter of Chest of a person in centimeter at nipple level.
ElbowD
The sum of diameters of two Elbows in centimeter.
WristD
Sum of two Wrists diameters in centimeter.
KneeD
The sum of the diameters of two Knees in centimeter.
AnkleD
The sum of the diameters of two Ankles in centimeter.
ShoulderG
The wideness of shoulder in centimeter.
ChestG
The circumference of chest centimeter taken at nipple line for males and just above breast tissue for females.
WaistG
The circumference of Waist in centimeter taken as the average of contracted and relaxed positions at the narrowest part.
AbdG
Girth of Abdomin in centimeter at umbilicus and iliac crest, where iliac crest is taken as a landmark.
HipG
Girth of Hip in centimeter at level of bitrochanteric diameter.
ThighG
Average of left and right Thigh girths in centimeter below gluteal fold.
BicepG
Average of left and right Bicep girths in centimeter.
ForearmG
Average of left and right Forearm girths, extended, palm up.
KneeG
Average of left and right Knees girths over patella, slightly flexed position.
CalfG
Average of right and left Calf maximum girths.
AnkleG
Average of right and left Ankle minimum girths.
WristG
Average of left and right minimum circumferences of Wrists.
Age
Age in years
Weight
Weight in kilogram
Height
Height in centimeter
Gender
Binary response with two categories; 1 - male, 0 - female
Heinz, G., Peterson, L.J., Johnson, R.W. and Kerk, C.J. (2003), “Exploring Relationships in Body Dimensions”, Journal of Statistics Education , 11.
Hurley, C. (2012), “ gclus: Clustering Graphics”, R package version 1.3.1, https://CRAN.R-project.org/package=gclus.
data(Body) str(Body)
data(Body) str(Body)
This data set is a record of radial velocity of a spiral galaxy that is measured at 323 points in its covered area of the sky. The positions of the measurements, that are in the range of seven slot crossing at the origin, are denoted by 4 variables.
data(Galaxy)
data(Galaxy)
A data frame with 324 observations recorded on the following 5 variables.
east.west
It is the east-west coordinate where east is taken as negative, west is taken as positive and origin, (0,0), is close to the center of galaxy.
north.south
It is the north-south coordinate where south is taken as negative, north is taken as positive and origin, (0,0), is near the center of galaxy.
angle
It is the degrees of anti rotation (clockwise) from the slot horizon where the observation lies.
radial.position
It is the signed distance from the center, (0,0), which is signed as negative if the east-west coordinate is negative.
velocity
This is the response variable denoting the radial velocity(km/sec) of the galaxy.
Buta, R. (1987), “The Structure and Dynamics of Ringed Galaxies, III: Surface Photometry and Kinematics of the Ringed Nonbarred Spiral NGC7531” The Astrophysical J. Supplement Ser. 64. 1–37.
data(Galaxy) str(Galaxy)
data(Galaxy) str(Galaxy)
This function selects optimal trees for classification from a total of t.initial
trees grown by random forest. Number of trees in the initial set, t.initial
, is specified by the user. If not specified then the default t.initial = 1000
is used.
OTClass(XTraining, YTraining, method=c("oob+independent","oob","sub-sampling"), p = 0.1,t.initial = NULL,nf = NULL, ns = NULL, info = TRUE)
OTClass(XTraining, YTraining, method=c("oob+independent","oob","sub-sampling"), p = 0.1,t.initial = NULL,nf = NULL, ns = NULL, info = TRUE)
XTraining |
An |
YTraining |
A vector of length |
method |
Method used in the selection of optimal trees. |
p |
Percent of the best |
t.initial |
Size of the initial set of classification trees. |
nf |
Number of features to be sampled for spliting the nodes of the trees. If equal to |
ns |
Node size: Minimal number of samples in the nodes. If equal to |
info |
If |
Large values are recommended for t.initial
for better performance as possible under the available computational resources.
A trained object consisting of the selected trees.
Prior action needs to be taken in the case of missing values as the fuction can not handle them at the current version.
Zardad Khan <[email protected]>
Khan, Z., Gul, A., Perperoglou, A., Miftahuddin, M., Mahmoud, O., Adler, W., & Lausen, B. (2019). Ensemble of optimal trees, random forest and random projection ensemble classification. Advances in Data Analysis and Classification, 1-20.
Liaw, A. and Wiener, M. (2002) “Classification and regression by random forest” R news. 2(3). 18–22.
Predict.OTClass
, OTReg
, OTProb
#load the data data(Body) data <- Body #Divide the data into training and test parts set.seed(9123) n <- nrow(data) training <- sample(1:n,round(2*n/3)) testing <- (1:n)[-training] X <- data[,1:24] Y <- data[,25] #Train OTClass on the training data Opt.Trees <- OTClass(XTraining=X[training,],YTraining = Y[training], t.initial=200,method="oob+independent") #Predict on test data Prediction <- Predict.OTClass(Opt.Trees, X[testing,],YTesting=Y[testing]) #Objects returned names(Prediction) Prediction$Confusion.Matrix Prediction$Predicted.Class.Labels
#load the data data(Body) data <- Body #Divide the data into training and test parts set.seed(9123) n <- nrow(data) training <- sample(1:n,round(2*n/3)) testing <- (1:n)[-training] X <- data[,1:24] Y <- data[,25] #Train OTClass on the training data Opt.Trees <- OTClass(XTraining=X[training,],YTraining = Y[training], t.initial=200,method="oob+independent") #Predict on test data Prediction <- Predict.OTClass(Opt.Trees, X[testing,],YTesting=Y[testing]) #Objects returned names(Prediction) Prediction$Confusion.Matrix Prediction$Predicted.Class.Labels
This function selects optimal trees for class membership probability estimation from a total of t.initial
trees grown by random forest. Number of trees in the initial set, t.initial
, is specified by the user. If not specified then the default t.initial = 1000
is used.
OTProb(XTraining, YTraining, p = 0.2, t.initial = NULL, nf = NULL, ns = NULL, info = TRUE)
OTProb(XTraining, YTraining, p = 0.2, t.initial = NULL, nf = NULL, ns = NULL, info = TRUE)
XTraining |
An |
YTraining |
A vector of length |
p |
Percent of the best |
t.initial |
Size of the initial set of probability estimation trees. |
nf |
Number of features to be sampled for spliting the nodes of the trees. If equal to |
ns |
Node size: Minimal number of samples in the nodes. If equal to |
info |
If |
Large values are recommended for t.initial
for better performance as possible under the available computational resources.
A trained object consisting of the selected trees.
Prior action needs to be taken in case of missing values as the fuction can not handle them at the current version.
Zardad Khan <[email protected]>
Khan, Z., Gul, A., Perperoglou, A., Miftahuddin, M., Mahmoud, O., Adler, W., & Lausen, B. (2019). Ensemble of optimal trees, random forest and random projection ensemble classification. Advances in Data Analysis and Classification, 1-20.
Liaw, A. and Wiener, M. (2002) “Classification and regression by random forest” R news. 2(3). 18–22.
Predict.OTProb
, OTReg
, OTClass
#load the data data(Body) data <- Body #Divide the data into training and test parts set.seed(9123) n <- nrow(data) training <- sample(1:n,round(2*n/3)) testing <- (1:n)[-training] X <- data[,1:24] Y <- data[,25] #Train OTClass on the training data Opt.Trees <- OTProb(XTraining=X[training,],YTraining = Y[training],t.initial=200) #Predict on test data Prediction <- Predict.OTProb(Opt.Trees, X[testing,],YTesting=Y[testing]) #Objects returned names(Prediction) Prediction$Brier.Score Prediction$Estimated.Probabilities
#load the data data(Body) data <- Body #Divide the data into training and test parts set.seed(9123) n <- nrow(data) training <- sample(1:n,round(2*n/3)) testing <- (1:n)[-training] X <- data[,1:24] Y <- data[,25] #Train OTClass on the training data Opt.Trees <- OTProb(XTraining=X[training,],YTraining = Y[training],t.initial=200) #Predict on test data Prediction <- Predict.OTProb(Opt.Trees, X[testing,],YTesting=Y[testing]) #Objects returned names(Prediction) Prediction$Brier.Score Prediction$Estimated.Probabilities
This function selects optimal trees for regression from a total of t.initial
trees grown by random forest. Number of trees in the initial set, t.initial
, is specified by the user. If not specified then the default t.initial = 1000
is used.
OTReg(XTraining, YTraining, p = 0.2, t.initial = NULL, nf = NULL, ns = NULL, info = TRUE)
OTReg(XTraining, YTraining, p = 0.2, t.initial = NULL, nf = NULL, ns = NULL, info = TRUE)
XTraining |
An |
YTraining |
A vector of length |
p |
Percent of the best |
t.initial |
Size of the initial set of regression trees. |
nf |
Number of features to be sampled for spliting the nodes of the trees. If equal to |
ns |
Node size: Minimal number of samples in the nodes. If equal to |
info |
If |
Large values are recommended for t.initial
for better performance as possible under the available computational resources.
A trained object consisting of the selected trees for regression.
Prior action needs to be taken in case of missing values as the fuction can not handle them at the current version.
Zardad Khan <[email protected]>
Khan, Z., Gul, A., Perperoglou, A., Miftahuddin, M., Mahmoud, O., Adler, W., & Lausen, B. (2019). Ensemble of optimal trees, random forest and random projection ensemble classification. Advances in Data Analysis and Classification, 1-20.
Liaw, A. and Wiener, M. (2002) “Classification and regression by random forest” R news. 2(3). 18–22.
Predict.OTReg
, OTProb
, OTClass
# Load the data data(Galaxy) data <- Galaxy #Divide the data into training and test parts set.seed(9123) n <- nrow(data) training <- sample(1:n,round(2*n/3)) testing <- (1:n)[-training] X <- data[,1:4] Y <- data[,5] #Train OTReg on the training data Opt.Trees <- OTReg(XTraining=X[training,],YTraining = Y[training],t.initial=200) #Predict on test data Prediction <- Predict.OTReg(Opt.Trees, X[testing,],YTesting=Y[testing]) #Objects returned names(Prediction) Prediction$Unexp.Variations Prediction$Pr.Values Prediction$Trees.Used
# Load the data data(Galaxy) data <- Galaxy #Divide the data into training and test parts set.seed(9123) n <- nrow(data) training <- sample(1:n,round(2*n/3)) testing <- (1:n)[-training] X <- data[,1:4] Y <- data[,5] #Train OTReg on the training data Opt.Trees <- OTReg(XTraining=X[training,],YTraining = Y[training],t.initial=200) #Predict on test data Prediction <- Predict.OTReg(Opt.Trees, X[testing,],YTesting=Y[testing]) #Objects returned names(Prediction) Prediction$Unexp.Variations Prediction$Pr.Values Prediction$Trees.Used
OTClass
This function provides prediction for test data on the trained OTClass
object for classification.
Predict.OTClass(Opt.Trees, XTesting, YTesting)
Predict.OTClass(Opt.Trees, XTesting, YTesting)
Opt.Trees |
An object of class |
XTesting |
An |
YTesting |
Optional. A vector of length |
A list with values
Error.Rate |
Error rate of the clssifier for the observations in XTesting. |
Confusion.Matrix |
Confusion matrix based on the estimated class labels and the true class labels. |
Estimated.Class |
A vector of length |
Zardad Khan <[email protected]>
Khan, Z., Gul, A., Perperoglou, A., Miftahuddin, M., Mahmoud, O., Adler, W., & Lausen, B. (2019). Ensemble of optimal trees, random forest and random projection ensemble classification. Advances in Data Analysis and Classification, 1-20.
Liaw, A. and Wiener, M. (2002) “Classification and regression by random forest” R news. 2(3). 18–22.
#load the data data(Body) data <- Body #Divide the data into training and test parts set.seed(9123) n <- nrow(data) training <- sample(1:n,round(2*n/3)) testing <- (1:n)[-training] X <- data[,1:24] Y <- data[,25] #Train OTClass on the training data Opt.Trees <- OTClass(XTraining=X[training,],YTraining = Y[training], t.initial=200, method="oob+independent") #Predict on test data Prediction <- Predict.OTClass(Opt.Trees, X[testing,],YTesting=Y[testing]) #Objects returned names(Prediction) Prediction$Confusion.Matrix Prediction$Predicted.Class.Labels
#load the data data(Body) data <- Body #Divide the data into training and test parts set.seed(9123) n <- nrow(data) training <- sample(1:n,round(2*n/3)) testing <- (1:n)[-training] X <- data[,1:24] Y <- data[,25] #Train OTClass on the training data Opt.Trees <- OTClass(XTraining=X[training,],YTraining = Y[training], t.initial=200, method="oob+independent") #Predict on test data Prediction <- Predict.OTClass(Opt.Trees, X[testing,],YTesting=Y[testing]) #Objects returned names(Prediction) Prediction$Confusion.Matrix Prediction$Predicted.Class.Labels
OTProb
This function provides prediction for test data on the trained OTProb
object for class membership probability estimation.
Predict.OTProb(Opt.Trees, XTesting, YTesting)
Predict.OTProb(Opt.Trees, XTesting, YTesting)
Opt.Trees |
An object of class |
XTesting |
An |
YTesting |
Optional. A vector of length |
A list with values
Brier.Score |
Brier Score based on the estimated probabilities and true class label in YTesting. |
Estimated.Probabilities |
A vector of length |
Zardad Khan <[email protected]>
Khan, Z., Gul, A., Perperoglou, A., Miftahuddin, M., Mahmoud, O., Adler, W., & Lausen, B. (2019). Ensemble of optimal trees, random forest and random projection ensemble classification. Advances in Data Analysis and Classification, 1-20.
Liaw, A. and Wiener, M. (2002) “Classification and regression by random forest” R news. 2(3). 18–22.
#load the data data(Body) data <- Body #Divide the data into training and test parts set.seed(9123) n <- nrow(data) training <- sample(1:n,round(2*n/3)) testing <- (1:n)[-training] X <- data[,1:24] Y <- data[,25] #Train OTClass on the training data Opt.Trees <- OTProb(XTraining=X[training,],YTraining = Y[training],t.initial=200) #Predict on test data Prediction <- Predict.OTProb(Opt.Trees, X[testing,],YTesting=Y[testing]) #Objects returned names(Prediction) Prediction$Brier.Score Prediction$Estimated.Probabilities
#load the data data(Body) data <- Body #Divide the data into training and test parts set.seed(9123) n <- nrow(data) training <- sample(1:n,round(2*n/3)) testing <- (1:n)[-training] X <- data[,1:24] Y <- data[,25] #Train OTClass on the training data Opt.Trees <- OTProb(XTraining=X[training,],YTraining = Y[training],t.initial=200) #Predict on test data Prediction <- Predict.OTProb(Opt.Trees, X[testing,],YTesting=Y[testing]) #Objects returned names(Prediction) Prediction$Brier.Score Prediction$Estimated.Probabilities
OTReg
This function provides prediction for test data on the trained OTReg
object for the continuous response variable.
Predict.OTReg(Opt.Trees, XTesting, YTesting)
Predict.OTReg(Opt.Trees, XTesting, YTesting)
Opt.Trees |
An object of class |
XTesting |
An |
YTesting |
Optional. A vector of length |
A list with values
Unexp.Variations |
Unexplained variations based on estimated response and given response. |
Pr.Values |
A vector of length |
Zardad Khan <[email protected]>
Khan, Z., Gul, A., Perperoglou, A., Miftahuddin, M., Mahmoud, O., Adler, W., & Lausen, B. (2019). Ensemble of optimal trees, random forest and random projection ensemble classification. Advances in Data Analysis and Classification, 1-20.
Liaw, A. and Wiener, M. (2002) “Classification and regression by random forest” R news. 2(3). 18–22.
# Load the data data(Galaxy) data <- Galaxy #Divide the data into training and test parts set.seed(9123) n <- nrow(data) training <- sample(1:n,round(2*n/3)) testing <- (1:n)[-training] X <- data[,1:4] Y <- data[,5] #Train oTReg on the training data Opt.Trees <- OTReg(XTraining=X[training,],YTraining = Y[training],t.initial=200) #Predict on test data Prediction <- Predict.OTReg(Opt.Trees, X[testing,],YTesting=Y[testing]) #Objects returned names(Prediction) Prediction$Unexp.Variations Prediction$Pr.Values Prediction$Trees.Used
# Load the data data(Galaxy) data <- Galaxy #Divide the data into training and test parts set.seed(9123) n <- nrow(data) training <- sample(1:n,round(2*n/3)) testing <- (1:n)[-training] X <- data[,1:4] Y <- data[,5] #Train oTReg on the training data Opt.Trees <- OTReg(XTraining=X[training,],YTraining = Y[training],t.initial=200) #Predict on test data Prediction <- Predict.OTReg(Opt.Trees, X[testing,],YTesting=Y[testing]) #Objects returned names(Prediction) Prediction$Unexp.Variations Prediction$Pr.Values Prediction$Trees.Used