SuperML R package is designed to
unify the model training process in R like Python. Generally, it’s seen
that people spend lot of time in searching for packages, figuring out
the syntax for training machine learning models in R. This behaviour is
highly apparent in users who frequently switch between R and Python.
This package provides a python´s scikit-learn interface
(fit
, predict
) to train models faster.
In addition to building machine learning models, there are handy functionalities to do feature engineering
This ambitious package is my ongoing effort to help the r-community build ML models easily and faster in R.
You can install latest cran version using (recommended):
You can install the developmemt version directly from github using:
For machine learning, superml is based on the existing R packages. Hence, while installing the package, we don’t install all the dependencies. However, while training any model, superml will automatically install the package if its not found. Still, if you want to install all dependencies at once, you can simply do:
This package uses existing r-packages to build machine learning model. In this tutorial, we’ll use data.table R package to do all tasks related to data manipulation.
We’ll quickly prepare the data set to be ready to served for model training.
load("../data/reg_train.rda")
# if the above doesn't work, you can try: load("reg_train.rda")
# superml::check_package("caret")
library(data.table)
library(caret)
#> Loading required package: ggplot2
#> Loading required package: lattice
library(superml)
library(Metrics)
#>
#> Attaching package: 'Metrics'
#> The following objects are masked from 'package:caret':
#>
#> precision, recall
head(reg_train)
#> Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
#> <int> <int> <char> <int> <int> <char> <char> <char>
#> 1: 1 60 RL 65 8450 Pave <NA> Reg
#> 2: 2 20 RL 80 9600 Pave <NA> Reg
#> 3: 3 60 RL 68 11250 Pave <NA> IR1
#> 4: 4 70 RL 60 9550 Pave <NA> IR1
#> 5: 5 60 RL 84 14260 Pave <NA> IR1
#> 6: 6 50 RL 85 14115 Pave <NA> IR1
#> LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2
#> <char> <char> <char> <char> <char> <char> <char>
#> 1: Lvl AllPub Inside Gtl CollgCr Norm Norm
#> 2: Lvl AllPub FR2 Gtl Veenker Feedr Norm
#> 3: Lvl AllPub Inside Gtl CollgCr Norm Norm
#> 4: Lvl AllPub Corner Gtl Crawfor Norm Norm
#> 5: Lvl AllPub FR2 Gtl NoRidge Norm Norm
#> 6: Lvl AllPub Inside Gtl Mitchel Norm Norm
#> BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle
#> <char> <char> <int> <int> <int> <int> <char>
#> 1: 1Fam 2Story 7 5 2003 2003 Gable
#> 2: 1Fam 1Story 6 8 1976 1976 Gable
#> 3: 1Fam 2Story 7 5 2001 2002 Gable
#> 4: 1Fam 2Story 7 5 1915 1970 Gable
#> 5: 1Fam 2Story 8 5 2000 2000 Gable
#> 6: 1Fam 1.5Fin 5 5 1993 1995 Gable
#> RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond
#> <char> <char> <char> <char> <int> <char> <char>
#> 1: CompShg VinylSd VinylSd BrkFace 196 Gd TA
#> 2: CompShg MetalSd MetalSd None 0 TA TA
#> 3: CompShg VinylSd VinylSd BrkFace 162 Gd TA
#> 4: CompShg Wd Sdng Wd Shng None 0 TA TA
#> 5: CompShg VinylSd VinylSd BrkFace 350 Gd TA
#> 6: CompShg VinylSd VinylSd None 0 TA TA
#> Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
#> <char> <char> <char> <char> <char> <int>
#> 1: PConc Gd TA No GLQ 706
#> 2: CBlock Gd TA Gd ALQ 978
#> 3: PConc Gd TA Mn GLQ 486
#> 4: BrkTil TA Gd No ALQ 216
#> 5: PConc Gd TA Av GLQ 655
#> 6: Wood Gd TA No GLQ 732
#> BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir
#> <char> <int> <int> <int> <char> <char> <char>
#> 1: Unf 0 150 856 GasA Ex Y
#> 2: Unf 0 284 1262 GasA Ex Y
#> 3: Unf 0 434 920 GasA Ex Y
#> 4: Unf 0 540 756 GasA Gd Y
#> 5: Unf 0 490 1145 GasA Ex Y
#> 6: Unf 0 64 796 GasA Ex Y
#> Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
#> <char> <int> <int> <int> <int> <int>
#> 1: SBrkr 856 854 0 1710 1
#> 2: SBrkr 1262 0 0 1262 0
#> 3: SBrkr 920 866 0 1786 1
#> 4: SBrkr 961 756 0 1717 1
#> 5: SBrkr 1145 1053 0 2198 1
#> 6: SBrkr 796 566 0 1362 1
#> BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual
#> <int> <int> <int> <int> <int> <char>
#> 1: 0 2 1 3 1 Gd
#> 2: 1 2 0 3 1 TA
#> 3: 0 2 1 3 1 Gd
#> 4: 0 1 0 3 1 Gd
#> 5: 0 2 1 4 1 Gd
#> 6: 0 1 1 1 1 TA
#> TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt
#> <int> <char> <int> <char> <char> <int>
#> 1: 8 Typ 0 <NA> Attchd 2003
#> 2: 6 Typ 1 TA Attchd 1976
#> 3: 6 Typ 1 TA Attchd 2001
#> 4: 7 Typ 1 Gd Detchd 1998
#> 5: 9 Typ 1 TA Attchd 2000
#> 6: 5 Typ 0 <NA> Attchd 1993
#> GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive
#> <char> <int> <int> <char> <char> <char>
#> 1: RFn 2 548 TA TA Y
#> 2: RFn 2 460 TA TA Y
#> 3: RFn 2 608 TA TA Y
#> 4: Unf 3 642 TA TA Y
#> 5: RFn 3 836 TA TA Y
#> 6: Unf 2 480 TA TA Y
#> WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC
#> <int> <int> <int> <int> <int> <int> <char>
#> 1: 0 61 0 0 0 0 <NA>
#> 2: 298 0 0 0 0 0 <NA>
#> 3: 0 42 0 0 0 0 <NA>
#> 4: 0 35 272 0 0 0 <NA>
#> 5: 192 84 0 0 0 0 <NA>
#> 6: 40 30 0 320 0 0 <NA>
#> Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
#> <char> <char> <int> <int> <int> <char> <char> <int>
#> 1: <NA> <NA> 0 2 2008 WD Normal 208500
#> 2: <NA> <NA> 0 5 2007 WD Normal 181500
#> 3: <NA> <NA> 0 9 2008 WD Normal 223500
#> 4: <NA> <NA> 0 2 2006 WD Abnorml 140000
#> 5: <NA> <NA> 0 12 2008 WD Normal 250000
#> 6: MnPrv Shed 700 10 2009 WD Normal 143000
split <- createDataPartition(y = reg_train$SalePrice, p = 0.7)
xtrain <- reg_train[split$Resample1]
xtest <- reg_train[!split$Resample1]
# remove features with 90% or more missing values
# we will also remove the Id column because it doesn't contain
# any useful information
na_cols <- colSums(is.na(xtrain)) / nrow(xtrain)
na_cols <- names(na_cols[which(na_cols > 0.9)])
xtrain[, c(na_cols, "Id") := NULL]
xtest[, c(na_cols, "Id") := NULL]
# encode categorical variables
cat_cols <- names(xtrain)[sapply(xtrain, is.character)]
for(c in cat_cols){
lbl <- LabelEncoder$new()
lbl$fit(c(xtrain[[c]], xtest[[c]]))
xtrain[[c]] <- lbl$transform(xtrain[[c]])
xtest[[c]] <- lbl$transform(xtest[[c]])
}
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
# removing noise column
noise <- c('GrLivArea','TotalBsmtSF')
xtrain[, c(noise) := NULL]
xtest[, c(noise) := NULL]
# fill missing value with -1
xtrain[is.na(xtrain)] <- -1
xtest[is.na(xtest)] <- -1
KNN Regression
knn <- KNNTrainer$new(k = 2,prob = T,type = 'reg')
knn$fit(train = xtrain, test = xtest, y = 'SalePrice')
probs <- knn$predict(type = 'prob')
labels <- knn$predict(type='raw')
rmse(actual = xtest$SalePrice, predicted=labels)
#> [1] 41518.75
SVM Regression
svm <- SVMTrainer$new()
svm$fit(xtrain, 'SalePrice')
pred <- svm$predict(xtest)
rmse(actual = xtest$SalePrice, predicted = pred)
Simple Regresison
lf <- LMTrainer$new(family="gaussian")
lf$fit(X = xtrain, y = "SalePrice")
summary(lf$model)
#>
#> Call:
#> stats::glm(formula = f, family = self$family, data = X, weights = self$weights)
#>
#> Coefficients: (1 not defined because of singularities)
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4.264e+04 1.753e+06 0.024 0.980598
#> MSSubClass -1.615e+02 7.222e+01 -2.236 0.025593 *
#> MSZoning -1.660e+03 1.658e+03 -1.001 0.317073
#> LotFrontage -2.483e+01 3.654e+01 -0.680 0.496962
#> LotArea 3.787e-01 1.405e-01 2.696 0.007133 **
#> Street -3.836e+04 1.944e+04 -1.973 0.048762 *
#> LotShape -5.048e+02 2.357e+03 -0.214 0.830437
#> LandContour 2.744e+03 2.022e+03 1.357 0.175152
#> Utilities NA NA NA NA
#> LotConfig 2.086e+03 1.233e+03 1.691 0.091116 .
#> LandSlope 1.403e+04 5.647e+03 2.485 0.013138 *
#> Neighborhood -8.000e+02 2.109e+02 -3.794 0.000158 ***
#> Condition1 -3.006e+03 9.257e+02 -3.247 0.001207 **
#> Condition2 -7.903e+03 3.517e+03 -2.247 0.024881 *
#> BldgType 2.129e+03 3.070e+03 0.693 0.488252
#> HouseStyle 1.635e+03 1.126e+03 1.452 0.146864
#> OverallQual 1.417e+04 1.526e+03 9.288 < 2e-16 ***
#> OverallCond 6.369e+03 1.366e+03 4.661 3.60e-06 ***
#> YearBuilt 4.018e+02 9.057e+01 4.436 1.02e-05 ***
#> YearRemodAdd 1.032e+02 9.009e+01 1.146 0.252258
#> RoofStyle 9.404e+03 2.445e+03 3.847 0.000128 ***
#> RoofMatl -1.979e+04 3.501e+03 -5.653 2.08e-08 ***
#> Exterior1st -1.589e+03 8.372e+02 -1.898 0.057935 .
#> Exterior2nd 1.296e+03 8.017e+02 1.616 0.106422
#> MasVnrType 1.651e+03 1.796e+03 0.919 0.358285
#> MasVnrArea 1.919e+01 7.731e+00 2.482 0.013254 *
#> ExterQual 2.272e+03 2.649e+03 0.858 0.391314
#> ExterCond 1.344e+02 2.898e+03 0.046 0.963026
#> Foundation -3.370e+03 2.247e+03 -1.499 0.134094
#> BsmtQual 7.344e+03 1.676e+03 4.382 1.30e-05 ***
#> BsmtCond -1.544e+03 2.127e+03 -0.726 0.468140
#> BsmtExposure 1.668e+03 1.066e+03 1.564 0.118044
#> BsmtFinType1 -6.899e+02 8.748e+02 -0.789 0.430557
#> BsmtFinSF1 1.078e+01 6.249e+00 1.725 0.084877 .
#> BsmtFinType2 -7.487e+02 1.250e+03 -0.599 0.549264
#> BsmtFinSF2 1.809e+01 1.092e+01 1.657 0.097836 .
#> BsmtUnfSF 5.451e+00 6.011e+00 0.907 0.364749
#> Heating 5.199e+02 4.385e+03 0.119 0.905664
#> HeatingQC -2.765e+03 1.522e+03 -1.817 0.069471 .
#> CentralAir 5.299e+03 5.939e+03 0.892 0.372492
#> Electrical 2.488e+03 2.443e+03 1.018 0.308810
#> `1stFlrSF` 5.158e+01 7.640e+00 6.752 2.54e-11 ***
#> `2ndFlrSF` 5.410e+01 6.540e+00 8.272 4.43e-16 ***
#> LowQualFinSF 1.103e+01 2.389e+01 0.462 0.644318
#> BsmtFullBath 9.969e+03 3.336e+03 2.988 0.002878 **
#> BsmtHalfBath 3.822e+03 4.973e+03 0.769 0.442346
#> FullBath 7.344e+03 3.479e+03 2.111 0.035044 *
#> HalfBath 1.578e+03 3.385e+03 0.466 0.641205
#> BedroomAbvGr -6.908e+03 2.116e+03 -3.265 0.001132 **
#> KitchenAbvGr -2.150e+04 6.475e+03 -3.321 0.000931 ***
#> KitchenQual 7.831e+03 2.018e+03 3.881 0.000111 ***
#> TotRmsAbvGrd 2.694e+03 1.531e+03 1.759 0.078848 .
#> Functional -3.430e+03 1.656e+03 -2.071 0.038584 *
#> Fireplaces -1.416e+03 2.825e+03 -0.501 0.616466
#> FireplaceQu 4.918e+03 1.511e+03 3.255 0.001172 **
#> GarageType -9.004e+02 1.405e+03 -0.641 0.521874
#> GarageYrBlt -1.258e+01 6.452e+00 -1.949 0.051535 .
#> GarageFinish 2.847e+03 1.647e+03 1.729 0.084204 .
#> GarageCars 1.870e+04 3.727e+03 5.017 6.27e-07 ***
#> GarageArea -3.306e+00 1.251e+01 -0.264 0.791663
#> GarageQual 8.001e+03 3.369e+03 2.375 0.017743 *
#> GarageCond -6.558e+03 4.118e+03 -1.593 0.111586
#> PavedDrive -2.194e+03 3.425e+03 -0.641 0.521999
#> WoodDeckSF 3.364e+01 9.822e+00 3.425 0.000642 ***
#> OpenPorchSF -1.281e+01 1.801e+01 -0.711 0.477116
#> EnclosedPorch 3.012e+01 2.034e+01 1.481 0.138972
#> `3SsnPorch` 4.637e+01 3.932e+01 1.179 0.238613
#> ScreenPorch 8.738e+01 2.180e+01 4.009 6.58e-05 ***
#> PoolArea -9.149e+00 2.763e+01 -0.331 0.740651
#> Fence -6.358e+02 1.162e+03 -0.547 0.584277
#> MiscVal -2.150e+00 4.311e+00 -0.499 0.617999
#> MoSold -2.667e+02 4.264e+02 -0.625 0.531871
#> YrSold -5.371e+02 8.704e+02 -0.617 0.537383
#> SaleType 3.076e+03 1.318e+03 2.334 0.019803 *
#> SaleCondition 1.564e+03 1.515e+03 1.033 0.301920
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 1202453519)
#>
#> Null deviance: 7.0716e+12 on 1023 degrees of freedom
#> Residual deviance: 1.1423e+12 on 950 degrees of freedom
#> AIC: 24389
#>
#> Number of Fisher Scoring iterations: 2
predictions <- lf$predict(df = xtest)
#> Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
#> prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 28114.03
Lasso Regression
lf <- LMTrainer$new(family = "gaussian", alpha = 1, lambda = 1000)
lf$fit(X = xtrain, y = "SalePrice")
predictions <- lf$predict(df = xtest)
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 31485.61
Ridge Regression
lf <- LMTrainer$new(family = "gaussian", alpha=0)
lf$fit(X = xtrain, y = "SalePrice")
predictions <- lf$predict(df = xtest)
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 31996.39
Logistic Regression with CV
lf <- LMTrainer$new(family = "gaussian")
lf$cv_model(X = xtrain, y = 'SalePrice', nfolds = 5, parallel = FALSE)
predictions <- lf$cv_predict(df = xtest)
coefs <- lf$get_importance()
rmse(actual = xtest$SalePrice, predicted = predictions)
Random Forest
rf <- RFTrainer$new(n_estimators = 500,classification = 0)
rf$fit(X = xtrain, y = "SalePrice")
pred <- rf$predict(df = xtest)
rf$get_importance()
#> tmp.order.tmp..decreasing...TRUE..
#> OverallQual 897355696833
#> GarageCars 573896531183
#> 1stFlrSF 532567081665
#> GarageArea 465916371596
#> YearBuilt 381916514943
#> BsmtFinSF1 351467511773
#> 2ndFlrSF 289467129408
#> GarageYrBlt 280309239661
#> FullBath 254291970649
#> TotRmsAbvGrd 247291654326
#> LotArea 208933865259
#> ExterQual 192180631991
#> YearRemodAdd 184228984449
#> KitchenQual 156990439124
#> Fireplaces 155507255578
#> FireplaceQu 150289993682
#> MasVnrArea 139964631422
#> BsmtQual 118916830843
#> LotFrontage 115673529565
#> Foundation 94524504058
#> OpenPorchSF 79661007611
#> Neighborhood 74014615272
#> BsmtUnfSF 73692664766
#> WoodDeckSF 70008125913
#> BsmtFinType1 56604155491
#> HeatingQC 53650301917
#> GarageType 52489934869
#> BedroomAbvGr 45089359704
#> MoSold 41277120140
#> Exterior2nd 41264457223
#> RoofStyle 38602071072
#> HalfBath 37595400337
#> MSSubClass 35027094479
#> Exterior1st 29325511399
#> OverallCond 28987509816
#> HouseStyle 28219994918
#> GarageFinish 27984923118
#> BsmtExposure 25462643357
#> YrSold 23018084261
#> Fence 22733826352
#> LotShape 22669878684
#> BsmtFullBath 22202738623
#> MasVnrType 20135742426
#> SaleCondition 20117674148
#> PoolArea 18793427348
#> ScreenPorch 18730088330
#> SaleType 16653935002
#> GarageCond 16301884801
#> LotConfig 15707134034
#> LandContour 15385421905
#> MSZoning 13839441813
#> EnclosedPorch 13232890803
#> CentralAir 12363124255
#> LandSlope 12261331975
#> Condition1 12092559067
#> BldgType 11914182134
#> GarageQual 11345066274
#> RoofMatl 11068060677
#> BsmtFinSF2 10958447459
#> ExterCond 9206220888
#> BsmtCond 8813778449
#> BsmtHalfBath 7934540923
#> Functional 7660614640
#> BsmtFinType2 7153362349
#> KitchenAbvGr 6860084785
#> PavedDrive 5917647539
#> LowQualFinSF 3880928771
#> Heating 2864326669
#> Condition2 2778147869
#> Electrical 2464940714
#> 3SsnPorch 2130009823
#> MiscVal 1574955136
#> Street 602498863
#> Utilities 0
rmse(actual = xtest$SalePrice, predicted = pred)
#> [1] 23806.23
Xgboost
xgb <- XGBTrainer$new(objective = "reg:linear"
, n_estimators = 500
, eval_metric = "rmse"
, maximize = F
, learning_rate = 0.1
,max_depth = 6)
xgb$fit(X = xtrain, y = "SalePrice", valid = xtest)
pred <- xgb$predict(xtest)
rmse(actual = xtest$SalePrice, predicted = pred)
Grid Search
xgb <- XGBTrainer$new(objective = "reg:linear")
gst <- GridSearchCV$new(trainer = xgb,
parameters = list(n_estimators = c(10,50), max_depth = c(5,2)),
n_folds = 3,
scoring = c('accuracy','auc'))
gst$fit(xtrain, "SalePrice")
gst$best_iteration()
Random Search
rf <- RFTrainer$new()
rst <- RandomSearchCV$new(trainer = rf,
parameters = list(n_estimators = c(5,10),
max_depth = c(5,2)),
n_folds = 3,
scoring = c('accuracy','auc'),
n_iter = 3)
rst$fit(xtrain, "SalePrice")
#> [1] "In total, 3 models will be trained"
rst$best_iteration()
#> $n_estimators
#> [1] 10
#>
#> $max_depth
#> [1] 5
#>
#> $accuracy_avg
#> [1] 0.009760947
#>
#> $accuracy_sd
#> [1] 0.007379562
#>
#> $auc_avg
#> [1] NaN
#>
#> $auc_sd
#> [1] NA
Here, we will solve a simple binary classification problem (predict people who survived on titanic ship). The idea here is to demonstrate how to use this package to solve classification problems.
Data Preparation
# load class
load('../data/cla_train.rda')
# if the above doesn't work, you can try: load("cla_train.rda")
head(cla_train)
#> PassengerId Survived Pclass
#> <int> <int> <int>
#> 1: 1 0 3
#> 2: 2 1 1
#> 3: 3 1 3
#> 4: 4 1 1
#> 5: 5 0 3
#> 6: 6 0 3
#> Name Sex Age SibSp Parch
#> <char> <char> <num> <int> <int>
#> 1: Braund, Mr. Owen Harris male 22 1 0
#> 2: Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
#> 3: Heikkinen, Miss. Laina female 26 0 0
#> 4: Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
#> 5: Allen, Mr. William Henry male 35 0 0
#> 6: Moran, Mr. James male NA 0 0
#> Ticket Fare Cabin Embarked
#> <char> <num> <char> <char>
#> 1: A/5 21171 7.2500 S
#> 2: PC 17599 71.2833 C85 C
#> 3: STON/O2. 3101282 7.9250 S
#> 4: 113803 53.1000 C123 S
#> 5: 373450 8.0500 S
#> 6: 330877 8.4583 Q
# split the data
split <- createDataPartition(y = cla_train$Survived,p = 0.7)
xtrain <- cla_train[split$Resample1]
xtest <- cla_train[!split$Resample1]
# encode categorical variables - shorter way
for(c in c('Embarked','Sex','Cabin')) {
lbl <- LabelEncoder$new()
lbl$fit(c(xtrain[[c]], xtest[[c]]))
xtrain[[c]] <- lbl$transform(xtrain[[c]])
xtest[[c]] <- lbl$transform(xtest[[c]])
}
#> The data contains blank values. Imputing them with 'NA'
#> The data contains blank values. Imputing them with 'NA'
#> The data contains blank values. Imputing them with 'NA'
#> The data contains blank values. Imputing them with 'NA'
#> The data contains blank values. Imputing them with 'NA'
# impute missing values
xtrain[, Age := replace(Age, is.na(Age), median(Age, na.rm = T))]
xtest[, Age := replace(Age, is.na(Age), median(Age, na.rm = T))]
# drop these features
to_drop <- c('PassengerId','Ticket','Name')
xtrain <- xtrain[,-c(to_drop), with=F]
xtest <- xtest[,-c(to_drop), with=F]
Now, our data is ready to be served for model training. Let’s do it.
KNN Classification
knn <- KNNTrainer$new(k = 2,prob = T,type = 'class')
knn$fit(train = xtrain, test = xtest, y = 'Survived')
probs <- knn$predict(type = 'prob')
labels <- knn$predict(type = 'raw')
auc(actual = xtest$Survived, predicted = labels)
#> [1] 0.6385027
Naive Bayes Classification
nb <- NBTrainer$new()
nb$fit(xtrain, 'Survived')
pred <- nb$predict(xtest)
#> Warning: predict.naive_bayes(): more features in the newdata are provided as
#> there are probability tables in the object. Calculation is performed based on
#> features to be found in the tables.
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.7771836
SVM Classification
#predicts labels
svm <- SVMTrainer$new()
svm$fit(xtrain, 'Survived')
pred <- svm$predict(xtest)
auc(actual = xtest$Survived, predicted=pred)
Logistic Regression
lf <- LMTrainer$new(family = "binomial")
lf$fit(X = xtrain, y = "Survived")
summary(lf$model)
#>
#> Call:
#> stats::glm(formula = f, family = self$family, data = X, weights = self$weights)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 1.830070 0.616894 2.967 0.00301 **
#> Pclass -0.980785 0.192493 -5.095 3.48e-07 ***
#> Sex 2.508241 0.230374 10.888 < 2e-16 ***
#> Age -0.041034 0.009309 -4.408 1.04e-05 ***
#> SibSp -0.235520 0.117715 -2.001 0.04542 *
#> Parch -0.098742 0.137791 -0.717 0.47361
#> Fare 0.001281 0.002842 0.451 0.65230
#> Cabin 0.008408 0.004786 1.757 0.07899 .
#> Embarked 0.248088 0.166616 1.489 0.13649
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 831.52 on 623 degrees of freedom
#> Residual deviance: 564.76 on 615 degrees of freedom
#> AIC: 582.76
#>
#> Number of Fisher Scoring iterations: 5
predictions <- lf$predict(df = xtest)
auc(actual = xtest$Survived, predicted = predictions)
#> [1] 0.8832145
Lasso Logistic Regression
lf <- LMTrainer$new(family="binomial", alpha=1)
lf$cv_model(X = xtrain, y = "Survived", nfolds = 5, parallel = FALSE)
pred <- lf$cv_predict(df = xtest)
auc(actual = xtest$Survived, predicted = pred)
Ridge Logistic Regression
lf <- LMTrainer$new(family="binomial", alpha=0)
lf$cv_model(X = xtrain, y = "Survived", nfolds = 5, parallel = FALSE)
pred <- lf$cv_predict(df = xtest)
auc(actual = xtest$Survived, predicted = pred)
Random Forest
rf <- RFTrainer$new(n_estimators = 500,classification = 1, max_features = 3)
rf$fit(X = xtrain, y = "Survived")
pred <- rf$predict(df = xtest)
rf$get_importance()
#> tmp.order.tmp..decreasing...TRUE..
#> Sex 69.10742
#> Fare 57.96084
#> Age 48.50156
#> Pclass 23.91175
#> Cabin 21.19329
#> SibSp 12.58503
#> Parch 10.55128
#> Embarked 10.07059
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.7988414
Xgboost
xgb <- XGBTrainer$new(objective = "binary:logistic"
, n_estimators = 500
, eval_metric = "auc"
, maximize = T
, learning_rate = 0.1
,max_depth = 6)
xgb$fit(X = xtrain, y = "Survived", valid = xtest)
pred <- xgb$predict(xtest)
auc(actual = xtest$Survived, predicted = pred)
Grid Search
xgb <- XGBTrainer$new(objective="binary:logistic")
gst <-GridSearchCV$new(trainer = xgb,
parameters = list(n_estimators = c(10,50),
max_depth = c(5,2)),
n_folds = 3,
scoring = c('accuracy','auc'))
gst$fit(xtrain, "Survived")
gst$best_iteration()
Random Search
rf <- RFTrainer$new()
rst <- RandomSearchCV$new(trainer = rf,
parameters = list(n_estimators = c(10,50), max_depth = c(5,2)),
n_folds = 3,
scoring = c('accuracy','auc'),
n_iter = 3)
rst$fit(xtrain, "Survived")
#> [1] "In total, 3 models will be trained"
rst$best_iteration()
#> $n_estimators
#> [1] 50
#>
#> $max_depth
#> [1] 5
#>
#> $accuracy_avg
#> [1] 0.8028846
#>
#> $accuracy_sd
#> [1] 0.01733438
#>
#> $auc_avg
#> [1] 0.7804264
#>
#> $auc_sd
#> [1] 0.02631447
Let’s create some new feature based on target variable using target encoding and test a model.
# add target encoding features
xtrain[, feat_01 := smoothMean(train_df = xtrain,
test_df = xtest,
colname = "Embarked",
target = "Survived")$train[[2]]]
xtest[, feat_01 := smoothMean(train_df = xtrain,
test_df = xtest,
colname = "Embarked",
target = "Survived")$test[[2]]]
# train a random forest
# Random Forest
rf <- RFTrainer$new(n_estimators = 500,classification = 1, max_features = 4)
rf$fit(X = xtrain, y = "Survived")
pred <- rf$predict(df = xtest)
rf$get_importance()
#> tmp.order.tmp..decreasing...TRUE..
#> Sex 71.417138
#> Fare 61.039958
#> Age 51.787990
#> Pclass 24.257112
#> Cabin 21.549374
#> SibSp 12.374317
#> Parch 10.392826
#> feat_01 6.490151
#> Embarked 6.270997
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.7988414