Introduction to SuperML

SuperML R package is designed to unify the model training process in R like Python. Generally, it’s seen that people spend lot of time in searching for packages, figuring out the syntax for training machine learning models in R. This behaviour is highly apparent in users who frequently switch between R and Python. This package provides a python´s scikit-learn interface (fit, predict) to train models faster.

In addition to building machine learning models, there are handy functionalities to do feature engineering

This ambitious package is my ongoing effort to help the r-community build ML models easily and faster in R.

Install

You can install latest cran version using (recommended):

install.packages("superml")

You can install the developmemt version directly from github using:

devtools::install_github("saraswatmks/superml")

Caveats on superml installation

For machine learning, superml is based on the existing R packages. Hence, while installing the package, we don’t install all the dependencies. However, while training any model, superml will automatically install the package if its not found. Still, if you want to install all dependencies at once, you can simply do:

install.packages("superml", dependencies=TRUE)

Examples - Machine Learning Models

This package uses existing r-packages to build machine learning model. In this tutorial, we’ll use data.table R package to do all tasks related to data manipulation.

Regression Data

We’ll quickly prepare the data set to be ready to served for model training.

load("../data/reg_train.rda")
# if the above doesn't work, you can try: load("reg_train.rda")
# superml::check_package("caret")
library(data.table)
library(caret)
#> Loading required package: ggplot2
#> Loading required package: lattice
library(superml)

library(Metrics)
#> 
#> Attaching package: 'Metrics'
#> The following objects are masked from 'package:caret':
#> 
#>     precision, recall

head(reg_train)
#>       Id MSSubClass MSZoning LotFrontage LotArea Street  Alley LotShape
#>    <int>      <int>   <char>       <int>   <int> <char> <char>   <char>
#> 1:     1         60       RL          65    8450   Pave   <NA>      Reg
#> 2:     2         20       RL          80    9600   Pave   <NA>      Reg
#> 3:     3         60       RL          68   11250   Pave   <NA>      IR1
#> 4:     4         70       RL          60    9550   Pave   <NA>      IR1
#> 5:     5         60       RL          84   14260   Pave   <NA>      IR1
#> 6:     6         50       RL          85   14115   Pave   <NA>      IR1
#>    LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2
#>         <char>    <char>    <char>    <char>       <char>     <char>     <char>
#> 1:         Lvl    AllPub    Inside       Gtl      CollgCr       Norm       Norm
#> 2:         Lvl    AllPub       FR2       Gtl      Veenker      Feedr       Norm
#> 3:         Lvl    AllPub    Inside       Gtl      CollgCr       Norm       Norm
#> 4:         Lvl    AllPub    Corner       Gtl      Crawfor       Norm       Norm
#> 5:         Lvl    AllPub       FR2       Gtl      NoRidge       Norm       Norm
#> 6:         Lvl    AllPub    Inside       Gtl      Mitchel       Norm       Norm
#>    BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle
#>      <char>     <char>       <int>       <int>     <int>        <int>    <char>
#> 1:     1Fam     2Story           7           5      2003         2003     Gable
#> 2:     1Fam     1Story           6           8      1976         1976     Gable
#> 3:     1Fam     2Story           7           5      2001         2002     Gable
#> 4:     1Fam     2Story           7           5      1915         1970     Gable
#> 5:     1Fam     2Story           8           5      2000         2000     Gable
#> 6:     1Fam     1.5Fin           5           5      1993         1995     Gable
#>    RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond
#>      <char>      <char>      <char>     <char>      <int>    <char>    <char>
#> 1:  CompShg     VinylSd     VinylSd    BrkFace        196        Gd        TA
#> 2:  CompShg     MetalSd     MetalSd       None          0        TA        TA
#> 3:  CompShg     VinylSd     VinylSd    BrkFace        162        Gd        TA
#> 4:  CompShg     Wd Sdng     Wd Shng       None          0        TA        TA
#> 5:  CompShg     VinylSd     VinylSd    BrkFace        350        Gd        TA
#> 6:  CompShg     VinylSd     VinylSd       None          0        TA        TA
#>    Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
#>        <char>   <char>   <char>       <char>       <char>      <int>
#> 1:      PConc       Gd       TA           No          GLQ        706
#> 2:     CBlock       Gd       TA           Gd          ALQ        978
#> 3:      PConc       Gd       TA           Mn          GLQ        486
#> 4:     BrkTil       TA       Gd           No          ALQ        216
#> 5:      PConc       Gd       TA           Av          GLQ        655
#> 6:       Wood       Gd       TA           No          GLQ        732
#>    BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir
#>          <char>      <int>     <int>       <int>  <char>    <char>     <char>
#> 1:          Unf          0       150         856    GasA        Ex          Y
#> 2:          Unf          0       284        1262    GasA        Ex          Y
#> 3:          Unf          0       434         920    GasA        Ex          Y
#> 4:          Unf          0       540         756    GasA        Gd          Y
#> 5:          Unf          0       490        1145    GasA        Ex          Y
#> 6:          Unf          0        64         796    GasA        Ex          Y
#>    Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
#>        <char>    <int>    <int>        <int>     <int>        <int>
#> 1:      SBrkr      856      854            0      1710            1
#> 2:      SBrkr     1262        0            0      1262            0
#> 3:      SBrkr      920      866            0      1786            1
#> 4:      SBrkr      961      756            0      1717            1
#> 5:      SBrkr     1145     1053            0      2198            1
#> 6:      SBrkr      796      566            0      1362            1
#>    BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual
#>           <int>    <int>    <int>        <int>        <int>      <char>
#> 1:            0        2        1            3            1          Gd
#> 2:            1        2        0            3            1          TA
#> 3:            0        2        1            3            1          Gd
#> 4:            0        1        0            3            1          Gd
#> 5:            0        2        1            4            1          Gd
#> 6:            0        1        1            1            1          TA
#>    TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt
#>           <int>     <char>      <int>      <char>     <char>       <int>
#> 1:            8        Typ          0        <NA>     Attchd        2003
#> 2:            6        Typ          1          TA     Attchd        1976
#> 3:            6        Typ          1          TA     Attchd        2001
#> 4:            7        Typ          1          Gd     Detchd        1998
#> 5:            9        Typ          1          TA     Attchd        2000
#> 6:            5        Typ          0        <NA>     Attchd        1993
#>    GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive
#>          <char>      <int>      <int>     <char>     <char>     <char>
#> 1:          RFn          2        548         TA         TA          Y
#> 2:          RFn          2        460         TA         TA          Y
#> 3:          RFn          2        608         TA         TA          Y
#> 4:          Unf          3        642         TA         TA          Y
#> 5:          RFn          3        836         TA         TA          Y
#> 6:          Unf          2        480         TA         TA          Y
#>    WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC
#>         <int>       <int>         <int>     <int>       <int>    <int> <char>
#> 1:          0          61             0         0           0        0   <NA>
#> 2:        298           0             0         0           0        0   <NA>
#> 3:          0          42             0         0           0        0   <NA>
#> 4:          0          35           272         0           0        0   <NA>
#> 5:        192          84             0         0           0        0   <NA>
#> 6:         40          30             0       320           0        0   <NA>
#>     Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
#>    <char>      <char>   <int>  <int>  <int>   <char>        <char>     <int>
#> 1:   <NA>        <NA>       0      2   2008       WD        Normal    208500
#> 2:   <NA>        <NA>       0      5   2007       WD        Normal    181500
#> 3:   <NA>        <NA>       0      9   2008       WD        Normal    223500
#> 4:   <NA>        <NA>       0      2   2006       WD       Abnorml    140000
#> 5:   <NA>        <NA>       0     12   2008       WD        Normal    250000
#> 6:  MnPrv        Shed     700     10   2009       WD        Normal    143000

split <- createDataPartition(y = reg_train$SalePrice, p = 0.7)
xtrain <- reg_train[split$Resample1]
xtest <- reg_train[!split$Resample1]

# remove features with 90% or more missing values
# we will also remove the Id column because it doesn't contain
# any useful information
na_cols <- colSums(is.na(xtrain)) / nrow(xtrain)
na_cols <- names(na_cols[which(na_cols > 0.9)])

xtrain[, c(na_cols, "Id") := NULL]
xtest[, c(na_cols, "Id") := NULL]

# encode categorical variables
cat_cols <- names(xtrain)[sapply(xtrain, is.character)]

for(c in cat_cols){
    lbl <- LabelEncoder$new()
    lbl$fit(c(xtrain[[c]], xtest[[c]]))
    xtrain[[c]] <- lbl$transform(xtrain[[c]])
    xtest[[c]] <- lbl$transform(xtest[[c]])
}
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA'

# removing noise column
noise <- c('GrLivArea','TotalBsmtSF')

xtrain[, c(noise) := NULL]
xtest[, c(noise) := NULL]

# fill missing value with  -1
xtrain[is.na(xtrain)] <- -1
xtest[is.na(xtest)] <- -1

KNN Regression

knn <- KNNTrainer$new(k = 2,prob = T,type = 'reg')
knn$fit(train = xtrain, test = xtest, y = 'SalePrice')
probs <- knn$predict(type = 'prob')
labels <- knn$predict(type='raw')
rmse(actual = xtest$SalePrice, predicted=labels)
#> [1] 41518.75

SVM Regression

svm <- SVMTrainer$new()
svm$fit(xtrain, 'SalePrice')
pred <- svm$predict(xtest)
rmse(actual = xtest$SalePrice, predicted = pred)

Simple Regresison

lf <- LMTrainer$new(family="gaussian")
lf$fit(X = xtrain, y = "SalePrice")
summary(lf$model)
#> 
#> Call:
#> stats::glm(formula = f, family = self$family, data = X, weights = self$weights)
#> 
#> Coefficients: (1 not defined because of singularities)
#>                 Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)    4.264e+04  1.753e+06   0.024 0.980598    
#> MSSubClass    -1.615e+02  7.222e+01  -2.236 0.025593 *  
#> MSZoning      -1.660e+03  1.658e+03  -1.001 0.317073    
#> LotFrontage   -2.483e+01  3.654e+01  -0.680 0.496962    
#> LotArea        3.787e-01  1.405e-01   2.696 0.007133 ** 
#> Street        -3.836e+04  1.944e+04  -1.973 0.048762 *  
#> LotShape      -5.048e+02  2.357e+03  -0.214 0.830437    
#> LandContour    2.744e+03  2.022e+03   1.357 0.175152    
#> Utilities             NA         NA      NA       NA    
#> LotConfig      2.086e+03  1.233e+03   1.691 0.091116 .  
#> LandSlope      1.403e+04  5.647e+03   2.485 0.013138 *  
#> Neighborhood  -8.000e+02  2.109e+02  -3.794 0.000158 ***
#> Condition1    -3.006e+03  9.257e+02  -3.247 0.001207 ** 
#> Condition2    -7.903e+03  3.517e+03  -2.247 0.024881 *  
#> BldgType       2.129e+03  3.070e+03   0.693 0.488252    
#> HouseStyle     1.635e+03  1.126e+03   1.452 0.146864    
#> OverallQual    1.417e+04  1.526e+03   9.288  < 2e-16 ***
#> OverallCond    6.369e+03  1.366e+03   4.661 3.60e-06 ***
#> YearBuilt      4.018e+02  9.057e+01   4.436 1.02e-05 ***
#> YearRemodAdd   1.032e+02  9.009e+01   1.146 0.252258    
#> RoofStyle      9.404e+03  2.445e+03   3.847 0.000128 ***
#> RoofMatl      -1.979e+04  3.501e+03  -5.653 2.08e-08 ***
#> Exterior1st   -1.589e+03  8.372e+02  -1.898 0.057935 .  
#> Exterior2nd    1.296e+03  8.017e+02   1.616 0.106422    
#> MasVnrType     1.651e+03  1.796e+03   0.919 0.358285    
#> MasVnrArea     1.919e+01  7.731e+00   2.482 0.013254 *  
#> ExterQual      2.272e+03  2.649e+03   0.858 0.391314    
#> ExterCond      1.344e+02  2.898e+03   0.046 0.963026    
#> Foundation    -3.370e+03  2.247e+03  -1.499 0.134094    
#> BsmtQual       7.344e+03  1.676e+03   4.382 1.30e-05 ***
#> BsmtCond      -1.544e+03  2.127e+03  -0.726 0.468140    
#> BsmtExposure   1.668e+03  1.066e+03   1.564 0.118044    
#> BsmtFinType1  -6.899e+02  8.748e+02  -0.789 0.430557    
#> BsmtFinSF1     1.078e+01  6.249e+00   1.725 0.084877 .  
#> BsmtFinType2  -7.487e+02  1.250e+03  -0.599 0.549264    
#> BsmtFinSF2     1.809e+01  1.092e+01   1.657 0.097836 .  
#> BsmtUnfSF      5.451e+00  6.011e+00   0.907 0.364749    
#> Heating        5.199e+02  4.385e+03   0.119 0.905664    
#> HeatingQC     -2.765e+03  1.522e+03  -1.817 0.069471 .  
#> CentralAir     5.299e+03  5.939e+03   0.892 0.372492    
#> Electrical     2.488e+03  2.443e+03   1.018 0.308810    
#> `1stFlrSF`     5.158e+01  7.640e+00   6.752 2.54e-11 ***
#> `2ndFlrSF`     5.410e+01  6.540e+00   8.272 4.43e-16 ***
#> LowQualFinSF   1.103e+01  2.389e+01   0.462 0.644318    
#> BsmtFullBath   9.969e+03  3.336e+03   2.988 0.002878 ** 
#> BsmtHalfBath   3.822e+03  4.973e+03   0.769 0.442346    
#> FullBath       7.344e+03  3.479e+03   2.111 0.035044 *  
#> HalfBath       1.578e+03  3.385e+03   0.466 0.641205    
#> BedroomAbvGr  -6.908e+03  2.116e+03  -3.265 0.001132 ** 
#> KitchenAbvGr  -2.150e+04  6.475e+03  -3.321 0.000931 ***
#> KitchenQual    7.831e+03  2.018e+03   3.881 0.000111 ***
#> TotRmsAbvGrd   2.694e+03  1.531e+03   1.759 0.078848 .  
#> Functional    -3.430e+03  1.656e+03  -2.071 0.038584 *  
#> Fireplaces    -1.416e+03  2.825e+03  -0.501 0.616466    
#> FireplaceQu    4.918e+03  1.511e+03   3.255 0.001172 ** 
#> GarageType    -9.004e+02  1.405e+03  -0.641 0.521874    
#> GarageYrBlt   -1.258e+01  6.452e+00  -1.949 0.051535 .  
#> GarageFinish   2.847e+03  1.647e+03   1.729 0.084204 .  
#> GarageCars     1.870e+04  3.727e+03   5.017 6.27e-07 ***
#> GarageArea    -3.306e+00  1.251e+01  -0.264 0.791663    
#> GarageQual     8.001e+03  3.369e+03   2.375 0.017743 *  
#> GarageCond    -6.558e+03  4.118e+03  -1.593 0.111586    
#> PavedDrive    -2.194e+03  3.425e+03  -0.641 0.521999    
#> WoodDeckSF     3.364e+01  9.822e+00   3.425 0.000642 ***
#> OpenPorchSF   -1.281e+01  1.801e+01  -0.711 0.477116    
#> EnclosedPorch  3.012e+01  2.034e+01   1.481 0.138972    
#> `3SsnPorch`    4.637e+01  3.932e+01   1.179 0.238613    
#> ScreenPorch    8.738e+01  2.180e+01   4.009 6.58e-05 ***
#> PoolArea      -9.149e+00  2.763e+01  -0.331 0.740651    
#> Fence         -6.358e+02  1.162e+03  -0.547 0.584277    
#> MiscVal       -2.150e+00  4.311e+00  -0.499 0.617999    
#> MoSold        -2.667e+02  4.264e+02  -0.625 0.531871    
#> YrSold        -5.371e+02  8.704e+02  -0.617 0.537383    
#> SaleType       3.076e+03  1.318e+03   2.334 0.019803 *  
#> SaleCondition  1.564e+03  1.515e+03   1.033 0.301920    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 1202453519)
#> 
#>     Null deviance: 7.0716e+12  on 1023  degrees of freedom
#> Residual deviance: 1.1423e+12  on  950  degrees of freedom
#> AIC: 24389
#> 
#> Number of Fisher Scoring iterations: 2
predictions <- lf$predict(df = xtest)
#> Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
#> prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 28114.03

Lasso Regression

lf <- LMTrainer$new(family = "gaussian", alpha = 1, lambda = 1000)
lf$fit(X = xtrain, y = "SalePrice")
predictions <- lf$predict(df = xtest)
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 31485.61

Ridge Regression

lf <- LMTrainer$new(family = "gaussian", alpha=0)
lf$fit(X = xtrain, y = "SalePrice")
predictions <- lf$predict(df = xtest)
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 31996.39

Logistic Regression with CV

lf <- LMTrainer$new(family = "gaussian")
lf$cv_model(X = xtrain, y = 'SalePrice', nfolds = 5, parallel = FALSE)
predictions <- lf$cv_predict(df = xtest)
coefs <- lf$get_importance()
rmse(actual = xtest$SalePrice, predicted = predictions)

Random Forest

rf <- RFTrainer$new(n_estimators = 500,classification = 0)
rf$fit(X = xtrain, y = "SalePrice")
pred <- rf$predict(df = xtest)
rf$get_importance()
#>               tmp.order.tmp..decreasing...TRUE..
#> OverallQual                         897355696833
#> GarageCars                          573896531183
#> 1stFlrSF                            532567081665
#> GarageArea                          465916371596
#> YearBuilt                           381916514943
#> BsmtFinSF1                          351467511773
#> 2ndFlrSF                            289467129408
#> GarageYrBlt                         280309239661
#> FullBath                            254291970649
#> TotRmsAbvGrd                        247291654326
#> LotArea                             208933865259
#> ExterQual                           192180631991
#> YearRemodAdd                        184228984449
#> KitchenQual                         156990439124
#> Fireplaces                          155507255578
#> FireplaceQu                         150289993682
#> MasVnrArea                          139964631422
#> BsmtQual                            118916830843
#> LotFrontage                         115673529565
#> Foundation                           94524504058
#> OpenPorchSF                          79661007611
#> Neighborhood                         74014615272
#> BsmtUnfSF                            73692664766
#> WoodDeckSF                           70008125913
#> BsmtFinType1                         56604155491
#> HeatingQC                            53650301917
#> GarageType                           52489934869
#> BedroomAbvGr                         45089359704
#> MoSold                               41277120140
#> Exterior2nd                          41264457223
#> RoofStyle                            38602071072
#> HalfBath                             37595400337
#> MSSubClass                           35027094479
#> Exterior1st                          29325511399
#> OverallCond                          28987509816
#> HouseStyle                           28219994918
#> GarageFinish                         27984923118
#> BsmtExposure                         25462643357
#> YrSold                               23018084261
#> Fence                                22733826352
#> LotShape                             22669878684
#> BsmtFullBath                         22202738623
#> MasVnrType                           20135742426
#> SaleCondition                        20117674148
#> PoolArea                             18793427348
#> ScreenPorch                          18730088330
#> SaleType                             16653935002
#> GarageCond                           16301884801
#> LotConfig                            15707134034
#> LandContour                          15385421905
#> MSZoning                             13839441813
#> EnclosedPorch                        13232890803
#> CentralAir                           12363124255
#> LandSlope                            12261331975
#> Condition1                           12092559067
#> BldgType                             11914182134
#> GarageQual                           11345066274
#> RoofMatl                             11068060677
#> BsmtFinSF2                           10958447459
#> ExterCond                             9206220888
#> BsmtCond                              8813778449
#> BsmtHalfBath                          7934540923
#> Functional                            7660614640
#> BsmtFinType2                          7153362349
#> KitchenAbvGr                          6860084785
#> PavedDrive                            5917647539
#> LowQualFinSF                          3880928771
#> Heating                               2864326669
#> Condition2                            2778147869
#> Electrical                            2464940714
#> 3SsnPorch                             2130009823
#> MiscVal                               1574955136
#> Street                                 602498863
#> Utilities                                      0
rmse(actual = xtest$SalePrice, predicted = pred)
#> [1] 23806.23

Xgboost

xgb <- XGBTrainer$new(objective = "reg:linear"
                      , n_estimators = 500
                      , eval_metric = "rmse"
                      , maximize = F
                      , learning_rate = 0.1
                      ,max_depth = 6)
xgb$fit(X = xtrain, y = "SalePrice", valid = xtest)
pred <- xgb$predict(xtest)
rmse(actual = xtest$SalePrice, predicted = pred)

Grid Search

xgb <- XGBTrainer$new(objective = "reg:linear")

gst <- GridSearchCV$new(trainer = xgb,
                             parameters = list(n_estimators = c(10,50), max_depth = c(5,2)),
                             n_folds = 3,
                             scoring = c('accuracy','auc'))
gst$fit(xtrain, "SalePrice")
gst$best_iteration()

Random Search

rf <- RFTrainer$new()
rst <- RandomSearchCV$new(trainer = rf,
                             parameters = list(n_estimators = c(5,10),
                             max_depth = c(5,2)),
                             n_folds = 3,
                             scoring = c('accuracy','auc'),
                             n_iter = 3)
rst$fit(xtrain, "SalePrice")
#> [1] "In total, 3 models will be trained"
rst$best_iteration()
#> $n_estimators
#> [1] 10
#> 
#> $max_depth
#> [1] 5
#> 
#> $accuracy_avg
#> [1] 0.009760947
#> 
#> $accuracy_sd
#> [1] 0.007379562
#> 
#> $auc_avg
#> [1] NaN
#> 
#> $auc_sd
#> [1] NA

Binary Classification Data

Here, we will solve a simple binary classification problem (predict people who survived on titanic ship). The idea here is to demonstrate how to use this package to solve classification problems.

Data Preparation

# load class
load('../data/cla_train.rda')
# if the above doesn't work, you can try: load("cla_train.rda")

head(cla_train)
#>    PassengerId Survived Pclass
#>          <int>    <int>  <int>
#> 1:           1        0      3
#> 2:           2        1      1
#> 3:           3        1      3
#> 4:           4        1      1
#> 5:           5        0      3
#> 6:           6        0      3
#>                                                   Name    Sex   Age SibSp Parch
#>                                                 <char> <char> <num> <int> <int>
#> 1:                             Braund, Mr. Owen Harris   male    22     1     0
#> 2: Cumings, Mrs. John Bradley (Florence Briggs Thayer) female    38     1     0
#> 3:                              Heikkinen, Miss. Laina female    26     0     0
#> 4:        Futrelle, Mrs. Jacques Heath (Lily May Peel) female    35     1     0
#> 5:                            Allen, Mr. William Henry   male    35     0     0
#> 6:                                    Moran, Mr. James   male    NA     0     0
#>              Ticket    Fare  Cabin Embarked
#>              <char>   <num> <char>   <char>
#> 1:        A/5 21171  7.2500               S
#> 2:         PC 17599 71.2833    C85        C
#> 3: STON/O2. 3101282  7.9250               S
#> 4:           113803 53.1000   C123        S
#> 5:           373450  8.0500               S
#> 6:           330877  8.4583               Q

# split the data
split <- createDataPartition(y = cla_train$Survived,p = 0.7)
xtrain <- cla_train[split$Resample1]
xtest <- cla_train[!split$Resample1]

# encode categorical variables - shorter way
for(c in c('Embarked','Sex','Cabin')) {
    lbl <- LabelEncoder$new()
    lbl$fit(c(xtrain[[c]], xtest[[c]]))
    xtrain[[c]] <- lbl$transform(xtrain[[c]])
    xtest[[c]] <- lbl$transform(xtest[[c]])
}
#> The data contains blank values. Imputing them with 'NA' 
#> The data contains blank values. Imputing them with 'NA' 
#> The data contains blank values. Imputing them with 'NA' 
#> The data contains blank values. Imputing them with 'NA' 
#> The data contains blank values. Imputing them with 'NA'

# impute missing values
xtrain[, Age := replace(Age, is.na(Age), median(Age, na.rm = T))]
xtest[, Age := replace(Age, is.na(Age), median(Age, na.rm = T))]

# drop these features
to_drop <- c('PassengerId','Ticket','Name')

xtrain <- xtrain[,-c(to_drop), with=F]
xtest <- xtest[,-c(to_drop), with=F]

Now, our data is ready to be served for model training. Let’s do it.

KNN Classification

knn <- KNNTrainer$new(k = 2,prob = T,type = 'class')
knn$fit(train = xtrain, test = xtest, y = 'Survived')
probs <- knn$predict(type = 'prob')
labels <- knn$predict(type = 'raw')
auc(actual = xtest$Survived, predicted = labels)
#> [1] 0.6385027

Naive Bayes Classification

nb <- NBTrainer$new()
nb$fit(xtrain, 'Survived')
pred <- nb$predict(xtest)
#> Warning: predict.naive_bayes(): more features in the newdata are provided as
#> there are probability tables in the object. Calculation is performed based on
#> features to be found in the tables.
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.7771836

SVM Classification

#predicts labels
svm <- SVMTrainer$new()
svm$fit(xtrain, 'Survived')
pred <- svm$predict(xtest)
auc(actual = xtest$Survived, predicted=pred)

Logistic Regression

lf <- LMTrainer$new(family = "binomial")
lf$fit(X = xtrain, y = "Survived")
summary(lf$model)
#> 
#> Call:
#> stats::glm(formula = f, family = self$family, data = X, weights = self$weights)
#> 
#> Coefficients:
#>              Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  1.830070   0.616894   2.967  0.00301 ** 
#> Pclass      -0.980785   0.192493  -5.095 3.48e-07 ***
#> Sex          2.508241   0.230374  10.888  < 2e-16 ***
#> Age         -0.041034   0.009309  -4.408 1.04e-05 ***
#> SibSp       -0.235520   0.117715  -2.001  0.04542 *  
#> Parch       -0.098742   0.137791  -0.717  0.47361    
#> Fare         0.001281   0.002842   0.451  0.65230    
#> Cabin        0.008408   0.004786   1.757  0.07899 .  
#> Embarked     0.248088   0.166616   1.489  0.13649    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 831.52  on 623  degrees of freedom
#> Residual deviance: 564.76  on 615  degrees of freedom
#> AIC: 582.76
#> 
#> Number of Fisher Scoring iterations: 5
predictions <- lf$predict(df = xtest)
auc(actual = xtest$Survived, predicted = predictions)
#> [1] 0.8832145

Lasso Logistic Regression

lf <- LMTrainer$new(family="binomial", alpha=1)
lf$cv_model(X = xtrain, y = "Survived", nfolds = 5, parallel = FALSE)
pred <- lf$cv_predict(df = xtest)
auc(actual = xtest$Survived, predicted = pred)

Ridge Logistic Regression

lf <- LMTrainer$new(family="binomial", alpha=0)
lf$cv_model(X = xtrain, y = "Survived", nfolds = 5, parallel = FALSE)
pred <- lf$cv_predict(df = xtest)
auc(actual = xtest$Survived, predicted = pred)

Random Forest

rf <- RFTrainer$new(n_estimators = 500,classification = 1, max_features = 3)
rf$fit(X = xtrain, y = "Survived")

pred <- rf$predict(df = xtest)
rf$get_importance()
#>          tmp.order.tmp..decreasing...TRUE..
#> Sex                                69.10742
#> Fare                               57.96084
#> Age                                48.50156
#> Pclass                             23.91175
#> Cabin                              21.19329
#> SibSp                              12.58503
#> Parch                              10.55128
#> Embarked                           10.07059

auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.7988414

Xgboost

xgb <- XGBTrainer$new(objective = "binary:logistic"
                      , n_estimators = 500
                      , eval_metric = "auc"
                      , maximize = T
                      , learning_rate = 0.1
                      ,max_depth = 6)
xgb$fit(X = xtrain, y = "Survived", valid = xtest)

pred <- xgb$predict(xtest)
auc(actual = xtest$Survived, predicted = pred)

Grid Search

xgb <- XGBTrainer$new(objective="binary:logistic")
gst <-GridSearchCV$new(trainer = xgb,
                             parameters = list(n_estimators = c(10,50),
                             max_depth = c(5,2)),
                             n_folds = 3,
                             scoring = c('accuracy','auc'))
gst$fit(xtrain, "Survived")
gst$best_iteration()

Random Search

rf <- RFTrainer$new()
rst <- RandomSearchCV$new(trainer = rf,
                             parameters = list(n_estimators = c(10,50), max_depth = c(5,2)),
                             n_folds = 3,
                             scoring = c('accuracy','auc'),
                             n_iter = 3)
rst$fit(xtrain, "Survived")
#> [1] "In total, 3 models will be trained"
rst$best_iteration()
#> $n_estimators
#> [1] 50
#> 
#> $max_depth
#> [1] 5
#> 
#> $accuracy_avg
#> [1] 0.8028846
#> 
#> $accuracy_sd
#> [1] 0.01733438
#> 
#> $auc_avg
#> [1] 0.7804264
#> 
#> $auc_sd
#> [1] 0.02631447

Let’s create some new feature based on target variable using target encoding and test a model.

# add target encoding features
xtrain[, feat_01 := smoothMean(train_df = xtrain,
                        test_df = xtest,
                        colname = "Embarked",
                        target = "Survived")$train[[2]]]
xtest[, feat_01 := smoothMean(train_df = xtrain,
                               test_df = xtest,
                               colname = "Embarked",
                               target = "Survived")$test[[2]]]

# train a random forest
# Random Forest
rf <- RFTrainer$new(n_estimators = 500,classification = 1, max_features = 4)
rf$fit(X = xtrain, y = "Survived")
pred <- rf$predict(df = xtest)
rf$get_importance()
#>          tmp.order.tmp..decreasing...TRUE..
#> Sex                               71.417138
#> Fare                              61.039958
#> Age                               51.787990
#> Pclass                            24.257112
#> Cabin                             21.549374
#> SibSp                             12.374317
#> Parch                             10.392826
#> feat_01                            6.490151
#> Embarked                           6.270997

auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.7988414