The following examples illustrate the functionality of the KernelKnn package for regression tasks. I’ll make use of the Housing data set,
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
When using an algorithm where the ouput depends on distance calculation (as is the case in k-nearest-neighbors) it is recommended to first scale the data,
X = scale(Boston[, -ncol(Boston)])
y = Boston[, ncol(Boston)]
# random split of data in train and test
spl_train = sample(1:length(y), round(length(y) * 0.75))
spl_test = setdiff(1:length(y), spl_train)
str(spl_train)
## int [1:380] 411 64 274 134 28 380 17 125 247 373 ...
## int [1:126] 4 12 15 18 20 24 35 37 43 47 ...
The KernelKnn function takes a number of arguments. To read details for each one of the arguments type ?KernelKnn::KernelKnn in the console.
A simple k-nearest-neighbors can be run with weights_function = NULL (the parameter ‘regression’ should be set to TRUE for regression),
library(KernelKnn)
preds_TEST = KernelKnn(X[spl_train, ], TEST_data = X[spl_test, ], y[spl_train], k = 5 ,
method = 'euclidean', weights_function = NULL, regression = T)
str(preds_TEST)
## num [1:126] 30.3 20.8 18 16.3 19 ...
Using transf_categ_cols = TRUE, categorical features can be
either encoded to dummy or to numeric features depending on the number
of the unique values (here I convert the ‘chas’ and ‘rad’ features to
factor to apply the transf_categ_cols parameter)
## crim zn indus chas nox rm age dis rad tax
## 504 26 76 2 81 446 356 412 9 66
## ptratio black lstat medv
## 46 357 455 229
tmp_bst = Boston
tmp_bst$chas = as.factor(tmp_bst$chas)
tmp_bst$rad = as.factor(tmp_bst$rad)
preds_TEST = KernelKnn(tmp_bst[spl_train, -ncol(tmp_bst)],
TEST_data = tmp_bst[spl_test, -ncol(tmp_bst)],
y[spl_train], k = 5 , method = 'euclidean',
regression = T, transf_categ_cols = T)
str(preds_TEST)
## num [1:126] 23.7 22.3 30.2 22.2 25.4 ...
There are two ways to use a kernel in the KernelKnn function.
The first option is to choose one of the existing
kernels (uniform, triangular, epanechnikov,
biweight, triweight, tricube,
gaussian, cosine, logistic,
silverman, inverse, gaussianSimple,
exponential). Here, I use the mahalanobis metric
(which takes advantage of the covariance matrix of the data, but it
somewhat slows down training in comparison to the other distance
metrics) and the biweight kernel, because they give optimal
results (according to my RandomSearchR package),
preds_TEST_biw = KernelKnn(X[spl_train, ], TEST_data = X[spl_test, ], y[spl_train], k = 5,
method = 'mahalanobis', weights_function = 'biweight',
regression = T, transf_categ_cols = F)
str(preds_TEST_biw)
## num [1:126] 33.1 21.9 17 17.5 20.1 ...
The second option is to give a self defined
kernel function. Here, I’ll pick the density function of the normal
distribution with mean = 0.0 and standard deviation = 1.0 (the data are
scaled to have mean zero and unit variance),
norm_kernel = function(W) {
W = dnorm(W, mean = 0, sd = 1.0)
W = W / rowSums(W)
return(W)
}
preds_TEST_norm = KernelKnn(X[spl_train, ], TEST_data = X[spl_test, ], y[spl_train], k = 5,
method = 'mahalanobis', weights_function = norm_kernel,
regression = T, transf_categ_cols = F)
str(preds_TEST_norm)
## num [1:126] 30.3 22.3 17 16.3 19.9 ...
The computations can be speed up by using the parameter
threads (multiple cores can be run in parallel). There
is also the option to exclude extrema (minimum and
maximum distances) during the calculation of the k-nearest-neighbor
distances using extrema = TRUE. The bandwidth of the existing
kernels can be tuned using the h parameter.
K-nearest-neigbor calculations in the KernelKnn function can be
accomplished using the following distance metrics : euclidean,
manhattan, chebyshev, canberra,
braycurtis, minkowski (by default the order ‘p’ of the
minkowski parameter equals k), hamming, mahalanobis,
pearson_correlation, simple_matching_coefficient,
jaccard_coefficient and Rao_coefficient. The last four
are similarity measures and are appropriate for binary data [0,1].
I employed my RandomSearchR package to find the optimal parameters
for the KernelKnn function and the following two pairs of parameters
give an optimal mean-squared-error,
k | method | kernel |
---|---|---|
9 | mahalanobis | triweight |
3 | canberra | cosine |
I’ll use the KernelKnnCV function to calculate the mean-squared-error using 3-fold cross-validation for the previous mentioned parameter pairs,
fit_cv_pair1 = KernelKnnCV(X, y, k = 9, folds = 3, method = 'mahalanobis',
weights_function = 'triweight', regression = T,
threads = 5, seed_num = 3)
## List of 2
## $ preds:List of 3
## ..$ : num [1:169] 26.3 26 28.3 21.8 21.4 ...
## ..$ : num [1:168] 19.6 17.4 15.9 17.4 14.3 ...
## ..$ : num [1:169] 23.2 22.7 29.8 22.7 22.9 ...
## $ folds:List of 3
## ..$ fold_1: int [1:169] 3 4 5 12 13 19 20 21 25 26 ...
## ..$ fold_2: int [1:168] 7 8 9 18 24 27 29 36 40 41 ...
## ..$ fold_3: int [1:169] 1 2 6 10 11 14 15 16 17 22 ...
fit_cv_pair2 = KernelKnnCV(X, y, k = 3, folds = 3, method = 'canberra',
weights_function = 'cosine', regression = T,
threads = 5, seed_num = 3)
Each cross-validated object returns a list of length 2 ( the first sublist includes the predictions for each fold whereas the second gives the indices of the folds)
mse_pair1 = unlist(lapply(1:length(fit_cv_pair1$preds),
function(x) mse(y[fit_cv_pair1$folds[[x]]],
fit_cv_pair1$preds[[x]])))
mse_pair1
## [1] 19.03225 19.47981 19.41959
## mse for params_pair1 is : 19.31055
mse_pair2 = unlist(lapply(1:length(fit_cv_pair2$preds),
function(x) mse(y[fit_cv_pair2$folds[[x]]],
fit_cv_pair2$preds[[x]])))
mse_pair2
## [1] 17.83355 29.86579 28.65615
## mse for params_pair2 is : 25.45183
In the KernelKnn package there is also the option to combine kernels (adding or multiplying) from the existing ones. For instance, if I want to multiply the tricube with the gaussian kernel, then I’ll give the following character string to the weights_function, “tricube_gaussian_MULT”. On the other hand, If I want to add the same kernels then the weights_function will be “tricube_gaussian_ADD”. I experimented with my RandomSearchR package combining the different kernels and the following two parameter settings gave optimal results,
k | method | kernel |
---|---|---|
19 | mahalanobis | triangular_triweight_MULT |
18 | mahalanobis | biweight_triweight_gaussian_MULT |
fit_cv_pair1 = KernelKnnCV(X, y, k = 19, folds = 3, method = 'mahalanobis',
weights_function = 'triangular_triweight_MULT',
regression = T, threads = 5, seed_num = 3)
## List of 2
## $ preds:List of 3
## ..$ : num [1:169] 26.4 26 27.8 21.5 21.8 ...
## ..$ : num [1:168] 19.7 17.6 16.4 16.7 14.6 ...
## ..$ : num [1:169] 23.1 22.7 29 22.3 22.3 ...
## $ folds:List of 3
## ..$ fold_1: int [1:169] 3 4 5 12 13 19 20 21 25 26 ...
## ..$ fold_2: int [1:168] 7 8 9 18 24 27 29 36 40 41 ...
## ..$ fold_3: int [1:169] 1 2 6 10 11 14 15 16 17 22 ...
fit_cv_pair2 = KernelKnnCV(X, y, k = 18, folds = 3, method = 'mahalanobis',
weights_function = 'biweight_triweight_gaussian_MULT',
regression = T, threads = 5, seed_num = 3)
## List of 2
## $ preds:List of 3
## ..$ : num [1:169] 26.2 26 28.1 21.7 21.7 ...
## ..$ : num [1:168] 19.7 17.6 16.4 16.7 14.6 ...
## ..$ : num [1:169] 23.1 22.8 29.2 22.4 22.2 ...
## $ folds:List of 3
## ..$ fold_1: int [1:169] 3 4 5 12 13 19 20 21 25 26 ...
## ..$ fold_2: int [1:168] 7 8 9 18 24 27 29 36 40 41 ...
## ..$ fold_3: int [1:169] 1 2 6 10 11 14 15 16 17 22 ...
mse_pair1 = unlist(lapply(1:length(fit_cv_pair1$preds),
function(x) mse(y[fit_cv_pair1$folds[[x]]],
fit_cv_pair1$preds[[x]])))
mse_pair1
## [1] 19.18593 19.42827 17.33437
## mse for params_pair1 is : 18.64952
mse_pair2 = unlist(lapply(1:length(fit_cv_pair2$preds),
function(x) mse(y[fit_cv_pair2$folds[[x]]],
fit_cv_pair2$preds[[x]])))
mse_pair2
## [1] 19.20290 19.31054 17.56264
## mse for params_pair2 is : 18.69202