In this vignette I’ll illustrate how to increase the
accuracy on the MNIST (to approx. 98.4%) and CIFAR-10 data (to approx.
58.3%) using the KernelKnn package and HOG (histogram of oriented
gradients).
The MNIST data set of handwritten digits has a training set of 70,000 examples and each row of the matrix corresponds to a 28 x 28 image. The unique values of the response variable y range from 0 to 9. More information about the data can be found in the DataSets repository (the folder includes also an Rmarkdown file).
# using system('wget..') on a linux OS
system("wget https://raw.githubusercontent.com/mlampros/DataSets/master/mnist.zip")
mnist <- read.table(unz("mnist.zip", "mnist.csv"), nrows = 70000, header = T,
quote = "\"", sep = ",")
X = mnist[, -ncol(mnist)]
dim(X)
## [1] 70000 784
# the KernelKnn function requires that the labels are numeric and start from 1 : Inf
y = mnist[, ncol(mnist)] + 1
table(y)
## y
## 1 2 3 4 5 6 7 8 9 10
## 6903 7877 6990 7141 6824 6313 6876 7293 6825 6958
K nearest neighbors do not perform well in high dimensions due to the
curse of dimensionality (k observations that are nearest to a
given test observation x1 may be very far away from x1 in p-dimensional
space when p is large [ An introduction to statistical learning,
James/Witten/Hastie/Tibshirani, pages 108-109 ]), leading to a very poor
k-nearest-neighbors fit. One option to overcome this problem would be to
use truncated svd (irlba package) to reduce the dimensions of the data.
A second option, which is appropriate in case of images, would be to use
image descriptors. In this vignette, I’ll compare those two approaches.
I experimented with different settings and the following parameters
gave the best results,
irlba_singlular_vectors | k | method | kernel |
---|---|---|---|
40 | 8 | braycurtis | biweight_tricube_MULT |
library(irlba)
svd_irlb = irlba(as.matrix(X), nv = 40, nu = 40, verbose = F) # irlba truncated svd
new_x = as.matrix(X) %*% svd_irlb$v # new_x using the 40 right singular vectors
library(KernelKnn)
fit = KernelKnnCV(as.matrix(new_x), y, k = 8, folds = 4, method = 'braycurtis',
weights_function = 'biweight_tricube_MULT', regression = F,
threads = 6, Levels = sort(unique(y)))
# str(fit)
# evaluation metric
acc = function (y_true, preds) {
out = table(y_true, max.col(preds, ties.method = "random"))
acc = sum(diag(out))/sum(out)
acc
}
acc_fit = unlist(lapply(1:length(fit$preds),
function(x) acc(y[fit$folds[[x]]],
fit$preds[[x]])))
acc_fit
## [1] 0.9742857 0.9749143 0.9761143 0.9741143
cat('mean accuracy using cross-validation :', mean(acc_fit), '\n')
## mean accuracy using cross-validation : 0.9748571
Utilizing truncated svd a 4-fold cross-validation KernelKnn model
gives a 97.48% accuracy.
In this chunk of code, besides KernelKnnCV I’ll also use HOG. The
histogram of oriented gradients (HOG) is a feature descriptor used in
computer vision and image processing for the purpose of object
detection. The technique counts occurrences of gradient orientation in
localized portions of an image. This method is similar to that of edge
orientation histograms, scale-invariant feature transform descriptors,
and shape contexts, but differs in that it is computed on a dense grid
of uniformly spaced cells and uses overlapping local contrast
normalization for improved accuracy (Wikipedia).
library(OpenImageR)
hog = HOG_apply(X, cells = 6, orientations = 9, rows = 28, columns = 28, threads = 6)
##
## time to complete : 1.802997 secs
dim(hog)
## [1] 70000 324
fit_hog = KernelKnnCV(hog, y, k = 20, folds = 4, method = 'braycurtis',
weights_function = 'biweight_tricube_MULT', regression = F,
threads = 6, Levels = sort(unique(y)))
#str(fit_hog)
acc_fit_hog = unlist(lapply(1:length(fit_hog$preds),
function(x) acc(y[fit_hog$folds[[x]]],
fit_hog$preds[[x]])))
acc_fit_hog
## [1] 0.9833714 0.9840571 0.9846857 0.9838857
cat('mean accuracy for hog-features using cross-validation :', mean(acc_fit_hog), '\n')
## mean accuracy for hog-features using cross-validation : 0.984
By changing from the simple svd-features to HOG-features the accuracy of a 4-fold cross-validation model increased from 97.48% to 98.4% (approx. 1% difference)
CIFAR-10 is an established computer-vision dataset used for object
recognition. The data I’ll use in this example is a subset of an 80
million tiny images dataset and consists of 60,000 32x32 color images
containing one of 10 object classes ( 6000 images per class ).
Furthermore, the data were converted from RGB to gray, normalized and
rounded to 2 decimal places (to reduce the storage size). More
information about the data can be found in my DataSets
repository (I included an Rmarkdown file).
I’ll build the kernel k-nearest-neighbors models in the same way I’ve done for the mnist data set and then I’ll compare the results.
# using system('wget..') on a linux OS
system("wget https://raw.githubusercontent.com/mlampros/DataSets/master/cifar_10.zip")
cifar_10 <- read.table(unz("cifar_10.zip", "cifar_10.csv"), nrows = 60000, header = T,
quote = "\"", sep = ",")
X = cifar_10[, -ncol(cifar_10)]
dim(X)
## [1] 60000 1024
# the KernelKnn function requires that the labels are numeric and start from 1 : Inf
y = cifar_10[, ncol(cifar_10)]
table(y)
## y
## 1 2 3 4 5 6 7 8 9 10
## 6000 6000 6000 6000 6000 6000 6000 6000 6000 6000
The parameter settings are similar to those for the mnist data,
irlba_singlular_vectors | k | method | kernel |
---|---|---|---|
40 | 8 | braycurtis | biweight_tricube_MULT |
svd_irlb = irlba(as.matrix(X), nv = 40, nu = 40, verbose = F) # irlba truncated svd
new_x = as.matrix(X) %*% svd_irlb$v # new_x using the 40 right singular vectors
fit = KernelKnnCV(as.matrix(new_x), y, k = 8, folds = 4, method = 'braycurtis',
weights_function = 'biweight_tricube_MULT', regression = F,
threads = 6, Levels = sort(unique(y)))
# str(fit)
acc_fit = unlist(lapply(1:length(fit$preds),
function(x) acc(y[fit$folds[[x]]],
fit$preds[[x]])))
acc_fit
## [1] 0.4080667 0.4097333 0.4040000 0.4102667
cat('mean accuracy using cross-validation :', mean(acc_fit), '\n')
## mean accuracy using cross-validation : 0.4080167
The accuracy of a 4-fold cross-validation model using truncated svd is 40.8%.
Next, I’ll run the KernelKnnCV using the HOG-descriptors,
hog = HOG_apply(X, cells = 6, orientations = 9, rows = 32,
columns = 32, threads = 6)
##
## time to complete : 3.394621 secs
dim(hog)
## [1] 60000 324
fit_hog = KernelKnnCV(hog, y, k = 20, folds = 4, method = 'braycurtis',
weights_function = 'biweight_tricube_MULT', regression = F,
threads = 6, Levels = sort(unique(y)))
# str(fit_hog)
acc_fit_hog = unlist(lapply(1:length(fit_hog$preds),
function(x) acc(y[fit_hog$folds[[x]]],
fit_hog$preds[[x]])))
acc_fit_hog
## [1] 0.5807333 0.5884000 0.5777333 0.5861333
cat('mean accuracy for hog-features using cross-validation :', mean(acc_fit_hog), '\n')
## mean accuracy for hog-features using cross-validation : 0.58325
By using hog-descriptors in a 4-fold cross-validation model the accuracy in the cifar-10 data increases from 40.8% to 58.3% (approx. 17.5% difference).