Title: | Fast Nearest Neighbor Search Algorithms and Applications |
---|---|
Description: | Cover-tree and kd-tree fast k-nearest neighbor search algorithms and related applications including KNN classification, regression and information measures are implemented. |
Authors: | Alina Beygelzimer [aut] (cover tree library), Sham Kakadet [aut] (cover tree library), John Langford [aut] (cover tree library), Sunil Arya [aut] (ANN library 1.1.2 for the kd-tree approach), David Mount [aut] (ANN library 1.1.2 for the kd-tree approach), Shengqiao Li [aut, cre] |
Maintainer: | Shengqiao Li <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.1.4.1 |
Built: | 2025-01-21 06:27:46 UTC |
Source: | CRAN |
KNN Cross Entropy Estimators.
crossentropy(X, Y, k=10, algorithm=c("kd_tree", "cover_tree", "brute"))
crossentropy(X, Y, k=10, algorithm=c("kd_tree", "cover_tree", "brute"))
X |
an input data matrix. |
Y |
an input data matrix. |
k |
the maximum number of nearest neighbors to search. The default value is set to 10. |
algorithm |
nearest neighbor search algorithm. |
If p(x)
and q(x)
are two continuous probability density functions,
then the cross-entropy of p
and q
is defined as
.
a vector of length k
for crossentropy estimates using 1:k
nearest neighbors, respectively.
Shengqiao Li. To report any bugs or suggestions please email: [email protected]
S. Boltz, E. Debreuve and M. Barlaud (2007). “kNN-based high-dimensional Kullback-Leibler distance for tracking”. Image Analysis for Multimedia Interactive Services, 2007. WIAMIS '07. Eighth International Workshop on.
KNN Shannon Entropy Estimators.
entropy(X, k = 10, algorithm = c("kd_tree", "brute"))
entropy(X, k = 10, algorithm = c("kd_tree", "brute"))
X |
an input data matrix. |
k |
the maximum number of nearest neighbors to search. The default value is set to 10. |
algorithm |
nearest neighbor search algorithm. |
a vector of length k
for entropy estimates using 1:k
nearest neighbors, respectively.
Shengqiao Li. To report any bugs or suggestions please email: [email protected]
H. Singh, N. Misra, V. Hnizdo, A. Fedorowicz and E. Demchuk (2003). “Nearest neighbor estimates of entropy”. American Journal of Mathematical and Management Sciences, 23, 301-321.
M.N. Goria, N.N.Leonenko, V.V. Mergel and P.L. Novi Inverardi (2005). “A new class of random vector entropy estimators and its applications in testing statistical hypotheses”. Journal of Nonparametric Statistics, 17:3, 277–297.
R.M. Mnatsakanov, N. Misra, S. Li and E.J. Harner (2008). “K_n-nearest neighbor estimators of entropy”. Mathematical Methods of Statistics, 17:3, 261-277.
Fast k-nearest neighbor searching algorithms including a kd-tree, cover-tree and the algorithm implemented in class package.
get.knn(data, k=10, algorithm=c("kd_tree", "cover_tree", "CR", "brute")) get.knnx(data, query, k=10, algorithm=c("kd_tree", "cover_tree", "CR", "brute"))
get.knn(data, k=10, algorithm=c("kd_tree", "cover_tree", "CR", "brute")) get.knnx(data, query, k=10, algorithm=c("kd_tree", "cover_tree", "CR", "brute"))
data |
an input data matrix. |
query |
a query data matrix. |
algorithm |
nearest neighbor searching algorithm. |
k |
the maximum number of nearest neighbors to search. The default value is set to 10. |
The cover tree is O(n) space data structure which allows us to answer queries in the same O(log(n)) time as kd tree given a fixed intrinsic dimensionality. Templated code from https://hunch.net/~jl/projects/cover_tree/cover_tree.html is used.
The kd tree algorithm is implemented in the Approximate Near Neighbor (ANN) C++ library (see http://www.cs.umd.edu/~mount/ANN/). The exact nearest neighbors are searched in this package.
The CR algorithm is the VR using distance 1-x'y assuming x
and y
are unit vectors.
The brute algorithm searches linearly. It is a naive method.
a list contains:
nn.index |
an n x k matrix for the nearest neighbor indice. |
nn.dist |
an n x k matrix for the nearest neighbor Euclidean distances. |
Shengqiao Li. To report any bugs or suggestions please email: [email protected]
Bentley J.L. (1975), “Multidimensional binary search trees used for associative search,” Communication ACM, 18, 309-517.
Arya S. and Mount D.M. (1993), “Approximate nearest neighbor searching,” Proc. 4th Ann. ACM-SIAM Symposium on Discrete Algorithms (SODA'93), 271-280.
Arya S., Mount D.M., Netanyahu N.S., Silverman R. and Wu A.Y. (1998), “An optimal algorithm for approximate nearest neighbor searching,” Journal of the ACM, 45, 891-923.
Beygelzimer A., Kakade S. and Langford J. (2006), “Cover trees for nearest neighbor,” ACM Proc. 23rd international conference on Machine learning, 148, 97-104.
nn2
in RANN, ann
in yaImpute and knn
in class.
data<- query<- cbind(1:10, 1:10) get.knn(data, k=5) get.knnx(data, query, k=5) get.knnx(data, query, k=5, algo="kd_tree") th<- runif(10, min=0, max=2*pi) data2<- cbind(cos(th), sin(th)) get.knn(data2, k=5, algo="CR")
data<- query<- cbind(1:10, 1:10) get.knn(data, k=5) get.knnx(data, query, k=5) get.knnx(data, query, k=5, algo="kd_tree") th<- runif(10, min=0, max=2*pi) data2<- cbind(cos(th), sin(th)) get.knn(data2, k=5, algo="CR")
Compute Kullback-Leibler symmetric distance.
KL.dist(X, Y, k = 10, algorithm=c("kd_tree", "cover_tree", "brute")) KLx.dist(X, Y, k = 10, algorithm="kd_tree")
KL.dist(X, Y, k = 10, algorithm=c("kd_tree", "cover_tree", "brute")) KLx.dist(X, Y, k = 10, algorithm="kd_tree")
X |
An input data matrix. |
Y |
An input data matrix. |
k |
The maximum number of nearest neighbors to search. The default value is set to 10. |
algorithm |
nearest neighbor search algorithm. |
Kullback-Leibler distance is the sum of divergence q(x)
from p(x)
and p(x)
from q(x)
.
KL.*
versions return distances from C
code to R
but KLx.*
do not.
Return the Kullback-Leibler distance between X
and Y
.
Shengqiao Li. To report any bugs or suggestions please email: [email protected]
S. Boltz, E. Debreuve and M. Barlaud (2007). “kNN-based high-dimensional Kullback-Leibler distance for tracking”. Image Analysis for Multimedia Interactive Services, 2007. WIAMIS '07. Eighth International Workshop on.
S. Boltz, E. Debreuve and M. Barlaud (2009). “High-dimensional statistical measure for region-of-interest tracking”. Trans. Img. Proc., 18:6, 1266–1283.
set.seed(1000) X<- rexp(10000, rate=0.2) Y<- rexp(10000, rate=0.4) KL.dist(X, Y, k=5) KLx.dist(X, Y, k=5) #thoretical distance = (0.2-0.4)^2/(0.2*0.4) = 0.5
set.seed(1000) X<- rexp(10000, rate=0.2) Y<- rexp(10000, rate=0.4) KL.dist(X, Y, k=5) KLx.dist(X, Y, k=5) #thoretical distance = (0.2-0.4)^2/(0.2*0.4) = 0.5
Compute Kullback-Leibler divergence.
KL.divergence(X, Y, k = 10, algorithm=c("kd_tree", "cover_tree", "brute")) KLx.divergence(X, Y, k = 10, algorithm="kd_tree")
KL.divergence(X, Y, k = 10, algorithm=c("kd_tree", "cover_tree", "brute")) KLx.divergence(X, Y, k = 10, algorithm="kd_tree")
X |
An input data matrix. |
Y |
An input data matrix. |
k |
The maximum number of nearest neighbors to search. The default value is set to 10. |
algorithm |
nearest neighbor search algorithm. |
If p(x)
and q(x)
are two continuous probability density functions,
then the Kullback-Leibler divergence of q
from p
is defined as
.
KL.*
versions return divergences from C
code to R
but KLx.*
do not.
Return the Kullback-Leibler divergence from X
to Y
.
Shengqiao Li. To report any bugs or suggestions please email: [email protected]
S. Boltz, E. Debreuve and M. Barlaud (2007). “kNN-based high-dimensional Kullback-Leibler distance for tracking”. Image Analysis for Multimedia Interactive Services, 2007. WIAMIS '07. Eighth International Workshop on.
S. Boltz, E. Debreuve and M. Barlaud (2009). “High-dimensional statistical measure for region-of-interest tracking”. Trans. Img. Proc., 18:6, 1266–1283.
set.seed(1000) X<- rexp(10000, rate=0.2) Y<- rexp(10000, rate=0.4) KL.divergence(X, Y, k=5) #theoretical divergence = log(0.2/0.4)+(0.4/0.2)-1 = 1-log(2) = 0.307
set.seed(1000) X<- rexp(10000, rate=0.2) Y<- rexp(10000, rate=0.4) KL.divergence(X, Y, k=5) #theoretical divergence = log(0.2/0.4)+(0.4/0.2)-1 = 1-log(2) = 0.307
k-nearest neighbour classification for test set from training set. For
each row of the test set, the k
nearest (in Euclidean distance)
training set vectors are found, and the classification is decided by
majority vote, with ties broken at random. If there are ties for the
k
th nearest vector, all candidates are included in the vote.
knn(train, test, cl, k = 1, prob = FALSE, algorithm=c("kd_tree", "cover_tree", "brute"))
knn(train, test, cl, k = 1, prob = FALSE, algorithm=c("kd_tree", "cover_tree", "brute"))
train |
matrix or data frame of training set cases. |
test |
matrix or data frame of test set cases. A vector will be interpreted as a row vector for a single case. |
cl |
factor of true classifications of training set. |
k |
number of neighbours considered. |
prob |
if this is true, the proportion of the votes for the winning class
are returned as attribute |
algorithm |
nearest neighbor search algorithm. |
factor of classifications of test set. doubt
will be returned as NA
.
Shengqiao Li. To report any bugs or suggestions please email: [email protected]
B.D. Ripley (1996). Pattern Recognition and Neural Networks. Cambridge.
M.N. Venables and B.D. Ripley (2002). Modern Applied Statistics with S. Fourth edition. Springer.
ownn
, knn.cv
and knn
in class.
data(iris3) train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3]) test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3]) cl <- factor(c(rep("s",25), rep("c",25), rep("v",25))) knn(train, test, cl, k = 3, prob=TRUE) attributes(.Last.value)
data(iris3) train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3]) test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3]) cl <- factor(c(rep("s",25), rep("c",25), rep("v",25))) knn(train, test, cl, k = 3, prob=TRUE) attributes(.Last.value)
k-nearest neighbour classification cross-validation from training set.
knn.cv(train, cl, k = 1, prob = FALSE, algorithm=c("kd_tree", "cover_tree", "brute"))
knn.cv(train, cl, k = 1, prob = FALSE, algorithm=c("kd_tree", "cover_tree", "brute"))
train |
matrix or data frame of training set cases. |
cl |
factor of true classifications of training set |
k |
number of neighbours considered. |
prob |
if this is true, the proportion of the votes for the winning class
are returned as attribute |
algorithm |
nearest neighbor search algorithm. |
This uses leave-one-out cross validation.
For each row of the training set train
, the k
nearest
(in Euclidean distance) other training set vectors are found, and the classification
is decided by majority vote, with ties broken at random. If there are ties for the
k
th nearest vector, all candidates are included in the vote.
factor of classifications of training set. doubt
will be returned as NA
.
distances and indice of k nearest neighbors are also returned as attributes.
Shengqiao Li. To report any bugs or suggestions please email: [email protected]
Ripley, B. D. (1996) Pattern Recognition and Neural Networks. Cambridge.
Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition. Springer.
data(iris3) train <- rbind(iris3[,,1], iris3[,,2], iris3[,,3]) cl <- factor(c(rep("s",50), rep("c",50), rep("v",50))) knn.cv(train, cl, k = 3, prob = TRUE) attributes(.Last.value)
data(iris3) train <- rbind(iris3[,,1], iris3[,,2], iris3[,,3]) cl <- factor(c(rep("s",50), rep("c",50), rep("v",50))) knn.cv(train, cl, k = 3, prob = TRUE) attributes(.Last.value)
Fast k-nearest neighbor distance searching algorithms.
knn.dist(data, k=10, algorithm=c("kd_tree", "cover_tree", "CR", "brute")) knnx.dist(data, query, k=10, algorithm=c("kd_tree", "cover_tree", "CR", "brute"))
knn.dist(data, k=10, algorithm=c("kd_tree", "cover_tree", "CR", "brute")) knnx.dist(data, query, k=10, algorithm=c("kd_tree", "cover_tree", "CR", "brute"))
data |
an input data matrix. |
query |
a query data matrix. |
algorithm |
nearest neighbor searching algorithm. |
k |
the maximum number of nearest neighbors to search. The default value is set to 10. |
return the Euclidiean distances of k nearest neighbors.
Shengqiao Li. To report any bugs or suggestions please email: [email protected]
Bentley J.L. (1975), “Multidimensional binary search trees used for associative search,” Communication ACM, 18, 309-517.
Arya S. and Mount D.M. (1993), “Approximate nearest neighbor searching,” Proc. 4th Ann. ACM-SIAM Symposium on Discrete Algorithms (SODA'93), 271-280.
Arya S., Mount D.M., Netanyahu N.S., Silverman R. and Wu A.Y. (1998), “An optimal algorithm for approximate nearest neighbor searching,” Journal of the ACM, 45, 891-923.
Beygelzimer A., Kakade S. and Langford J. (2006), “Cover trees for nearest neighbor,” ACM Proc. 23rd international conference on Machine learning, 148, 97-104.
if(require(mvtnorm)) { sigma<- function(v, r, p) { V<- matrix(r^2, ncol=p, nrow=p) diag(V)<- 1 V*v } X<- rmvnorm(1000, mean=rep(0, 20), sigma(1, .5, 20)) print(system.time(knn.dist(X)) ) print(system.time(knn.dist(X, algorithm = "kd_tree"))) }
if(require(mvtnorm)) { sigma<- function(v, r, p) { V<- matrix(r^2, ncol=p, nrow=p) diag(V)<- 1 V*v } X<- rmvnorm(1000, mean=rep(0, 20), sigma(1, .5, 20)) print(system.time(knn.dist(X)) ) print(system.time(knn.dist(X, algorithm = "kd_tree"))) }
Fast k-nearest neighbor searching algorithms including a kd-tree, cover-tree and the algorithm implemented in class package.
knn.index(data, k=10, algorithm=c("kd_tree", "cover_tree", "CR", "brute")) knnx.index(data, query, k=10, algorithm=c("kd_tree", "cover_tree", "CR", "brute"))
knn.index(data, k=10, algorithm=c("kd_tree", "cover_tree", "CR", "brute")) knnx.index(data, query, k=10, algorithm=c("kd_tree", "cover_tree", "CR", "brute"))
data |
an input data matrix. |
query |
a query data matrix. |
algorithm |
nearest neighbor searching algorithm. |
k |
the maximum number of nearest neighbors to search. The default value is set to 10. |
return the indice of k nearest neighbors.
Shengqiao Li. To report any bugs or suggestions please email: [email protected]
Bentley J.L. (1975), “Multidimensional binary search trees used for associative search,” Communication ACM, 18, 309-517.
Arya S. and Mount D.M. (1993), “Approximate nearest neighbor searching,” Proc. 4th Ann. ACM-SIAM Symposium on Discrete Algorithms (SODA'93), 271-280.
Arya S., Mount D.M., Netanyahu N.S., Silverman R. and Wu A.Y. (1998), “An optimal algorithm for approximate nearest neighbor searching,” Journal of the ACM, 45, 891-923.
Beygelzimer A., Kakade S. and Langford J. (2006), “Cover trees for nearest neighbor,” ACM Proc. 23rd international conference on Machine learning, 148, 97-104.
data<- query<- cbind(1:10, 1:10) knn.index(data, k=5) knnx.index(data, query, k=5) knnx.index(data, query, k=5, algo="kd_tree")
data<- query<- cbind(1:10, 1:10) knn.index(data, k=5) knnx.index(data, query, k=5) knnx.index(data, query, k=5, algo="kd_tree")
k-nearest neighbor regression
knn.reg(train, test = NULL, y, k = 3, algorithm=c("kd_tree", "cover_tree", "brute"))
knn.reg(train, test = NULL, y, k = 3, algorithm=c("kd_tree", "cover_tree", "brute"))
train |
matrix or data frame of training set cases. |
test |
matrix or data frame of test set cases. A vector will be interpreted as a row vector for a single case. If not supplied, cross-validataion will be done. |
y |
reponse of each observation in the training set. |
k |
number of neighbours considered. |
algorithm |
nearest neighbor search algorithm. |
If test is not supplied, Leave one out cross-validation is performed and R-square is the predicted R-square.
knn.reg
returns an object of class
"knnReg"
or "knnRegCV"
if test
data is not supplied.
The returnedobject is a list containing at least the following components:
call |
the match call. |
k |
number of neighbours considered. |
n |
number of predicted values, either equals test size or train size. |
pred |
a vector of predicted values. |
residuals |
predicted residuals. |
PRESS |
the sums of squares of the predicted residuals. |
R2Pred |
predicted R-square. |
The code for “VR” nearest neighbor searching is taken from class
source
Shengqiao Li. To report any bugs or suggestions please email: [email protected]
knn
.
if(require(chemometrics)){ data(PAC); pac.knn<- knn.reg(PAC$X, y=PAC$y, k=3); plot(PAC$y, pac.knn$pred, xlab="y", ylab=expression(hat(y))) }
if(require(chemometrics)){ data(PAC); pac.knn<- knn.reg(PAC$X, y=PAC$y, k=3); plot(PAC$y, pac.knn$pred, xlab="y", ylab=expression(hat(y))) }
KNN Mutual Information Estimators.
mutinfo(X, Y, k=10, direct=TRUE)
mutinfo(X, Y, k=10, direct=TRUE)
X |
an input data matrix. |
Y |
an input data matrix. |
k |
the maximum number of nearest neighbors to search. The default value is set to 10. |
direct |
Directly compute or via entropies. |
The direct computation is based on the first estimator of A. Kraskov, H. Stogbauer and P.Grassberger (2004) and the indirect computation is done via entropy estimates, i.e., I(X, Y) = H (X) + H(Y) - H(X, Y). The direct method has smaller bias and variance but the indirect method is faster, see Evans (2008).
For direct method, one mutual information estimate;
For indirect method,a vector of length k
for mutual information estimates using 1:k
nearest neighbors, respectively.
Shengqiao Li. To report any bugs or suggestions please email: [email protected]
A. Kraskov, H. Stogbauer and P.Grassberger (2004). “Estimating mutual information”. Physical Review E, 69:066138, 1–16.
D. Evans (2008). “A Computationally efficient estimator for mutual information”. Proc. R. Soc. A, 464, 1203–1215.
This function implements Samworth's optimal weighting scheme for k nearest neighbor classification. The performance improvement is greatest when the dimension is 4 as reported in the reference.
ownn(train, test, cl, testcl=NULL, k = NULL, prob = FALSE, algorithm=c("kd_tree", "cover_tree", "brute"))
ownn(train, test, cl, testcl=NULL, k = NULL, prob = FALSE, algorithm=c("kd_tree", "cover_tree", "brute"))
train |
matrix or data frame of training set cases. |
test |
matrix or data frame of test set cases. A vector will be interpreted as a row vector for a single case. |
cl |
factor of true classifications of training set. |
testcl |
factor of true classifications of testing set for error rate calculation. |
k |
number of neighbours considered, chosen by 5-fold cross-validation if not supplied. |
prob |
if this is true, the proportion of the weights for the winning class
are returned as attribute |
algorithm |
nearest neighbor search algorithm. |
a list includes k, predictions by ordinary knn, optimal weighted knn and bagged knn, and accuracies if class labels of test data set are given.
Shengqiao Li. To report any bugs or suggestions please email: [email protected]
Richard J. Samworth (2012), “Optimal Weighted Nearest Neighbor Classifiers,” Annals of Statistics, 40:5, 2733-2763.
data(iris3) train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3]) test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3]) cl <- factor(c(rep("s",25), rep("c",25), rep("v",25))) testcl <- factor(c(rep("s",25), rep("c",25), rep("v",25))) out <- ownn(train, test, cl, testcl) out
data(iris3) train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3]) test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3]) cl <- factor(c(rep("s",25), rep("c",25), rep("v",25))) testcl <- factor(c(rep("s",25), rep("c",25), rep("v",25))) out <- ownn(train, test, cl, testcl) out
Print method for KNN regression.
## S3 method for class 'knnReg' print(x, ...) ## S3 method for class 'knnRegCV' print(x, ...)
## S3 method for class 'knnReg' print(x, ...) ## S3 method for class 'knnRegCV' print(x, ...)
x |
a |
... |
Additonal |
Shengqiao Li. To report any bugs or suggestions please email: [email protected]