Title: | Gower's Distance |
---|---|
Description: | Compute Gower's distance (or similarity) coefficient between records. Compute the top-n matches between records. Core algorithms are executed in parallel on systems supporting OpenMP. |
Authors: | Mark van der Loo [aut, cre], David Turner [ctb] |
Maintainer: | Mark van der Loo <[email protected]> |
License: | GPL-3 |
Version: | 1.0.1 |
Built: | 2024-10-31 22:24:40 UTC |
Source: | CRAN |
Compute Gower's distance, pairwise between records in two data sets x
and y
. Records from the smallest data set are recycled over.
gower_dist( x, y, pair_x = NULL, pair_y = NULL, eps = 1e-08, weights = NULL, ignore_case = FALSE, nthread = getOption("gd_num_thread") )
gower_dist( x, y, pair_x = NULL, pair_y = NULL, eps = 1e-08, weights = NULL, ignore_case = FALSE, nthread = getOption("gd_num_thread") )
x |
|
y |
|
pair_x |
|
pair_y |
|
eps |
|
weights |
|
ignore_case |
|
nthread |
Number of threads to use for parallelization. By default,
for a dual-core machine, 2 threads are used. For any other machine
n-1 cores are used so your machine doesn't freeze during a big computation.
The maximum nr of threads are determined using |
A numeric
vector of length max(nrow(x),nrow(y))
.
When there are no columns to compare, a message is printed and both
numeric(0)
is returned invisibly.
There are three ways to specify which columns of x
should be compared
with what columns of y
. The first option is do give no specification.
In that case columns with matching names will be used. The second option
is to use only the pairs_y
argument, specifying for each column in x
in order, which column in y
must be used to pair it with (use 0
to skip a column in x
). The third option is to explicitly specify the
columns to be matched using pair_x
and pair_y
.
Gower (1971) originally defined a similarity measure (, say)
with values ranging from 0 (completely dissimilar) to 1 (completely similar).
The distance returned here equals
.
Gower, John C. "A general coefficient of similarity and some of its properties." Biometrics (1971): 857-871.
Find the top-n matches in y
for each record in x
.
gower_topn( x, y, pair_x = NULL, pair_y = NULL, n = 5, eps = 1e-08, weights = NULL, ignore_case = FALSE, nthread = getOption("gd_num_thread") )
gower_topn( x, y, pair_x = NULL, pair_y = NULL, n = 5, eps = 1e-08, weights = NULL, ignore_case = FALSE, nthread = getOption("gd_num_thread") )
x |
|
y |
|
pair_x |
|
pair_y |
|
n |
The top-n indices and distances to return. |
eps |
|
weights |
|
ignore_case |
|
nthread |
Number of threads to use for parallelization. By default,
for a dual-core machine, 2 threads are used. For any other machine
n-1 cores are used so your machine doesn't freeze during a big computation.
The maximum nr of threads are determined using |
A list
with two array elements: index
and distance
. Both have size n X nrow(x)
. Each ith column
corresponds to the top-n best matches of x
with rows in y
.
When there are no columns to compare, a message is printed and both
distance
and index
will be empty matrices; the list is
then returned invisibly.
# find the top 4 best matches in the iris data set with itself. x <- iris[1:3,] lookup <- iris[1:10,] gower_topn(x=x,y=lookup,n=4)
# find the top 4 best matches in the iris data set with itself. x <- iris[1:3,] lookup <- iris[1:10,] gower_topn(x=x,y=lookup,n=4)