Title: | Co-Operation: Fast Covariance, Correlation, and Cosine Similarity Operations |
---|---|
Description: | Fast implementations of the co-operations: covariance, correlation, and cosine similarity. The implementations are fast and memory-efficient and their use is resolved automatically based on the input data, handled by R's S3 methods. Full descriptions of the algorithms and benchmarks are available in the package vignettes. |
Authors: | Drew Schmidt [aut, cre], Christian Heckendorf [ctb] (Caught some memory errors.) |
Maintainer: | Drew Schmidt <[email protected]> |
License: | BSD 2-clause License + file LICENSE |
Version: | 0.6-3 |
Built: | 2024-11-14 06:36:25 UTC |
Source: | CRAN |
Fast implementations of the co-operations: covariance, correlation, and cosine similarity. The implementations are fast and memory-efficient and their use is resolved automatically based on the input data, handled by R's S3 methods. Full descriptions of the algorithms and benchmarks are available in the package vignettes.
Covariance and correlation should largely need no introduction. Cosine similarity is commonly needed in, for example, natural language processing, where the cosine similarity coefficients of all columns of a term-document or document-term matrix is needed.
inplace
argumentWhen computing covariance and correlation with dense matrices,
we must operate on the centered and/or scaled input data. When
inplace=FALSE
, a copy of the matrix is made. This
allows for very wall-clock efficient processing at the cost of
m*n additional double precision numbers allocated. On the
other hand, if inplace=TRUE
, then the wall-clock
performance will drop considerably, but at the memory expense
of only m+n additional doubles. For perspective, given a
30,000x30,000 matrix, a copy of the data requires an
additional 6.7 GiB of data, while the inplace method requires
only 469 KiB, a 15,000-fold difference.
Note that cosine is always computed in place.
t
functionsThe package also includes "t" functions, like tcosine()
. These
behave analogously to tcrossprod()
as crossprod()
in base R.
So of cosine()
operates on the columns of the input matrix, then
tcosine()
operates on the rows. Another way to think of it is,
tcosine(x) = cosine(t(x))
.
Multiple storage schemes for the input data are accepted.
For dense matrices, an ordinary R matrix input is accepted.
For sparse matrices, a matrix in COO format, namely
simple_triplet_matrix
from the slam package, is accepted.
The implementation for dense matrix inputs is dominated
by a symmetric rank-k update via the BLAS subroutine dsyrk
;
see the package vignette for a discussion of the algorithm
implementation and complexity.
The implementation for two dense vector inputs is dominated by the
product t(x) %*% y
performed by the BLAS subroutine
dgemm
and the normalizing products t(y) %*% y
,
each computed via the BLAS function dsyrk
.
Drew Schmidt
Compute the cosine similarity matrix efficiently. The function
syntax and behavior is largely modeled after that of the
cosine()
function from the lsa
package, although
with a very different implementation.
cosine(x, y, use = "everything", inverse = FALSE) tcosine(x, y, use = "everything", inverse = FALSE)
cosine(x, y, use = "everything", inverse = FALSE) tcosine(x, y, use = "everything", inverse = FALSE)
x |
A numeric dataframe/matrix or vector. |
y |
A vector (when |
use |
The NA handler, as in R's |
inverse |
Logical; should the inverse covariance matrix be returned? |
See ?coop-package
for implementation details.
The matrix of all pair-wise vector cosine
similarities of the columns.
Drew Schmidt
x <- matrix(rnorm(10*3), 10, 3) coop::cosine(x) coop::cosine(x[, 1], x[, 2])
x <- matrix(rnorm(10*3), 10, 3) coop::cosine(x) coop::cosine(x[, 1], x[, 2])
An optimized, efficient implemntation for computing covariance.
covar(x, y, use = "everything", inplace = FALSE, inverse = FALSE) tcovar(x, y, use = "everything", inplace = FALSE, inverse = FALSE)
covar(x, y, use = "everything", inplace = FALSE, inverse = FALSE) tcovar(x, y, use = "everything", inplace = FALSE, inverse = FALSE)
x |
A numeric dataframe/matrix or vector. |
y |
A vector (when |
use |
The NA handler, as in R's |
inplace |
Logical; if |
inverse |
Logical; should the inverse covariance matrix be returned? |
See ?coop-package
for implementation details.
The covariance matrix.
Drew Schmidt
x <- matrix(rnorm(10*3), 10, 3) coop::pcor(x) coop::pcor(x[, 1], x[, 2])
x <- matrix(rnorm(10*3), 10, 3) coop::pcor(x) coop::pcor(x[, 1], x[, 2])
An optimized, efficient implemntation for computing the pearson correlation.
pcor(x, y, use = "everything", inplace = FALSE, inverse = FALSE) tpcor(x, y, use = "everything", inplace = FALSE, inverse = FALSE)
pcor(x, y, use = "everything", inplace = FALSE, inverse = FALSE) tpcor(x, y, use = "everything", inplace = FALSE, inverse = FALSE)
x |
A numeric dataframe/matrix or vector. |
y |
A vector (when |
use |
The NA handler, as in R's |
inplace |
Logical; if |
inverse |
Logical; should the inverse covariance matrix be returned? |
See ?coop
for implementation details.
The pearson correlation matrix.
Drew Schmidt
x <- matrix(rnorm(10*3), 10, 3) coop::pcor(x) coop::pcor(x[, 1], x[, 2])
x <- matrix(rnorm(10*3), 10, 3) coop::pcor(x) coop::pcor(x[, 1], x[, 2])
A function to center (subtract mean) and/or scale (divide by standard deviation) data column-wise in a computationally efficient way.
scaler(x, center = TRUE, scale = TRUE)
scaler(x, center = TRUE, scale = TRUE)
x |
The input matrix. |
center , scale
|
Logical; determine if the data should be centered and/or scaled. |
Unlike its R counterpart scale()
, the arguments
center
and scale
can only be logical values
(and not vectors).
The centered/scaled data, with attributes as in R's scale()
.
Show the sparsity (as a count or proportion) of a matrix. For example, .99 sparsity means 99% of the values are zero. Similarly, a sparsity of 0 means the matrix is fully dense.
sparsity(x, proportion = TRUE)
sparsity(x, proportion = TRUE)
x |
The matrix, stored as an ordinary R matrix or as a "simple triplet matrix" (from the slam package). |
proportion |
Logical; should a proportion or a count be returned? |
The implementation is very efficient for dense matrices. For sparse triplet matrices, the count is trivial.
The sparsity of the input matrix, as a proportion or a count.
Drew Schmidt
## Completely sparse matrix x <- matrix(0, 10, 10) coop::sparsity(x) ## 15\% density / 85\% sparsity x[sample(length(x), size=15)] <- 1 coop::sparsity(x)
## Completely sparse matrix x <- matrix(0, 10, 10) coop::sparsity(x) ## 15\% density / 85\% sparsity x[sample(length(x), size=15)] <- 1 coop::sparsity(x)
An optimized, efficient implemntation for computing weighted covariance,
correlation, and cosine similarity. Similar to R's cov.wt()
.
x |
A matrix or data.frame. |
wt |
A vector of weights or scalar weight. |
method |
Either "unbiased" or "ml". Unlike R, case is ignored. |
See ?coop-package
for implementation details.
Drew Schmidt
x <- matrix(rnorm(10*3), 10, 3) cov.wt(x)
x <- matrix(rnorm(10*3), 10, 3) cov.wt(x)