Title: | Tools for Tall Distributed Matrices |
---|---|
Description: | Many data science problems reduce to operations on very tall, skinny matrices. However, sometimes these matrices can be so tall that they are difficult to work with, or do not even fit into main memory. One strategy to deal with such objects is to distribute their rows across several processors. To this end, we offer an 'S4' class for tall, skinny, distributed matrices, called the 'shaq'. We also provide many useful numerical methods and statistics operations for operating on these distributed objects. The naming is a bit "tongue-in-cheek", with the class a play on the fact that 'Shaquille' 'ONeal' ('Shaq') is very tall, and he starred in the film 'Kazaam'. |
Authors: | Drew Schmidt [aut, cre], Wei-Chen Chen [aut], Mike Matheson [aut], George Ostrouchov [aut], ORNL [cph] |
Maintainer: | Drew Schmidt <[email protected]> |
License: | BSD 2-clause License + file LICENSE |
Version: | 0.1-0 |
Built: | 2024-11-23 06:23:31 UTC |
Source: | CRAN |
Many data science problems reduce to operations on very tall, skinny matrices. However, sometimes these matrices can be so tall that they are difficult to work with, or do not even fit into main memory. One strategy to deal with such objects is to distribute their rows across several processors. To this end, we offer an 'S4' class for tall, skinny, distributed matrices, called the 'shaq'. We also provide many useful numerical methods and statistics operations for operating on these distributed objects. The naming is a bit "tongue-in-cheek", with the class a play on the fact that 'Shaquille' 'ONeal' ('Shaq') is very tall, and he starred in the film 'Kazaam'.
Drew Schmidt [email protected], Wei-Chen Chen, Mike Matheson, and George Ostrouchov.
Programming with Big Data in R Website: http://r-pbd.org/
Some binary arithmetic operations for shaqs. All operations are vector-shaq or shaq-vector, but not shaq-shaq. See details section for more information.
## S4 method for signature 'shaq,shaq' e1 + e2 ## S4 method for signature 'shaq,numeric' e1 + e2 ## S4 method for signature 'numeric,shaq' e1 + e2 ## S4 method for signature 'shaq,shaq' e1 - e2 ## S4 method for signature 'shaq,numeric' e1 - e2 ## S4 method for signature 'numeric,shaq' e1 - e2 ## S4 method for signature 'shaq,shaq' e1 * e2 ## S4 method for signature 'shaq,numeric' e1 * e2 ## S4 method for signature 'numeric,shaq' e1 * e2 ## S4 method for signature 'shaq,shaq' e1 / e2 ## S4 method for signature 'shaq,numeric' e1 / e2 ## S4 method for signature 'numeric,shaq' e1 / e2
## S4 method for signature 'shaq,shaq' e1 + e2 ## S4 method for signature 'shaq,numeric' e1 + e2 ## S4 method for signature 'numeric,shaq' e1 + e2 ## S4 method for signature 'shaq,shaq' e1 - e2 ## S4 method for signature 'shaq,numeric' e1 - e2 ## S4 method for signature 'numeric,shaq' e1 - e2 ## S4 method for signature 'shaq,shaq' e1 * e2 ## S4 method for signature 'shaq,numeric' e1 * e2 ## S4 method for signature 'numeric,shaq' e1 * e2 ## S4 method for signature 'shaq,shaq' e1 / e2 ## S4 method for signature 'shaq,numeric' e1 / e2 ## S4 method for signature 'numeric,shaq' e1 / e2
e1 , e2
|
A shaq or a numeric vector. |
For binary operations involving two shaqs, they must be distributed identically.
A shaq.
Each operation is completely local.
## Not run: library(kazaam) x = ranshaq(runif, 10, 3) y = ranshaq(runif, 10, 3) x + y x / 2 y + 1 finalize() ## End(Not run)
## Not run: library(kazaam) x = ranshaq(runif, 10, 3) y = ranshaq(runif, 10, 3) x + y x / 2 y + 1 finalize() ## End(Not run)
Subsetting via `[`
for shaq objects.
## S4 method for signature 'shaq' x[i, j] ## S4 replacement method for signature 'shaq' x[i, j, ...] <- value
## S4 method for signature 'shaq' x[i, j] ## S4 replacement method for signature 'shaq' x[i, j, ...] <- value
x |
A shaq. |
i , j
|
Indices. NOTE currently only implemented for |
... |
Ignored. |
value |
Replacement value(s) for the |
A shaq.
Each operation is completely local.
## Not run: library(kazaam) x = ranshaq(runif, 10, 3) y = x[, -1] y finalize() ## End(Not run)
## Not run: library(kazaam) x = ranshaq(runif, 10, 3) y = x[, -1] y finalize() ## End(Not run)
Column binding for shaqs.
## S3 method for class 'shaq' cbind(..., deparse.level = 1)
## S3 method for class 'shaq' cbind(..., deparse.level = 1)
... |
A collection of shaqs. |
deparse.level |
Ignored. |
All shaqs should have the same number of rows. Additionally, all shaqs should be distributed in identical fashion.
A shaq.
The operation is completely local.
## Not run: library(kazaam) x = ranshaq(runif, 10, 3) y = ranshaq(runif, 10, 1) cbind(x, y) finalize() ## End(Not run)
## Not run: library(kazaam) x = ranshaq(runif, 10, 3) y = ranshaq(runif, 10, 1) cbind(x, y) finalize() ## End(Not run)
Column operations (currently sums/means) for shaq objects.
## S4 method for signature 'shaq' colSums(x, na.rm = FALSE, dims = 1L) ## S4 method for signature 'shaq' colMeans(x, na.rm = FALSE, dims = 1L)
## S4 method for signature 'shaq' colSums(x, na.rm = FALSE, dims = 1L) ## S4 method for signature 'shaq' colMeans(x, na.rm = FALSE, dims = 1L)
x |
A shaq. |
na.rm |
Should |
dims |
Ignored. |
A regular vector.
The operation consists of a local column sum operation, followed by an
allreduce()
call, quadratic on the number of columns.
## Not run: library(kazaam) x = ranshaq(runif, 10, 3) cs = colSums(x) comm.print(cs) finalize() ## End(Not run)
## Not run: library(kazaam) x = ranshaq(runif, 10, 3) cs = colSums(x) comm.print(cs) finalize() ## End(Not run)
Collapse a shaq into a regular matrix.
collapse(x)
collapse(x)
x |
A shaq. |
Only rank 0 will own the matrix on return.
A regular matrix (rank 0) or NULL
(everyone else).
Short answer: quite a bit. Each local submatrix has to be sent to rank 0.
## Not run: library(kazaam) dx = ranshaq(runif, 10, 3) x = collapse(dx) comm.print(x) finalize() ## End(Not run)
## Not run: library(kazaam) dx = ranshaq(runif, 10, 3) x = collapse(dx) comm.print(x) finalize() ## End(Not run)
Covariance and (pearson) correlation.
## S4 method for signature 'shaq' cov(x, y = NULL, use = "everything", method = "pearson") ## S4 method for signature 'shaq' cor(x, y = NULL, use = "everything", method = "pearson")
## S4 method for signature 'shaq' cov(x, y = NULL, use = "everything", method = "pearson") ## S4 method for signature 'shaq' cor(x, y = NULL, use = "everything", method = "pearson")
x |
A shaq. |
y |
At this time, this must be |
use |
NA handling rules, as with R's cov/cor functions. At this time, only "everything" is supported. |
method |
The cov/cor method. Currently only "pearson" is available. |
A regular matrix.
The operation is completely local except for forming the crossproduct, which
is an allreduce()
call, quadratic on the number of columns.
## Not run: library(kazaam) x = ranshaq(runif, 10, 3) cov(x) cor(x) finalize() ## End(Not run)
## Not run: library(kazaam) x = ranshaq(runif, 10, 3) cov(x) cor(x) finalize() ## End(Not run)
Conceptually, this computes t(x) %*% x
for a shaq x
.
## S4 method for signature 'shaq' crossprod(x, y = NULL)
## S4 method for signature 'shaq' crossprod(x, y = NULL)
x |
A shaq. |
y |
Must be |
A regular matrix.
The operation consists of a local crossproduct, followed by an
allreduce()
call, quadratic on the number of columns.
## Not run: library(kazaam) x = ranshaq(runif, 10, 3) cp = crossprod(x) comm.print(cp) finalize() ## End(Not run)
## Not run: library(kazaam) x = ranshaq(runif, 10, 3) cp = crossprod(x) comm.print(cp) finalize() ## End(Not run)
Expand a regular matrix owned on MPI rank 0 into a shaq.
expand(x)
expand(x)
x |
A regular matrix. |
A shaq.
Short answer: quite a bit. Each local submatrix has to be received from rank 0.
## Not run: library(kazaam) if (comm.rank() == 0){ x = matrix(runif(30), 10, 3) } else { x = NULL } dx = expand(x) dx finalize() ## End(Not run)
## Not run: library(kazaam) if (comm.rank() == 0){ x = matrix(runif(30), 10, 3) } else { x = NULL } dx = expand(x) dx finalize() ## End(Not run)
Getters for shaq objects.
## S4 method for signature 'shaq' nrow(x) ## S4 method for signature 'shaq' NROW(x) nrow.local(x) ## S4 method for signature 'shaq' nrow.local(x) ## S4 method for signature 'shaq' ncol(x) ## S4 method for signature 'shaq' NCOL(x) ncol.local(x) ## S4 method for signature 'shaq' ncol.local(x) ## S4 method for signature 'shaq' length(x) Data(x) ## S4 method for signature 'shaq' Data(x)
## S4 method for signature 'shaq' nrow(x) ## S4 method for signature 'shaq' NROW(x) nrow.local(x) ## S4 method for signature 'shaq' nrow.local(x) ## S4 method for signature 'shaq' ncol(x) ## S4 method for signature 'shaq' NCOL(x) ncol.local(x) ## S4 method for signature 'shaq' ncol.local(x) ## S4 method for signature 'shaq' length(x) Data(x) ## S4 method for signature 'shaq' Data(x)
x |
A shaq. |
Functions to return the number of rows (nrow()
and NROW()
),
the number of columns (ncol()
and NCOL()
), the length - or
product of the number of rows and cols - (length()
), and the local
submatrix (Data()
).
Each operation is completely local.
Linear regression (Gaussian GLM), logistic regression, and poisson regression model fitters.
reg.fit(x, y, maxiter = 100) logistic.fit(x, y, maxiter = 100) poisson.fit(x, y, maxiter = 100)
reg.fit(x, y, maxiter = 100) logistic.fit(x, y, maxiter = 100) poisson.fit(x, y, maxiter = 100)
x , y
|
The input data |
maxiter |
The maximum number of iterations. |
Each function is implemented with gradient descent using the conjugate
gradients method ("CG") of the optim()
function.
Both of x
and y
must be distributed in an identical fashion.
This means that the number of rows owned by each MPI rank should match, and
the data rows x
and response rows y
should be aligned.
Additionally, each MPI rank should own at least one row. Ideally they should
be load balanced, so that each MPI rank owns roughly the same amount of data.
The return is the output of an optim()
call.
The communication consists of an allreduce of 1 double (the local cost/objective function value) at each iteration of the optimization.
McCullagh, P. and Nelder, J.A., 1989. Generalized Linear Models, no. 37 in Monograph on Statistics and Applied Probability.
Duda, R.O., Hart, P.E. and Stork, D.G., 1973. Pattern classification (pp. 526-528). Wiley, New York.
## Not run: library(kazaam) comm.set.seed(1234, diff=TRUE) x = ranshaq(rnorm, 10, 3) y = ranshaq(function(i) sample(0:1, size=i, replace=TRUE), 10) fit = logistic.fit(x, y) comm.print(fit) finalize() ## End(Not run)
## Not run: library(kazaam) comm.set.seed(1234, diff=TRUE) x = ranshaq(rnorm, 10, 3) y = ranshaq(function(i) sample(0:1, size=i, replace=TRUE), 10) fit = logistic.fit(x, y) comm.print(fit) finalize() ## End(Not run)
Test if an object is a shaq.
is.shaq(x)
is.shaq(x)
x |
An R object. |
A logical value, indicating whether or not the input is a shaq.
The operation is completely local.
## Not run: library(kazaam) x = ranshaq(runif, 10, 3) comm.print(is.shaq(x)) finalize() ## End(Not run)
## Not run: library(kazaam) x = ranshaq(runif, 10, 3) comm.print(is.shaq(x)) finalize() ## End(Not run)
Coefficients of the linear model.
lm_coefs(x, y, tol = 1e-07)
lm_coefs(x, y, tol = 1e-07)
x , y
|
The input data |
tol |
Numerical tolerance for deciding rank. |
The model is fit using a QR factorization of the input x
. At this
time, that means
Both of x
and y
must be distributed in an identical fashion.
This means that the number of rows owned by each MPI rank should match, and
the data rows x
and labels y
should be aligned. Additionally,
each MPI rank should own at least one row. Ideally they should be load
balanced, so that each MPI rank owns roughly the same amount of data.
A regular vector.
The operation has the same communication as
## Not run: library(kazaam) comm.set.seed(1234, diff=TRUE) x = ranshaq(rnorm, 10, 3) y = ranshaq(runif, 10) fit = lm_coefs(x, y) comm.print(fit) finalize() ## End(Not run)
## Not run: library(kazaam) comm.set.seed(1234, diff=TRUE) x = ranshaq(rnorm, 10, 3) y = ranshaq(runif, 10) fit = lm_coefs(x, y) comm.print(fit) finalize() ## End(Not run)
Multiplies two distributed matrices, if they are conformable.
## S4 method for signature 'shaq,matrix' x %*% y
## S4 method for signature 'shaq,matrix' x %*% y
x |
A shaq. |
y |
A regular matrix, globally ownd on all ranks. Since the number of columns of a shaq should be small, this matrix should be small as well. |
The two shaqs must be distributed identically.
A shaq.
The operation is completely local.
## Not run: library(kazaam) x = ranshaq(runif, 10, 3) y = matrix(1:9, 3, 3) x %*% y finalize() ## End(Not run)
## Not run: library(kazaam) x = ranshaq(runif, 10, 3) y = matrix(1:9, 3, 3) x %*% y finalize() ## End(Not run)
Implementation of R's norm()
function for shaq objects.
## S4 method for signature 'shaq,ANY' norm(x, type = c("O", "I", "F", "M", "2"))
## S4 method for signature 'shaq,ANY' norm(x, type = c("O", "I", "F", "M", "2"))
x |
A shaq |
type |
The type of norm: one, infinity, frobenius, max-modulus, and spectral. |
If type == "O"
then the norm is calculated as the maximum of the
column sums.
If type == "I"
then the norm is calculated as the maximum absolute
value of the row sums.
If type == "F"
then the norm is calculated as the square root of the
sum of the square of the values of the matrix.
If type == "M"
then the norm is calculated as the max of the absolute
value of the values of the matrix.
If type == "2"
then the norm is calculated as the largest singular
value.
A number (length 1 regular vector).
If type == "O"
then the communication consists of an allreduce,
quadratic on the number of columns.
If type == "I"
then the communication conists of an allgather.
If type == "F"
then the communication is an allreduce, quadratic on
the number of columns.
If type == "M"
then the communication consists of an allgather.
If type == "2"
then the communication consists of the same as that of
an svd()
call: an allreduce, quadratic on the number of columns.
## Not run: library(kazaam) x = ranshaq(runif, 10, 3) nm = norm(x) comm.print(nm) finalize() ## End(Not run)
## Not run: library(kazaam) x = ranshaq(runif, 10, 3) nm = norm(x) comm.print(nm) finalize() ## End(Not run)
Performs the principal components analysis.
## S3 method for class 'shaq' prcomp(x, retx = TRUE, center = TRUE, scale. = FALSE, tol = NULL, ...)
## S3 method for class 'shaq' prcomp(x, retx = TRUE, center = TRUE, scale. = FALSE, tol = NULL, ...)
x |
A shaq. |
retx |
Should the rotated variables be returned? |
center |
Should columns are zero centered? |
scale. |
Should columns are rescaled to unit variance? |
tol |
Ignored. |
... |
Ignored. |
prcomp()
performs the principal components analysis on the data
matrix by taking the SVD. Sometimes core R and kazaam will disagree
slightly in what the rotated variables are because of how the SVD is
caluclated.
A list of elements sdev
, rotation
, center
, scale
,
and x
, as with R's own prcomp()
. The elements are,
respectively, a regular vector, a regular matrix, a regular vector, a regular
vector, and a shaq.
The communication is an allreduce()
call, quadratic on the number of
columns. Most of the run time should be dominated by relatively expensive
local operations.
## Not run: library(kazaam) x = ranshaq(runif, 10, 3) pca = prcomp(x) comm.print(pca) finalize() ## End(Not run)
## Not run: library(kazaam) x = ranshaq(runif, 10, 3) pca = prcomp(x) comm.print(pca) finalize() ## End(Not run)
Print method for a shaq.
## S4 method for signature 'shaq' print(x, ...) ## S4 method for signature 'shaq' show(object)
## S4 method for signature 'shaq' print(x, ...) ## S4 method for signature 'shaq' show(object)
x , object
|
A shaq. |
... |
Ignored |
The operation is completely local.
## Not run: library(kazaam) x = shaq(1, 10, 3) x # same as print(x) or comm.print(x) finalize() ## End(Not run)
## Not run: library(kazaam) x = shaq(1, 10, 3) x # same as print(x) or comm.print(x) finalize() ## End(Not run)
QR factorization.
qr_R(x) qr_Q(x, R)
qr_R(x) qr_Q(x, R)
x |
A shaq. |
R |
A regular matrix. This argument is optional, in that if it is not supplied explicitly, then it will be computed in the background. But if have already computed R, supplying it here will improve performance (by avoiding needlessly recomputing it). |
is formed by first forming the crossproduct
and taking
its Cholesky factorization. But then
. Inverting
is handled by an efficient triangular inverse routine.
Q (a shaq) or R (a regular matrix).
The operation is completely local except for forming the crossproduct, which
is an allreduce()
call, quadratic on the number of columns.
## Not run: library(kazaam) x = ranshaq(runif, 10, 3) R = qr_R(x) comm.print(R) Q = qr_Q(x, R) Q finalize() ## End(Not run)
## Not run: library(kazaam) x = ranshaq(runif, 10, 3) R = qr_R(x) comm.print(R) Q = qr_Q(x, R) Q finalize() ## End(Not run)
Generate a random shaq object.
ranshaq(generator, nrows, ncols, local = FALSE, ...)
ranshaq(generator, nrows, ncols, local = FALSE, ...)
generator |
A function, such as |
nrows , ncols
|
The number of rows |
local |
Is the problem size |
... |
Additional arguments passed to the generator. |
A shaq.
The operation is entirely local.
## Not run: library(kazaam) # a 10x3 shaq with random uniform data x = ranshaq(runif, 10, 3) x # a (comm.size() * 10)x3 shaq with random normal data y = ranshaq(rnorm, 10, 3, local=TRUE) y finalize() ## End(Not run)
## Not run: library(kazaam) # a 10x3 shaq with random uniform data x = ranshaq(runif, 10, 3) x # a (comm.size() * 10)x3 shaq with random normal data y = ranshaq(rnorm, 10, 3, local=TRUE) y finalize() ## End(Not run)
Centers and/or scales the columns of a distributed matrix.
scale(x, center = TRUE, scale = TRUE) ## S4 method for signature 'shaq,logical,logical' scale(x, center = TRUE, scale = TRUE)
scale(x, center = TRUE, scale = TRUE) ## S4 method for signature 'shaq,logical,logical' scale(x, center = TRUE, scale = TRUE)
x |
A shaq. |
center |
logical value, determines whether or not columns are zero centered |
scale |
logical value, determines whether or not columns are rescaled to unit variance |
A shaq.
The communication consists of two allreduce calls, each quadratic on the number of columns.
## Not run: library(kazaam) x = ranshaq(rnorm, 10, 3, mean=30, sd=10) x scale(x) finalize() ## End(Not run)
## Not run: library(kazaam) x = ranshaq(rnorm, 10, 3, mean=30, sd=10) x scale(x) finalize() ## End(Not run)
Setter functions for shaq objects. Generally not recommended unless you are sure you know what you're doing.
Data(x) <- value ## S4 replacement method for signature 'shaq' Data(x) <- value DATA(x) <- value ## S4 replacement method for signature 'shaq' DATA(x) <- value
Data(x) <- value ## S4 replacement method for signature 'shaq' Data(x) <- value DATA(x) <- value ## S4 replacement method for signature 'shaq' DATA(x) <- value
x |
A shaq. |
value |
The new data. |
Data<-
will perform checks on the inserted data and ensure that the
number of columns match across processors (requiring communication). It will
also udpate the number of rows as necessary.
DATA<-
will perform no checks, so use only if you're really sure that
you know what you're doing.
With Data<-
, a check on the global number of rows is performed. This
amounts to an allgather operation on a logical value (the local dimension
check).
Constructor for shaq objects.
shaq(Data, nrows, ncols, checks = TRUE) ## S3 method for class 'matrix' shaq(Data, nrows, ncols, checks = TRUE) ## S3 method for class 'numeric' shaq(Data, nrows, ncols, checks = TRUE)
shaq(Data, nrows, ncols, checks = TRUE) ## S3 method for class 'matrix' shaq(Data, nrows, ncols, checks = TRUE) ## S3 method for class 'numeric' shaq(Data, nrows, ncols, checks = TRUE)
Data |
The local submatrix. |
nrows , ncols
|
The GLOBAL number of rows and columns. |
checks |
Logical. Should some basic dimension checks be performed? Note that these require communication, and with many MPI ranks, could be expensive. |
If nrows
and/or ncols
is missing, then it will be imputed.
This means one must be especially careful to manually provide ncols
if some of ranks have "placeholder data" (a 0x0 matrix), which is typical
when reading from a subset of processors and then broadcasting out to the
remainder.
If checks=TRUE
, a check on the global number of rows is performed.
This amounts to an allgather operation on a logical value (the local
dimension check).
An S4 container for a distributed tall/skinny matrix.
The (conceptual) global (non-distributed) matrix should be distributed by row, meaning that each submatrix should own all of the columns of the global matrix. Most methods assume no other real structure, however for best performance (and for the methods which require it), one should try to organize their distributed data in a particular way.
First, adjacent MPI ranks should hold adjacent rows. So if the last row that
rank k
owns is i
, then the first row that rank k+1
owns
should be row i+1
. Additionally, any method that operates on two (or
more) shaq objects, the two shaqs should be distributed identically. By this
we mean that if the number of rows shaq A
owns on rank k
is
k_i
, then the number of rows shaq B
owns on rank k
should also be k_i
.
Finally, for best performance, one should generally try to keep the number of rows "balanced" (roughly equal) across processes, with perhaps the last "few" having one less row than the others.
DATA
The local submatrix.
nrows,
ncols The global matrix dimension.
Singular value decomposition.
## S4 method for signature 'shaq' svd(x, nu = min(n, p), nv = min(n, p), LINPACK = FALSE)
## S4 method for signature 'shaq' svd(x, nu = min(n, p), nv = min(n, p), LINPACK = FALSE)
x |
A shaq. |
nu |
number of left singular vectors to return. |
nv |
number of right singular vectors to return. |
LINPACK |
Ignored. |
The factorization works by first forming the crossproduct
and then taking its eigenvalue decomposition. In this case, the square root
of the eigenvalues are the singular values. If the left/right singular
vectors
or
are desired, then in either case,
is
computed (the eigenvectors). From these,
can be reconstructed, since
if
, then
.
A list of elements d
, u
, and v
, as with R's own
svd()
. The elements are, respectively, a regular vector, a shaq, and
a regular matrix.
The operation is completely local except for forming the crossproduct, which
is an allreduce()
call, quadratic on the number of columns.
## Not run: library(kazaam) x = ranshaq(runif, 10, 3) svd = svd(x) comm.print(svd$d) # a globally owned vector svd$u # a shaq comm.print(svd$v) # a globally owned matrix finalize() ## End(Not run)
## Not run: library(kazaam) x = ranshaq(runif, 10, 3) svd = svd(x) comm.print(svd$d) # a globally owned vector svd$u # a shaq comm.print(svd$v) # a globally owned matrix finalize() ## End(Not run)