Title: | Generalized Linear Models (GLM) for Large Data Sets |
---|---|
Description: | Allows the user to carry out GLM on very large data sets. Data can be created using the data_frame() function and appended to the object with object$append(data); data_frame and data_matrix objects are available that allow the user to store large data on disk. The data is stored as doubles in binary format and any character columns are transformed to factors and then stored as numeric (binary) data while a look-up table is stored in a separate .meta_data file in the same folder. The data is stored in blocks and GLM regression algorithm is modified and carries out a MapReduce- like algorithm to fit the model. The functions bglm(), and summary() and bglm_predict() are available for creating and post-processing of models. The library requires Armadillo installed on your system. It may not function on windows since multi-core processing is done using mclapply() which forks R on Unix/Linux type operating systems. |
Authors: | Chibisi Chima-Okereke <[email protected]> |
Maintainer: | Chibisi Chima-Okereke <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.1.5 |
Built: | 2024-11-06 06:29:31 UTC |
Source: | CRAN |
Function for creating control parameters for the GLM fit
.control(epsilon = 1e-08, maxit = 25, trace = TRUE)
.control(epsilon = 1e-08, maxit = 25, trace = TRUE)
epsilon |
defaults to 1E-8 |
maxit |
defaults 25 maximum number of iterations |
trace |
defaults to TRUE |
converts numeric vector to integer
asInteger(x)
asInteger(x)
x |
numeric vector |
Function to carry out generalized linear regression on a data_frame data object
bglm( formula, family = gaussian_(), data, weights = NULL, offset = NULL, start = NULL, control = list(), etastart = NULL, mustart = NULL )
bglm( formula, family = gaussian_(), data, weights = NULL, offset = NULL, start = NULL, control = list(), etastart = NULL, mustart = NULL )
formula |
formula that defines your regression model |
family |
family object from activeReg, e.g. .gaussian(), .binomial(), .poisson(), .quasipoisson(), .quasibinomial(), .Gamma(), .inverse.gaussian(), .quasi() |
data |
data_frame object containing data for linear regression |
weights |
weights for the model |
offset |
offsets for the model |
start |
starting values for the linear predictor |
control |
list of parameters for .control() function |
etastart |
starting values for the linear predictor |
mustart |
starting values for vector of means |
require(parallel) data("plasma", package = "bigReg") data_dir = tempdir() plasma1 <- plasma plasma1 <- data_frame(plasma1, 10, path = data_dir, nCores = 1) plasma_glm <- bglm(ESR ~ fibrinogen + globulin, data = plasma1, family = binomial_("logit")) summary(plasma_glm)
require(parallel) data("plasma", package = "bigReg") data_dir = tempdir() plasma1 <- plasma plasma1 <- data_frame(plasma1, 10, path = data_dir, nCores = 1) plasma_glm <- bglm(ESR ~ fibrinogen + globulin, data = plasma1, family = binomial_("logit")) summary(plasma_glm)
predict function for bglm object
bglm_predict( mf = stop("mf: model frame must be supplied"), object = stop("object: bglm object must be supplied"), type = stop("type: either \"link\", \"response\", \"terms\"") )
bglm_predict( mf = stop("mf: model frame must be supplied"), object = stop("object: bglm object must be supplied"), type = stop("type: either \"link\", \"response\", \"terms\"") )
mf |
model frame |
object |
a bglm object |
type |
one of c("link", "response", "terms") |
binomial family function
binomial_(link = "logit")
binomial_(link = "logit")
link |
function character |
Function to carry out linear regression on a data_frame data object
blm( formula = stop("formula: not supplied"), data = stop("data: data not supplied"), control = list(), weights = NULL, offset = NULL )
blm( formula = stop("formula: not supplied"), data = stop("data: data not supplied"), control = list(), weights = NULL, offset = NULL )
formula |
formula that defines your regression model |
data |
data_frame object containing data for linear regression |
control |
list of parameters for control() function |
weights |
weights for the model |
offset |
offsets for the model |
The CreateFactor
function creates a factor from a numeric
vector and a character vector for levels
CreateFactor(x, levels)
CreateFactor(x, levels)
x |
numeric vector containing the numeric indices of the levels |
levels |
character vector levels |
function to create a data_frame object. The data_frame object
is an object that is held on disk. It is written to a folder
path
on disk where the data is written to in blocks or
chunks. The data is written in binary format using a C++ function
in purely numerical data and a mapping to the table is held in
a ".meta_data" file in the folder. The table object accomodates
numeric, factor, and character (converted to factor).
data_frame( data = stop("data must be supplied"), chunkSize = stop("chunkSize must be specified, a good number is 50000"), path = stop("path must be specified"), nCores = parallel::detectCores(), ... )
data_frame( data = stop("data must be supplied"), chunkSize = stop("chunkSize must be specified, a good number is 50000"), path = stop("path must be specified"), nCores = parallel::detectCores(), ... )
data |
data.frame object to be converted into a data_frame object |
chunkSize |
number of rows to be used in each chunk |
path |
character to folder where the object will be created |
nCores |
the number of cores to use defaults to parallel::detectCores() |
... |
not currently used. |
Creates a data_frame object
irisA <- data_frame(iris[1:75,], 10, "irisA", nCores = 1) irisA$append(iris[76:150,]) irisA$head() irisA$tail(10) irisA$delete(); rm(irisA)
irisA <- data_frame(iris[1:75,], 10, "irisA", nCores = 1) irisA$append(iris[76:150,]) irisA$head() irisA$tail(10) irisA$delete(); rm(irisA)
function to create a data_matrix object. The data_matrix object
is an object that is held on disk. It is written to a folder
path
on disk where the data is written to in blocks or
chunks. The data is written in binary format using a C++ function
in purely numerical data.
data_matrix( data = stop("data: matrix must be supplied"), chunkSize = stop("chunkSize must be specified, a good number is 50000"), path = stop("path must be specified"), nCores = parallel::detectCores(), ... )
data_matrix( data = stop("data: matrix must be supplied"), chunkSize = stop("chunkSize must be specified, a good number is 50000"), path = stop("path must be specified"), nCores = parallel::detectCores(), ... )
data |
object to be converted into a data_matrix object |
chunkSize |
number of rows to be used in each chunk |
path |
character to folder where the object will be created |
nCores |
the number of cores to use defaults to parallel::detectCores() |
... |
not used at the moment |
Creates a data_matrix object
family function
family_(distr, link)
family_(distr, link)
distr |
distr character one of "binomial", "poisson", "gaussian", "quasipoisson", "quasibinomial", "Gamma", "inverse.gaussian", "quasi" |
link |
function character |
Gamma family function
Gamma_(link = "inverse")
Gamma_(link = "inverse")
link |
function character |
gaussian family function
gaussian_(link = "identity")
gaussian_(link = "identity")
link |
function character |
inverse.gaussian family function
inverse.gaussian_(link = "1/mu^2")
inverse.gaussian_(link = "1/mu^2")
link |
function character |
function to load data_frame object
load_data_frame(path = stop("path: to data_frame folder must be supplied"))
load_data_frame(path = stop("path: to data_frame folder must be supplied"))
path |
character to folder containing object |
function to load data_frame object
load_data_matrix(path = stop("path: to data_matrix folder must be supplied"))
load_data_matrix(path = stop("path: to data_matrix folder must be supplied"))
path |
character to folder containing object |
finds whether x is in y
myIn(x, y)
myIn(x, y)
x |
item to be sought |
y |
vector to be matched against |
a function to create a sequence of integers
mySeq(start, end)
mySeq(start, end)
start |
integer from where sequence should start |
end |
integer where sequence should end |
Dataset from the HSAUR package
data(plasma)
data(plasma)
a data.frame
...
HSAUR R package (HSAUR package)
data(plasma) head(plasma)
data(plasma) head(plasma)
poisson family function
poisson_(link = "log")
poisson_(link = "log")
link |
function character |
print function for the bglm object
## S3 method for class 'bglm' print(x, digits = max(3L, getOption("digits") - 3L), ...)
## S3 method for class 'bglm' print(x, digits = max(3L, getOption("digits") - 3L), ...)
x |
bglm object to be displayed |
digits |
number of significant digits to use |
... |
not yet used |
print function for the blm object
## S3 method for class 'blm' print(x, digits = max(3L, getOption("digits") - 3L), ...)
## S3 method for class 'blm' print(x, digits = max(3L, getOption("digits") - 3L), ...)
x |
blm object to be displayed |
digits |
number of significant digits to use |
... |
not yet used |
print function for a data_frame
## S3 method for class 'data_frame' print(x, ...)
## S3 method for class 'data_frame' print(x, ...)
x |
data_frame object to print |
... |
not used |
print function for a data_matrix
## S3 method for class 'data_matrix' print(x, ...)
## S3 method for class 'data_matrix' print(x, ...)
x |
data_matrix object to print |
... |
not used |
Function to print the summary object from the bglm object
## S3 method for class 'summary.bglm' print( x, digits = max(3L, getOption("digits") - 3L), signif.stars = getOption("show.signif.stars"), ... )
## S3 method for class 'summary.bglm' print( x, digits = max(3L, getOption("digits") - 3L), signif.stars = getOption("show.signif.stars"), ... )
x |
summary blm object |
digits |
- the digits to be displayed |
signif.stars |
passed to printCoefmat |
... |
arguments passed to |
Function to print the summary object from the blm object
## S3 method for class 'summary.blm' print( x, digits = max(3L, getOption("digits") - 3L), signif.stars = getOption("show.signif.stars"), ... )
## S3 method for class 'summary.blm' print( x, digits = max(3L, getOption("digits") - 3L), signif.stars = getOption("show.signif.stars"), ... )
x |
summary blm object |
digits |
- the digits to be displayed |
signif.stars |
passed to printCoefmat |
... |
arguments passed to |
Function to print the summary object from the blm object
process_bglm_block( mf, formula, mmCall, family, offset, weights, start, niter, etastart, mustart )
process_bglm_block( mf, formula, mmCall, family, offset, weights, start, niter, etastart, mustart )
mf |
the data block to be processed |
formula |
the formula of for the model |
mmCall |
the call object of the model |
family |
the family object for the model |
offset |
the model offset |
weights |
the model weights |
start |
the starting coefficient estimates |
niter |
the current number of iterations |
etastart |
the start for eta |
mustart |
the start for mu |
quasi family function
quasi_(link = "identity", variance = "constant")
quasi_(link = "identity", variance = "constant")
link |
function character |
variance |
choice character |
quasibinomial family function
quasibinomial_(link = "logit")
quasibinomial_(link = "logit")
link |
function character |
quasipoisson family function
quasipoisson_(link = "log")
quasipoisson_(link = "log")
link |
function character |
row binding for benchmarking
r_bind(x, y)
r_bind(x, y)
x |
first matrix to be bound together |
y |
second matrix to be bound together |
read data frame block from file
read_df_block(size, filePath, df, ncol, factors, factor_indices)
read_df_block(size, filePath, df, ncol, factors, factor_indices)
size |
number of elements in the block |
filePath |
path to where the block is stored |
df |
an empty list having the same number of elements as columns in the table |
ncol |
number of columns in the dataframe block |
factors |
list containing factors |
factor_indices |
numeric vector containing the indicies that denote the factors |
read multiple blocks of data frames from file
read_df_blocks(size, filePaths, df, ncols, factors, factor_indices)
read_df_blocks(size, filePaths, df, ncols, factors, factor_indices)
size |
number of elements in each block |
filePaths |
path to where the blocks are stored |
df |
an empty list having the same number of elements as columns in the table |
ncols |
number of columns in the dataframe block |
factors |
list containing factors |
factor_indices |
numeric vector containing the indicies that denote the factors |
read matrix block from file
read_matrix_block(filePath, size, ncol)
read_matrix_block(filePath, size, ncol)
filePath |
path to file where matrix should be read from |
size |
total number of elements to be read |
ncol |
number of columns in the matrix |
read matrix blocks from file
read_matrix_blocks(filePaths, size, ncols)
read_matrix_blocks(filePaths, size, ncols)
filePaths |
file paths from where the matrix blocks will be read |
size |
numeric vector containing the number of elements in each block |
ncols |
number of columns in the matrix |
reads numeric vector to file
readNumericVector(size, filePath)
readNumericVector(size, filePath)
size |
the length of the numeric vector |
filePath |
dependent variable |
The reduction function for the algorithm
sum_bglm_block(x1, x2)
sum_bglm_block(x1, x2)
x1 |
the first list object to be reduced |
x2 |
the second list object to be reduced |
summary function for the bglm object
## S3 method for class 'bglm' summary(object, ...)
## S3 method for class 'bglm' summary(object, ...)
object |
bglm object to be summarized |
... |
not used |
summary function for the blm object
## S3 method for class 'blm' summary(object, ...)
## S3 method for class 'blm' summary(object, ...)
object |
blm object to be summarized |
... |
not used |
Singular value decomposition of the aggregated list from XWXMatrix(W) functions
SVD(out, epsilon)
SVD(out, epsilon)
out |
list containing requisite computed values |
epsilon |
either machine epsilon or user depermined epsilon |
writes numeric vector to file
write_numeric_vector(v, filePath)
write_numeric_vector(v, filePath)
v |
numeric vector to be written to file |
filePath |
path to file where the numeric vector should be written |
writes numeric vector to file
writeNumericVector(v, filePath)
writeNumericVector(v, filePath)
v |
numeric vector |
filePath |
dependent variable |
Calculation of iterative regression components
XWXMatrix(X, y)
XWXMatrix(X, y)
X |
design matrix |
y |
dependent variable |
Calculation of iterative regression components
XWXMatrixW(X, y, W)
XWXMatrixW(X, y, W)
X |
design matrix |
y |
dependent variable |
W |
weights |