BigDataStatMeth provides statistical and linear algebra
operations for matrices stored in HDF5 files. The package is designed
for workflows in which matrices may be too large to be held entirely in
memory, while still allowing users to work with familiar R
functions.
The examples in this vignette use small matrices so that they can be run quickly and compared with standard in-memory R results. The same interface is intended for larger HDF5-backed matrices. All files created in the examples are written to temporary locations.
The recommended user-facing interface is based on
HDF5Matrix objects and standard R generics. Thus,
HDF5-backed matrices can be manipulated using calls such as
dim(), [, %*%,
crossprod(), scale(), cor(),
svd(), qr(), and prcomp().
Internally, BigDataStatMeth uses an R6 and C++ backend.
Advanced users may interact with lower-level interfaces when needed, but
this vignette focuses on the S3 interface because it allows code to
remain close to ordinary R matrix workflows.
The following table summarizes the main user-facing functionality
available for HDF5Matrix objects. It is not an exhaustive
list of all exported functions; rather, it highlights the standard
R-style methods and high-level helpers used throughout this vignette.
Additional domain-specific, lower-level, and compatibility functions are
documented in their corresponding help pages.
| Category | Representative calls |
|---|---|
| Core object handling | hdf5_create_matrix(),
hdf5_matrix(), dim(), nrow(),
ncol(), is_open(), close() |
| HDF5 inspection and I/O | list_datasets(),
hdf5_import(), hdf5_import_multiple(),
as.matrix(), as.data.frame() |
| Subsetting and assignment | X[i, j],
X[i, j] <- value |
| Dimension names | rownames(), colnames(),
dimnames() |
| Element-wise arithmetic | X + Y, X - Y,
X * Y, X / Y |
| Matrix algebra | %*%, crossprod(),
tcrossprod() |
| Binding | cbind(), rbind() |
| Aggregations | colSums(), rowSums(),
colMeans(), rowMeans(),
colVars(), rowVars(), colSds(),
rowSds(), colMins(), rowMins(),
colMaxs(), rowMaxs() |
| Scalar summaries | mean(), var(),
sd() |
| Normalization and transformations | scale(), sweep() |
| Correlation | cor() |
| Matrix decompositions | svd(), prcomp(),
eigen(), pseudoinverse() |
| Factorizations and solvers | qr(), chol(),
solve() |
| Diagonal operations | diag(), diag<-(),
diag_op(), diag_scale() |
| Split, reduce, and apply | split(), split_dataset(),
reduce(), hdf5_reduce(),
apply_function(), hdf5_apply() |
| Resource management and options | hdf5matrix_options(),
show_hdf5matrix_options(),
hdf5_close_all() |
| Specialized high-level helpers | selected bd* functions for utilities
without a direct standard R generic, such as
bdCreate_hdf5_group(), bdmove_hdf5_dataset(),
and bdWrite_hdf5_dimnames() |
Most examples in this vignette use the standard
HDF5Matrix/S3 interface. Some high-level helper functions
keep the bd* prefix because they provide specialized
functionality that does not map directly to an existing base R generic.
These functions remain part of the package API and are documented in
their corresponding help pages.
Some S3 operators, such as element-wise arithmetic, follow the standard R syntax and do not expose additional arguments in the function call. Global options make it possible to configure block-wise processing, parallelization, the number of threads, and compression for such operations.
old_opts <- hdf5matrix_options()
old_opts
$paral
NULL
$block_size
NULL
$threads
NULL
$compression
NULLFor example, the following settings enable parallel execution with two threads, use a fixed block size, and set the compression level used for new output datasets when not explicitly specified by a method call.
hdf5matrix_options(
paral = TRUE,
threads = 2L,
block_size = 512L,
compression = 6L
)
hdf5matrix_options()
$paral
[1] TRUE
$block_size
[1] 512
$threads
[1] 2
$compression
[1] 6When an individual method provides explicit arguments, those
arguments should be used for local control of that operation. The global
options are especially useful for generic operators where the standard R
call is kept unchanged. Additional operation-specific parameters are
documented in the corresponding help pages, for example
?svd.HDF5Matrix, ?prcomp.HDF5Matrix,
?qr.HDF5Matrix, and ?hdf5matrix_options.
The function hdf5_create_matrix() creates an HDF5
dataset and returns an HDF5Matrix object pointing to it. In
the example below, a standard R matrix is written to an HDF5 file and
then manipulated through the S3 interface.
h5file <- tempfile(fileext = ".h5")
set.seed(123)
X <- matrix(rnorm(500 * 100), nrow = 500, ncol = 100)
X_h5 <- hdf5_create_matrix(
filename = h5file,
dataset = "data/X",
data = X,
overwrite = TRUE
)
X_h5
HDF5Matrix object
File: /tmp/Rtmpgjoek3/filed051193099b.h5
Path: data/X
Dimensions: 500 x 100
Type:
Status: OPEN
dim(X_h5)
[1] 500 100
nrow(X_h5)
[1] 500
ncol(X_h5)
[1] 100The datasets stored in an HDF5 file can be listed with
list_datasets(). Existing datasets can be reopened with
hdf5_matrix().
list_datasets(h5file)
[1] "data/X"
X_h5_reopened <- hdf5_matrix(
filename = h5file,
path = "data/X"
)
dim(X_h5_reopened)
[1] 500 100Operations that return an HDF5Matrix object write their
results to an HDF5 dataset. The printed object shows the file and
dataset path. Some methods, such as crossprod() and
tcrossprod(), allow the output group and dataset name to be
specified explicitly. Other S3 operators, such as element-wise
arithmetic, currently use package-defined output names.
Delimited text files can be imported directly into HDF5 with
hdf5_import(). The following example uses the small
cholesterol data file distributed with the package.
In the final package layout, the file should be available under
inst/extdata/colesterol.csv. The fallback below is included
only to allow this vignette to be tested against older source layouts in
which the same file was stored elsewhere.
# Generate reproducible example data matching the cholesterol dataset structure
set.seed(4214123)
n <- 1000L
colesterol <- data.frame(
TCholesterol = rnorm(n, mean = 195.8, sd = 30.1),
Age = rnorm(n, mean = 58.4, sd = 9.9),
Insulin = rnorm(n, mean = 12.5, sd = 7.4),
Creatinine = rnorm(n, mean = 0.798, sd = 0.098),
BUN = rnorm(n, mean = 14.6, sd = 3.4),
LLDR = rnorm(n, mean = 0.981, sd = 0.450),
Triglycerides= rnorm(n, mean = 136.6, sd = 55.1),
HDL_C = rnorm(n, mean = 48.2, sd = 9.2),
LDL_C = rnorm(n, mean = 116.4, sd = 25.5),
Sex = rbinom(n, 1, prob = 0.58)
)
csv_file <- tempfile(fileext = ".csv")
write.csv(colesterol, csv_file, row.names = FALSE)
h5_csv <- tempfile(fileext = ".h5")
cholesterol_h5 <- hdf5_import(
source = csv_file,
filename = h5_csv,
dataset = "cholesterol/data",
sep = ",",
header = TRUE,
overwrite = TRUE
)
cholesterol_h5
HDF5Matrix object
File: /tmp/Rtmpgjoek3/filed05625eec92.h5
Path: cholesterol/data
Dimensions: 1000 x 10
Type:
Status: OPEN
dim(cholesterol_h5)
[1] 1000 10
cholesterol_h5[1:5, 1:min(6, ncol(cholesterol_h5))]
TCholesterol Age Insulin Creatinine BUN LLDR
[1,] 210.6167 71.49745 2.109771 0.7787582 15.315296 0.7306967
[2,] 231.8875 53.95919 15.652611 0.8264965 9.401774 1.2230963
[3,] 191.7023 56.31360 7.744493 0.9293400 5.071212 1.2897677
[4,] 214.3665 45.09847 22.989093 0.8592065 14.564008 0.8122450
[5,] 202.0006 39.70275 11.293069 0.7682224 18.492814 1.0190207The [ operator can be used to read subsets from an
HDF5Matrix object. Only the requested rows and columns are
read.
X_h5[1:5, 1:6]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -0.56047565 -0.6018928 -0.99579872 -0.8209867 -0.5116037 -0.6788076
[2,] -0.23017749 -0.9936986 -1.03995504 -0.3072572 0.2369379 0.5743127
[3,] 1.55870831 1.0267851 -0.01798024 -0.9020980 -0.5415892 -0.7045145
[4,] 0.07050839 0.7510613 -0.13217513 0.6270687 1.2192276 -0.5339841
[5,] 0.12928774 -1.5091665 -2.54934277 1.1203550 0.1741359 0.7743846
sub_X <- X_h5[1:20, 1:10]
dim(sub_X)
[1] 20 10The replacement form [<- writes values back to the
HDF5 dataset. Changes are applied directly to the data stored on
disk.
X_edit <- hdf5_create_matrix(
h5file,
"data/X_edit",
data = X[1:10, 1:6],
overwrite = TRUE
)
X_edit[1, 1] <- 999
X_edit[1:3, 1:3]
[,1] [,2] [,3]
[1,] 999.0000000 -0.6018928 -0.99579872
[2,] -0.2301775 -0.9936986 -1.03995504
[3,] 1.5587083 1.0267851 -0.01798024Dimension names can also be stored and retrieved with the usual R accessors.
DN_h5 <- hdf5_create_matrix(
h5file,
"data/dimnames_example",
data = matrix(seq_len(12), nrow = 4, ncol = 3),
overwrite = TRUE
)
rownames(DN_h5) <- paste0("sample_", seq_len(nrow(DN_h5)))
colnames(DN_h5) <- paste0("feature_", seq_len(ncol(DN_h5)))
rownames(DN_h5)
[1] "sample_1" "sample_2" "sample_3" "sample_4"
colnames(DN_h5)
[1] "feature_1" "feature_2" "feature_3"
dimnames(DN_h5)
[[1]]
[1] "sample_1" "sample_2" "sample_3" "sample_4"
[[2]]
[1] "feature_1" "feature_2" "feature_3"The function as.matrix() reads an HDF5-backed matrix
into memory. It is useful for small examples or final results, but
should be used with care for very large datasets.
X_small <- as.matrix(X_h5[1:10, 1:5])
X_small
[,1] [,2] [,3] [,4] [,5]
[1,] -0.56047565 -0.60189285 -0.99579872 -0.8209867 -0.5116037
[2,] -0.23017749 -0.99369859 -1.03995504 -0.3072572 0.2369379
[3,] 1.55870831 1.02678506 -0.01798024 -0.9020980 -0.5415892
[4,] 0.07050839 0.75106130 -0.13217513 0.6270687 1.2192276
[5,] 0.12928774 -1.50916654 -2.54934277 1.1203550 0.1741359
[6,] 1.71506499 -0.09514745 1.04057346 2.1272136 -0.6152683
[7,] 0.46091621 -0.89594782 0.24972574 0.3661144 -1.8068930
[8,] -1.26506123 -2.07075107 2.41620737 -0.8747814 -0.6436811
[9,] -0.68685285 0.15012013 0.68519824 1.0244749 2.0460189
[10,] -0.44566197 -0.07921171 -0.44695931 0.9047589 -0.5607624Element-wise arithmetic between HDF5Matrix objects can
be performed using the standard arithmetic operators. The results are
stored as new HDF5-backed matrices.
set.seed(1)
A <- matrix(rnorm(300 * 80), nrow = 300, ncol = 80)
B <- matrix(rnorm(300 * 80), nrow = 300, ncol = 80)
C <- matrix(rnorm(80 * 40), nrow = 80, ncol = 40)
A_h5 <- hdf5_create_matrix(
h5file, "data/A", data = A,
overwrite = TRUE
)
B_h5 <- hdf5_create_matrix(
h5file, "data/B", data = B,
overwrite = TRUE
)
C_h5 <- hdf5_create_matrix(
h5file, "data/C", data = C,
overwrite = TRUE
)
S_h5 <- A_h5 + B_h5
D_h5 <- A_h5 - B_h5
S_h5
HDF5Matrix object
File: /tmp/Rtmpgjoek3/filed051193099b.h5
Path: OUTPUT/A_plus_B
Dimensions: 300 x 80
Type:
Status: OPEN
dim(S_h5)
[1] 300 80
all.equal(as.matrix(S_h5), A + B)
[1] TRUE
all.equal(as.matrix(D_h5), A - B)
[1] TRUEThe output datasets can be inspected directly from the file. This is useful when results need to be reopened later in the workflow.
Matrix multiplication uses the same %*% syntax as
ordinary R matrices, while the computation is performed block-wise on
the HDF5-backed data.
Crossproducts are available through the standard R calls. These methods also allow the output group and dataset name to be specified explicitly.
XtX_h5 <- crossprod(
A_h5,
outgroup = "RESULTS",
outdataset = "A_crossprod"
)
XXt_h5 <- tcrossprod(
A_h5,
outgroup = "RESULTS",
outdataset = "A_tcrossprod"
)
XtX_h5
HDF5Matrix object
File: /tmp/Rtmpgjoek3/filed051193099b.h5
Path: RESULTS/A_crossprod
Dimensions: 80 x 80
Type:
Status: OPEN
list_datasets(h5file, group = "RESULTS", recursive = TRUE)
[1] "A_crossprod" "A_tcrossprod"
all.equal(as.matrix(XtX_h5), crossprod(A))
[1] TRUE
all.equal(as.matrix(XXt_h5), tcrossprod(A))
[1] TRUEThe functions cbind() and rbind() can be
used to combine compatible HDF5-backed matrices by columns or rows.
A1_h5 <- hdf5_create_matrix(
h5file,
"bind/A1",
data = A[1:50, 1:10],
overwrite = TRUE
)
A2_h5 <- hdf5_create_matrix(
h5file,
"bind/A2",
data = A[1:50, 11:20],
overwrite = TRUE
)
Cbind_h5 <- cbind(
A1_h5, A2_h5,
out_group = "BIND_RESULTS",
out_dataset = "A1_A2_cbind",
overwrite = TRUE
)
Rbind_h5 <- rbind(
A1_h5, A1_h5,
out_group = "BIND_RESULTS",
out_dataset = "A1_A1_rbind",
overwrite = TRUE
)
dim(Cbind_h5)
[1] 50 20
dim(Rbind_h5)
[1] 100 10
all.equal(as.matrix(Cbind_h5), cbind(A[1:50, 1:10], A[1:50, 11:20]))
[1] TRUE
all.equal(as.matrix(Rbind_h5), rbind(A[1:50, 1:10], A[1:50, 1:10]))
[1] TRUEMany summary operations return standard R vectors or scalars. This is convenient when the output is small relative to the input matrix.
all.equal(colMeans(A_h5), colMeans(A))
[1] TRUE
all.equal(rowSums(A_h5), rowSums(A))
[1] TRUE
all.equal(colVars(A_h5), apply(A, 2, var))
[1] TRUE
all.equal(rowSds(A_h5), apply(A, 1, sd))
[1] TRUEThe scale() method centers and/or scales an HDF5-backed
matrix by blocks and stores the result on disk.
A_scaled_h5 <- scale(A_h5)
A_scaled <- scale(A)
all.equal(
as.matrix(A_scaled_h5),
A_scaled,
check.attributes = FALSE
)
[1] TRUECorrelation matrices can be computed directly from an
HDF5Matrix.
Some operations can be combined with small HDF5-backed vectors. The
following example subtracts column means from each column using
sweep().
col_means_h5 <- hdf5_create_matrix(
h5file,
"stats/A_col_means",
data = matrix(colMeans(A), nrow = 1),
overwrite = TRUE
)
A_centered_h5 <- sweep(A_h5, MARGIN = 2, STATS = col_means_h5, FUN = "-")
all.equal(
as.matrix(A_centered_h5),
sweep(A, MARGIN = 2, STATS = colMeans(A), FUN = "-"),
check.attributes = FALSE
)
[1] TRUEBigDataStatMeth implements several matrix decompositions
for HDF5-backed matrices. The examples below use small matrices to
validate results against standard R computations.
The SVD method can center and scale the input before decomposition. The returned singular values are stored in memory, while singular vectors are stored as HDF5-backed matrices.
set.seed(123)
X_svd <- matrix(rnorm(120 * 300), nrow = 120, ncol = 300)
X_svd_h5 <- hdf5_create_matrix(
h5file,
"decomp/X_svd",
data = X_svd,
overwrite = TRUE
)
svd_h5 <- svd(
X_svd_h5,
nu = 10,
nv = 10,
center = TRUE,
scale = TRUE,
overwrite = TRUE
)
head(svd_h5$d)
[1] 27.48157 27.27928 26.93254 26.60646 26.34913 25.95616
dim(svd_h5$u)
[1] 120 10
dim(svd_h5$v)
[1] 300 10The first singular values can be compared with the corresponding in-memory R calculation.
svd_r <- svd(scale(X_svd), nu = 10, nv = 10)
all.equal(
svd_h5$d[1:10],
svd_r$d[1:10],
tolerance = 1e-6
)
[1] TRUEFor large matrices, svd() can use a block-wise strategy.
The parameter k controls the number of local SVDs per
incremental level, whereas q controls the number of
incremental levels used to combine intermediate decompositions. Larger
values may reduce the size of each local problem, but they can also
increase overhead. Therefore, these parameters should be selected
according to the matrix dimensions and the available hardware. The
argument threads controls the number of threads used by
parallel parts of the computation.
The arguments nu and nv follow the same
idea as in base::svd(): nu controls how many
left singular vectors are returned, and nv controls how
many right singular vectors are returned. Requesting only the leading
components is useful when the complete set of singular vectors is not
needed.
The prcomp() method provides a PCA interface for
HDF5-backed matrices. The result includes standard PCA summaries
together with HDF5-backed matrices for quantities such as scores and
loadings. The argument ncomponents controls the number of
principal components computed; a value of 0 requests all
available components.
set.seed(124)
n <- 180
p <- 40
group <- rep(c("Group 1", "Group 2", "Group 3"), each = n / 3)
X_pca <- matrix(rnorm(n * p), nrow = n, ncol = p)
X_pca[group == "Group 2", 1:8] <- X_pca[group == "Group 2", 1:8] + 1.5
X_pca[group == "Group 3", 9:16] <- X_pca[group == "Group 3", 9:16] - 1.5
X_pca_h5 <- hdf5_create_matrix(
h5file,
"decomp/X_pca",
data = X_pca,
overwrite = TRUE
)
pca_h5 <- prcomp(
X_pca_h5,
center = TRUE,
scale. = TRUE,
ncomponents = 5,
overwrite = TRUE
)
pca_h5
HDF5PCA - Principal Component Analysis (on-disk)
File : /tmp/Rtmpgjoek3/filed051193099b.h5
PCs : 5
center : TRUE | scale: TRUE
sdev[1:3]: 2.1775, 1.4867, 1.3418
cumvar% : 11.85, 17.38, 21.88 ...
rotation : 40 x 40 [HDF5Matrix]
x (ind.) : 180 x 40 [HDF5Matrix]
head(pca_h5$sdev)
[1] 2.177534 1.486692 1.341799 1.305983 1.264360The PCA scores are stored in the x component of the
returned object. When x is an HDF5-backed matrix, the
required columns can be subsetted and converted to memory for
visualization. Here the synthetic groups are used only to make the
low-dimensional structure visible.
class(pca_h5$x)
[1] "HDF5Matrix" "HDF5Matrix" "R6"
dim(pca_h5$x)
[1] 180 40
pca_scores <- as.matrix(pca_h5$x[, 1:2])
pca_df <- data.frame(
PC1 = pca_scores[, 1],
PC2 = pca_scores[, 2],
group = group
)
if (requireNamespace("ggplot2", quietly = TRUE)) {
ggplot2::ggplot(pca_df, ggplot2::aes(PC1, PC2, colour = group)) +
ggplot2::geom_point(size = 2.4, alpha = 0.85) +
ggplot2::stat_ellipse(linewidth = 0.6, alpha = 0.7) +
ggplot2::theme_minimal(base_size = 12) +
ggplot2::theme(
legend.position = "top",
panel.grid.minor = ggplot2::element_blank()
) +
ggplot2::labs(
title = "PCA of an HDF5-backed matrix",
subtitle = "Scores returned by prcomp.HDF5Matrix()",
x = "PC1",
y = "PC2",
colour = "Synthetic group"
)
} else {
plot(
pca_df$PC1,
pca_df$PC2,
pch = 19,
xlab = "PC1",
ylab = "PC2",
main = "PCA of an HDF5-backed matrix"
)
}PCA scores computed from an HDF5-backed matrix.
The QR decomposition returns HDF5-backed Q and
R matrices. The argument thin controls whether
a reduced decomposition is returned. With thin = TRUE,
Q has only the columns needed to reconstruct the input
matrix, which is usually preferable for tall matrices because it avoids
storing unnecessary columns. With thin = FALSE, a full
Q matrix is returned.
Rather than comparing the signs of individual columns, the example
validates the factorization and the orthogonality of Q.
QR_h5 <- qr(A_h5, thin = TRUE, overwrite = TRUE)
Q <- as.matrix(QR_h5$Q)
R <- as.matrix(QR_h5$R)
all.equal(Q %*% R, A, tolerance = 1e-6)
[1] TRUE
all.equal(crossprod(Q), diag(ncol(Q)), tolerance = 1e-6)
[1] TRUEFor tall-skinny matrices, the QR method can use the TSQR path. The
method = "auto" setting selects the backend according to
the matrix shape, while method = "tsqr" requests the
tall-skinny algorithm explicitly.
For symmetric positive-definite matrices, chol() and
solve() can be used directly on HDF5Matrix
objects.
set.seed(321)
Z <- matrix(rnorm(200 * 50), nrow = 200, ncol = 50)
SPD <- crossprod(Z) + diag(1e-6, 50)
SPD_h5 <- hdf5_create_matrix(
h5file,
"decomp/SPD",
data = SPD,
overwrite = TRUE
)
Ch_h5 <- chol(SPD_h5, overwrite = TRUE)
Inv_h5 <- solve(SPD_h5, overwrite = TRUE)
Ch <- as.matrix(Ch_h5)
chol_error <- min(
max(abs(crossprod(Ch) - SPD)),
max(abs(tcrossprod(Ch) - SPD))
)
chol_error < 1e-6
[1] TRUE
all.equal(as.matrix(Inv_h5), solve(SPD), tolerance = 1e-6)
[1] TRUEAdditional decompositions are available through S3 methods. For
example, eigen() can be applied to symmetric HDF5-backed
matrices, and pseudoinverse() computes the Moore–Penrose
pseudoinverse.
split() divides an HDF5Matrix into
equal-sized blocks stored as separate datasets in the same HDF5 file.
Use n_blocks to specify the number of blocks, or
block_size to specify the maximum number of rows (or
columns, if bycols = TRUE) per block.
fn <- tempfile(fileext = ".h5")
X <- hdf5_create_matrix(fn, "data/X", data = matrix(rnorm(600), 60, 10))
blocks <- split(X, n_blocks = 3) # returns a named list: block_0, block_1, block_2
dim(blocks[["block_0"]]) # 20 x 10
lapply(blocks, close)
close(X)
hdf5_close_all()
unlink(fn)split_dataset() is an equivalent with a cleaner
signature that omits the f and drop parameters
inherited from base::split. Both produce identical
results.
Note: BigDataStatMeth::split() produces an error because
split is a base generic — this is the same
behaviour as BigDataStatMeth::cor() or
BigDataStatMeth::svd(). The correct call is simply
split(X, n_blocks = 3) after loading the package.
reduce() accumulates all datasets in the same HDF5 group
as x into a single dataset using "+" or
"-". This is useful after a split operation or when partial
results from parallel computation are stored as separate datasets in a
group.
fn <- tempfile(fileext = ".h5")
hdf5_create_matrix(fn, "parts/A", data = matrix(1:12, 3, 4))
hdf5_create_matrix(fn, "parts/B", data = matrix(1:12, 3, 4))
hdf5_create_matrix(fn, "parts/C", data = matrix(1:12, 3, 4))
entry <- hdf5_matrix(fn, "parts/A")
result <- reduce(entry, func = "+") # sums A + B + C
as.matrix(result) # each element equals 3x the original
hdf5_close_all()
unlink(fn)hdf5_reduce(filename, group, ...) provides the same
operation when you only have the file path and group name, without
needing an open HDF5Matrix object.
apply_function() applies a predefined algebraic or
statistical operation to a named set of datasets within the HDF5 group
of x.
This is not equivalent to
base::apply(). It does not apply an arbitrary R
function row- or column-wise. Instead, it dispatches built-in C++
operations — QR decomposition, cross-products, Cholesky, and others — to
a batch of datasets in the file. Support for row/column-wise application
of arbitrary R functions (apply.HDF5Matrix) is planned for
a future release.
fn <- tempfile(fileext = ".h5")
hdf5_create_matrix(fn, "data/A", data = matrix(rnorm(50), 5, 10))
hdf5_create_matrix(fn, "data/B", data = matrix(rnorm(50), 5, 10))
X <- hdf5_matrix(fn, "data/A")
res <- apply_function(X, datasets = c("A", "B"),
func = "CrossProd", out_group = "RESULTS")
hdf5_close_all()
unlink(fn)hdf5_apply(filename, group, datasets, func, outgroup, ...)
provides the same operation when you only have the file path, without an
open object.
HDF5 datasets can be stored with different compression levels. Lower
compression levels generally reduce writing time but use more disk
space. Higher compression levels may reduce file size but can increase
the time needed to write and read data. The default value in
BigDataStatMeth is compression = 6, which
provides a balanced setting for many workflows.
The following small example illustrates the trade-off between writing time and file size. It is not intended as a formal benchmark.
set.seed(123)
X_cmp <- round(matrix(rnorm(2500 * 250), nrow = 2500, ncol = 250), 2)
f0 <- tempfile(fileext = ".h5")
f6 <- tempfile(fileext = ".h5")
t0 <- system.time(
hdf5_create_matrix(
f0, "data/X",
data = X_cmp,
compression = 0,
overwrite = TRUE
)
)
t6 <- system.time(
hdf5_create_matrix(
f6, "data/X",
data = X_cmp,
compression = 6,
overwrite = TRUE
)
)
data.frame(
compression = c(0, 6),
elapsed = c(t0[["elapsed"]], t6[["elapsed"]]),
file_size_MB = round(file.info(c(f0, f6))$size / 1024^2, 3)
)
compression elapsed file_size_MB
1 0 0.004 4.771
2 6 0.260 1.151Formal performance comparisons should be run on the target hardware and with representative datasets.
When datasets are deleted or overwritten in an HDF5 file, the
physical file size on disk does not always decrease immediately.
BigDataStatMeth creates HDF5 files using a file space
management strategy that allows free space inside the file to be tracked
and reused by subsequent writes. Thus, space released by removed
datasets can be reused within the same file instead of unnecessarily
increasing the file size after repeated operations.
This behavior helps control file growth during workflows that create,
overwrite, or remove intermediate datasets. Nevertheless, reducing the
physical size of an existing HDF5 file may require repacking the file.
The HDF Group provides external command-line tools for this purpose. In
particular, the h5repack
user guide describes how to rewrite an HDF5 file into a new
compacted or reconfigured file. This is an external HDF5 utility and is
not executed from this vignette.
HDF5-backed objects keep file handles open while they are in use. In ordinary interactive workflows these handles are released when objects are garbage-collected. Users can also close objects explicitly.
The function hdf5_close_all() closes the HDF5 handles
currently tracked by the package, including open file handles. After
this call, HDF5-backed objects that pointed to those handles should be
reopened before further use.
Calling gc() may help trigger R finalizers for objects
that are no longer referenced, which can be useful when repeatedly
re-running code during development.
The examples above use the standard R interface, which is the
recommended entry point for most users. However,
BigDataStatMeth also provides a C++ infrastructure for
developers who need to implement new scalable methods.
Internally, the package defines C++ classes for managing HDF5 files, groups, and datasets, together with block-wise routines for linear algebra and statistical operations. These are the same computational building blocks used by the R/S3 interface. As a result, developers can reuse the package infrastructure from Rcpp-based code instead of implementing HDF5 file management, block iteration, and numerical operations from scratch.
This makes the package useful not only as an end-user R package, but also as a development framework for extending HDF5-backed statistical computing in C++. Since the purpose of this vignette is to introduce the standard R interface, the C++ API is not discussed in detail here. Nevertheless, it is an important part of the package architecture and provides the computational infrastructure behind the R/S3 methods shown above.
sessionInfo()
R version 4.6.0 (2026-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] BigDataStatMeth_2.0.2 knitr_1.51 BiocStyle_2.41.0
loaded via a namespace (and not attached):
[1] vctrs_0.7.3 cli_3.6.6 rlang_1.2.0
[4] xfun_0.58 S7_0.2.2 jsonlite_2.0.0
[7] data.table_1.18.4 labeling_0.4.3 glue_1.8.1
[10] buildtools_1.0.0 RCurl_1.98-1.19 htmltools_0.5.9
[13] maketools_1.3.2 sys_3.4.3 sass_0.4.10
[16] scales_1.4.0 rmarkdown_2.31 grid_4.6.0
[19] evaluate_1.0.5 jquerylib_0.1.4 MASS_7.3-65
[22] bitops_1.0-9 fastmap_1.2.0 yaml_2.3.12
[25] lifecycle_1.0.5 BiocManager_1.30.27 compiler_4.6.0
[28] RColorBrewer_1.1-3 Rcpp_1.1.1-1.1 farver_2.1.2
[31] digest_0.6.39 R6_2.6.1 bslib_0.11.0
[34] withr_3.0.2 gtable_0.3.6 tools_4.6.0
[37] ggplot2_4.0.3 cachem_1.1.0