Overview

BigDataStatMeth provides statistical and linear algebra operations for matrices stored in HDF5 files. The package is designed for workflows in which matrices may be too large to be held entirely in memory, while still allowing users to work with familiar R functions.

The examples in this vignette use small matrices so that they can be run quickly and compared with standard in-memory R results. The same interface is intended for larger HDF5-backed matrices. All files created in the examples are written to temporary locations.

library(BigDataStatMeth)

User-facing interface and global options

The recommended user-facing interface is based on HDF5Matrix objects and standard R generics. Thus, HDF5-backed matrices can be manipulated using calls such as dim(), [, %*%, crossprod(), scale(), cor(), svd(), qr(), and prcomp().

Internally, BigDataStatMeth uses an R6 and C++ backend. Advanced users may interact with lower-level interfaces when needed, but this vignette focuses on the S3 interface because it allows code to remain close to ordinary R matrix workflows.

The following table summarizes the main user-facing functionality available for HDF5Matrix objects. It is not an exhaustive list of all exported functions; rather, it highlights the standard R-style methods and high-level helpers used throughout this vignette. Additional domain-specific, lower-level, and compatibility functions are documented in their corresponding help pages.

Category	Representative calls
Core object handling	`hdf5_create_matrix()`, `hdf5_matrix()`, `dim()`, `nrow()`, `ncol()`, `is_open()`, `close()`
HDF5 inspection and I/O	`list_datasets()`, `hdf5_import()`, `hdf5_import_multiple()`, `as.matrix()`, `as.data.frame()`
Subsetting and assignment	`X[i, j]`, `X[i, j] <- value`
Dimension names	`rownames()`, `colnames()`, `dimnames()`
Element-wise arithmetic	`X + Y`, `X - Y`, `X * Y`, `X / Y`
Matrix algebra	`%*%`, `crossprod()`, `tcrossprod()`
Binding	`cbind()`, `rbind()`
Aggregations	`colSums()`, `rowSums()`, `colMeans()`, `rowMeans()`, `colVars()`, `rowVars()`, `colSds()`, `rowSds()`, `colMins()`, `rowMins()`, `colMaxs()`, `rowMaxs()`
Scalar summaries	`mean()`, `var()`, `sd()`
Normalization and transformations	`scale()`, `sweep()`
Correlation	`cor()`
Matrix decompositions	`svd()`, `prcomp()`, `eigen()`, `pseudoinverse()`
Factorizations and solvers	`qr()`, `chol()`, `solve()`
Diagonal operations	`diag()`, `diag<-()`, `diag_op()`, `diag_scale()`
Split, reduce, and apply	`split()`, `split_dataset()`, `reduce()`, `hdf5_reduce()`, `apply_function()`, `hdf5_apply()`
Resource management and options	`hdf5matrix_options()`, `show_hdf5matrix_options()`, `hdf5_close_all()`
Specialized high-level helpers	selected `bd*` functions for utilities without a direct standard R generic, such as `bdCreate_hdf5_group()`, `bdmove_hdf5_dataset()`, and `bdWrite_hdf5_dimnames()`

Most examples in this vignette use the standard HDF5Matrix/S3 interface. Some high-level helper functions keep the bd* prefix because they provide specialized functionality that does not map directly to an existing base R generic. These functions remain part of the package API and are documented in their corresponding help pages.

Global options

Some S3 operators, such as element-wise arithmetic, follow the standard R syntax and do not expose additional arguments in the function call. Global options make it possible to configure block-wise processing, parallelization, the number of threads, and compression for such operations.

old_opts <- hdf5matrix_options()
old_opts
$paral
NULL

$block_size
NULL

$threads
NULL

$compression
NULL

For example, the following settings enable parallel execution with two threads, use a fixed block size, and set the compression level used for new output datasets when not explicitly specified by a method call.

hdf5matrix_options(
  paral = TRUE,
  threads = 2L,
  block_size = 512L,
  compression = 6L
)

hdf5matrix_options()
$paral
[1] TRUE

$block_size
[1] 512

$threads
[1] 2

$compression
[1] 6

When an individual method provides explicit arguments, those arguments should be used for local control of that operation. The global options are especially useful for generic operators where the standard R call is kept unchanged. Additional operation-specific parameters are documented in the corresponding help pages, for example ?svd.HDF5Matrix, ?prcomp.HDF5Matrix, ?qr.HDF5Matrix, and ?hdf5matrix_options.

Creating, opening, and inspecting HDF5 datasets

The function hdf5_create_matrix() creates an HDF5 dataset and returns an HDF5Matrix object pointing to it. In the example below, a standard R matrix is written to an HDF5 file and then manipulated through the S3 interface.

h5file <- tempfile(fileext = ".h5")

set.seed(123)
X <- matrix(rnorm(500 * 100), nrow = 500, ncol = 100)

X_h5 <- hdf5_create_matrix(
  filename = h5file,
  dataset = "data/X",
  data = X,
  overwrite = TRUE
)

X_h5
HDF5Matrix object
  File: /tmp/Rtmpgjoek3/filed051193099b.h5
  Path: data/X
  Dimensions: 500 x 100
  Type: 
  Status: OPEN
dim(X_h5)
[1] 500 100
nrow(X_h5)
[1] 500
ncol(X_h5)
[1] 100

The datasets stored in an HDF5 file can be listed with list_datasets(). Existing datasets can be reopened with hdf5_matrix().

list_datasets(h5file)
[1] "data/X"

X_h5_reopened <- hdf5_matrix(
  filename = h5file,
  path = "data/X"
)

dim(X_h5_reopened)
[1] 500 100

Operations that return an HDF5Matrix object write their results to an HDF5 dataset. The printed object shows the file and dataset path. Some methods, such as crossprod() and tcrossprod(), allow the output group and dataset name to be specified explicitly. Other S3 operators, such as element-wise arithmetic, currently use package-defined output names.

Importing tabular data

Delimited text files can be imported directly into HDF5 with hdf5_import(). The following example uses the small cholesterol data file distributed with the package.

In the final package layout, the file should be available under inst/extdata/colesterol.csv. The fallback below is included only to allow this vignette to be tested against older source layouts in which the same file was stored elsewhere.

# Generate reproducible example data matching the cholesterol dataset structure
set.seed(4214123)
n <- 1000L

colesterol <- data.frame(
  TCholesterol = rnorm(n, mean = 195.8, sd = 30.1),
  Age          = rnorm(n, mean = 58.4,  sd = 9.9),
  Insulin      = rnorm(n, mean = 12.5,  sd = 7.4),
  Creatinine   = rnorm(n, mean = 0.798, sd = 0.098),
  BUN          = rnorm(n, mean = 14.6,  sd = 3.4),
  LLDR         = rnorm(n, mean = 0.981, sd = 0.450),
  Triglycerides= rnorm(n, mean = 136.6, sd = 55.1),
  HDL_C        = rnorm(n, mean = 48.2,  sd = 9.2),
  LDL_C        = rnorm(n, mean = 116.4, sd = 25.5),
  Sex          = rbinom(n, 1, prob = 0.58)
)

csv_file <- tempfile(fileext = ".csv")
write.csv(colesterol, csv_file, row.names = FALSE)

h5_csv <- tempfile(fileext = ".h5")

cholesterol_h5 <- hdf5_import(
  source = csv_file,
  filename = h5_csv,
  dataset = "cholesterol/data",
  sep = ",",
  header = TRUE,
  overwrite = TRUE
)

cholesterol_h5
HDF5Matrix object
  File: /tmp/Rtmpgjoek3/filed05625eec92.h5
  Path: cholesterol/data
  Dimensions: 1000 x 10
  Type: 
  Status: OPEN
dim(cholesterol_h5)
[1] 1000   10
cholesterol_h5[1:5, 1:min(6, ncol(cholesterol_h5))]
     TCholesterol      Age   Insulin Creatinine       BUN      LLDR
[1,]     210.6167 71.49745  2.109771  0.7787582 15.315296 0.7306967
[2,]     231.8875 53.95919 15.652611  0.8264965  9.401774 1.2230963
[3,]     191.7023 56.31360  7.744493  0.9293400  5.071212 1.2897677
[4,]     214.3665 45.09847 22.989093  0.8592065 14.564008 0.8122450
[5,]     202.0006 39.70275 11.293069  0.7682224 18.492814 1.0190207

Subsetting, assignment, dimnames, and conversion to memory

The [ operator can be used to read subsets from an HDF5Matrix object. Only the requested rows and columns are read.

X_h5[1:5, 1:6]
            [,1]       [,2]        [,3]       [,4]       [,5]       [,6]
[1,] -0.56047565 -0.6018928 -0.99579872 -0.8209867 -0.5116037 -0.6788076
[2,] -0.23017749 -0.9936986 -1.03995504 -0.3072572  0.2369379  0.5743127
[3,]  1.55870831  1.0267851 -0.01798024 -0.9020980 -0.5415892 -0.7045145
[4,]  0.07050839  0.7510613 -0.13217513  0.6270687  1.2192276 -0.5339841
[5,]  0.12928774 -1.5091665 -2.54934277  1.1203550  0.1741359  0.7743846

sub_X <- X_h5[1:20, 1:10]
dim(sub_X)
[1] 20 10

The replacement form [<- writes values back to the HDF5 dataset. Changes are applied directly to the data stored on disk.

X_edit <- hdf5_create_matrix(
  h5file,
  "data/X_edit",
  data = X[1:10, 1:6],
  overwrite = TRUE
)

X_edit[1, 1] <- 999
X_edit[1:3, 1:3]
            [,1]       [,2]        [,3]
[1,] 999.0000000 -0.6018928 -0.99579872
[2,]  -0.2301775 -0.9936986 -1.03995504
[3,]   1.5587083  1.0267851 -0.01798024

Dimension names can also be stored and retrieved with the usual R accessors.

DN_h5 <- hdf5_create_matrix(
  h5file,
  "data/dimnames_example",
  data = matrix(seq_len(12), nrow = 4, ncol = 3),
  overwrite = TRUE
)

rownames(DN_h5) <- paste0("sample_", seq_len(nrow(DN_h5)))
colnames(DN_h5) <- paste0("feature_", seq_len(ncol(DN_h5)))

rownames(DN_h5)
[1] "sample_1" "sample_2" "sample_3" "sample_4"
colnames(DN_h5)
[1] "feature_1" "feature_2" "feature_3"
dimnames(DN_h5)
[[1]]
[1] "sample_1" "sample_2" "sample_3" "sample_4"

[[2]]
[1] "feature_1" "feature_2" "feature_3"

The function as.matrix() reads an HDF5-backed matrix into memory. It is useful for small examples or final results, but should be used with care for very large datasets.

X_small <- as.matrix(X_h5[1:10, 1:5])
X_small
             [,1]        [,2]        [,3]       [,4]       [,5]
 [1,] -0.56047565 -0.60189285 -0.99579872 -0.8209867 -0.5116037
 [2,] -0.23017749 -0.99369859 -1.03995504 -0.3072572  0.2369379
 [3,]  1.55870831  1.02678506 -0.01798024 -0.9020980 -0.5415892
 [4,]  0.07050839  0.75106130 -0.13217513  0.6270687  1.2192276
 [5,]  0.12928774 -1.50916654 -2.54934277  1.1203550  0.1741359
 [6,]  1.71506499 -0.09514745  1.04057346  2.1272136 -0.6152683
 [7,]  0.46091621 -0.89594782  0.24972574  0.3661144 -1.8068930
 [8,] -1.26506123 -2.07075107  2.41620737 -0.8747814 -0.6436811
 [9,] -0.68685285  0.15012013  0.68519824  1.0244749  2.0460189
[10,] -0.44566197 -0.07921171 -0.44695931  0.9047589 -0.5607624

Matrix algebra

Element-wise arithmetic

Element-wise arithmetic between HDF5Matrix objects can be performed using the standard arithmetic operators. The results are stored as new HDF5-backed matrices.

set.seed(1)

A <- matrix(rnorm(300 * 80), nrow = 300, ncol = 80)
B <- matrix(rnorm(300 * 80), nrow = 300, ncol = 80)
C <- matrix(rnorm(80 * 40), nrow = 80, ncol = 40)

A_h5 <- hdf5_create_matrix(
  h5file, "data/A", data = A,
  overwrite = TRUE
)

B_h5 <- hdf5_create_matrix(
  h5file, "data/B", data = B,
  overwrite = TRUE
)

C_h5 <- hdf5_create_matrix(
  h5file, "data/C", data = C,
  overwrite = TRUE
)

S_h5 <- A_h5 + B_h5
D_h5 <- A_h5 - B_h5

S_h5
HDF5Matrix object
  File: /tmp/Rtmpgjoek3/filed051193099b.h5
  Path: OUTPUT/A_plus_B
  Dimensions: 300 x 80
  Type: 
  Status: OPEN
dim(S_h5)
[1] 300  80

all.equal(as.matrix(S_h5), A + B)
[1] TRUE
all.equal(as.matrix(D_h5), A - B)
[1] TRUE

The output datasets can be inspected directly from the file. This is useful when results need to be reopened later in the workflow.

list_datasets(h5file, group = "OUTPUT", recursive = TRUE)
[1] "A_minus_B" "A_plus_B"

Matrix multiplication

Matrix multiplication uses the same %*% syntax as ordinary R matrices, while the computation is performed block-wise on the HDF5-backed data.

M_h5 <- A_h5 %*% C_h5

M_h5
HDF5Matrix object
  File: /tmp/Rtmpgjoek3/filed051193099b.h5
  Path: OUTPUT/A_x_C
  Dimensions: 300 x 40
  Type: 
  Status: OPEN
dim(M_h5)
[1] 300  40
all.equal(as.matrix(M_h5), A %*% C)
[1] TRUE

Crossproducts

Crossproducts are available through the standard R calls. These methods also allow the output group and dataset name to be specified explicitly.

XtX_h5 <- crossprod(
  A_h5,
  outgroup = "RESULTS",
  outdataset = "A_crossprod"
)

XXt_h5 <- tcrossprod(
  A_h5,
  outgroup = "RESULTS",
  outdataset = "A_tcrossprod"
)

XtX_h5
HDF5Matrix object
  File: /tmp/Rtmpgjoek3/filed051193099b.h5
  Path: RESULTS/A_crossprod
  Dimensions: 80 x 80
  Type: 
  Status: OPEN
list_datasets(h5file, group = "RESULTS", recursive = TRUE)
[1] "A_crossprod"  "A_tcrossprod"

all.equal(as.matrix(XtX_h5), crossprod(A))
[1] TRUE
all.equal(as.matrix(XXt_h5), tcrossprod(A))
[1] TRUE

Binding matrices

The functions cbind() and rbind() can be used to combine compatible HDF5-backed matrices by columns or rows.

A1_h5 <- hdf5_create_matrix(
  h5file,
  "bind/A1",
  data = A[1:50, 1:10],
  overwrite = TRUE
)

A2_h5 <- hdf5_create_matrix(
  h5file,
  "bind/A2",
  data = A[1:50, 11:20],
  overwrite = TRUE
)

Cbind_h5 <- cbind(
  A1_h5, A2_h5,
  out_group = "BIND_RESULTS",
  out_dataset = "A1_A2_cbind",
  overwrite = TRUE
)

Rbind_h5 <- rbind(
  A1_h5, A1_h5,
  out_group = "BIND_RESULTS",
  out_dataset = "A1_A1_rbind",
  overwrite = TRUE
)

dim(Cbind_h5)
[1] 50 20
dim(Rbind_h5)
[1] 100  10

all.equal(as.matrix(Cbind_h5), cbind(A[1:50, 1:10], A[1:50, 11:20]))
[1] TRUE
all.equal(as.matrix(Rbind_h5), rbind(A[1:50, 1:10], A[1:50, 1:10]))
[1] TRUE

Statistical operations

Many summary operations return standard R vectors or scalars. This is convenient when the output is small relative to the input matrix.

all.equal(colMeans(A_h5), colMeans(A))
[1] TRUE
all.equal(rowSums(A_h5), rowSums(A))
[1] TRUE

all.equal(colVars(A_h5), apply(A, 2, var))
[1] TRUE
all.equal(rowSds(A_h5), apply(A, 1, sd))
[1] TRUE

The scale() method centers and/or scales an HDF5-backed matrix by blocks and stores the result on disk.

A_scaled_h5 <- scale(A_h5)
A_scaled <- scale(A)

all.equal(
  as.matrix(A_scaled_h5),
  A_scaled,
  check.attributes = FALSE
)
[1] TRUE

Correlation matrices can be computed directly from an HDF5Matrix.

Cor_h5 <- cor(A_h5)

all.equal(
  as.matrix(Cor_h5),
  cor(A),
  tolerance = 1e-6
)
[1] TRUE

Some operations can be combined with small HDF5-backed vectors. The following example subtracts column means from each column using sweep().

col_means_h5 <- hdf5_create_matrix(
  h5file,
  "stats/A_col_means",
  data = matrix(colMeans(A), nrow = 1),
  overwrite = TRUE
)

A_centered_h5 <- sweep(A_h5, MARGIN = 2, STATS = col_means_h5, FUN = "-")

all.equal(
  as.matrix(A_centered_h5),
  sweep(A, MARGIN = 2, STATS = colMeans(A), FUN = "-"),
  check.attributes = FALSE
)
[1] TRUE

Matrix decompositions

BigDataStatMeth implements several matrix decompositions for HDF5-backed matrices. The examples below use small matrices to validate results against standard R computations.

Singular value decomposition

The SVD method can center and scale the input before decomposition. The returned singular values are stored in memory, while singular vectors are stored as HDF5-backed matrices.

set.seed(123)
X_svd <- matrix(rnorm(120 * 300), nrow = 120, ncol = 300)

X_svd_h5 <- hdf5_create_matrix(
  h5file,
  "decomp/X_svd",
  data = X_svd,
  overwrite = TRUE
)

svd_h5 <- svd(
  X_svd_h5,
  nu = 10,
  nv = 10,
  center = TRUE,
  scale = TRUE,
  overwrite = TRUE
)

head(svd_h5$d)
[1] 27.48157 27.27928 26.93254 26.60646 26.34913 25.95616
dim(svd_h5$u)
[1] 120  10
dim(svd_h5$v)
[1] 300  10

The first singular values can be compared with the corresponding in-memory R calculation.

svd_r <- svd(scale(X_svd), nu = 10, nv = 10)

all.equal(
  svd_h5$d[1:10],
  svd_r$d[1:10],
  tolerance = 1e-6
)
[1] TRUE

For large matrices, svd() can use a block-wise strategy. The parameter k controls the number of local SVDs per incremental level, whereas q controls the number of incremental levels used to combine intermediate decompositions. Larger values may reduce the size of each local problem, but they can also increase overhead. Therefore, these parameters should be selected according to the matrix dimensions and the available hardware. The argument threads controls the number of threads used by parallel parts of the computation.

The arguments nu and nv follow the same idea as in base::svd(): nu controls how many left singular vectors are returned, and nv controls how many right singular vectors are returned. Requesting only the leading components is useful when the complete set of singular vectors is not needed.

svd_blk_h5 <- svd(
  X_svd_h5,
  nu = 5,
  nv = 5,
  k = 4,
  q = 1,
  threads = 2,
  overwrite = TRUE
)

head(svd_blk_h5$d)
[1] 27.48157 27.27928 26.93254 26.60646 26.34913
dim(svd_blk_h5$u)
[1] 120   5
dim(svd_blk_h5$v)
[1] 300   5

Principal component analysis

The prcomp() method provides a PCA interface for HDF5-backed matrices. The result includes standard PCA summaries together with HDF5-backed matrices for quantities such as scores and loadings. The argument ncomponents controls the number of principal components computed; a value of 0 requests all available components.

set.seed(124)

n <- 180
p <- 40
group <- rep(c("Group 1", "Group 2", "Group 3"), each = n / 3)

X_pca <- matrix(rnorm(n * p), nrow = n, ncol = p)
X_pca[group == "Group 2", 1:8] <- X_pca[group == "Group 2", 1:8] + 1.5
X_pca[group == "Group 3", 9:16] <- X_pca[group == "Group 3", 9:16] - 1.5

X_pca_h5 <- hdf5_create_matrix(
  h5file,
  "decomp/X_pca",
  data = X_pca,
  overwrite = TRUE
)

pca_h5 <- prcomp(
  X_pca_h5,
  center = TRUE,
  scale. = TRUE,
  ncomponents = 5,
  overwrite = TRUE
)

pca_h5
HDF5PCA - Principal Component Analysis (on-disk)
  File     : /tmp/Rtmpgjoek3/filed051193099b.h5
  PCs      : 5
  center   : TRUE  |  scale: TRUE
  sdev[1:3]: 2.1775, 1.4867, 1.3418
  cumvar%  : 11.85, 17.38, 21.88 ...
  rotation : 40 x 40  [HDF5Matrix]
  x (ind.) : 180 x 40  [HDF5Matrix]
head(pca_h5$sdev)
[1] 2.177534 1.486692 1.341799 1.305983 1.264360

The PCA scores are stored in the x component of the returned object. When x is an HDF5-backed matrix, the required columns can be subsetted and converted to memory for visualization. Here the synthetic groups are used only to make the low-dimensional structure visible.

class(pca_h5$x)
[1] "HDF5Matrix" "HDF5Matrix" "R6"        
dim(pca_h5$x)
[1] 180  40

pca_scores <- as.matrix(pca_h5$x[, 1:2])
pca_df <- data.frame(
  PC1 = pca_scores[, 1],
  PC2 = pca_scores[, 2],
  group = group
)

if (requireNamespace("ggplot2", quietly = TRUE)) {
  ggplot2::ggplot(pca_df, ggplot2::aes(PC1, PC2, colour = group)) +
    ggplot2::geom_point(size = 2.4, alpha = 0.85) +
    ggplot2::stat_ellipse(linewidth = 0.6, alpha = 0.7) +
    ggplot2::theme_minimal(base_size = 12) +
    ggplot2::theme(
      legend.position = "top",
      panel.grid.minor = ggplot2::element_blank()
    ) +
    ggplot2::labs(
      title = "PCA of an HDF5-backed matrix",
      subtitle = "Scores returned by prcomp.HDF5Matrix()",
      x = "PC1",
      y = "PC2",
      colour = "Synthetic group"
    )
} else {
  plot(
    pca_df$PC1,
    pca_df$PC2,
    pch = 19,
    xlab = "PC1",
    ylab = "PC2",
    main = "PCA of an HDF5-backed matrix"
  )
}

PCA scores computed from an HDF5-backed matrix.

QR decomposition

The QR decomposition returns HDF5-backed Q and R matrices. The argument thin controls whether a reduced decomposition is returned. With thin = TRUE, Q has only the columns needed to reconstruct the input matrix, which is usually preferable for tall matrices because it avoids storing unnecessary columns. With thin = FALSE, a full Q matrix is returned.

Rather than comparing the signs of individual columns, the example validates the factorization and the orthogonality of Q.

QR_h5 <- qr(A_h5, thin = TRUE, overwrite = TRUE)

Q <- as.matrix(QR_h5$Q)
R <- as.matrix(QR_h5$R)

all.equal(Q %*% R, A, tolerance = 1e-6)
[1] TRUE
all.equal(crossprod(Q), diag(ncol(Q)), tolerance = 1e-6)
[1] TRUE

For tall-skinny matrices, the QR method can use the TSQR path. The method = "auto" setting selects the backend according to the matrix shape, while method = "tsqr" requests the tall-skinny algorithm explicitly.

X_tsqr <- matrix(rnorm(600 * 30), nrow = 600, ncol = 30)
X_tsqr_h5 <- hdf5_create_matrix(
  h5file,
  "decomp/X_tsqr",
  data = X_tsqr,
  overwrite = TRUE
)

QR_tsqr_h5 <- qr(
  X_tsqr_h5,
  thin = TRUE,
  method = "tsqr",
  threads = 2L,
  overwrite = TRUE
)

dim(QR_tsqr_h5$Q)
[1] 600  30
dim(QR_tsqr_h5$R)
[1] 30 30

Cholesky decomposition and inverse

For symmetric positive-definite matrices, chol() and solve() can be used directly on HDF5Matrix objects.

set.seed(321)
Z <- matrix(rnorm(200 * 50), nrow = 200, ncol = 50)
SPD <- crossprod(Z) + diag(1e-6, 50)

SPD_h5 <- hdf5_create_matrix(
  h5file,
  "decomp/SPD",
  data = SPD,
  overwrite = TRUE
)

Ch_h5 <- chol(SPD_h5, overwrite = TRUE)
Inv_h5 <- solve(SPD_h5, overwrite = TRUE)

Ch <- as.matrix(Ch_h5)
chol_error <- min(
  max(abs(crossprod(Ch) - SPD)),
  max(abs(tcrossprod(Ch) - SPD))
)
chol_error < 1e-6
[1] TRUE

all.equal(as.matrix(Inv_h5), solve(SPD), tolerance = 1e-6)
[1] TRUE

Eigen decomposition and pseudoinverse

Additional decompositions are available through S3 methods. For example, eigen() can be applied to symmetric HDF5-backed matrices, and pseudoinverse() computes the Moore–Penrose pseudoinverse.

Eig_h5 <- eigen(
  SPD_h5,
  symmetric = TRUE,
  k = 5L,
  overwrite = TRUE
)

head(Eig_h5$values)
[1] 430.0639 406.9944 392.0909 374.7714 365.4983
dim(Eig_h5$vectors)
[1] 50  5

Pinv_h5 <- pseudoinverse(
  A_h5,
  overwrite = TRUE
)

dim(Pinv_h5)
[1]  80 300

Dataset operations: split, reduce, and apply

Split

split() divides an HDF5Matrix into equal-sized blocks stored as separate datasets in the same HDF5 file. Use n_blocks to specify the number of blocks, or block_size to specify the maximum number of rows (or columns, if bycols = TRUE) per block.

fn     <- tempfile(fileext = ".h5")
X      <- hdf5_create_matrix(fn, "data/X", data = matrix(rnorm(600), 60, 10))
blocks <- split(X, n_blocks = 3)   # returns a named list: block_0, block_1, block_2
dim(blocks[["block_0"]])           # 20 x 10
lapply(blocks, close)
close(X)
hdf5_close_all()
unlink(fn)

split_dataset() is an equivalent with a cleaner signature that omits the f and drop parameters inherited from base::split. Both produce identical results.

Note: BigDataStatMeth::split() produces an error because split is a base generic — this is the same behaviour as BigDataStatMeth::cor() or BigDataStatMeth::svd(). The correct call is simply split(X, n_blocks = 3) after loading the package.

Reduce

reduce() accumulates all datasets in the same HDF5 group as x into a single dataset using "+" or "-". This is useful after a split operation or when partial results from parallel computation are stored as separate datasets in a group.

fn <- tempfile(fileext = ".h5")
hdf5_create_matrix(fn, "parts/A", data = matrix(1:12, 3, 4))
hdf5_create_matrix(fn, "parts/B", data = matrix(1:12, 3, 4))
hdf5_create_matrix(fn, "parts/C", data = matrix(1:12, 3, 4))

entry  <- hdf5_matrix(fn, "parts/A")
result <- reduce(entry, func = "+")   # sums A + B + C
as.matrix(result)                     # each element equals 3x the original
hdf5_close_all()
unlink(fn)

hdf5_reduce(filename, group, ...) provides the same operation when you only have the file path and group name, without needing an open HDF5Matrix object.

Apply function

apply_function() applies a predefined algebraic or statistical operation to a named set of datasets within the HDF5 group of x.

This is not equivalent to base::apply(). It does not apply an arbitrary R function row- or column-wise. Instead, it dispatches built-in C++ operations — QR decomposition, cross-products, Cholesky, and others — to a batch of datasets in the file. Support for row/column-wise application of arbitrary R functions (apply.HDF5Matrix) is planned for a future release.

fn <- tempfile(fileext = ".h5")
hdf5_create_matrix(fn, "data/A", data = matrix(rnorm(50), 5, 10))
hdf5_create_matrix(fn, "data/B", data = matrix(rnorm(50), 5, 10))

X   <- hdf5_matrix(fn, "data/A")
res <- apply_function(X, datasets = c("A", "B"),
                      func = "CrossProd", out_group = "RESULTS")
hdf5_close_all()
unlink(fn)

hdf5_apply(filename, group, datasets, func, outgroup, ...) provides the same operation when you only have the file path, without an open object.

Compression and HDF5 file space management

Compression

HDF5 datasets can be stored with different compression levels. Lower compression levels generally reduce writing time but use more disk space. Higher compression levels may reduce file size but can increase the time needed to write and read data. The default value in BigDataStatMeth is compression = 6, which provides a balanced setting for many workflows.

The following small example illustrates the trade-off between writing time and file size. It is not intended as a formal benchmark.

set.seed(123)
X_cmp <- round(matrix(rnorm(2500 * 250), nrow = 2500, ncol = 250), 2)

f0 <- tempfile(fileext = ".h5")
f6 <- tempfile(fileext = ".h5")

t0 <- system.time(
  hdf5_create_matrix(
    f0, "data/X",
    data = X_cmp,
    compression = 0,
    overwrite = TRUE
  )
)

t6 <- system.time(
  hdf5_create_matrix(
    f6, "data/X",
    data = X_cmp,
    compression = 6,
    overwrite = TRUE
  )
)

data.frame(
  compression = c(0, 6),
  elapsed = c(t0[["elapsed"]], t6[["elapsed"]]),
  file_size_MB = round(file.info(c(f0, f6))$size / 1024^2, 3)
)
  compression elapsed file_size_MB
1           0   0.004        4.771
2           6   0.260        1.151

Formal performance comparisons should be run on the target hardware and with representative datasets.

HDF5 file space management

When datasets are deleted or overwritten in an HDF5 file, the physical file size on disk does not always decrease immediately. BigDataStatMeth creates HDF5 files using a file space management strategy that allows free space inside the file to be tracked and reused by subsequent writes. Thus, space released by removed datasets can be reused within the same file instead of unnecessarily increasing the file size after repeated operations.

This behavior helps control file growth during workflows that create, overwrite, or remove intermediate datasets. Nevertheless, reducing the physical size of an existing HDF5 file may require repacking the file. The HDF Group provides external command-line tools for this purpose. In particular, the h5repack user guide describes how to rewrite an HDF5 file into a new compacted or reconfigured file. This is an external HDF5 utility and is not executed from this vignette.

Managing HDF5 resources

HDF5-backed objects keep file handles open while they are in use. In ordinary interactive workflows these handles are released when objects are garbage-collected. Users can also close objects explicitly.

close(X_h5_reopened)

The function hdf5_close_all() closes the HDF5 handles currently tracked by the package, including open file handles. After this call, HDF5-backed objects that pointed to those handles should be reopened before further use.

Calling gc() may help trigger R finalizers for objects that are no longer referenced, which can be useful when repeatedly re-running code during development.

hdf5matrix_options(
  paral = old_opts$paral,
  block_size = old_opts$block_size,
  threads = old_opts$threads,
  compression = old_opts$compression
)

hdf5_close_all()
gc()
          used (Mb) gc trigger  (Mb) max used  (Mb)
Ncells 1552754 83.0    3136827 167.6  2892078 154.5
Vcells 3628564 27.7    8388608  64.0  5182923  39.6

Extending BigDataStatMeth through C++

The examples above use the standard R interface, which is the recommended entry point for most users. However, BigDataStatMeth also provides a C++ infrastructure for developers who need to implement new scalable methods.

Internally, the package defines C++ classes for managing HDF5 files, groups, and datasets, together with block-wise routines for linear algebra and statistical operations. These are the same computational building blocks used by the R/S3 interface. As a result, developers can reuse the package infrastructure from Rcpp-based code instead of implementing HDF5 file management, block iteration, and numerical operations from scratch.

This makes the package useful not only as an end-user R package, but also as a development framework for extending HDF5-backed statistical computing in C++. Since the purpose of this vignette is to introduce the standard R interface, the C++ API is not discussed in detail here. Nevertheless, it is an important part of the package architecture and provides the computational infrastructure behind the R/S3 methods shown above.

Session information

sessionInfo()
R version 4.6.0 (2026-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BigDataStatMeth_2.0.2 knitr_1.51            BiocStyle_2.41.0     

loaded via a namespace (and not attached):
 [1] vctrs_0.7.3         cli_3.6.6           rlang_1.2.0        
 [4] xfun_0.58           S7_0.2.2            jsonlite_2.0.0     
 [7] data.table_1.18.4   labeling_0.4.3      glue_1.8.1         
[10] buildtools_1.0.0    RCurl_1.98-1.19     htmltools_0.5.9    
[13] maketools_1.3.2     sys_3.4.3           sass_0.4.10        
[16] scales_1.4.0        rmarkdown_2.31      grid_4.6.0         
[19] evaluate_1.0.5      jquerylib_0.1.4     MASS_7.3-65        
[22] bitops_1.0-9        fastmap_1.2.0       yaml_2.3.12        
[25] lifecycle_1.0.5     BiocManager_1.30.27 compiler_4.6.0     
[28] RColorBrewer_1.1-3  Rcpp_1.1.1-1.1      farver_2.1.2       
[31] digest_0.6.39       R6_2.6.1            bslib_0.11.0       
[34] withr_3.0.2         gtable_0.3.6        tools_4.6.0        
[37] ggplot2_4.0.3       cachem_1.1.0

Working with HDF5-Backed Matrices in BigDataStatMeth