| Title: | Uniform Data Model and 'Zarr' Interchange for Single-Cell Omics |
|---|---|
| Description: | A lightweight interchange layer for single-cell and spatial omics data, built on the L-star model of labelled axes and typed fields over them, serialized to the 'Zarr' format. Provides bidirectional converters ("profiles") for 'Seurat', 'SingleCellExperiment', 'Conos', and 'pagoda2' objects, including collections of heterogeneous samples, via a shared C++ core ('libstar') so the same store is readable from R, 'Python', and C++. |
| Authors: | Peter Kharchenko [aut, cre] |
| Maintainer: | Peter Kharchenko <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-06-22 19:42:46 UTC |
| Source: | https://github.com/cran/lstar |
For a cells x genes sparse matrix and a per-cell group assignment, returns each group's
sum, sum-of-squares, and number of expressing cells per gene, computed over log1p(m) (or raw).
This is the reduction cluster stats and marker tables are built from; the same C++ core backs
the WASM and Python bindings.
col_sum_by_group(m, code, ngroups, lognorm = TRUE)col_sum_by_group(m, code, ngroups, lognorm = TRUE)
m |
a |
code |
length- |
ngroups |
number of groups. |
lognorm |
compute over |
a list with sum, sumsq, n_expr, each a flat row-major (group, gene) numeric vector
of length ngroups * ncol(m).
A collection keeps each sample's own data — per-sample cells.<s>/genes.<s> axes and
<field>.<s> fields, over gene sets that may overlap, differ, or be entirely disjoint across
samples — alongside a samples axis and a union cells axis carrying the joint analysis (a shared
embedding, clusters, a graph). This is lstar's "a collection is not one aligned tensor" model, built
from any list of separately-processed samples rather than hand-assembled.
collection_from( samples, joint = NULL, sample_field = "sample", prefix_cells = TRUE )collection_from( samples, joint = NULL, sample_field = "sample", prefix_cells = TRUE )
samples |
a named list of per-sample |
joint |
optional named list of fields over the union cells (the integration outputs): a matrix
becomes a joint embedding; a factor/character vector a clustering (inducing a factor axis); a
|
sample_field |
name of the design label recording each union cell's sample (default |
prefix_cells |
prefix each cell label with its sample name so union labels are unique (default |
an lstar_dataset of kind "collection".
mk <- function(s, nc, genes) { ds <- list(kind = "sample", axes = list(), fields = list()) ds$axes$cells <- list(labels = paste0("c", seq_len(nc)), origin = "observed", role = "observation") ds$axes$genes <- list(labels = genes, origin = "observed", role = "feature") m <- as(Matrix::Matrix(matrix(rpois(nc * length(genes), 2), nc), sparse = TRUE), "CsparseMatrix") ds$fields$counts <- list(role = "measure", span = c("cells", "genes"), state = "raw", values = m) class(ds) <- "lstar_dataset"; ds } col <- collection_from(list(A = mk("A", 5, paste0("g", 1:8)), B = mk("B", 7, paste0("g", 3:12)))) # divergent gene sets col$kindmk <- function(s, nc, genes) { ds <- list(kind = "sample", axes = list(), fields = list()) ds$axes$cells <- list(labels = paste0("c", seq_len(nc)), origin = "observed", role = "observation") ds$axes$genes <- list(labels = genes, origin = "observed", role = "feature") m <- as(Matrix::Matrix(matrix(rpois(nc * length(genes), 2), nc), sparse = TRUE), "CsparseMatrix") ds$fields$counts <- list(role = "measure", span = c("cells", "genes"), state = "raw", values = m) class(ds) <- "lstar_dataset"; ds } col <- collection_from(list(A = mk("A", 5, paste0("g", 1:8)), B = mk("B", 7, paste0("g", 3:12)))) # divergent gene sets col$kind
Accessor: a field's value by name.
field_value(ds, name)field_value(ds, name)
ds |
an |
name |
field name |
the field's values (a vector, matrix, or sparse Matrix), or NULL if absent.
Gathers the de.<factor>.<stat> fields (score/lfc/pval/padj over the factor x genes axes) into a
long-form data frame: one row per (group, gene) with whichever statistics are present.
lstar_markers(ds, factor, top = NULL, sort_by = "score", descending = TRUE)lstar_markers(ds, factor, top = NULL, sort_by = "score", descending = TRUE)
ds |
an |
factor |
the factor-axis name the DE was computed over (e.g. |
top |
optional: keep only the top-N genes per group (by |
sort_by |
statistic to rank within a group (default |
descending |
sort descending (default |
a data frame with columns group, gene, and the available statistics.
Read an L* Zarr store into an R dataset.
lstar_read(path)lstar_read(path)
path |
path to a |
an lstar_dataset: a list with axes and fields, each field's values
assembled as a base vector, matrix, or Matrix sparse matrix.
p <- tempfile(fileext = ".lstar.zarr") ds <- list(kind = "sample", axes = list(), fields = list()) ds$axes$cells <- list(labels = paste0("c", 1:3), origin = "observed", role = "observation") ds$fields$depth <- list(role = "measure", span = "cells", values = c(1, 2, 3)) class(ds) <- "lstar_dataset" lstar_write(ds, p) ds2 <- lstar_read(p) field_value(ds2, "depth")p <- tempfile(fileext = ".lstar.zarr") ds <- list(kind = "sample", axes = list(), fields = list()) ds$axes$cells <- list(labels = paste0("c", 1:3), origin = "observed", role = "observation") ds$fields$depth <- list(role = "measure", span = "cells", values = c(1, 2, 3)) class(ds) <- "lstar_dataset" lstar_write(ds, p) ds2 <- lstar_read(p) field_value(ds2, "depth")
Reads genes [g_lo, g_hi) (0-based, half-open) of a CSC measure as a dgCMatrix (cells x genes),
decoding only the store chunks that overlap the range. The general block-read primitive a consumer
drives to build out-of-core reductions over an L* store without implementing them in lstar.
lstar_read_block(path, field, g_lo, g_hi, cell_names = NULL, gene_names = NULL)lstar_read_block(path, field, g_lo, g_hi, cell_names = NULL, gene_names = NULL)
path |
path to a |
field |
name of a |
g_lo, g_hi
|
0-based, half-open gene (column) range |
cell_names, gene_names
|
optional dimnames for the returned matrix (default |
a dgCMatrix (cells x genes) holding the requested gene columns.
Gathers the requested gene columns from a chunked CSC store, decoding each touched chunk at most once (an ascending sweep over sorted-unique columns), then restores the caller's order. Efficient for scattered subsets (e.g. overdispersed genes for PCA) – unlike a per-column read it does not re-decode a chunk once per gene it contains.
lstar_read_genes(path, field, genes, all_genes, cell_names = NULL)lstar_read_genes(path, field, genes, all_genes, cell_names = NULL)
path |
path to a |
field |
name of a |
genes |
the gene columns to gather, as names (matched against |
all_genes |
the field's full gene-label vector, used to resolve |
cell_names |
optional row names (cells) for the returned matrix (default |
a dgCMatrix (cells x length(genes)) in the caller's requested gene order.
Streaming, fused pseudobulk: per cell-group sums of each gene, computed in one threaded pass over a
chunked CSC store, with the same optional depth-normalized log1p view as stream_col_stats() (the
counterpart of pagoda2's colSumByFacView). group is a per-cell integer bucket in [0, ngroups)
in store row order (out-of-range cells are skipped; pass NA cells as 0 for an explicit <NA> row).
lstar_stream_col_sum_by_group( path, field, group, ngroups, lognorm = FALSE, depth = NULL, depthScale = 1, block = 4096L, n_threads = 1L )lstar_stream_col_sum_by_group( path, field, group, ngroups, lognorm = FALSE, depth = NULL, depthScale = 1, block = 4096L, n_threads = 1L )
path |
path to a |
field |
name of a |
group |
integer per-cell group bucket in |
ngroups |
number of groups (the result has one row per group). |
lognorm |
compute over the |
depth |
optional per-cell depth vector (length = n cells, store row order) for the normalized view. |
depthScale |
depth scaling factor used with |
block |
streaming block size in cells per pass (default 4096). |
n_threads |
threads per block reduction: 1 = serial (default), N = N threads, 0 = all cores. |
a ngroups x ngenes numeric matrix (row g = the per-gene sums for group g).
Write an R dataset to an L* Zarr store.
lstar_write( ds, path, chunk_elems = NULL, compression = c("none", "gzip", "zlib"), level = 5L )lstar_write( ds, path, chunk_elems = NULL, compression = c("none", "gzip", "zlib"), level = 5L )
ds |
an |
path |
output store path (a |
chunk_elems |
if non-NULL, chunk each array along its first axis so each chunk holds about
this many elements (e.g. |
compression |
chunk codec: |
level |
compression level 1-9 (default 5), used when |
the output path, invisibly.
lstar_read(), lstar_read_block()
Print an L* dataset
## S3 method for class 'lstar_dataset' print(x, ...)## S3 method for class 'lstar_dataset' print(x, ...)
x |
an |
... |
ignored, for S3 compatibility |
x, invisibly (called for the side effect of printing a summary of axes and fields).
The inverse of write_conos(): rebuilds the per-sample Pagoda2 objects (raw counts plus the
stored PCA reduction) and restores the joint graph, embedding and clustering(s), returning a live
conos::Conos object ready for plotting, marker detection and label transfer. Re-running
runGraph() will recompute the per-sample variance model (not stored).
read_conos(ds)read_conos(ds)
ds |
an |
a conos::Conos object.
Read a SingleCellExperiment into an L* dataset.
read_sce(sce)read_sce(sce)
sce |
a |
an lstar_dataset of kind "sample".
Reads the matrix from h5ad as an HDF5Array::H5ADMatrix – a DelayedMatrix that stays on disk
(genes x cells, the Bioconductor convention) – and wraps it in a SingleCellExperiment. The
matrix is never materialized. Pairs with Python lstar.convert_to_h5ad(store, h5ad).
read_sce_backed(h5ad, layer = NULL, assay_name = "counts")read_sce_backed(h5ad, layer = NULL, assay_name = "counts")
h5ad |
path to an .h5ad file. |
layer |
h5ad layer to read; |
assay_name |
name for the SCE assay (default inferred: |
a SingleCellExperiment whose assay is a disk-backed DelayedMatrix.
Handles Seurat v3/v4 (Assay) and v5 (Assay5); a v5 assay split by sample
(split(assay, f = ...)) is read as an L* collection. The detected versions are recorded in
ds$profiles.
read_seurat(so, assay = SeuratObject::DefaultAssay(so))read_seurat(so, assay = SeuratObject::DefaultAssay(so))
so |
a |
assay |
assay to read (default: the default assay) |
an lstar_dataset (of kind "sample", or "collection" for a split assay).
Reads the matrix from h5ad with BPCells (open_matrix_anndata_hdf5) so it stays on disk as a
streaming IterableMatrix – already oriented genes x cells (rownames genes, colnames cells), the
Seurat convention – and wraps it in a Seurat v5 Assay5. The matrix is never materialized: peak
memory is a few megabytes regardless of atlas size. Typical use is the bounded end of an L*
conversion, after lstar.convert_to_h5ad(store, h5ad) in Python:
read_seurat_backed(h5ad, group = "X", assay = "RNA", project = "lstar")read_seurat_backed(h5ad, group = "X", assay = "RNA", project = "lstar")
h5ad |
path to an .h5ad file (its |
group |
h5ad group to read as the matrix (default |
assay |
name for the Seurat assay (default |
project |
Seurat project name. |
so <- read_seurat_backed("atlas.h5ad") # counts live on disk (BPCells)
so <- Seurat::NormalizeData(so) # Seurat v5 ops stream off disk
a Seurat object whose assay counts are a disk-backed BPCells matrix.
Computes the zero-aware per-gene mean, variance, and number of expressing cells of a cells x genes measure directly from an L\* store, reading it block-by-block so the whole matrix never
lands in memory – the C++/R counterpart of Python's stream_col_stats. Bounded memory requires a
chunked store (one written with chunk_elems set, e.g. by a streamed conversion); on an unchunked
store it still works but reads the data array whole. The field must be CSC (gene-major); use it
for HVG selection / variance modeling over an atlas too large to load.
stream_col_stats( path, field, block = 4096L, n_threads = 1L, lognorm = FALSE, depth = NULL, depthScale = 1, population = FALSE )stream_col_stats( path, field, block = 4096L, n_threads = 1L, lognorm = FALSE, depth = NULL, depthScale = 1, population = FALSE )
path |
path to an L\* store ( |
field |
measure field name (e.g. |
block |
number of gene columns per streamed block (default 4096). |
n_threads |
threading policy for each block's reduction: 1 = serial (default), N = N threads,
|
lognorm |
reduce over |
depth |
optional per-cell depth vector (length n cells, in store row order). When given, each
nonzero is normalized to |
depthScale |
depth scaling factor (default 1) used with |
population |
if |
a list with mean, var (length ngenes numeric) and nnz (length ngenes integer).
Build an L* dataset from a Conos object (a collection of samples).
write_conos(co, clustering = NULL)write_conos(co, clustering = NULL)
co |
a |
clustering |
optional name of a joint clustering in |
an lstar_dataset of kind collection
*.lstar.zarr) store.Writes counts (raw, cell x gene), the embedding, cluster/cell-type/QC labels, and the
viewer profile's cluster sufficient stats + marker tables ([email protected]). Computes the
cluster stats with the shared libstar kernel.
write_pagoda2(p2, path = NULL, grouping = "leiden")write_pagoda2(p2, path = NULL, grouping = "leiden")
p2 |
a Pagoda2 (pagoda2.1) object. |
path |
output store path ( |
grouping |
a |
an lstar_dataset (invisibly if written).
Measures become assays (transposed to genes x cells), embeddings become reducedDims, and
arity-1 cell fields become colData.
write_sce(ds)write_sce(ds)
ds |
an |
a SingleCellExperiment object.
Measures over (cells, genes) become assay layers (transposed to Seurat's genes x cells
orientation); embeddings become DimReducs; arity-1 cell fields become meta.data.
write_seurat(ds)write_seurat(ds)
ds |
an |
a Seurat object.