| Title: | Bayesian Context Trees for Discrete Sequence Data |
|---|---|
| Description: | Models discrete sequential data using Bayesian Context Trees. Context trees, also known as Variable Length Markov Chains (VLMCs), are parsimonious Markov models where the order of dependence can vary with the observed past. Provides a generic 'R6' class structure that exposes the full tree for building custom algorithms, exact Bayesian inference via a bottom-up recursive algorithm (closed-form marginal likelihood, Maximum A Posteriori (MAP) tree, exact posterior probabilities, and exact sampling from the posterior), a frequentist estimator via the context algorithm with likelihood-ratio pruning, simulation utilities, and a Metropolis-Hastings sampler. See Paulichen and Freguglia (2026) <doi:10.48550/arXiv.2603.25806>. |
| Authors: | Victor Freguglia [aut, cre] (ORCID: <https://orcid.org/0000-0002-6189-4453>), Thiago Paulichen [ctb] (ORCID: <https://orcid.org/0009-0002-5505-2409>) |
| Maintainer: | Victor Freguglia <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 1.0.0 |
| Built: | 2026-05-12 23:03:54 UTC |
| Source: | https://github.com/cran/bacontrees |
The data objects abc_vec and abc_list contain sequences generated using a Variable Length Markov Chain (VLMC) with a specified context tree and transition probabilities.
abc_list abc_vecabc_list abc_vec
abc_vec: A character vector of length 1000 representing a single sequence generated by the VLMC.
abc_list: A list of three character vectors, each of length 1000, representing independent sequences generated by the VLMC.
An object of class character of length 1000.
The VLMC used to generate the data was based on an alphabet of three symbols (a, b, and c) with a context tree constructed to a maximal depth of 3. The active contexts and their corresponding transition probabilities are as follows:
*.a: c(0.10, 0.20, 0.70)
*.b: c(0.33, 0.33, 0.34)
*.c.a: c(0.20, 0.10, 0.70)
*.c.b: c(0.01, 0.98, 0.01)
*.c.c: c(0.40, 0.40, 0.20)
The sequences were generated using a seed value of 1 to ensure reproducibility.
The data was generated using the rvlmc function in the R package bacontrees.
# Access the single sequence data(abc_vec) print(abc_vec) # Access the list of sequences data(abc_list) print(abc_list)# Access the single sequence data(abc_vec) print(abc_vec) # Access the list of sequences data(abc_list) print(abc_list)
The baConTree class extends ContextTree to support Bayesian inference, including Dirichlet priors, context prior weights, and Metropolis-Hastings sampling for posterior inference on context trees.
This class provides methods for running MCMC and extracting posterior samples for Bayesian context tree models.
Gets the sampled chain stored.
A numeric (when log = TRUE) or a brob (when log = FALSE).
Invisibly returns the object itself. As a side-effect, the MAP tree is set as the active tree.
A list of length two, containing the prior and posterior probabilities
(or their logarithms, if log = TRUE), in this order.
bacontrees::ContextTree -> baConTree
bacontrees::ContextTree$activateByCode()bacontrees::ContextTree$activateFromContexts()bacontrees::ContextTree$activateMaximal()bacontrees::ContextTree$activateRoot()bacontrees::ContextTree$activeTreeCode()bacontrees::ContextTree$getActiveNodes()bacontrees::ContextTree$getAlphabet()bacontrees::ContextTree$getChildrenNodes()bacontrees::ContextTree$getDataset()bacontrees::ContextTree$getGrowableNodes()bacontrees::ContextTree$getInnerNodes()bacontrees::ContextTree$getLeaves()bacontrees::ContextTree$getMaximalDepth()bacontrees::ContextTree$getParentNode()bacontrees::ContextTree$getPrunableNodes()bacontrees::ContextTree$getSiblingNodes()bacontrees::ContextTree$growActive()bacontrees::ContextTree$igraph()bacontrees::ContextTree$nodeExists()bacontrees::ContextTree$print()bacontrees::ContextTree$pruneActive()bacontrees::ContextTree$root()bacontrees::ContextTree$setData()bacontrees::ContextTree$validate()new()
baConTree$new(
data,
maximalDepth = 5,
alpha,
priorWeights,
initialTree = c("map", "root")
)dataEither a vector with discrete data or a list of vectors.
maximalDepthDepth of the maximal tree considered.
alphaHyperparameter for the Dirichlet prior distribution of probabilities.
priorWeightsA function to be evaluated at each node that returns its weight in the prior distribution.
initialTreeEither "map" (default) or "root". Controls the
initial active tree: "map" activates the MAP tree immediately after
initialization, while "root" starts from the root-only tree.
runMetropolisHastings()
baConTree$runMetropolisHastings(steps)
stepsNumber of steps to run the Metropolis Hastings algorithm for.
This method supports progress monitoring via the progressr package.
Users can wrap the function call in with_progress() to display a progress
bar while the function executes. If no progress handler is registered, the
function will run without showing progress.
To enable progress, register a handler and wrap the function call in
with_progress().
getChain()
Chain generated via Metropolis Hastings algorithm.
baConTree$getChain()
sampleTree()
Samples a context tree exactly from the prior or posterior distribution using the per-node branching probabilities.
baConTree$sampleTree(type = c("prior", "posterior"))typeEither "prior" or "posterior". Determines which branching
probabilities are used for sampling. "prior" uses
priorBranchingProbability and "posterior" uses
posteriorBranchingProbability.
The algorithm starts from the root-only tree and at each non-leaf node
the tree is grown (via growActive) with probability equal to the node's
branching probability; otherwise the node remains a leaf of the sampled
tree and its subtree is never visited.
This yields an exact sample from the prior or posterior
distribution over context trees.
The branching probability at a node is
prod(children sigmaPrior) / sigmaPrior for the prior, and the
analogous quantity for the posterior, so that it equals one minus the
probability that the process stops at that node.
Returns the active tree code of the sampled tree (a
character string as produced by activeTreeCode()). As a side-effect,
the object's active tree is set to the sampled tree.
getMarginalLikelihood()
Marginal likelihood of the data under the Bayesian context tree model.
baConTree$getMarginalLikelihood(log = TRUE)
logLogical. If TRUE (default), returns the log marginal likelihood
as a plain R numeric. If FALSE, returns the marginal likelihood as a
brob object.
This is computed from sigmaPosterior/sigmaPrior at the root node,
which equals the sum over all trees of prior(T) * p(data | T),
normalised by the prior partition function.
activateMap()
Activates the smallest Maximum a Posteriori (MAP) tree under the Bayesian context tree model.
baConTree$activateMap()
Starting from the root node (initially active) and proceeding recursively
through its descendants. The method compares the posterior weight at the node
with the maximum posterior value attainable over all sub-tree configurations below it.
If stopping at the node achieves the maximum posterior (node$extra$isMapLeaf = TRUE) the
node remains active and the recursion stops along that branch (i.e., the sub-tree below
the node is pruned). Otherwise, the node is deactivated, its children are activated, and
the procedure continues recursively.
activeTreeProbabilities()
Computes the prior and posterior probabilities of the active tree under the Bayesian context tree model.
baConTree$activeTreeProbabilities(log = FALSE)
logLogical. If FALSE (default), returns the prior and posterior
probabilities of the active tree. If TRUE, returns their logarithms.
The probabilities are obtained by taking the product of the weights associated
with all active nodes in the tree. These quantities are then normalized by the
corresponding normalizing constants, namely the value of sigmaPrior at the
root node for the prior and the value of sigmaPosterior at the root for the posterior.
For numerical stability, all computations are performed on the logarithmic scale, which avoids underflow when dealing with small values.
clone()
The objects of this class are cloneable with this method.
baConTree$clone(deep = FALSE)
deepWhether to make a deep clone.
bt <- baConTree$new(abc_list, maximalDepth = 3, alpha = 0.01, priorWeights = function(node) exp(-1/3*node$getDepth())) bt$runMetropolisHastings(300) chain <- bt$getChain()bt <- baConTree$new(abc_list, maximalDepth = 3, alpha = 0.01, priorWeights = function(node) exp(-1/3*node$getDepth())) bt$runMetropolisHastings(300) chain <- bt$getChain()
The ContextTree class represents a variable-length Markov context tree, supporting construction, manipulation, and data assignment for context tree models. It manages nodes, active/inactive states, and provides methods for growing, pruning, and validating the tree.
This class is the core data structure for context tree modeling, supporting both root and maximal initialization, and efficient management of tree structure and data.
nodesList of nodes from a context tree (both active and non-active). Read-only.
new()
Initializes a ContextTree object with a given maximal depth.
If dataset is provided, the alphabet is inferred from data.
ContextTree$new(dataset = NULL, maximalDepth = 3, alphabet = NULL)
datasetA Sequence object, a character vector with a single observed chain or a list of vectors of observed chains to be set as data for the context tree.
maximalDepthDepth of the maximal tree considered.
alphabetEither an object of class Alphabet or a character vector with the symbols for the alphabet considered in the context tree.
root()
ContextTree$root()
Returns the Context Tree root node.
validate()
ContextTree$validate()
Returns a logical value incating whether the Context Tree is valid.
getDataset()
ContextTree$getDataset()
Returns the dataset as a Sequence object.
getAlphabet()
ContextTree$getAlphabet()
Returns the alphabet related to the Context Tree.
getMaximalDepth()
ContextTree$getMaximalDepth()
Returns the maximal depth of the tree.
getActiveNodes()
ContextTree$getActiveNodes(idx = TRUE)
idxA logical value. If TRUE, the function will return the index (path) of the node as a string. If FALSE, returns a list of nodes.
Returns a list of active nodes (leaf nodes of the active tree).
activeTreeCode()
ContextTree$activeTreeCode()
Returns a character value representing the active tree.
activateRoot()
Sets the active tree to be the one containing only the root node.
ContextTree$activateRoot()
activateMaximal()
Activates the leaf nodes of the maximal Context Tree.
ContextTree$activateMaximal()
activateByCode()
Sets the active tree to be the one corresponding to a tree code obtained
from the activeTreeCode method.
ContextTree$activateByCode(code)
codeThe tree code for the tree to be activated.
codeThe tree code for the tree to be activated.
activateFromContexts()
Sets the active tree to match a specified set of contexts, given as a character vector or a brace-enclosed comma-separated string. The contexts must be compatible with the tree's existing alphabet and maximal depth.
ContextTree$activateFromContexts(contexts)
contextsCharacter vector or string. A vector of context paths
(e.g., c("*.a", "*.b.a", "*.b.b")) or a single brace-enclosed
string (e.g., "\{*.a, *.b.a, *.b.b\}").
getLeaves()
ContextTree$getLeaves(idx = TRUE)
idxA logical value. If TRUE, the function will return the index (path) of the node as a string. If FALSE, returns a list of nodes.
Returns the leaf nodes of the maximal Context Tree (regardless of the current active tree).
getInnerNodes()
ContextTree$getInnerNodes(idx = TRUE)
idxA logical value. If TRUE, the function will return the index (path) of the node as a string. If FALSE, returns a list of nodes.
Returns the inner nodes nodes of the active Context Tree. Inner nodes are nodes that are in the path between the root (including) and active nodes.
nodeExists()
ContextTree$nodeExists(path)
pathA string representing the path of a node.
TRUE if a node with a given path exists in the Context Tree.
getParentNode()
ContextTree$getParentNode(path, idx = TRUE)
pathA string representing the path of a node.
idxA logical value. If TRUE, the function will return the index (path) of the node as a string. If FALSE, returns a list of nodes.
Returns the parent node of the node in a given path.
getChildrenNodes()
ContextTree$getChildrenNodes(path, idx = TRUE)
pathA string representing the path of a node.
idxA logical value. If TRUE, the function will return the index (path) of the node as a string. If FALSE, returns a list of nodes.
Returns the parent node of the node in a given path.
getSiblingNodes()
ContextTree$getSiblingNodes(path, idx = TRUE)
pathA string representing the path of a node.
idxA logical value. If TRUE, the function will return the index (path) of the node as a string. If FALSE, returns a list of nodes.
Returns the sibling nodes of the node in a given path.
growActive()
Replaces the node of a given path by its children in the active tree.
ContextTree$growActive(path)
pathA string representing the path of a node.
pruneActive()
Replaces the node of a given path and its siblings by the parent node in the active tree.
ContextTree$pruneActive(path)
pathA string representing the path of a node.
getGrowableNodes()
ContextTree$getGrowableNodes()
Returns a list of paths of active nodes that have children that can be activated.
getPrunableNodes()
ContextTree$getPrunableNodes()
Returns a list of paths of active nodes that have all siblings active.
setData()
Sets data for the Context Tree by setting the counts of occurrences of each symbol of the alphabet within each context (node) of the tree.
ContextTree$setData(dataset)
datasetA Sequence object, a character vector with a single observed chain or a list of vectors of observed chains to be set as data for the context tree.
igraph()
ContextTree$igraph(activeOnly = TRUE)
activeOnlylogical value. If TRUE, only the nodes in the active tree are plotted (including internal nodes).
Converts the ContextTree to an igraph object. All attributes
in the extra field of nodes are included in the attributes of each node
for the igraph.
Returns an igraph object.
print()
Prints the current active Context Tree and the counts for each context.
ContextTree$print()
clone()
The objects of this class are cloneable with this method.
ContextTree$clone(deep = FALSE)
deepWhether to make a deep clone.
tree <- ContextTree$new(abc_list, maximalDepth = 3) tree$activateMaximal() print(tree)tree <- ContextTree$new(abc_list, maximalDepth = 3) tree$activateMaximal() print(tree)
Fits a VLMC model to discrete sequence data using the context algorithm, performing likelihood ratio tests to prune the context tree.
fit_vlmc(data, cutoff = 10, max_length = 6)fit_vlmc(data, cutoff = 10, max_length = 6)
data |
Either a character vector (single sequence) or a list of character vectors (multiple sequences). |
cutoff |
Numeric. Cutoff value for the (log) likelihood ratio test statistic used for pruning. |
max_length |
Integer. Depth of the maximal tree considered. |
The function builds a maximal context tree, computes log-likelihoods for each node, and prunes nodes whose likelihood ratio test statistic is below the cutoff. The result is a pruned context tree representing the fitted VLMC.
A ContextTree object representing the fitted VLMC as the active tree.
data(abc_list) fit <- fit_vlmc(abc_list, cutoff = 20, max_length = 4) print(fit$getActiveNodes())data(abc_list) fit <- fit_vlmc(abc_list, cutoff = 20, max_length = 4) print(fit$getActiveNodes())
Runs the Metropolis-Hastings algorithm to sample from the posterior distribution of context trees given sequence data, Dirichlet priors, and context prior weights.
metropolis_vlmc( data, n_steps, max_depth = 6, alpha = 0.001, context_weights = function(node) 1, burnin = 100, thin = 1 )metropolis_vlmc( data, n_steps, max_depth = 6, alpha = 0.001, context_weights = function(node) 1, burnin = 100, thin = 1 )
data |
Either a character vector (single sequence) or a list of character vectors (multiple sequences). |
n_steps |
Integer. Number of MCMC steps to run. |
max_depth |
Integer. Maximum depth for the context tree. |
alpha |
Numeric. Dirichlet prior parameter for transition probabilities. |
context_weights |
Function. Returns the prior weight for a given node. |
burnin |
Integer. Number of initial iterations to discard from posterior summaries. |
thin |
Integer. Thinning interval for posterior summaries. |
The Markov chain is initialized from the maximal tree (all leaves active).
This function supports progress monitoring via the progressr package. Wrap the call in with_progress() to display a progress bar.
An object of class metropolis_vlmc with elements:
df: Data frame of context sets, tree codes, counts, and posterior probabilities.
codes: List of context sets for each tree code.
chain: The full chain of tree codes sampled.
data(abc_list) result <- metropolis_vlmc(abc_list, n_steps = 1000, max_depth = 3) print(result)data(abc_list) result <- metropolis_vlmc(abc_list, n_steps = 1000, max_depth = 3) print(result)
Plot method for baConTree class
## S3 method for class 'baConTree' plot(x, ...)## S3 method for class 'baConTree' plot(x, ...)
x |
A |
... |
Not used. |
Plots the full maximal tree. Each node is labelled and filled
according to its posterior branching probability,
i.e. the probability that the node has children in the true context tree
given the data. Values close to 1 (dark blue) indicate near-certain
branching; values close to 0 (white) indicate the node is likely a leaf.
Active edges are drawn solid and inactive edges dotted, matching the
convention of plot.ContextTree.
The plot is done using ggraph and can be further modified.
Plot method for ContextTree class
## S3 method for class 'ContextTree' plot(x, ..., activeOnly = TRUE)## S3 method for class 'ContextTree' plot(x, ..., activeOnly = TRUE)
x |
A |
... |
Not used. |
activeOnly |
logical value. If TRUE, only the nodes in the active tree are plotted (including internal nodes). |
The edges of the tree have their width corresponding to the number of occurrences of contexts matching the child node.
The plot is done using ggraph and can be further modified.
This function simulates one or more sequences using a Variable Length Markov Chain (VLMC) model, where the memory (context) length can vary depending on the context tree provided.
rvlmc(n, alphabet, context_list, context_probs)rvlmc(n, alphabet, context_list, context_probs)
n |
An integer specifying the length of the sequence to generate, or a vector of lengths for multiple sequences. |
alphabet |
A character vector representing the set of symbols used in the sequence. |
context_list |
A character vector specifying the contexts of the VLMC.
Each context is a dot-separated string starting with |
context_probs |
A list of numeric probability vectors, one per element
of |
The function generates sequences by traversing the context tree defined by
context_list and context_probs. For each position in the sequence, the
most specific matching context in context_list is used to sample the next
symbol according to the corresponding probability vector.
The first H symbols (where H is the depth of the deepest context) are
sampled uniformly from alphabet as an initialisation step and are not
drawn from the VLMC model.
If n is a single integer, returns a character vector of length n
representing the generated sequence. If n is a vector of length > 1,
returns a list of character vectors, each of the specified length.
# Define parameters for the VLMC n <- 1000 alphabet <- c("a", "b", "c") context_list <- c("*.a", "*.b", "*.c.a", "*.c.b", "*.c.c") context_probs <- list( c(0.10, 0.20, 0.70), # for *.a c(0.33, 0.33, 0.34), # for *.b c(0.20, 0.10, 0.70), # for *.c.a c(0.01, 0.98, 0.01), # for *.c.b c(0.40, 0.40, 0.20) # for *.c.c ) # Generate a single sequence sequence <- rvlmc(n, alphabet, context_list, context_probs) print(sequence) # Generate multiple sequences of different lengths n_vec <- c(100, 200, 150) sequences <- rvlmc(n_vec, alphabet, context_list, context_probs) str(sequences)# Define parameters for the VLMC n <- 1000 alphabet <- c("a", "b", "c") context_list <- c("*.a", "*.b", "*.c.a", "*.c.b", "*.c.c") context_probs <- list( c(0.10, 0.20, 0.70), # for *.a c(0.33, 0.33, 0.34), # for *.b c(0.20, 0.10, 0.70), # for *.c.a c(0.01, 0.98, 0.01), # for *.c.b c(0.40, 0.40, 0.20) # for *.c.c ) # Generate a single sequence sequence <- rvlmc(n, alphabet, context_list, context_probs) print(sequence) # Generate multiple sequences of different lengths n_vec <- c(100, 200, 150) sequences <- rvlmc(n_vec, alphabet, context_list, context_probs) str(sequences)
The Sequence class provides a unified representation of one or more
sequences of categorical symbols. When initialized, the input is encoded as
integers according to a user-defined or automatically inferred alphabet.
The class accepts:
a list of character or numeric vectors (multiple sequences)
a single character or numeric vector (one sequence)
The alphabet can be provided manually or inferred via infer_alphabet(),
which must return an object containing a field symbols defining the
valid symbol set.
Internally, all sequences are stored as integer vectors using
as.numeric(factor(..., levels = Alphabet$symbols)).
An R6 object of class Sequence.
dataA list of integer-encoded sequences.
nA vector with the length of each sequence.
AlphabetAn object defining the symbol set used for encoding.
initialize(x, alphabet = NULL)Create a new object from input.
print()Display a compact summary.
This class represents a tree node with a path, depth, and various methods for managing its state and validating its path.
extraList to hold extra information.
countsNumeric vector to hold counts (active binding; validated on write).
new()
Initializaes the TreeNode
TreeNode$new(path)
pathA character string representing the path of the node.
print()
Print the path of the node
TreeNode$print(...)
...Additional arguments passed to the cat function.
activate()
Activate the node.
This method sets the node's active status to TRUE.
TreeNode$activate()
deactivate()
Deactivate the node.
This method sets the node's active status to FALSE.
TreeNode$deactivate()
isActive()
TreeNode$isActive()
Logical value indicating if the node is active.
getDepth()
TreeNode$getDepth()
Integer value representing the depth of the node.
isLeaf()
TreeNode$isLeaf()
Returns a logical value indicating whether the node is a leaf (of the maximal tree, not the active tree).
getPath()
TreeNode$getPath()
Character string representing the path of the node.
getParentPath()
TreeNode$getParentPath()
Character string representing the parent path of the node.
getChildrenPaths()
TreeNode$getChildrenPaths()
Character string representing the paths for the children of a node.
validatePath()
Validate the node path against an alphabet
This method validates the node's path by checking if all elements of the path (excluding the first element) are in the provided alphabet.
TreeNode$validatePath(Alphabet)
AlphabetAn Alphabet object containing a symbols vector to validate against.
Logical value indicating if the path is valid.
clone()
The objects of this class are cloneable with this method.
TreeNode$clone(deep = FALSE)
deepWhether to make a deep clone.