--- title: "1. Getting started: basic analysis and trajectory trees" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{1. Getting started: basic analysis and trajectory trees} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", fig.width = 8, fig.height = 5, out.width = "100%") library(transitiontrees) ``` `transitiontrees` fits a **variable-depth prediction suffix tree** to categorical sequence data and reports it as a tidy, pathway-centric set of tables and plots. A fixed-order Markov chain assumes memory is the *same length everywhere*; a variable-order tree lets the **data decide, per context, how much history matters**. This first vignette walks the core workflow end to end and finishes with the two **trajectory trees** that draw the sequences forward in time. The other vignettes go further: *Complete analysis case* reads one dataset all the way through, *Ecosystem compatibility* shows the `tna` / `Nestimate` hand-off (and `TraMineR`-export compatibility), *Advanced analysis* covers tuning, bootstrapping and comparison, and *Visualization* tours every plot. ## 1. Fit `context_tree()` accepts a wide character matrix or data.frame, a list of character vectors, or a long event log. We start with the bundled `trajectories` matrix (138 learners x 15 time-steps, three engagement states; trailing `NA`s mark dropouts). ```{r fit} data(trajectories) dim(trajectories) tree <- context_tree(trajectories, max_depth = 3L, min_count = 5L) tree ``` `max_depth` caps how long a history (context) may be; `min_count` is the minimum number of times a context must occur to earn its own node. A **long event log** is reshaped internally -- just name the columns: ```{r long} data(group_regulation_long) head(group_regulation_long) tree_long <- context_tree(group_regulation_long, actor = "Actor", time = "Time", action = "Action", max_depth = 2L, min_count = 5L) n_nodes(tree_long) ``` ## 2. Inspect the fit ```{r inspect} summary(tree) model_fit(tree) # logLik, df, nobs, AIC, BIC, perplexity ``` Perplexity is the effective number of equally likely next states; it sits below the uniform baseline (the alphabet size, here 3) when history is informative. ## 3. The pathway tables Every accessor returns a plain `data.frame` in one canonical schema, so the views join cleanly. Pathways read left-to-right oldest-to-newest (`A -> B -> C`); the root context is shown as `(start)`. ```{r tables} common_pathways(tree, top = 6) # by frequency divergent_pathways(tree, top = 6) # by divergence from the shorter history sharp_pathways(tree, top = 6) # by how peaked the next-state prediction is ``` `changes_prediction = TRUE` flags a context whose single most likely next state differs from its parent's -- the histories where memory overturns the corpus-wide default. The lesson the tables teach together: **common is not the same as informative**. The most frequent pathways are the backbone of the corpus; the divergent ones carry the insight. ## 4. Prune to the reliable tree A context can survive fitting yet not earn its depth. `prune_tree()` collapses contexts whose extra history is not a significant gain over their parent (default: a likelihood-ratio G-squared test). ```{r prune} pruned <- prune_tree(tree, criterion = "G2", alpha = 0.05) pruned ``` The pruned tree's banner reports its node count and the criterion used -- compare it to the unpruned `tree` printed in section 1 to see how much the G-squared test removed. ## 5. Predict ```{r predict} predict(pruned, c("Active", "Active"), type = "class") # most likely next round(predict(pruned, c("Active", "Active"), type = "prob"), 3) # full distribution ``` When an exact context is missing from the tree, prediction backs off to the longest matching suffix -- the property that makes a *variable*-order model robust: it never refuses to predict, it just uses as much history as it has evidence for. ## 6. A first tree plot Just `plot()` the tree. The default is a horizontal layout: node size is the context count, the colour is the most-recent state, and the predicted next state sits under each node. ```{r plot, fig.width = 14, fig.height = 8} plot(pruned) ``` `plot()` also takes `style = "dendrogram"`, `"icicle"`, or `"interactive"` for the same tree in other layouts -- the *Visualization* vignette tours all four. ## 7. Trajectory trees: where sequences go, and how predictably The context tree reads *backwards* -- a node is a suffix, the most-recent state. The same sequences can be drawn *forwards* as a **trajectory tree**: start at the left, every path is a run of states unfolding in time. Forward trajectories are most informative on a richer alphabet, so we switch to the bundled `ai_long` log -- one row per AI-prompting move (eight move types: `Execute`, `Investigate`, `Plan`, ...), with a session id. `context_tree()` reads it directly. ```{r traj-fit} data(ai_long) tree_ai <- context_tree(ai_long, actor = "project", session = "session_id", action = "code", max_depth = 3L, min_count = 10L) pruned_ai <- prune_tree(tree_ai) tree_ai ``` `plot_trajectories()` draws the forward prefix tree and colours the one tree two ways. ### By frequency -- how many sequences walk each path ```{r traj-frequency, fig.width = 11, fig.height = 7} plot_trajectories(tree_ai, measure = "frequency", min_count = 20L) ``` Node fill and edge width both scale to the number of sessions on each path, so the thick, dark branches are the prompting routines most projects actually follow -- the corpus's highways from the opening move outward. ### By predictability -- how confidently the model calls each step ```{r traj-predictability, fig.width = 11, fig.height = 7} plot_trajectories(pruned_ai, measure = "predictability", min_count = 20L) ``` Same nodes and edges, but each edge is now coloured by `P(state | history)` from the model. Reading the two side by side separates **traffic** from **predictability**: an edge that is wide (frequency) but pale (predictability) is a *decision point* -- many sessions reach it, but the next move is genuinely open. Those forks are where behaviour is decided rather than executed. ## Where to go next | You want to... | See vignette | |---|---| | Read one dataset all the way through | *Complete analysis case* | | Feed in a `tna` / `Nestimate` object (or `TraMineR` export) | *Ecosystem compatibility* | | Tune, bootstrap, and compare cohorts | *Advanced analysis* | | Tour every plot style | *Visualization* |