Package 'tidylda'

Title: Latent Dirichlet Allocation Using 'tidyverse' Conventions
Description: Implements an algorithm for Latent Dirichlet Allocation (LDA), Blei et at. (2003) <>, using style conventions from the 'tidyverse', Wickham et al. (2019)<doi:10.21105/joss.01686>, and 'tidymodels', Kuhn et al.<>. Fitting is done via collapsed Gibbs sampling. Also implements several novel features for LDA such as guided models and transfer learning based on ongoing and, as yet, unpublished research.
Authors: Tommy Jones [aut, cre] , Brendan Knapp [ctb] , Barum Park [ctb]
Maintainer: Tommy Jones <[email protected]>
License: MIT + file LICENSE
Version: 0.0.5
Built: 2025-03-01 07:56:11 UTC
Source: CRAN

Augment method for tidylda objects


augment appends observation level model outputs.


## S3 method for class 'tidylda'
  type = c("class", "prob"),
  document_col = "document",
  term_col = "term",



an object of class tidylda


a tidy tibble containing one row per original document-token pair, such as is returned by tdm_tidiers with column names c("document", "term") at a minimum.


one of either "class" or "prob"


character specifying the name of the column that corresponds to document IDs. Defaults to "document".


character specifying the name of the column that corresponds to term/token IDs. Defaults to "term".


other arguments passed to methods,currently not used


The key statistic for augment is P(topic | document, token) = P(topic | token) * P(token | document). P(topic | token) are the entries of the 'lambda' matrix in the tidylda object passed with x. P(token | document) is taken to be the frequency of each token normalized within each document.


augment returns a tidy tibble containing one row per document-token pair, with one or more columns appended, depending on the value of type.

If type = 'prob', then one column per topic is appended. Its value is P(topic | document, token).

If type = 'class', then the most-probable topic for each document-token pair is returned. If multiple topics are equally probable, then the topic with the smallest index is returned by default.

Probabilistic coherence of topics


Calculates the probabilistic coherence of a topic or topics. This approximates semantic coherence or human understandability of a topic.


calc_prob_coherence(beta, data, m = 5)



A numeric matrix or a numeric vector. The vector, or rows of the matrix represent the numeric relationship between topic(s) and terms. For example, this relationship may be p(word|topic) or p(topic|word).


A document term matrix or term co-occurrence matrix. The preferred class is a dgCMatrix-class. However there is support for any Matrix-class object as well as several other commonly-used classes such as matrix, dfm, DocumentTermMatrix, and simple_triplet_matrix


An integer for the number of words to be used in the calculation. Defaults to 5


For each pair of words {a, b} in the top M words in a topic, probabilistic coherence calculates P(b|a) - P(b), where {a} is more probable than {b} in the topic. For example, suppose the top 4 words in a topic are {a, b, c, d}. Then, we calculate 1. P(a|b) - P(b), P(a|c) - P(c), P(a|d) - P(d) 2. P(b|c) - P(c), P(b|d) - P(d) 3. P(c|d) - P(d) All 6 differences are averaged together.


Returns an object of class numeric corresponding to the probabilistic coherence of the input topic(s).


# Load a pre-formatted dtm and topic model

# fit a model
model <- tidylda(
  data = nih_sample_dtm[1:20, ], k = 5,
  iterations = 100, burnin = 50

calc_prob_coherence(beta = model$beta, data = nih_sample_dtm, m = 5)

Glance method for tidylda objects


glance constructs a single-row summary "glance" of a tidylda topic model.


## S3 method for class 'tidylda'
glance(x, ...)



an object of class tidylda


other arguments passed to methods,currently not used


glance returns a one-row tibble with the following columns:

num_topics: the number of topics in the model num_documents: the number of documents used for fitting num_tokens: the number of tokens covered by the model iterations: number of total Gibbs iterations run burnin: number of burn-in Gibbs iterations run


dtm <- nih_sample_dtm

lda <- tidylda(data = dtm, k = 10, iterations = 100, burnin = 75)


Abstracts and metadata from NIH research grants awarded in 2014


This dataset holds information on research grants awarded by the National Institutes of Health (NIH) in 2014. The data set was downloaded in approximately January of 2015. It includes both 'projects' and 'abstracts' files.




For nih_sample, a tibble of 100 randomly-sampled grants' abstracts and metadata. For nih_sample_dtm, a dgCMatrix-class representing the document term matrix of abstracts from 100 randomly-sampled grants.


National Institutes of Health ExPORTER

Draw from the marginal posteriors of a tidylda topic model


Sample from the marginal posteriors of a tidylda topic model. This is useful for quantifying uncertainty around the parameters of beta or theta.


posterior(x, ...)

## S3 method for class 'tidylda'
posterior(x, matrix, which, times, ...)



An object of class tidylda.


Other arguments, currently not used.


A character of either 'theta' or 'beta', indicating from which matrix to draw posterior samples.


Row index of theta, for document, or beta, for topic, from which to draw samples. which may also be a vector of indices to sample from multiple documents or topics simultaneously.


Integer, number of samples to draw.


posterior returns a tibble with one row per parameter per sample.

Returns a data frame where each row is a single sample from the posterior. Each column is the distribution over a single parameter. The variable var is a facet for subsetting by document (for theta) or topic (for beta).


Heinrich, G. (2005) Parameter estimation for text analysis. Technical report.


# load some data

# fit a model

m <- tidylda(
  data = nih_sample_dtm[1:20, ], k = 5,
  iterations = 200, burnin = 175

# sample from the marginal posterior corresponding to topic 1
t1 <- posterior(
  x = m,
  matrix = "beta",
  which = 1,
  times = 100  

# sample from the marginal posterior corresponding to documents 5 and 6
d5 <- posterior(
  x = m,
  matrix = "theta",
  which = c(5, 6),
  times = 100

Get predictions from a Latent Dirichlet Allocation model


Obtains predictions of topics for new documents from a fitted LDA model


## S3 method for class 'tidylda'
  type = c("prob", "class", "distribution"),
  method = c("gibbs", "dot"),
  iterations = NULL,
  burnin = -1,
  no_common_tokens = c("default", "zero", "uniform"),
  times = 100,
  threads = 1,
  verbose = TRUE,



a fitted object of class tidylda


a DTM or TCM of class dgCMatrix or a numeric vector


one of "prob", "class", or "distribution". Defaults to "prob".


one of either "gibbs" or "dot". If "gibbs" Gibbs sampling is used and iterations must be specified.


If method = "gibbs", an integer number of iterations for the Gibbs sampler to run. A future version may include automatic stopping criteria.


If method = "gibbs", an integer number of burnin iterations. If burnin is greater than -1, the entries of the resulting "theta" matrix are an average over all iterations greater than burnin. Behavior is the same as documented in tidylda.


behavior when encountering documents that have no tokens in common with the model. Options are "default", "zero", or "uniform". See 'details', below for explanation of behavior.


Integer, number of samples to draw if type = "distribution". Ignored if type is "class" or "prob". Defaults to 100.


Number of parallel threads, defaults to 1. Note: currently ignored; only single-threaded prediction is implemented.


Logical. Do you want to print a progress bar out to the console? Only active if method = "gibbs". Defaults to TRUE.


Additional arguments, currently unused


If predict.tidylda encounters documents that have no tokens in common with the model in object it will engage in one of three behaviors based on the setting of no_common_tokens.

default (the default) sets all topics to 0 for offending documents. This enables continued computations downstream in a way that NA would not. However, if no_common_tokens == "default", then predict.tidylda will emit a warning for every such document it encounters.

zero has the same behavior as default but it emits a message instead of a warning.

uniform sets all topics to 1/k for every topic for offending documents. it does not emit a warning or message.


type gives different outputs depending on whether the user selects "prob", "class", or "distribution". If "prob", the default, returns a a "theta" matrix with one row per document and one column per topic. If "class", returns a vector with the topic index of the most likely topic in each document. If "distribution", returns a tibble with one row per parameter per sample. Number of samples is set by the times argument.


# load some data

# fit a model

m <- tidylda(
  data = nih_sample_dtm[1:20, ], k = 5,
  iterations = 200, burnin = 175


# predict on held-out documents using gibbs sampling "fold in"
p1 <- predict(m, nih_sample_dtm[21:100, ],
  method = "gibbs",
  iterations = 200, burnin = 175

# predict on held-out documents using the dot product
p2 <- predict(m, nih_sample_dtm[21:100, ], method = "dot")

# compare the methods
barplot(rbind(p1[1, ], p2[1, ]), beside = TRUE, col = c("red", "blue"))

# predict classes on held out documents
p3 <- predict(m, nih_sample_dtm[21:100, ],
  method = "gibbs",
  type = "class",
  iterations = 100, burnin = 75

# predict distribution on held out documents
p4 <- predict(m, nih_sample_dtm[21:100, ],
  method = "gibbs",
  type = "distribution",
  iterations = 100, burnin = 75,
  times = 10

Print Method for tidylda


Print a summary for objects of class tidylda


## S3 method for class 'tidylda'
print(x, digits = max(3L, getOption("digits") - 3L), n = 5, ...)



an object of class tidylda


minimal number of significant digits


Number of rows to show in each displayed tibble.


further arguments passed to or from other methods


Silently returns x


dtm <- nih_sample_dtm

lda <- tidylda(data = dtm, k = 10, iterations = 100)



print(lda, digits = 2)

Update a Latent Dirichlet Allocation topic model


Update an LDA model using collapsed Gibbs sampling.


## S3 method for class 'tidylda'
  iterations = NULL,
  burnin = -1,
  prior_weight = 1,
  additional_k = 0,
  additional_eta_sum = 250,
  optimize_alpha = FALSE,
  calc_likelihood = FALSE,
  calc_r2 = FALSE,
  return_data = FALSE,
  threads = 1,
  verbose = TRUE,



a fitted object of class tidylda.


A document term matrix or term co-occurrence matrix of class dgCMatrix.


Integer number of iterations for the Gibbs sampler to run.


Integer number of burnin iterations. If burnin is greater than -1, the resulting "beta" and "theta" matrices are an average over all iterations greater than burnin.


Numeric, 0 or greater or NA. The weight of the beta as a prior from the base model. See Details, below.


Integer number of topics to add, defaults to 0.


Numeric magnitude of prior for additional topics. Ignored if additional_k is 0. Defaults to 250.


Logical. Experimental. Do you want to optimize alpha every iteration? Defaults to FALSE.


Logical. Do you want to calculate the log likelihood every iteration? Useful for assessing convergence. Defaults to FALSE.


Logical. Do you want to calculate R-squared after the model is trained? Defaults to FALSE.


Logical. Do you want new_data returned as part of the model object?


Number of parallel threads, defaults to 1.


Logical. Do you want to print a progress bar out to the console? Defaults to TRUE.


Additional arguments, currently unused


refit allows you to (a) update the probabilities (i.e. weights) of a previously-fit model with new data or additional iterations and (b) optionally use beta of a previously-fit LDA topic model as the eta prior for the new model. This is tuned by setting beta_as_prior = FALSE or beta_as_prior = TRUE respectively.

prior_weight tunes how strong the base model is represented in the prior. If prior_weight = 1, then the tokens from the base model's training data have the same relative weight as tokens in new_data. In other words, it is like just adding training data. If prior_weight is less than 1, then tokens in new_data are given more weight. If prior_weight is greater than 1, then the tokens from the base model's training data are given more weight.

If prior_weight is NA, then the new eta is equal to eta from the old model, with new tokens folded in. (For handling of new tokens, see below.) Effectively, this just controls how the sampler initializes (described below), but does not give prior weight to the base model.

Instead of initializing token-topic assignments in the manner for new models (see tidylda), the update initializes in 2 steps:

First, topic-document probabilities (i.e. theta) are obtained by a call to predict.tidylda using method = "dot" for the documents in new_data. Next, both beta and theta are passed to an internal function, initialize_topic_counts, which assigns topics to tokens in a manner approximately proportional to the posteriors and executes a single Gibbs iteration.

refit handles the addition of new vocabulary by adding a flat prior over new tokens. Specifically, each entry in the new prior is equal to the 10th percentile of eta from the old model. The resulting model will have the total vocabulary of the old model plus any new vocabulary tokens. In other words, after running refit.tidylda ncol(beta) >= ncol(new_data) where beta is from the new model and new_data is the additional data.

You can add additional topics by setting the additional_k parameter to an integer greater than zero. New entries to alpha have a flat prior equal to the median value of alpha in the old model. (Note that if alpha itself is a flat prior, i.e. scalar, then the new topics have the same value for their prior.) New entries to eta have a shape from the average of all previous topics in eta and scaled by additional_eta_sum.


Returns an S3 object of class c("tidylda").


Updates are, as of this writing, are almost-surely useful but their behaviors have not been optimized or well-studied. Caveat emptor!


# load a document term matrix

d1 <- nih_sample_dtm[1:50, ]

d2 <- nih_sample_dtm[51:100, ]

# fit a model
m <- tidylda(d1,
  k = 10,
  iterations = 200, burnin = 175

# update an existing model by adding documents using old model as prior
m2 <- refit(
  object = m,
  new_data = rbind(d1, d2),
  iterations = 200,
  burnin = 175,
  prior_weight = 1

# use an old model to initialize new model and not use old model as prior
m3 <- refit(
  object = m,
  new_data = d2, # new documents only
  iterations = 200,
  burnin = 175,
  prior_weight = NA

# add topics while updating a model by adding documents
m4 <- refit(
  object = m,
  new_data = rbind(d1, d2),
  additional_k = 3,
  iterations = 200,
  burnin = 175

Tidy a matrix from a tidylda topic model


Tidy the result of a tidylda topic model


## S3 method for class 'tidylda'
tidy(x, matrix, log = FALSE, ...)

## S3 method for class 'matrix'
tidy(x, matrix, log = FALSE, ...)



an object of class tidylda or an individual beta, theta, or lambda matrix.


the matrix to tidy; one of 'beta', 'theta', or 'lambda'


do you want to have the result on a log scale? Defaults to FALSE


other arguments passed to methods,currently not used


Returns a tibble.

If matrix = "beta" then the result is a table of one row per topic and token with the following columns: topic, token, beta

If matrix = "theta" then the result is a table of one row per document and topic with the following columns: document, topic, theta

If matrix = "lambda" then the result is a table of one row per topic and token with the following columns: topic, token, lambda


  • tidy(matrix): Tidy an individual matrix. Useful for predictions and called from tidy.tidylda


If log = TRUE then "log_" will be appended to the name of the third column of the resulting table. e.g "beta" becomes "log_beta".


dtm <- nih_sample_dtm

lda <- tidylda(data = dtm, k = 10, iterations = 100, burnin = 75)

tidy_beta <- tidy(lda, matrix = "beta")

tidy_theta <- tidy(lda, matrix = "theta")

tidy_lambda <- tidy(lda, matrix = "lambda")

Fit a Latent Dirichlet Allocation topic model


Fit a Latent Dirichlet Allocation topic model using collapsed Gibbs sampling.


  iterations = NULL,
  burnin = -1,
  alpha = 0.1,
  eta = 0.05,
  optimize_alpha = FALSE,
  calc_likelihood = TRUE,
  calc_r2 = FALSE,
  threads = 1,
  return_data = FALSE,
  verbose = TRUE,



A document term matrix or term co-occurrence matrix. The preferred class is a dgCMatrix-class. However there is support for any Matrix-class object as well as several other commonly-used classes such as matrix, dfm, DocumentTermMatrix, and simple_triplet_matrix


Integer number of topics.


Integer number of iterations for the Gibbs sampler to run.


Integer number of burnin iterations. If burnin is greater than -1, the resulting "beta" and "theta" matrices are an average over all iterations greater than burnin.


Numeric scalar or vector of length k. This is the prior for topics over documents.


Numeric scalar, numeric vector of length ncol(data), or numeric matrix with k rows and ncol(data) columns. This is the prior for words over topics.


Logical. Do you want to optimize alpha every iteration? Defaults to FALSE. See 'details' below for more information.


Logical. Do you want to calculate the log likelihood every iteration? Useful for assessing convergence. Defaults to TRUE.


Logical. Do you want to calculate R-squared after the model is trained? Defaults to FALSE. See calc_lda_r2.


Number of parallel threads, defaults to 1. See Details, below.


Logical. Do you want data returned as part of the model object?


Logical. Do you want to print a progress bar out to the console? Defaults to TRUE.


Additional arguments, currently unused


This function calls a collapsed Gibbs sampler for Latent Dirichlet Allocation written using the excellent Rcpp package. Some implementation notes follow:

Topic-token and topic-document assignments are not initialized based on a uniform-random sampling, as is common. Instead, topic-token probabilities (i.e. beta) are initialized by sampling from a Dirichlet distribution with eta as its parameter. The same is done for topic-document probabilities (i.e. theta) using alpha. Then an internal function is called (initialize_topic_counts) to run a single Gibbs iteration to initialize assignments of tokens to topics and topics to documents.

When you use burn-in iterations (i.e. burnin = TRUE), the resulting beta and theta matrices are calculated by averaging over every iteration after the specified number of burn-in iterations. If you do not use burn-in iterations, then the matrices are calculated from the last run only. Ideally, you'd burn in every iteration before convergence, then average over the chain after its converged (and thus every observation is independent).

If you set optimize_alpha to TRUE, then each element of alpha is proportional to the number of times each topic has be sampled that iteration averaged with the value of alpha from the previous iteration. This lets you start with a symmetric alpha and drift into an asymmetric one. However, (a) this probably means that convergence will take longer to happen or convergence may not happen at all. And (b) I make no guarantees that doing this will give you any benefit or that it won't hurt your model. Caveat emptor!

The log likelihood calculation is the same that can be found on page 9 of The only difference is that the version in tidylda allows eta to be a vector or matrix. (Vector used in this function, matrix used for model updates in refit.tidylda. At present, the log likelihood function appears to be ok for assessing convergence. i.e. It has the right shape. However, it is, as of this writing, returning positive numbers, rather than the expected negative numbers. Looking into that, but in the meantime caveat emptor once again.

Parallelism, is not currently implemented. The threads argument is a placeholder for planned enhancements.


Returns an S3 object of class tidylda. See new_tidylda.


# load some data

# fit a model
m <- tidylda(
  data = nih_sample_dtm[1:20, ], k = 5,
  iterations = 200, burnin = 175


# predict on held-out documents using gibbs sampling "fold in"
p1 <- predict(m, nih_sample_dtm[21:100, ],
  method = "gibbs",
  iterations = 200, burnin = 175

# predict on held-out documents using the dot product method
p2 <- predict(m, nih_sample_dtm[21:100, ], method = "dot")

# compare the methods
barplot(rbind(p1[1, ], p2[1, ]), beside = TRUE, col = c("red", "blue"))