Title: | Latent Dirichlet Allocation Using 'tidyverse' Conventions |
---|---|
Description: | Implements an algorithm for Latent Dirichlet Allocation (LDA), Blei et at. (2003) <https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf>, using style conventions from the 'tidyverse', Wickham et al. (2019)<doi:10.21105/joss.01686>, and 'tidymodels', Kuhn et al.<https://tidymodels.github.io/model-implementation-principles/>. Fitting is done via collapsed Gibbs sampling. Also implements several novel features for LDA such as guided models and transfer learning based on ongoing and, as yet, unpublished research. |
Authors: | Tommy Jones [aut, cre] , Brendan Knapp [ctb] , Barum Park [ctb] |
Maintainer: | Tommy Jones <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.0.5 |
Built: | 2024-12-31 08:11:25 UTC |
Source: | CRAN |
tidylda
objectsaugment
appends observation level model outputs.
## S3 method for class 'tidylda' augment( x, data, type = c("class", "prob"), document_col = "document", term_col = "term", ... )
## S3 method for class 'tidylda' augment( x, data, type = c("class", "prob"), document_col = "document", term_col = "term", ... )
x |
an object of class |
data |
a tidy tibble containing one row per original document-token pair, such as is returned by tdm_tidiers with column names c("document", "term") at a minimum. |
type |
one of either "class" or "prob" |
document_col |
character specifying the name of the column that
corresponds to document IDs. Defaults to |
term_col |
character specifying the name of the column that
corresponds to term/token IDs. Defaults to |
... |
other arguments passed to methods,currently not used |
The key statistic for augment
is P(topic | document, token) =
P(topic | token) * P(token | document). P(topic | token) are the entries
of the 'lambda' matrix in the tidylda
object passed
with x
. P(token | document) is taken to be the frequency of each
token normalized within each document.
augment
returns a tidy tibble containing one row per document-token
pair, with one or more columns appended, depending on the value of type
.
If type = 'prob'
, then one column per topic is appended. Its value
is P(topic | document, token).
If type = 'class'
, then the most-probable topic for each document-token
pair is returned. If multiple topics are equally probable, then the topic
with the smallest index is returned by default.
Calculates the probabilistic coherence of a topic or topics. This approximates semantic coherence or human understandability of a topic.
calc_prob_coherence(beta, data, m = 5)
calc_prob_coherence(beta, data, m = 5)
beta |
A numeric matrix or a numeric vector. The vector, or rows of the matrix represent the numeric relationship between topic(s) and terms. For example, this relationship may be p(word|topic) or p(topic|word). |
data |
A document term matrix or term co-occurrence matrix. The preferred
class is a |
m |
An integer for the number of words to be used in the calculation. Defaults to 5 |
For each pair of words {a, b} in the top M words in a topic, probabilistic coherence calculates P(b|a) - P(b), where {a} is more probable than {b} in the topic. For example, suppose the top 4 words in a topic are {a, b, c, d}. Then, we calculate 1. P(a|b) - P(b), P(a|c) - P(c), P(a|d) - P(d) 2. P(b|c) - P(c), P(b|d) - P(d) 3. P(c|d) - P(d) All 6 differences are averaged together.
Returns an object of class numeric
corresponding to the
probabilistic coherence of the input topic(s).
# Load a pre-formatted dtm and topic model data(nih_sample_dtm) # fit a model set.seed(12345) model <- tidylda( data = nih_sample_dtm[1:20, ], k = 5, iterations = 100, burnin = 50 ) calc_prob_coherence(beta = model$beta, data = nih_sample_dtm, m = 5)
# Load a pre-formatted dtm and topic model data(nih_sample_dtm) # fit a model set.seed(12345) model <- tidylda( data = nih_sample_dtm[1:20, ], k = 5, iterations = 100, burnin = 50 ) calc_prob_coherence(beta = model$beta, data = nih_sample_dtm, m = 5)
tidylda
objectsglance
constructs a single-row summary "glance" of a tidylda
topic model.
## S3 method for class 'tidylda' glance(x, ...)
## S3 method for class 'tidylda' glance(x, ...)
x |
an object of class |
... |
other arguments passed to methods,currently not used |
glance
returns a one-row tibble
with the
following columns:
num_topics
: the number of topics in the model
num_documents
: the number of documents used for fitting
num_tokens
: the number of tokens covered by the model
iterations
: number of total Gibbs iterations run
burnin
: number of burn-in Gibbs iterations run
dtm <- nih_sample_dtm lda <- tidylda(data = dtm, k = 10, iterations = 100, burnin = 75) glance(lda)
dtm <- nih_sample_dtm lda <- tidylda(data = dtm, k = 10, iterations = 100, burnin = 75) glance(lda)
This dataset holds information on research grants awarded by the National Institutes of Health (NIH) in 2014. The data set was downloaded in approximately January of 2015. It includes both 'projects' and 'abstracts' files.
data("nih_sample")
data("nih_sample")
For nih_sample
, a tibble
of 100 randomly-sampled
grants' abstracts and metadata. For nih_sample_dtm
, a
dgCMatrix-class
representing the document term matrix
of abstracts from 100 randomly-sampled grants.
National Institutes of Health ExPORTER https://reporter.nih.gov/exporter
Sample from the marginal posteriors of a tidylda
topic
model. This is useful for quantifying uncertainty around the parameters of
beta
or theta
.
posterior(x, ...) ## S3 method for class 'tidylda' posterior(x, matrix, which, times, ...)
posterior(x, ...) ## S3 method for class 'tidylda' posterior(x, matrix, which, times, ...)
x |
An object of class |
... |
Other arguments, currently not used. |
matrix |
A character of either 'theta' or 'beta', indicating from which matrix to draw posterior samples. |
which |
Row index of |
times |
Integer, number of samples to draw. |
posterior
returns a tibble with one row per parameter per sample.
Returns a data frame where each row is a single sample from the posterior.
Each column is the distribution over a single parameter. The variable var
is a facet for subsetting by document (for theta) or topic (for beta).
Heinrich, G. (2005) Parameter estimation for text analysis. Technical report. http://www.arbylon.net/publications/text-est.pdf
# load some data data(nih_sample_dtm) # fit a model set.seed(12345) m <- tidylda( data = nih_sample_dtm[1:20, ], k = 5, iterations = 200, burnin = 175 ) # sample from the marginal posterior corresponding to topic 1 t1 <- posterior( x = m, matrix = "beta", which = 1, times = 100 ) # sample from the marginal posterior corresponding to documents 5 and 6 d5 <- posterior( x = m, matrix = "theta", which = c(5, 6), times = 100 )
# load some data data(nih_sample_dtm) # fit a model set.seed(12345) m <- tidylda( data = nih_sample_dtm[1:20, ], k = 5, iterations = 200, burnin = 175 ) # sample from the marginal posterior corresponding to topic 1 t1 <- posterior( x = m, matrix = "beta", which = 1, times = 100 ) # sample from the marginal posterior corresponding to documents 5 and 6 d5 <- posterior( x = m, matrix = "theta", which = c(5, 6), times = 100 )
Obtains predictions of topics for new documents from a fitted LDA model
## S3 method for class 'tidylda' predict( object, new_data, type = c("prob", "class", "distribution"), method = c("gibbs", "dot"), iterations = NULL, burnin = -1, no_common_tokens = c("default", "zero", "uniform"), times = 100, threads = 1, verbose = TRUE, ... )
## S3 method for class 'tidylda' predict( object, new_data, type = c("prob", "class", "distribution"), method = c("gibbs", "dot"), iterations = NULL, burnin = -1, no_common_tokens = c("default", "zero", "uniform"), times = 100, threads = 1, verbose = TRUE, ... )
object |
a fitted object of class |
new_data |
a DTM or TCM of class |
type |
one of "prob", "class", or "distribution". Defaults to "prob". |
method |
one of either "gibbs" or "dot". If "gibbs" Gibbs sampling is used
and |
iterations |
If |
burnin |
If |
no_common_tokens |
behavior when encountering documents that have no tokens
in common with the model. Options are " |
times |
Integer, number of samples to draw if |
threads |
Number of parallel threads, defaults to 1. Note: currently ignored; only single-threaded prediction is implemented. |
verbose |
Logical. Do you want to print a progress bar out to the console?
Only active if |
... |
Additional arguments, currently unused |
If predict.tidylda
encounters documents that have no tokens in common
with the model in object
it will engage in one of three behaviors based
on the setting of no_common_tokens
.
default
(the default) sets all topics to 0 for offending documents. This
enables continued computations downstream in a way that NA
would not.
However, if no_common_tokens == "default"
, then predict.tidylda
will emit a warning for every such document it encounters.
zero
has the same behavior as default
but it emits a message
instead of a warning.
uniform
sets all topics to 1/k for every topic for offending documents.
it does not emit a warning or message.
type
gives different outputs depending on whether the user selects
"prob", "class", or "distribution". If "prob", the default, returns a
a "theta" matrix with one row per document and one column per topic. If
"class", returns a vector with the topic index of the most likely topic in
each document. If "distribution", returns a tibble with one row per
parameter per sample. Number of samples is set by the times
argument.
# load some data data(nih_sample_dtm) # fit a model set.seed(12345) m <- tidylda( data = nih_sample_dtm[1:20, ], k = 5, iterations = 200, burnin = 175 ) str(m) # predict on held-out documents using gibbs sampling "fold in" p1 <- predict(m, nih_sample_dtm[21:100, ], method = "gibbs", iterations = 200, burnin = 175 ) # predict on held-out documents using the dot product p2 <- predict(m, nih_sample_dtm[21:100, ], method = "dot") # compare the methods barplot(rbind(p1[1, ], p2[1, ]), beside = TRUE, col = c("red", "blue")) # predict classes on held out documents p3 <- predict(m, nih_sample_dtm[21:100, ], method = "gibbs", type = "class", iterations = 100, burnin = 75 ) # predict distribution on held out documents p4 <- predict(m, nih_sample_dtm[21:100, ], method = "gibbs", type = "distribution", iterations = 100, burnin = 75, times = 10 )
# load some data data(nih_sample_dtm) # fit a model set.seed(12345) m <- tidylda( data = nih_sample_dtm[1:20, ], k = 5, iterations = 200, burnin = 175 ) str(m) # predict on held-out documents using gibbs sampling "fold in" p1 <- predict(m, nih_sample_dtm[21:100, ], method = "gibbs", iterations = 200, burnin = 175 ) # predict on held-out documents using the dot product p2 <- predict(m, nih_sample_dtm[21:100, ], method = "dot") # compare the methods barplot(rbind(p1[1, ], p2[1, ]), beside = TRUE, col = c("red", "blue")) # predict classes on held out documents p3 <- predict(m, nih_sample_dtm[21:100, ], method = "gibbs", type = "class", iterations = 100, burnin = 75 ) # predict distribution on held out documents p4 <- predict(m, nih_sample_dtm[21:100, ], method = "gibbs", type = "distribution", iterations = 100, burnin = 75, times = 10 )
Print a summary for objects of class tidylda
## S3 method for class 'tidylda' print(x, digits = max(3L, getOption("digits") - 3L), n = 5, ...)
## S3 method for class 'tidylda' print(x, digits = max(3L, getOption("digits") - 3L), n = 5, ...)
x |
an object of class |
digits |
minimal number of significant digits |
n |
Number of rows to show in each displayed |
... |
further arguments passed to or from other methods |
Silently returns x
dtm <- nih_sample_dtm lda <- tidylda(data = dtm, k = 10, iterations = 100) print(lda) lda print(lda, digits = 2)
dtm <- nih_sample_dtm lda <- tidylda(data = dtm, k = 10, iterations = 100) print(lda) lda print(lda, digits = 2)
Update an LDA model using collapsed Gibbs sampling.
## S3 method for class 'tidylda' refit( object, new_data, iterations = NULL, burnin = -1, prior_weight = 1, additional_k = 0, additional_eta_sum = 250, optimize_alpha = FALSE, calc_likelihood = FALSE, calc_r2 = FALSE, return_data = FALSE, threads = 1, verbose = TRUE, ... )
## S3 method for class 'tidylda' refit( object, new_data, iterations = NULL, burnin = -1, prior_weight = 1, additional_k = 0, additional_eta_sum = 250, optimize_alpha = FALSE, calc_likelihood = FALSE, calc_r2 = FALSE, return_data = FALSE, threads = 1, verbose = TRUE, ... )
object |
a fitted object of class |
new_data |
A document term matrix or term co-occurrence matrix of class dgCMatrix. |
iterations |
Integer number of iterations for the Gibbs sampler to run. |
burnin |
Integer number of burnin iterations. If |
prior_weight |
Numeric, 0 or greater or |
additional_k |
Integer number of topics to add, defaults to 0. |
additional_eta_sum |
Numeric magnitude of prior for additional topics.
Ignored if |
optimize_alpha |
Logical. Experimental. Do you want to optimize alpha
every iteration? Defaults to |
calc_likelihood |
Logical. Do you want to calculate the log likelihood every iteration?
Useful for assessing convergence. Defaults to |
calc_r2 |
Logical. Do you want to calculate R-squared after the model is trained?
Defaults to |
return_data |
Logical. Do you want |
threads |
Number of parallel threads, defaults to 1. |
verbose |
Logical. Do you want to print a progress bar out to the console?
Defaults to |
... |
Additional arguments, currently unused |
refit
allows you to (a) update the probabilities (i.e. weights) of
a previously-fit model with new data or additional iterations and (b) optionally
use beta
of a previously-fit LDA topic model as the eta
prior
for the new model. This is tuned by setting beta_as_prior = FALSE
or
beta_as_prior = TRUE
respectively.
prior_weight
tunes how strong the base model is represented in the prior.
If prior_weight = 1
, then the tokens from the base model's training data
have the same relative weight as tokens in new_data
. In other words,
it is like just adding training data. If prior_weight
is less than 1,
then tokens in new_data
are given more weight. If prior_weight
is greater than 1, then the tokens from the base model's training data are
given more weight.
If prior_weight
is NA
, then the new eta
is equal to
eta
from the old model, with new tokens folded in.
(For handling of new tokens, see below.) Effectively, this just controls
how the sampler initializes (described below), but does not give prior
weight to the base model.
Instead of initializing token-topic assignments in the manner for new
models (see tidylda
), the update initializes in 2
steps:
First, topic-document probabilities (i.e. theta
) are obtained by a
call to predict.tidylda
using method = "dot"
for the documents in new_data
. Next, both beta
and theta
are
passed to an internal function, initialize_topic_counts
,
which assigns topics to tokens in a manner approximately proportional to
the posteriors and executes a single Gibbs iteration.
refit
handles the addition of new vocabulary by adding a flat prior
over new tokens. Specifically, each entry in the new prior is equal to the
10th percentile of eta
from the old model. The resulting model will
have the total vocabulary of the old model plus any new vocabulary tokens.
In other words, after running refit.tidylda
ncol(beta) >= ncol(new_data)
where beta
is from the new model and new_data
is the additional data.
You can add additional topics by setting the additional_k
parameter
to an integer greater than zero. New entries to alpha
have a flat
prior equal to the median value of alpha
in the old model. (Note that
if alpha
itself is a flat prior, i.e. scalar, then the new topics have
the same value for their prior.) New entries to eta
have a shape
from the average of all previous topics in eta
and scaled by
additional_eta_sum
.
Returns an S3 object of class c("tidylda").
Updates are, as of this writing, are almost-surely useful but their behaviors have not been optimized or well-studied. Caveat emptor!
# load a document term matrix data(nih_sample_dtm) d1 <- nih_sample_dtm[1:50, ] d2 <- nih_sample_dtm[51:100, ] # fit a model m <- tidylda(d1, k = 10, iterations = 200, burnin = 175 ) # update an existing model by adding documents using old model as prior m2 <- refit( object = m, new_data = rbind(d1, d2), iterations = 200, burnin = 175, prior_weight = 1 ) # use an old model to initialize new model and not use old model as prior m3 <- refit( object = m, new_data = d2, # new documents only iterations = 200, burnin = 175, prior_weight = NA ) # add topics while updating a model by adding documents m4 <- refit( object = m, new_data = rbind(d1, d2), additional_k = 3, iterations = 200, burnin = 175 )
# load a document term matrix data(nih_sample_dtm) d1 <- nih_sample_dtm[1:50, ] d2 <- nih_sample_dtm[51:100, ] # fit a model m <- tidylda(d1, k = 10, iterations = 200, burnin = 175 ) # update an existing model by adding documents using old model as prior m2 <- refit( object = m, new_data = rbind(d1, d2), iterations = 200, burnin = 175, prior_weight = 1 ) # use an old model to initialize new model and not use old model as prior m3 <- refit( object = m, new_data = d2, # new documents only iterations = 200, burnin = 175, prior_weight = NA ) # add topics while updating a model by adding documents m4 <- refit( object = m, new_data = rbind(d1, d2), additional_k = 3, iterations = 200, burnin = 175 )
tidylda
topic modelTidy the result of a tidylda
topic model
## S3 method for class 'tidylda' tidy(x, matrix, log = FALSE, ...) ## S3 method for class 'matrix' tidy(x, matrix, log = FALSE, ...)
## S3 method for class 'tidylda' tidy(x, matrix, log = FALSE, ...) ## S3 method for class 'matrix' tidy(x, matrix, log = FALSE, ...)
x |
an object of class |
matrix |
the matrix to tidy; one of |
log |
do you want to have the result on a log scale? Defaults to |
... |
other arguments passed to methods,currently not used |
Returns a tibble
.
If matrix = "beta"
then the result is a table of one row per topic
and token with the following columns: topic
, token
, beta
If matrix = "theta"
then the result is a table of one row per document
and topic with the following columns: document
, topic
, theta
If matrix = "lambda"
then the result is a table of one row per topic
and token with the following columns: topic
, token
, lambda
tidy(matrix)
: Tidy an individual matrix.
Useful for predictions and called from tidy.tidylda
If log = TRUE
then "log_" will be appended to the name of the third
column of the resulting table. e.g "beta
" becomes "log_beta
".
dtm <- nih_sample_dtm lda <- tidylda(data = dtm, k = 10, iterations = 100, burnin = 75) tidy_beta <- tidy(lda, matrix = "beta") tidy_theta <- tidy(lda, matrix = "theta") tidy_lambda <- tidy(lda, matrix = "lambda")
dtm <- nih_sample_dtm lda <- tidylda(data = dtm, k = 10, iterations = 100, burnin = 75) tidy_beta <- tidy(lda, matrix = "beta") tidy_theta <- tidy(lda, matrix = "theta") tidy_lambda <- tidy(lda, matrix = "lambda")
Fit a Latent Dirichlet Allocation topic model using collapsed Gibbs sampling.
tidylda( data, k, iterations = NULL, burnin = -1, alpha = 0.1, eta = 0.05, optimize_alpha = FALSE, calc_likelihood = TRUE, calc_r2 = FALSE, threads = 1, return_data = FALSE, verbose = TRUE, ... )
tidylda( data, k, iterations = NULL, burnin = -1, alpha = 0.1, eta = 0.05, optimize_alpha = FALSE, calc_likelihood = TRUE, calc_r2 = FALSE, threads = 1, return_data = FALSE, verbose = TRUE, ... )
data |
A document term matrix or term co-occurrence matrix. The preferred
class is a |
k |
Integer number of topics. |
iterations |
Integer number of iterations for the Gibbs sampler to run. |
burnin |
Integer number of burnin iterations. If |
alpha |
Numeric scalar or vector of length |
eta |
Numeric scalar, numeric vector of length |
optimize_alpha |
Logical. Do you want to optimize alpha every iteration?
Defaults to |
calc_likelihood |
Logical. Do you want to calculate the log likelihood every iteration?
Useful for assessing convergence. Defaults to |
calc_r2 |
Logical. Do you want to calculate R-squared after the model is trained?
Defaults to |
threads |
Number of parallel threads, defaults to 1. See Details, below. |
return_data |
Logical. Do you want |
verbose |
Logical. Do you want to print a progress bar out to the console?
Defaults to |
... |
Additional arguments, currently unused |
This function calls a collapsed Gibbs sampler for Latent Dirichlet Allocation written using the excellent Rcpp package. Some implementation notes follow:
Topic-token and topic-document assignments are not initialized based on a
uniform-random sampling, as is common. Instead, topic-token probabilities
(i.e. beta
) are initialized by sampling from a Dirichlet distribution
with eta
as its parameter. The same is done for topic-document
probabilities (i.e. theta
) using alpha
. Then an internal
function is called (initialize_topic_counts
) to run
a single Gibbs iteration to initialize assignments of tokens to topics and
topics to documents.
When you use burn-in iterations (i.e. burnin = TRUE
), the resulting
beta
and theta
matrices are calculated by averaging over every
iteration after the specified number of burn-in iterations. If you do not
use burn-in iterations, then the matrices are calculated from the last run
only. Ideally, you'd burn in every iteration before convergence, then average
over the chain after its converged (and thus every observation is independent).
If you set optimize_alpha
to TRUE
, then each element of alpha
is proportional to the number of times each topic has be sampled that iteration
averaged with the value of alpha
from the previous iteration. This lets
you start with a symmetric alpha
and drift into an asymmetric one.
However, (a) this probably means that convergence will take longer to happen
or convergence may not happen at all. And (b) I make no guarantees that doing this
will give you any benefit or that it won't hurt your model. Caveat emptor!
The log likelihood calculation is the same that can be found on page 9 of
https://arxiv.org/pdf/1510.08628.pdf. The only difference is that the
version in tidylda
allows eta
to be a
vector or matrix. (Vector used in this function, matrix used for model
updates in refit.tidylda
. At present, the
log likelihood function appears to be ok for assessing convergence. i.e. It
has the right shape. However, it is, as of this writing, returning positive
numbers, rather than the expected negative numbers. Looking into that, but
in the meantime caveat emptor once again.
Parallelism, is not currently implemented. The threads
argument is a
placeholder for planned enhancements.
Returns an S3 object of class tidylda
. See new_tidylda
.
# load some data data(nih_sample_dtm) # fit a model set.seed(12345) m <- tidylda( data = nih_sample_dtm[1:20, ], k = 5, iterations = 200, burnin = 175 ) str(m) # predict on held-out documents using gibbs sampling "fold in" p1 <- predict(m, nih_sample_dtm[21:100, ], method = "gibbs", iterations = 200, burnin = 175 ) # predict on held-out documents using the dot product method p2 <- predict(m, nih_sample_dtm[21:100, ], method = "dot") # compare the methods barplot(rbind(p1[1, ], p2[1, ]), beside = TRUE, col = c("red", "blue"))
# load some data data(nih_sample_dtm) # fit a model set.seed(12345) m <- tidylda( data = nih_sample_dtm[1:20, ], k = 5, iterations = 200, burnin = 175 ) str(m) # predict on held-out documents using gibbs sampling "fold in" p1 <- predict(m, nih_sample_dtm[21:100, ], method = "gibbs", iterations = 200, burnin = 175 ) # predict on held-out documents using the dot product method p2 <- predict(m, nih_sample_dtm[21:100, ], method = "dot") # compare the methods barplot(rbind(p1[1, ], p2[1, ]), beside = TRUE, col = c("red", "blue"))