| Title: | Fast Topic Models Using Varimax |
|---|---|
| Description: | Fits topic models using varimax-rotated principal component analysis (PCA), following the "vintage factor analysis" approach of Rohe & Zheng (2020) <doi:10.48550/arXiv.2004.05387>. Leverages truncated PCA via 'irlba' for sparse matrices, enabling fast model fitting on large corpora. Includes an information-theoretic approach to vocabulary selection, 'broom'-compatible tidiers for extracting word-topic and topic-document matrices into a tidy data workflow, and samplers for constructing simulated corpora for benchmarking and method evaluation. |
| Authors: | D. Hicks [aut, cre, cph] (ORCID: <https://orcid.org/0000-0001-7945-4416>) |
| Maintainer: | D. Hicks <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 0.1.1 |
| Built: | 2026-05-30 17:09:53 UTC |
| Source: | https://github.com/cran/tmfast |
Fits topic models using varimax-rotated principal component analysis (PCA), following the "vintage factor analysis" approach of Rohe & Zheng (2020) doi:10.48550/arXiv.2004.05387. Leverages truncated PCA via 'irlba' for sparse matrices, enabling fast model fitting on large corpora. Includes an information-theoretic approach to vocabulary selection, 'broom'-compatible tidiers for extracting word-topic and topic-document matrices into a tidy data workflow, and samplers for constructing simulated corpora for benchmarking and method evaluation.
Maintainer: D. Hicks [email protected] (ORCID) [copyright holder]
Useful links:
Report bugs at https://github.com/dhicks/tmfast/issues
For the sparse case, an alias for tidytext::cast_sparse
build_matrix(data, row, column, value, ..., sparse = TRUE)build_matrix(data, row, column, value, ..., sparse = TRUE)
data |
Dataframe |
row |
Column name to use as row names, as string or symbol |
column |
Column name to use as column names, as string or symbol |
value |
Column name to use as matrix values, as string or symbol |
... |
Other arguments, passed to |
sparse |
Should the matrix be a |
A matrix or sparse Matrix object, with one row for each unique value in the row column, one column for each unique value in the column column, and with as many non-zero values as there are rows in data.
data.frame(id = c(1, 1, 2, 2) + 4, cols = c('a', 'b', 'a', 'b'), vals = 1:4) |> build_matrix(row = id, column = 'cols', value = vals)data.frame(id = c(1, 1, 2, 2) + 4, cols = c('a', 'b', 'a', 'b'), vals = 1:4) |> build_matrix(row = id, column = 'cols', value = vals)
Computes pairwise Hellinger distances between topics from one or two fitted models. Tokens missing from a beta dataframe are filled with probability 0 before comparison, so both models need not share the same vocabulary.
compare_betas(beta1, beta2 = NULL, vocab)compare_betas(beta1, beta2 = NULL, vocab)
beta1 |
Tidy beta dataframe with columns |
beta2 |
Optional second tidy beta dataframe in the same format. If
|
vocab |
Character vector of vocabulary tokens used to align the column
space of both matrices. Tokens in |
Numeric matrix of Hellinger distances. Dimensions are k1 × k1 when
beta2 = NULL, or k1 × k2 when two beta dataframes are supplied, where
k1 and k2 are the number of topics in each model.
set.seed(42) vocab = letters[1:5] make_beta = function(k) { rdirichlet(k, rep(1, length(vocab))) |> tibble::as_tibble(.name_repair = ~vocab) |> dplyr::mutate(topic = paste0('t', dplyr::row_number())) |> tidyr::pivot_longer(-topic, names_to = 'token', values_to = 'beta') } beta1 = make_beta(3) beta2 = make_beta(4) compare_betas(beta1, vocab = vocab) compare_betas(beta1, beta2, vocab = vocab)set.seed(42) vocab = letters[1:5] make_beta = function(k) { rdirichlet(k, rep(1, length(vocab))) |> tibble::as_tibble(.name_repair = ~vocab) |> dplyr::mutate(topic = paste0('t', dplyr::row_number())) |> tidyr::pivot_longer(-topic, names_to = 'token', values_to = 'beta') } beta1 = make_beta(3) beta2 = make_beta(4) compare_betas(beta1, vocab = vocab) compare_betas(beta1, beta2, vocab = vocab)
Draw a collection of documents
draw_corpus(N, theta, phi)draw_corpus(N, theta, phi)
N |
Length of documents |
theta |
Topic distribution for all documents, |
phi |
Word distribution for all topics, |
Standard pattern for generating a simulated DTM suitable for tmfast():
set.seed(42) theta = rdirichlet(n_docs, alpha = 1, k = n_topics) phi = rdirichlet(n_topics, alpha = 0.1, k = vocab_size) corpus = draw_corpus(rep(doc_length, n_docs), theta, phi) model = tmfast(corpus, n = n_topics)
alpha = 1 for theta gives uniform topic mixing; alpha = 0.1 for phi
gives sparse, topic-specific word distributions. doc_length should be large
enough that the full vocabulary is likely to appear (50–200 words per document
is typical for a small simulated example).
Document-term matrix, as a tibble, with columns doc, word, and n
Other generators:
journal_specific(),
peak_alpha(),
rdirichlet()
set.seed(42) theta = rdirichlet(30, 1, k = 3) phi = rdirichlet(3, 0.1, k = 20) corpus = draw_corpus(rep(50L, 30), theta, phi) head(corpus)set.seed(42) theta = rdirichlet(30, 1, k = 3) phi = rdirichlet(3, 0.1, k = 20) corpus = draw_corpus(rep(50L, 30), theta, phi) head(corpus)
Entropy of a distribution
entropy(p, base = 2)entropy(p, base = 2)
p |
Discrete probability distribution |
base |
Desired base for entropy, eg, 2 for bits |
Calculated Shannon entropy
entropy(c(0.5, 0.5)) entropy(c(0.9, 0.1))entropy(c(0.5, 0.5)) entropy(c(0.9, 0.1))
Samples P = <p1, p2, ..., pk> from Dirichlet distribution with parameter alpha = <alpha1, alpha2, ..., alphak> can be treated as categorical probability distributions with entropy . This function calculates the expected entropy given alpha.
expected_entropy(alpha, k = NULL)expected_entropy(alpha, k = NULL)
alpha |
Dirichlet parameter |
k |
If length(alpha) is 1, number of components in symmetric Dirichlet distribution |
Expected entropy in bits (log2 scale)
alpha = peak_alpha(50, 1) set.seed(1357) rdirichlet(500, alpha) |> apply(1, entropy) |> mean() expected_entropy(alpha)alpha = peak_alpha(50, 1) set.seed(1357) rdirichlet(500, alpha) |> apply(1, entropy) |> mean() expected_entropy(alpha)
n) PCA fit, return a rank k < n varimax fitGiven a (rank n) PCA fit, return a rank k < n varimax fit
fit_varimax( k, pca, feature_names, obs_names, varimax_fn = stats::varimax, varimax_opts = NULL, positive_skew = TRUE, x = NULL )fit_varimax( k, pca, feature_names, obs_names, varimax_fn = stats::varimax, varimax_opts = NULL, positive_skew = TRUE, x = NULL )
k |
Desired rank of the fitted varimax model |
pca |
Fitted PCA model or |
feature_names |
Names of the features (eg, data columns) |
obs_names |
Names of the observations (eg, data rows) |
varimax_fn |
Function to use for varimax rotation |
varimax_opts |
Options passed to |
positive_skew |
Should negative-skewed factors be flipped to have positive skew? |
x |
PCA scores matrix (n_obs x max_k), as returned by |
After the initial rotation, factors with negative skew (left tails) are flipped
pca must contain $rotation (feature loadings matrix) and $sdev (standard deviations
per PC); $x (PC scores matrix) is also required unless x is supplied directly.
List with components
- loadings: Rotated feature loadings
- rotmat: Rotation matrix
- scores: Rotated observation scores
set.seed(42) theta = rdirichlet(50, 1, k = 3) phi = rdirichlet(3, 0.1, k = 20) corpus = draw_corpus(rep(50L, 50), theta, phi) dtm = tidytext::cast_sparse(corpus, doc, word, n) pca = irlba::prcomp_irlba(dtm, n = 5) fit_varimax(k = 3, pca = pca, feature_names = colnames(dtm), obs_names = rownames(dtm))set.seed(42) theta = rdirichlet(50, 1, k = 3) phi = rdirichlet(3, 0.1, k = 20) corpus = draw_corpus(rep(50L, 50), theta, phi) dtm = tidytext::cast_sparse(corpus, doc, word, n) pca = irlba::prcomp_irlba(dtm, n = 5) fit_varimax(k = 3, pca = pca, feature_names = colnames(dtm), obs_names = rownames(dtm))
Calculates Hellinger distance between rows of one or two matrices or tidied topic model dataframes.
hellinger(topics1, ...) ## S3 method for class 'Matrix' hellinger(topics1, topics2 = NULL, ...) ## S3 method for class 'matrix' hellinger(...) ## S3 method for class 'data.frame' hellinger( topics1, id1 = "document", cat1 = "topic", prob1 = "prob", topics2 = NULL, id2 = "document", cat2 = "topic", prob2 = "prob", df = FALSE, ... )hellinger(topics1, ...) ## S3 method for class 'Matrix' hellinger(topics1, topics2 = NULL, ...) ## S3 method for class 'matrix' hellinger(...) ## S3 method for class 'data.frame' hellinger( topics1, id1 = "document", cat1 = "topic", prob1 = "prob", topics2 = NULL, id2 = "document", cat2 = "topic", prob2 = "prob", df = FALSE, ... )
topics1 |
First matrix ( |
... |
Not used; required for S3 method compatibility. |
topics2 |
Optional second matrix ( |
id1 |
Unit identifier column in |
cat1 |
Category identifier column in |
prob1 |
Probability value column in |
id2 |
Unit identifier column in |
cat2 |
Category identifier column in |
prob2 |
Probability value column in |
df |
Should the function return the matrix of Hellinger distances (default) or a tidy dataframe? (data.frame method only) |
Matrix of size or
(Matrix/matrix methods), or a matrix or tidy dataframe of Hellinger
distances (data.frame method).
# Matrix / matrix method set.seed(2022-06-09) topics1 = rdirichlet(3, rep(5, 5)) topics2 = rdirichlet(3, rep(5, 5)) hellinger(topics1) hellinger(topics1, topics2) # data.frame method set.seed(2022-06-09) topics1 = rdirichlet(3, rep(5, 5)) |> tibble::as_tibble(rownames = 'doc_id') |> dplyr::mutate(doc_id = stringr::str_c('doc_', doc_id)) |> tidyr::pivot_longer(-doc_id, names_to = 'topic', values_to = 'gamma') topics2 = rdirichlet(3, rep(5, 5)) |> tibble::as_tibble(rownames = 'doc_id') |> dplyr::mutate(doc_id = stringr::str_c('doc_', as.integer(doc_id) + 5)) |> tidyr::pivot_longer(-doc_id, names_to = 'topic', values_to = 'gamma') hellinger(topics1, doc_id, prob1 = 'gamma', df = TRUE) hellinger(topics1, doc_id, prob1 = 'gamma', topics2 = topics2, id2 = doc_id, prob2 = 'gamma')# Matrix / matrix method set.seed(2022-06-09) topics1 = rdirichlet(3, rep(5, 5)) topics2 = rdirichlet(3, rep(5, 5)) hellinger(topics1) hellinger(topics1, topics2) # data.frame method set.seed(2022-06-09) topics1 = rdirichlet(3, rep(5, 5)) |> tibble::as_tibble(rownames = 'doc_id') |> dplyr::mutate(doc_id = stringr::str_c('doc_', doc_id)) |> tidyr::pivot_longer(-doc_id, names_to = 'topic', values_to = 'gamma') topics2 = rdirichlet(3, rep(5, 5)) |> tibble::as_tibble(rownames = 'doc_id') |> dplyr::mutate(doc_id = stringr::str_c('doc_', as.integer(doc_id) + 5)) |> tidyr::pivot_longer(-doc_id, names_to = 'topic', values_to = 'gamma') hellinger(topics1, doc_id, prob1 = 'gamma', df = TRUE) hellinger(topics1, doc_id, prob1 = 'gamma', topics2 = topics2, id2 = doc_id, prob2 = 'gamma')
tmfast
Apply varimax rotation for a value of k less than the maximum already included in the tmfast.
insert_topics(fitted, k, x = NULL)insert_topics(fitted, k, x = NULL)
fitted |
Fitted |
k |
Desired number of topics for new model |
x |
Data matrix (document-term matrix), as Matrix object (eg, using |
tmfast object, as fitted, with additional topic model inserted
set.seed(42) theta = rdirichlet(50, 1, k = 4) phi = rdirichlet(4, 0.1, k = 20) corpus = draw_corpus(rep(50L, 50), theta, phi) model = tmfast(corpus, n = c(3, 4)) insert_topics(model, k = 2)set.seed(42) theta = rdirichlet(50, 1, k = 4) phi = rdirichlet(4, 0.1, k = 20) corpus = draw_corpus(rep(50L, 50), theta, phi) model = tmfast(corpus, n = c(3, 4)) insert_topics(model, k = 2)
Generates a corpus with Mj documents from k journals, each of which has a characteristic topic. Fits a varimax topic model of rank k, rotates the word-topic distribution to align with the true values, and reports Hellinger distance comparisons for each topic (word-topic) and document (topic-doc).
journal_specific( k = 5, Mj = 100, topic_peak = 0.8, topic_scale = 10, word_beta = 0.01, vocab = 10 * Mj * k, size = 3, mu = 300, bigjournal = FALSE, verbose = TRUE )journal_specific( k = 5, Mj = 100, topic_peak = 0.8, topic_scale = 10, word_beta = 0.01, vocab = 10 * Mj * k, size = 3, mu = 300, bigjournal = FALSE, verbose = TRUE )
k |
Number of topics/journals |
Mj |
Number of documents from each journal |
topic_peak |
Peak value for the asymmetric Dirichlet prior for true topic-doc distributions |
topic_scale |
Scale for the asymmetric Dirichlet prior for true topic-doc distributions |
word_beta |
Parameter for the symmetric Dirichlet prior for true word-doc distributions |
vocab |
Size of the vocabulary |
size |
Size parameter for the negative binomial distribution of document lengths |
mu |
Mean parameter for the negative binomial distribution of document lengths |
bigjournal |
Should the first journal have documents 10x as long (on average) as the others? |
verbose |
When TRUE, sends messages about the progress of the simulation |
A one-row tibble::tibble() with columns:
Mean Hellinger distance between true and fitted word-topic distributions
List-column of per-topic Hellinger distances
Mean Hellinger distance between true and fitted document-topic distributions
List-column of per-document Hellinger distances
Other generators:
draw_corpus(),
peak_alpha(),
rdirichlet()
journal_specific(k = 2, Mj = 10, vocab = 50, verbose = FALSE)journal_specific(k = 2, Mj = 10, vocab = 50, verbose = FALSE)
Extract a PCA/varimax loadings matrix
loadings(x, ...) ## Default S3 method: loadings(x, ...)loadings(x, ...) ## Default S3 method: loadings(x, ...)
x |
Object to dispatch on |
... |
Passed to methods |
An object of class "loadings" (from stats), structured as a
matrix with vocabulary terms as rows and varimax factors as columns. Values are
the loading (weight) of each term on each factor.
set.seed(42) theta = rdirichlet(50, 1, k = 3) phi = rdirichlet(3, 0.1, k = 20) corpus = draw_corpus(rep(50L, 50), theta, phi) model = tmfast(corpus, n = 3) loadings(model, k = 3) v = stats::varimax(matrix(runif(20), nrow = 5)) loadings(v)set.seed(42) theta = rdirichlet(50, 1, k = 3) phi = rdirichlet(3, 0.1, k = 20) corpus = draw_corpus(rep(50L, 50), theta, phi) model = tmfast(corpus, n = 3) loadings(model, k = 3) v = stats::varimax(matrix(runif(20), nrow = 5)) loadings(v)
Calculates , the log total occurrence times information gain (relative to the uniform distribution) for each term. I prefer this for vocabulary selection over methods such as TF-IDF.
ndH(dataf, doc_col, term_col, count_col)ndH(dataf, doc_col, term_col, count_col)
dataf |
Tidy document-term matrix |
doc_col |
Column of |
term_col |
Column of |
count_col |
Column of |
Dataframe with columns
- `{{ term col }}`, term
- `dH`, information gain relative to uniform distribution over documents
- `n`, total count of term occurrence
- `ndH`, \eqn{\log_2 n \times \delta H}
library(dplyr) library(tidytext) library(janeaustenr) austen_df = austen_books() |> unnest_tokens(term, text, token = 'words') |> mutate(author = 'Jane Austen') |> count(author, book, term) ndH(austen_df, book, term, n)library(dplyr) library(tidytext) library(janeaustenr) austen_df = austen_books() |> unnest_tokens(term, text, token = 'words') |> mutate(author = 'Jane Austen') |> count(author, book, term) ndH(austen_df, book, term, n)
An alternative to ndH() that uses information gain relative to a distribution of documents that is proportional to length. With the uniform distribution and dramatic differences in document lengths (eg, over a few orders of magnitude), high-ndH terms tend to be distinctive terms from very long documents. With the length-proportional distribution, high information-gain terms are more likely to come from shorter documents. Informal testing suggests this approach performs better than the ndH() uniform distribution when documents have widely varying lengths, eg, over a few orders of magnitude.
ndR(dataf, doc_col, term_col, count_col)ndR(dataf, doc_col, term_col, count_col)
dataf |
Tidy document-term matrix |
doc_col |
Column of |
term_col |
Column of |
count_col |
Column of |
Dataframe with columns
- `{{ term col }}`, term
- `n`, total count of term occurrence
- `dR`, information gain relative to length-proportional distribution over documents
- `ndR`, \eqn{\log_2 n \times \delta R}
library(dplyr) library(tidytext) library(janeaustenr) austen_df = austen_books() |> unnest_tokens(term, text, token = 'words') |> mutate(author = 'Jane Austen') |> count(author, book, term) ndR(austen_df, book, term, n)library(dplyr) library(tidytext) library(janeaustenr) austen_df = austen_books() |> unnest_tokens(term, text, token = 'words') |> mutate(author = 'Jane Austen') |> count(author, book, term) ndR(austen_df, book, term, n)
This function allows us to quickly define an alpha parameter for a Dirichlet distribution with a single (presumably high) peak*scale value at component i and all other components a uniform (presumably low) value (1-peak)/(k-1)*scale.
peak_alpha(k, i, peak = 0.8, scale = 1)peak_alpha(k, i, peak = 0.8, scale = 1)
k |
Number of components |
i |
Index for the component that takes value |
peak |
Value for the single peak component |
scale |
Scaling factor applied to all concentration parameters |
Vector of length k
Other generators:
draw_corpus(),
journal_specific(),
rdirichlet()
peak_alpha(5, 2) peak_alpha(5, 2, peak = 0.9, scale = 10)peak_alpha(5, 2) peak_alpha(5, 2, peak = 0.9, scale = 10)
Project new data into PCA score space
## S3 method for class 'varimaxes' predict(object, newdata, ...)## S3 method for class 'varimaxes' predict(object, newdata, ...)
object |
Fitted |
newdata |
Document-term matrix (observations x terms) to project |
... |
Not used; included for S3 method compatibility. |
Projects newdata through the PCA rotation stored in object, returning
raw PCA scores (not varimax scores). Intended for use in pipelines that combine
new data with an existing fitted model (e.g., insert_topics()). Fragile: newdata
must share the vocabulary of the training DTM, and the centering/scaling stored in
object must match how the training data was prepared.
Memory warning: scale() coerces sparse matrices to dense. For large DTMs,
this can be a substantial memory hazard. This mirrors the behavior of prcomp_irlba
itself, which is why PCA scores are computed once at fit time and not re-projected
on demand.
Matrix of PCA scores (n_obs x max_k)
set.seed(42) theta = rdirichlet(50, 1, k = 3) phi = rdirichlet(3, 0.1, k = 20) corpus = draw_corpus(rep(50L, 50), theta, phi) model = tmfast(corpus, n = 3) theta2 = rdirichlet(5, 1, k = 3) newdocs = draw_corpus(rep(200L, 5), theta2, phi) |> tidytext::cast_sparse(doc, word, n) predict(model, newdocs)set.seed(42) theta = rdirichlet(50, 1, k = 3) phi = rdirichlet(3, 0.1, k = 20) corpus = draw_corpus(rep(50L, 50), theta, phi) model = tmfast(corpus, n = 3) theta2 = rdirichlet(5, 1, k = 3) newdocs = draw_corpus(rep(200L, 5), theta2, phi) |> tidytext::cast_sparse(doc, word, n) predict(model, newdocs)
Sample from the Dirichlet distribution
rdirichlet(n, alpha, k = NULL)rdirichlet(n, alpha, k = NULL)
n |
Number of samples (rows) to draw |
alpha |
Concentration parameters; either length 1 or length > 1
If length 1, assumes symmetric Dirichlet; |
k |
Number of components (columns); ignored if |
A matrix of n rows and length(alpha) or k columns
Other generators:
draw_corpus(),
journal_specific(),
peak_alpha()
rdirichlet(10, .1, 5) rdirichlet(10, c(.8, .1, .1))rdirichlet(10, .1, 5) rdirichlet(10, c(.8, .1, .1))
Given a tidied dataframe of topic-doc or word-topic distributions and a exponent, renormalizes the distributions.
renorm(tidy_df, group_col, p_col, exponent, keep_original = FALSE)renorm(tidy_df, group_col, p_col, exponent, keep_original = FALSE)
tidy_df |
The tidied distribution dataframe |
group_col |
Grouping column, RHS of the conditional probability distribution, eg, topics for word-topic distributions |
p_col |
Column containing the probability for each category (eg, word) conditional on the group (eg, topic) |
exponent |
Exponent to use in renormalization |
keep_original |
Keep original probabilities? |
A dataframe with (if keep_original is TRUE) an added column of the form p_col_rn containing the renormalized probabilities or (if keep_original is FALSE) renormalized values in p_col.
set.seed(42) theta = rdirichlet(50, 1, k = 3) phi = rdirichlet(3, 0.1, k = 20) corpus = draw_corpus(rep(50L, 50), theta, phi) model = tmfast(corpus, n = 3) beta = tidy(model, matrix = 'beta', k = 3) pwr = target_power(beta, topic, beta, target_entropy = 2) renorm(beta, topic, beta, exponent = pwr)set.seed(42) theta = rdirichlet(50, 1, k = 3) phi = rdirichlet(3, 0.1, k = 20) corpus = draw_corpus(rep(50L, 50), theta, phi) model = tmfast(corpus, n = 3) beta = tidy(model, matrix = 'beta', k = 3) pwr = target_power(beta, topic, beta, target_entropy = 2) renorm(beta, topic, beta, exponent = pwr)
Extract varimax rotation
rotation(x, ...)rotation(x, ...)
x |
Object to dispatch on |
... |
Passed to methods |
A numeric k x k orthogonal rotation matrix, where k is the number of requested factors. This is the varimax rotation matrix used to transform PCA loadings into the rotated factor solution.
set.seed(42) theta = rdirichlet(50, 1, k = 3) phi = rdirichlet(3, 0.1, k = 20) corpus = draw_corpus(rep(50L, 50), theta, phi) model = tmfast(corpus, n = 3) rotation(model, k = 3)set.seed(42) theta = rdirichlet(50, 1, k = 3) phi = rdirichlet(3, 0.1, k = 20) corpus = draw_corpus(rep(50L, 50), theta, phi) model = tmfast(corpus, n = 3) rotation(model, k = 3)
Extract item scores from a fitted PCA/varimax model
scores(x, ...)scores(x, ...)
x |
Object to dispatch on |
... |
Passed to methods |
A numeric matrix with documents as rows and varimax factors as columns. Values are the factor score for each document on each factor.
set.seed(42) theta = rdirichlet(50, 1, k = 3) phi = rdirichlet(3, 0.1, k = 20) corpus = draw_corpus(rep(50L, 50), theta, phi) model = tmfast(corpus, n = 3) scores(model, k = 3)set.seed(42) theta = rdirichlet(50, 1, k = 3) phi = rdirichlet(3, 0.1, k = 20) corpus = draw_corpus(rep(50L, 50), theta, phi) model = tmfast(corpus, n = 3) scores(model, k = 3)
After https://stats.stackexchange.com/questions/521582/controlling-the-entropy-of-a-distribution
solve_power(p, target_H, return_full = FALSE)solve_power(p, target_H, return_full = FALSE)
p |
Initial distribution |
target_H |
Desired entropy for the transformed distribution |
return_full |
Return the full uniroot() output? |
Numeric value of the desired exponent
p = c(0.5, 0.3, 0.2) solve_power(p, target_H = 1.0)p = c(0.5, 0.3, 0.2) solve_power(p, target_H = 1.0)
Given a tidied dataframe of topic-doc or word-topic distributions and a target entropy, find the mean exponent needed to adjust the temperature of each distribution to approximately match the target entropy.
target_power(tidy_df, group_col, p_col, target_entropy)target_power(tidy_df, group_col, p_col, target_entropy)
tidy_df |
The tidied distribution dataframe |
group_col |
Grouping column, RHS of the conditional probability distribution, eg, topics for word-topic distributions |
p_col |
Column containing the probability for each category (eg, word) conditional on the group (eg, topic) |
target_entropy |
Target entropy |
Mean exponent to renormalize to the target entropy
set.seed(42) theta = rdirichlet(50, 1, k = 3) phi = rdirichlet(3, 0.1, k = 20) corpus = draw_corpus(rep(50L, 50), theta, phi) model = tmfast(corpus, n = 3) beta = tidy(model, matrix = 'beta', k = 3) target_power(beta, topic, beta, target_entropy = 2)set.seed(42) theta = rdirichlet(50, 1, k = 3) phi = rdirichlet(3, 0.1, k = 20) corpus = draw_corpus(rep(50L, 50), theta, phi) model = tmfast(corpus, n = 3) beta = tidy(model, matrix = 'beta', k = 3) target_power(beta, topic, beta, target_entropy = 2)
Extract gamma or beta matrices for all topics
tidy_all(x, matrix = "beta", ...)tidy_all(x, matrix = "beta", ...)
x |
|
matrix |
Desired matrix, |
... |
Other arguments, passed to |
A long dataframe, with one row per word-topic or topic-doc combination. Column names depend on the value of matrix.
set.seed(42) theta = rdirichlet(50, 1, k = 4) phi = rdirichlet(4, 0.1, k = 20) corpus = draw_corpus(rep(50L, 50), theta, phi) model = tmfast(corpus, n = c(3, 4)) tidy_all(model, matrix = 'beta')set.seed(42) theta = rdirichlet(50, 1, k = 4) phi = rdirichlet(4, 0.1, k = 20) corpus = draw_corpus(rep(50L, 50), theta, phi) model = tmfast(corpus, n = c(3, 4)) tidy_all(model, matrix = 'beta')
tmfast objectsExtract beta and gamma matrices from tmfast objects
## S3 method for class 'tmfast' tidy( x, k, matrix = "beta", df = TRUE, exponent = NULL, keep_original = FALSE, rotation = NULL, ... )## S3 method for class 'tmfast' tidy( x, k, matrix = "beta", df = TRUE, exponent = NULL, keep_original = FALSE, rotation = NULL, ... )
x |
|
k |
Index (number of topics/factors) |
matrix |
Desired matrix, either word-topic ( |
df |
Return a long dataframe (default) or wide matrix? |
exponent |
Renormalize the probabilities using a given exponent Applies only for |
keep_original |
If renormalizing, return original (pre-renormalized) probabilities? |
rotation |
Optional rotation matrix; see details |
... |
Not used; required for S3 method compatibility |
If rotation is not NULL, loadings/scores will be rotated. This might be used to align the fitted topics with known true topics, as in the journal_specific simulation. Loadings are left-multiplied by the given rotation, while scores are right-multiplied by the transpose of the given rotation.
A long dataframe, with one row per word-topic or topic-doc combination. Column names depend on the value of matrix.
set.seed(42) theta = rdirichlet(50, 1, k = 3) phi = rdirichlet(3, 0.1, k = 20) corpus = draw_corpus(rep(50L, 50), theta, phi) model = tmfast(corpus, n = 3) tidy(model, k = 3, matrix = 'beta') tidy(model, k = 3, matrix = 'gamma')set.seed(42) theta = rdirichlet(50, 1, k = 3) phi = rdirichlet(3, 0.1, k = 20) corpus = draw_corpus(rep(50L, 50), theta, phi) model = tmfast(corpus, n = 3) tidy(model, k = 3, matrix = 'beta') tidy(model, k = 3, matrix = 'gamma')
Fit a topic model using PCA+varimax
tmfast(dtm, n, row = "doc", column = "word", value = "n", verbose = FALSE, ...)tmfast(dtm, n, row = "doc", column = "word", value = "n", verbose = FALSE, ...)
dtm |
Document-term matrix. Either an object inheriting from |
n |
Number of topics to return |
row |
In dataframe |
column |
In dataframe |
value |
In dataframe |
verbose |
Should |
... |
Other arguments, passed to |
If dtm is not a matrix, will be cast to a sparse matrix using tidytext::case_sparse()
As per varimax_irlba, of class tmfast
2-dimensional "discursive space" representation of relationships between documents using Hellinger distances and t-SNE.
tsne(x, ...) ## S3 method for class 'data.frame' tsne(x, doc_ids, perplexity = NULL, df = TRUE, ...) ## S3 method for class 'tmfast' tsne(x, k, perplexity = NULL, df = TRUE, ...) ## S3 method for class 'STM' tsne(x, doc_ids, perplexity = NULL, df = TRUE, ...)tsne(x, ...) ## S3 method for class 'data.frame' tsne(x, doc_ids, perplexity = NULL, df = TRUE, ...) ## S3 method for class 'tmfast' tsne(x, k, perplexity = NULL, df = TRUE, ...) ## S3 method for class 'STM' tsne(x, doc_ids, perplexity = NULL, df = TRUE, ...)
x |
Fitted topic model ( |
... |
Passed to methods |
doc_ids |
Vector of document IDs, in the same order as rows in |
perplexity |
Perplexity parameter for t-SNE. By default, minimum of 30
and |
df |
Return a dataframe with columns |
k |
Number of topics |
Algorithm checks distances to 3*perplexity nearest neighbors. Rtsne
loses rownames (document IDs); these are either extracted from the tmfast
object or passed separately for an STM object. Use set.seed() before
calling for reproducibility.
See df
tsne(data.frame): Method for tidied gamma dataframes
tsne(tmfast): Method for fitted tmfast objects
tsne(STM): Method for fitted STM objects
set.seed(42) theta = rdirichlet(50, 1, k = 3) phi = rdirichlet(3, 0.1, k = 30) corpus = draw_corpus(rep(50L, 50), theta, phi) fitted = tmfast(corpus, n = 3) tsne(fitted, k = 3, df = TRUE)set.seed(42) theta = rdirichlet(50, 1, k = 3) phi = rdirichlet(3, 0.1, k = 30) corpus = draw_corpus(rep(50L, 50), theta, phi) fitted = tmfast(corpus, n = 3) tsne(fitted, k = 3, df = TRUE)
2-dimensional "discursive space" representation of relationships between documents using Hellinger distances and UMAP.
umap(x, ...) ## S3 method for class 'matrix' umap(x, include_data = FALSE, df = TRUE, ...) ## S3 method for class 'tmfast' umap(x, k, ...) ## S3 method for class 'STM' umap(x, doc_ids, ...)umap(x, ...) ## S3 method for class 'matrix' umap(x, include_data = FALSE, df = TRUE, ...) ## S3 method for class 'tmfast' umap(x, k, ...) ## S3 method for class 'STM' umap(x, doc_ids, ...)
x |
Fitted |
... |
Passed to methods |
include_data |
Return the distance matrix inside the umap object?
Default |
df |
Return a tibble with columns |
k |
Number of topics |
doc_ids |
Character vector of document IDs |
Tibble with columns document, x, y when df = TRUE; otherwise
an object of class umap with components layout, knn, and config.
umap(matrix): Method for distance matrices
umap(tmfast): Method for fitted tmfast objects
umap(STM): Method for fitted STM objects
gamma = rdirichlet(26, 1, 5) rownames(gamma) = letters h_gamma = hellinger(gamma) umap(h_gamma, df = TRUE) set.seed(42) theta = rdirichlet(30, 1, k = 3) phi = rdirichlet(3, 0.1, k = 30) corpus = draw_corpus(rep(50L, 30), theta, phi) fitted = tmfast(corpus, n = 3) umap(fitted, 3)gamma = rdirichlet(26, 1, 5) rownames(gamma) = letters h_gamma = hellinger(gamma) umap(h_gamma, df = TRUE) set.seed(42) theta = rdirichlet(30, 1, k = 3) phi = rdirichlet(3, 0.1, k = 30) corpus = draw_corpus(rep(50L, 30), theta, phi) fitted = tmfast(corpus, n = 3) umap(fitted, 3)
Extract n principal components from the matrix mx using irlba, then rotate the solution using varimax
varimax_irlba( mx, n, prcomp_fn = irlba::prcomp_irlba, prcomp_opts = NULL, varimax_fn = stats::varimax, varimax_opts = NULL, retx = TRUE )varimax_irlba( mx, n, prcomp_fn = irlba::prcomp_irlba, prcomp_opts = NULL, varimax_fn = stats::varimax, varimax_opts = NULL, retx = TRUE )
mx |
Matrix of interest |
n |
Number of principal components / varimax factors to return; can take a vector of values |
prcomp_fn |
Function to use to extract principal components |
prcomp_opts |
List of options to pass to |
varimax_fn |
Function to use for varimax rotation |
varimax_opts |
List of options to pass to |
retx |
Whether to return the input matrix |
A list of class varimaxes, with elements
totalvar: Total variance, from PCA
sdev: Standard deviations of the extracted principal components
x: If retx is TRUE, the input matrix mx
rotation: Rotation matrix (variable loadings) from PCA
varimaxes: A list of class varimaxes, containing one fitted varimax model for each value of n, with further elements
loadings: Varimax-rotated standardized loadings
rotmat: Varimax rotation matrix
scores: Varimax-rotated observation scores
set.seed(42) theta = rdirichlet(50, 1, k = 3) phi = rdirichlet(3, 0.1, k = 20) corpus = draw_corpus(rep(50L, 50), theta, phi) dtm = tidytext::cast_sparse(corpus, doc, word, n) varimax_irlba(dtm, n = 3)set.seed(42) theta = rdirichlet(50, 1, k = 3) phi = rdirichlet(3, 0.1, k = 20) corpus = draw_corpus(rep(50L, 50), theta, phi) dtm = tidytext::cast_sparse(corpus, doc, word, n) varimax_irlba(dtm, n = 3)