Title: | Topic Models |
---|---|
Description: | Provides an interface to the C code for Latent Dirichlet Allocation (LDA) models and Correlated Topics Models (CTM) by David M. Blei and co-authors and the C++ code for fitting LDA models using Gibbs sampling by Xuan-Hieu Phan and co-authors. |
Authors: | Bettina Grün [aut, cre] , Kurt Hornik [aut] , David M Blei [ctb, cph] (VEM estimation of LDA and CTM), John D Lafferty [ctb, cph] (VEM estimation of CTM), Xuan-Hieu Phan [ctb, cph] (MCMC estimation of LDA), Makoto Matsumoto [ctb, cph] (Mersenne Twister RNG), Takuji Nishimura [ctb, cph] (Mersenne Twister RNG), Shawn Cokus [ctb] (Mersenne Twister RNG) |
Maintainer: | Bettina Grün <[email protected]> |
License: | GPL-2 |
Version: | 0.2-17 |
Built: | 2024-11-19 06:44:11 UTC |
Source: | CRAN |
Associated Press data from the First Text Retrieval Conference (TREC-1) 1992.
data("AssociatedPress")
data("AssociatedPress")
The data set is an object of class "DocumentTermMatrix"
provided by package tm. It is a document-term matrix which
contains the term frequency of 10473 terms in 2246 documents.
Accompanying material to the source code for fitting LDA models provided by David M. Blei and co-authors. Downloaded from: http://www.cs.columbia.edu/~blei/
D. Harman (1992) Overview of the first text retrieval conference (TREC-1). In Proceedings of the First Text Retrieval Conference (TREC-1), 1–20.
Estimate a CTM model using for example the VEM algorithm.
CTM(x, k, method = "VEM", control = NULL, model = NULL, ...)
CTM(x, k, method = "VEM", control = NULL, model = NULL, ...)
x |
Object of class |
k |
Integer; number of topics. |
method |
The method to be used for fitting; currently only
|
control |
A named list of the control parameters for estimation
or an object of class |
model |
Object of class |
... |
Currently not used. |
The C code for CTM from David M. Blei and co-authors is used to estimate and fit a correlated topic model.
CTM()
returns an object of class
"CTM"
.
Bettina Gruen
Blei D.M., Lafferty J.D. (2007). A Correlated Topic Model of Science. The Annals of Applied Statistics, 1(1), 17–35.
data("AssociatedPress", package = "topicmodels") ctm <- CTM(AssociatedPress[1:20,], k = 2)
data("AssociatedPress", package = "topicmodels") ctm <- CTM(AssociatedPress[1:20,], k = 2)
The Hellinger distance between the rows of two data matrices are determined or if the second argument is missing between the rows of one data matrix.
## Default S3 method: distHellinger(x, y, ...) ## S3 method for class 'simple_triplet_matrix' distHellinger(x, y, ...)
## Default S3 method: distHellinger(x, y, ...) ## S3 method for class 'simple_triplet_matrix' distHellinger(x, y, ...)
x |
A data matrix. |
y |
A data matrix. |
... |
Currently not used. |
A matrix containing the distances.
Bettina Gruen
Dublin Core metadata for papers published in the Journal of Statistical Software (JSS) from 1996 until mid-2010.
data("JSS_papers")
data("JSS_papers")
A list matrix of character vectors, with rows corresponding to papers and the 15 columns giving the respective Dublin Core elements (variables).
Variables title
and description
give the title and the
abstract of the paper, respectively, and creator
gives the
authors (with entries character vectors with the names of the
individual authors).
Metadata were obtained from the JSS OAI repository at https://www.jstatsoft.org/oai via package OAIHarvester (https://CRAN.R-project.org/package=OAIHarvester). Records not corresponding to papers (such as book reviews) were dropped.
See the documentation of package OAIHarvester for more information on Dublin Core and OAI, and https://www.jstatsoft.org/ for information about JSS.
data("JSS_papers") ## Inspect the first records: head(JSS_papers) ## Numbers of papers by year: table(strftime(as.Date(unlist(JSS_papers[, "date"])), "%Y")) ## Frequent authors: head(sort(table(unlist(JSS_papers[, "creator"])), decreasing = TRUE))
data("JSS_papers") ## Inspect the first records: head(JSS_papers) ## Numbers of papers by year: table(strftime(as.Date(unlist(JSS_papers[, "date"])), "%Y")) ## Frequent authors: head(sort(table(unlist(JSS_papers[, "creator"])), decreasing = TRUE))
Estimate a LDA model using for example the VEM algorithm or Gibbs Sampling.
LDA(x, k, method = "VEM", control = NULL, model = NULL, ...)
LDA(x, k, method = "VEM", control = NULL, model = NULL, ...)
x |
Object of class |
k |
Integer; number of topics. |
method |
The method to be used for fitting; currently
|
control |
A named list of the control parameters for estimation
or an object of class |
model |
Object of class |
... |
Optional arguments. For |
The C code for LDA from David M. Blei and co-authors is used to estimate and fit a latent dirichlet allocation model with the VEM algorithm. For Gibbs Sampling the C++ code from Xuan-Hieu Phan and co-authors is used.
When Gibbs sampling is used for fitting the model, seed words with their additional weights for the prior parameters can be specified in order to be able to fit seeded topic models.
LDA()
returns an object of class "LDA"
.
Bettina Gruen
Blei D.M., Ng A.Y., Jordan M.I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993–1022.
Phan X.H., Nguyen L.M., Horguchi S. (2008). Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections. In Proceedings of the 17th International World Wide Web Conference (WWW 2008), pages 91–100, Beijing, China.
Lu, B., Ott, M., Cardie, C., Tsou, B.K. (2011). Multi-aspect Sentiment Analysis with Topic Models. In Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops, pages 81–88.
data("AssociatedPress", package = "topicmodels") lda <- LDA(AssociatedPress[1:20,], control = list(alpha = 0.1), k = 2) lda_inf <- posterior(lda, AssociatedPress[21:30,])
data("AssociatedPress", package = "topicmodels") lda <- LDA(AssociatedPress[1:20,], control = list(alpha = 0.1), k = 2) lda_inf <- posterior(lda, AssociatedPress[21:30,])
Data from the lda package is transformed to a document-term matrix. This data format can be used to fit topic models using package topicmodels.
Data in form of a document-term matrix is transformed to the LDA format used by package lda.
ldaformat2dtm(documents, vocab, omit_empty = TRUE) dtm2ldaformat(x, omit_empty = TRUE)
ldaformat2dtm(documents, vocab, omit_empty = TRUE) dtm2ldaformat(x, omit_empty = TRUE)
documents |
A |
vocab |
A |
x |
An object of class |
omit_empty |
A logical indicating if empty documents should be removed when converting the objects. By default empty documents are removed. |
An object of class "DocumentTermMatrix"
is returned by
ldaformat2dtm()
and a list with components "documents"
and "vocab"
by dtm2ldaformat()
.
Bettina Gruen
if (require("lda")) { data("cora.documents", package = "lda") data("cora.vocab", package = "lda") dtm <- ldaformat2dtm(cora.documents, cora.vocab) cora <- dtm2ldaformat(dtm) all.equal(cora, list(documents = cora.documents, vocab = cora.vocab)) }
if (require("lda")) { data("cora.documents", package = "lda") data("cora.vocab", package = "lda") dtm <- ldaformat2dtm(cora.documents, cora.vocab) cora <- dtm2ldaformat(dtm) all.equal(cora, list(documents = cora.documents, vocab = cora.vocab)) }
Compute the log-likelihood.
Compute the log-likelihood of a
"TopicModel"
object. For "VEM"
objects the sum of
the log-likelihood of all documents given the parameters for the
topic distribution and for the word distribution of each topic is
approximated using the variational parameters and underestimates
the log-likelihood by the Kullback-Leibler divergence between the
variational posterior probability and the true posterior
probability.
Compute the log-likelihoods of the
"TopicModel"
objects contained in the "Gibbs_list"
object.
Determine the perplexity of a fitted model.
perplexity(object, newdata, ...) ## S4 method for signature 'VEM,simple_triplet_matrix' perplexity(object, newdata, control, ...) ## S4 method for signature 'Gibbs,simple_triplet_matrix' perplexity(object, newdata, control, use_theta = TRUE, estimate_theta = TRUE, ...) ## S4 method for signature 'Gibbs_list,simple_triplet_matrix' perplexity(object, newdata, control, use_theta = TRUE, estimate_theta = TRUE, ...)
perplexity(object, newdata, ...) ## S4 method for signature 'VEM,simple_triplet_matrix' perplexity(object, newdata, control, ...) ## S4 method for signature 'Gibbs,simple_triplet_matrix' perplexity(object, newdata, control, use_theta = TRUE, estimate_theta = TRUE, ...) ## S4 method for signature 'Gibbs_list,simple_triplet_matrix' perplexity(object, newdata, control, use_theta = TRUE, estimate_theta = TRUE, ...)
object |
Object of class |
newdata |
If missing, the perplexity for the data to which the
model was fitted is determined. For objects fitted using Gibbs sampling
|
control |
If missing, the |
use_theta |
Object of class |
estimate_theta |
Object of class |
... |
Further arguments passed to the different methods. |
The specified control is modified to ensure that (1)
estimate.beta=FALSE
and (2) nstart=1
.
For "Gibbs_list"
objects the control
is further modified
to have (1) iter=thin
and (2) best=TRUE
and the model is
fitted to the new data with this control for each available
iteration. The perplexity is then determined by averaging over the
same number of iterations.
If a list
is supplied as object
, it is assumed that it
consists of several models which were fitted using different starting
configurations.
A numeric value.
Bettina Gruen
Blei D.M., Ng A.Y., Jordan M.I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993–1022.
Griffiths T.L., Steyvers, M. (2004). Finding Scientific Topics. Proceedings of the National Academy of Sciences of the United States of America, 101, Suppl. 1, 5228–5235.
Newman D., Asuncion A., Smyth P., Welling M. (2009). Distributed Algorithms for Topic Models. Journal of Machine Learning Research, 10, 1801–1828.
Determine the posterior probabilities of the topics for each document and of the terms for each topic for a fitted topic model.
## S4 method for signature 'TopicModel,missing' posterior(object, newdata, ...) ## S4 method for signature 'TopicModel,ANY' posterior(object, newdata, control = list(), ...)
## S4 method for signature 'TopicModel,missing' posterior(object, newdata, ...) ## S4 method for signature 'TopicModel,ANY' posterior(object, newdata, control = list(), ...)
object |
An object of class "TopicModel". |
newdata |
If missing the posteriors for the original observations are returned. |
control |
A named list of the control parameters for estimation or a suitable control object. |
... |
Currently not used. |
Bettina Gruen
Function to extract the most likely terms for each topic or the most likely topics for each document.
## S4 method for signature 'TopicModel' terms(x, k, threshold, ...) ## S4 method for signature 'TopicModel' topics(x, k, threshold, ...)
## S4 method for signature 'TopicModel' terms(x, k, threshold, ...) ## S4 method for signature 'TopicModel' topics(x, k, threshold, ...)
x |
Object of class |
k |
The maximum number of terms/topics returned. By default set to 1 if no threshold is given. |
threshold |
Only the terms/topics which are more likely than the threshold are returned. |
... |
Further arguments passed to |
A list or matrix containing the most likely terms for each topic or the most likely topics for each document.
Bettina Gruen
Fitted topic model.
Objects of class "LDA"
are returned by LDA()
and
of class "CTM"
by CTM()
.
Class "TopicModel"
contains
call
:Object of class "call"
.
Dim
:Object of class "integer"
; number of
documents and terms.
control
:Object of class "TopicModelcontrol"
;
options used for estimating the topic model.
k
:Object of class "integer"
; number of
topics.
terms
:Vector containing the term names.
documents
:Vector containing the document names.
beta
:Object of class "matrix"
; logarithmized
parameters of the word distribution for each topic.
gamma
:Object of class "matrix"
; parameters of
the posterior topic distribution for each document.
iter
:Object of class "integer"
; the number of
iterations made.
logLiks
:Object of class "numeric"
; the vector
of kept intermediate log-likelihood values of the corpus. See
loglikelihood
how the log-likelihood is determined.
n
:Object of class "integer"
; number of words
in the data used.
wordassignments
:Object of class
"simple_triplet_matrix"
; most probable topic for each
observed word in each document.
Class "VEM"
contains
loglikelihood
:Object of class "numeric"
; the
log-likelihood of each document given the parameters for the topic
distribution and for the word distribution of each topic is
approximated using the variational parameters and underestimates
the log-likelihood by the Kullback-Leibler divergence between the
variational posterior probability and the true posterior
probability.
Class "LDA"
extends class "TopicModel"
and has the additional
slots
loglikelihood
:Object of class "numeric"
; the
posterior likelihood of the corpus conditional on the topic
assignments is returned.
alpha
:Object of class "numeric"
; parameter of
the Dirichlet distribution for topics over documents.
Class "LDA_Gibbs"
extends class "LDA"
and has
the additional slots
seed
:Either NULL
or object of class
"simple_triplet_matrix"
; parameter for the prior
distribution of the word distribution for topics if seeded.
z
:Object of class "integer"
; topic assignments
of words ordered by terms with suitable repetition within
documents.
Class "CTM"
extends class "TopicModel"
and has the additional
slots
mu
:Object of class "numeric"
; mean of the
topic distribution on the logit scale.
Sigma
:Object of class "matrix"
;
variance-covariance matrix of topics on the logit scale.
Class "CTM_VEM"
extends classes "CTM"
and
"VEM"
and has the additional
slots
nusqared
:Object of class "matrix"
; variance of the
variational distribution on the parameter mu.
Bettina Gruen
Classes to control the estimation of topic models which are inheriting
from the virtual base class "TopicModelcontrol"
.
Objects can be created from named lists.
Class "TopicModelcontrol"
contains
seed
:Object of class "integer"
; used to set
the seed in the external code for VEM estimation and to call
set.seed
for Gibbs sampling. For Gibbs sampling it can also
be set to NA
(default) to avoid changing the seed of the
random number generator in the model fitting call.
verbose
:Object of class "integer"
. If a
positive integer, then the progress is reported every
verbose
iterations. If 0 (default), no output is generated
during model fitting.
save
:Object of class "integer"
. If a positive
integer the estimated model is saved all verbose
iterations. If 0 (default), no output is generated during model
fitting.
prefix
:Object of class "character"
; path
indicating where to save the intermediate results.
nstart
:Object of class "integer"
. Number of
repeated random starts.
best
:Object of class "logical"
; if TRUE
only the model with the maximum (posterior) likelihood is returned,
by default equals TRUE
.
keep
:Object of class "integer"
; if a positive
integer, the log-likelihood is saved every keep
iterations.
estimate.beta
:Object of class "logical"
;
controls if beta, the term distribution of the topics, is fixed,
by default equals TRUE
.
Class "VEMcontrol"
contains
var
:Object of class "OPTcontrol"
; controls the
variational inference for a single document, by default
iter.max
equals 500 and tol
10^-6.
em
:Object of class "OPTcontrol"
; controls the
variational EM algorithm, by default iter.max
equals 1000
and tol
10^-4.
initialize
:Object of class "character"
; one of
"random"
, "seeded"
and "model"
, by default
equals "random"
.
Class "LDAcontrol"
extends class "TopicModelcontrol"
and
has the additional slots
alpha
:Object of class "numeric"
; initial
value for alpha.
Class "LDA_VEMcontrol"
extends classes
"LDAcontrol"
and "VEMcontrol"
and has the
additional slots
estimate.alpha
:Object of class "logical"
;
indicates if the parameter alpha is fixed a-priori or estimated, by
default equals TRUE
.
Class "LDA_Gibbscontrol"
extends classes
"LDAcontrol"
and has the additional slots
delta
:Object of class "numeric"
; initial value
for delta, by default equals 0.1.
iter
:Object of class "integer"
; number of
Gibbs iterations (after omitting the burnin
iterations), by
default equals 2000.
thin
:Object of class "integer"
; number of
omitted in-between Gibbs iterations, by default equals iter
.
burnin
:Object of class "integer"
; number of
omitted Gibbs iterations at beginning, by default equals 0.
initialize
:Object of class "character"
;
one of "random"
, "beta"
and "z"
, by
default equals "random"
.
Class "CTM_VEMcontrol"
extends classes
"TopicModelcontrol"
and "VEMcontrol"
and has the
additional slots
cg
:Object of class "OPTcontrol"
; controls the
conjugate gradient iterations in fitting the variational mean and
variance per document, by default iter.max
equals 500 and
tol
10^-5.
Class "OPTcontrol"
contains
iter.max
:Object of class "integer"
; maximum
number of iterations.
tol
:Object of class "numeric"
; tolerance for
convergence check.
Bettina Gruen