Title: | Estimation of the Structural Topic and Sentiment-Discourse Model for Text Analysis |
---|---|
Description: | The Structural Topic and Sentiment-Discourse (STS) model allows researchers to estimate topic models with document-level metadata that determines both topic prevalence and sentiment-discourse. The sentiment-discourse is modeled as a document-level latent variable for each topic that modulates the word frequency within a topic. These latent topic sentiment-discourse variables are controlled by the document-level metadata. The STS model can be useful for regression analysis with text data in addition to topic modeling’s traditional use of descriptive analysis. The method was developed in Chen and Mankad (2024) <doi:10.1287/mnsc.2022.00261>. |
Authors: | Shawn Mankad [aut, cre], Li Chen [aut] |
Maintainer: | Shawn Mankad <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.2 |
Built: | 2024-12-25 07:14:43 UTC |
Source: | CRAN |
This package implements the Structural Topic and Sentiment-Discourse (STS) model, which allows researchers to estimate topic models with document-level metadata that determines both topic prevalence and sentiment-discourse. The sentiment-discourse is modeled as a document-level latent variable for each topic that modulates the word frequency within a topic. These latent topic sentiment-discourse variables are controlled by the document-level metadata. The STS model can be useful for regression analysis with text data in addition to topic modeling's traditional use of descriptive analysis.
Function to fit the model: sts
Functions for Post-Estimation: estimateRegns
topicExclusivity
topicSemanticCoherence
heldoutLikelihood
plotRepresentativeDocs
findRepresentativeDocs
printTopWords
plot.STS
Author: Shawn Mankad and Li Chen
Maintainer: Shawn Mankad <[email protected]>
Chen L. and Mankad, S. (forthcoming) "A Structural Topic and Sentiment-Discourse Model for Text Analysis" Management Science.
Estimates regression tables for prevalence and sentiment/discourse.
estimateRegns(object, prevalence_sentiment, corpus)
estimateRegns(object, prevalence_sentiment, corpus)
object |
an sts object |
prevalence_sentiment |
A formula object with no response variable or a design matrix with the covariates. If a formula, the variables must be contained in corpus$meta. |
corpus |
The document term matrix to be modeled in a sparse term count matrix with one row
per document and one column per term. The object must be a list of with each element
corresponding to a document. Each document is represented
as an integer matrix with two rows, and columns equal to the number of unique
vocabulary words in the document. The first row contains the 1-indexed
vocabulary entry and the second row contains the number of times that term
appears. This is the same format in the |
Estimate Gamma coefficients (along with standard errors, p-values, etc.) to assess how document-level meta-data determine prevalence and sentiment/discourse
a list of tables with regression coefficient estimates. The first <num-topic> elements pertain to prevalence; the latter <num-topic> elements pertain to sentiment-discourse.
library("tm"); library("stm"); library("sts") temp<-textProcessor(documents=gadarian$open.ended.response, metadata=gadarian, verbose = FALSE) out <- prepDocuments(temp$documents, temp$vocab, temp$meta, verbose = FALSE) out$meta$noTreatment <- ifelse(out$meta$treatment == 1, -1, 1) ## low max iteration number just for testing sts_estimate <- sts(~ treatment*pid_rep, ~ noTreatment, out, K = 3, maxIter = 2) regns <- estimateRegns(sts_estimate, ~treatment*pid_rep, out) printRegnTables(x = regns)
library("tm"); library("stm"); library("sts") temp<-textProcessor(documents=gadarian$open.ended.response, metadata=gadarian, verbose = FALSE) out <- prepDocuments(temp$documents, temp$vocab, temp$meta, verbose = FALSE) out$meta$noTreatment <- ifelse(out$meta$treatment == 1, -1, 1) ## low max iteration number just for testing sts_estimate <- sts(~ treatment*pid_rep, ~ noTreatment, out, K = 3, maxIter = 2) regns <- estimateRegns(sts_estimate, ~treatment*pid_rep, out) printRegnTables(x = regns)
Extracts documents with the highest prevalence for a given topic
findRepresentativeDocs(object, corpus_text, topic, n = 3)
findRepresentativeDocs(object, corpus_text, topic, n = 3)
object |
Model output from sts |
corpus_text |
vector of text documents, usually contained in the output of prepDocuments |
topic |
a single topic number |
n |
number of documents to extract |
#Examples with the Gadarian Data library("tm"); library("stm"); library("sts") temp<-textProcessor(documents=gadarian$open.ended.response, metadata=gadarian, verbose = FALSE) out <- prepDocuments(temp$documents, temp$vocab, temp$meta, verbose = FALSE) out$meta$noTreatment <- ifelse(out$meta$treatment == 1, -1, 1) ## low max iteration number just for testing sts_estimate <- sts(~ treatment*pid_rep, ~ noTreatment, out, K = 3, maxIter = 2) docs <- findRepresentativeDocs(sts_estimate, out$meta$open.ended.response, topic = 3, n = 4) plotRepresentativeDocs(docs, text.cex = 0.7, width = 100)
#Examples with the Gadarian Data library("tm"); library("stm"); library("sts") temp<-textProcessor(documents=gadarian$open.ended.response, metadata=gadarian, verbose = FALSE) out <- prepDocuments(temp$documents, temp$vocab, temp$meta, verbose = FALSE) out$meta$noTreatment <- ifelse(out$meta$treatment == 1, -1, 1) ## low max iteration number just for testing sts_estimate <- sts(~ treatment*pid_rep, ~ noTreatment, out, K = 3, maxIter = 2) docs <- findRepresentativeDocs(sts_estimate, out$meta$open.ended.response, topic = 3, n = 4) plotRepresentativeDocs(docs, text.cex = 0.7, width = 100)
Compute the heldout log-likelihood of the STS model
heldoutLikelihood(object, missing)
heldoutLikelihood(object, missing)
object |
an sts object, typically after applying |
missing |
list of which words and documents are in the heldout set |
expected.heldout is the average of the held-out log-likelihood values for each document.
library("tm"); library("stm"); library("sts") temp<-textProcessor(documents=gadarian$open.ended.response, metadata=gadarian, verbose = FALSE) out <- prepDocuments(temp$documents, temp$vocab, temp$meta, verbose = FALSE) out$meta$noTreatment <- ifelse(out$meta$treatment == 1, -1, 1) out_ho <- make.heldout(out$documents, out$vocab) out_ho$meta <- out$meta ## low max iteration number just for testing sts_estimate <- sts(~ treatment*pid_rep, ~ noTreatment, out_ho, K = 3, maxIter = 2, verbose = FALSE) heldoutLikelihood(sts_estimate, out_ho$missing)$expected.heldout
library("tm"); library("stm"); library("sts") temp<-textProcessor(documents=gadarian$open.ended.response, metadata=gadarian, verbose = FALSE) out <- prepDocuments(temp$documents, temp$vocab, temp$meta, verbose = FALSE) out$meta$noTreatment <- ifelse(out$meta$treatment == 1, -1, 1) out_ho <- make.heldout(out$documents, out$vocab) out_ho$meta <- out$meta ## low max iteration number just for testing sts_estimate <- sts(~ treatment*pid_rep, ~ noTreatment, out_ho, K = 3, maxIter = 2, verbose = FALSE) heldoutLikelihood(sts_estimate, out_ho$missing)$expected.heldout
Produces a plot of the most likely words and their probabilities for each topic for different levels of sentiment for an STS object.
## S3 method for class 'STS' plot( x, n = 10, topics = NULL, lowerPercentile = 0.05, upperPercentile = 0.95, ... )
## S3 method for class 'STS' plot( x, n = 10, topics = NULL, lowerPercentile = 0.05, upperPercentile = 0.95, ... )
x |
Model output from sts. |
n |
Sets the number of words used to label each topic. In perspective plots it approximately sets the total number of words in the plot. n must be greater than or equal to 2 |
topics |
Vector of topics to display. Defaults to all topics. |
lowerPercentile |
Percentile to calculate a representative negative sentiment document. |
upperPercentile |
Percentile to calculate a representative positive sentiment document. |
... |
Additional parameters passed to plotting functions. |
#Examples with the Gadarian Data library("tm"); library("stm"); library("sts") temp<-textProcessor(documents=gadarian$open.ended.response, metadata=gadarian, verbose = FALSE) out <- prepDocuments(temp$documents, temp$vocab, temp$meta, verbose = FALSE) out$meta$noTreatment <- ifelse(out$meta$treatment == 1, -1, 1) ## low max iteration number just for testing sts_estimate <- sts(~ treatment*pid_rep, ~ noTreatment, out, K = 3, maxIter = 2) plot(sts_estimate) plot(sts_estimate, n = 10, topic = c(1,2))
#Examples with the Gadarian Data library("tm"); library("stm"); library("sts") temp<-textProcessor(documents=gadarian$open.ended.response, metadata=gadarian, verbose = FALSE) out <- prepDocuments(temp$documents, temp$vocab, temp$meta, verbose = FALSE) out$meta$noTreatment <- ifelse(out$meta$treatment == 1, -1, 1) ## low max iteration number just for testing sts_estimate <- sts(~ treatment*pid_rep, ~ noTreatment, out, K = 3, maxIter = 2) plot(sts_estimate) plot(sts_estimate, n = 10, topic = c(1,2))
Produces a plot of the text of documents that load most heavily on topics for an STS object
plotRepresentativeDocs(object, text.cex = 1, width = 100)
plotRepresentativeDocs(object, text.cex = 1, width = 100)
object |
Model output from sts. |
text.cex |
Size of the text; Defaults to 1 |
width |
Size of the plotting window; Defaults to 100 |
#Examples with the Gadarian Data library("tm"); library("stm"); library("sts") temp<-textProcessor(documents=gadarian$open.ended.response, metadata=gadarian, verbose = FALSE) out <- prepDocuments(temp$documents, temp$vocab, temp$meta, verbose = FALSE) out$meta$noTreatment <- ifelse(out$meta$treatment == 1, -1, 1) ## low max iteration number just for testing sts_estimate <- sts(~ treatment*pid_rep, ~ noTreatment, out, K = 3, maxIter = 2) docs <- findRepresentativeDocs(sts_estimate, out$meta$open.ended.response, topic = 3, n = 1) plotRepresentativeDocs(docs, text.cex = 0.7, width = 100)
#Examples with the Gadarian Data library("tm"); library("stm"); library("sts") temp<-textProcessor(documents=gadarian$open.ended.response, metadata=gadarian, verbose = FALSE) out <- prepDocuments(temp$documents, temp$vocab, temp$meta, verbose = FALSE) out$meta$noTreatment <- ifelse(out$meta$treatment == 1, -1, 1) ## low max iteration number just for testing sts_estimate <- sts(~ treatment*pid_rep, ~ noTreatment, out, K = 3, maxIter = 2) docs <- findRepresentativeDocs(sts_estimate, out$meta$open.ended.response, topic = 3, n = 1) plotRepresentativeDocs(docs, text.cex = 0.7, width = 100)
Prints estimated regression tables from estimateRegnTables()
printRegnTables( x, topics = NULL, digits = max(3L, getOption("digits") - 3L), signif.stars = getOption("show.signif.stars"), ... )
printRegnTables( x, topics = NULL, digits = max(3L, getOption("digits") - 3L), signif.stars = getOption("show.signif.stars"), ... )
x |
the estimated regression tables from estimateRegnTables() |
topics |
Vector of topics to display. Defaults to all topics. |
digits |
minimum number of significant digits to be used for most numbers. |
signif.stars |
logical; if TRUE, P-values are additionally encoded visually as ‘significance stars’ in order to help scanning of long coefficient tables. It defaults to the show.signif.stars slot of options. |
... |
other arguments suitable for stats::printCoefmat() |
Prints estimated regression tables from estimateRegnTables() to console
library("tm"); library("stm"); library("sts") temp<-textProcessor(documents=gadarian$open.ended.response, metadata=gadarian, verbose = FALSE) out <- prepDocuments(temp$documents, temp$vocab, temp$meta, verbose = FALSE) out$meta$noTreatment <- ifelse(out$meta$treatment == 1, -1, 1) ## low max iteration number just for testing sts_estimate <- sts(~ treatment*pid_rep, ~ noTreatment, out, K = 3, maxIter = 2) regns <- estimateRegns(sts_estimate, ~treatment*pid_rep, out) printRegnTables(x = regns)
library("tm"); library("stm"); library("sts") temp<-textProcessor(documents=gadarian$open.ended.response, metadata=gadarian, verbose = FALSE) out <- prepDocuments(temp$documents, temp$vocab, temp$meta, verbose = FALSE) out$meta$noTreatment <- ifelse(out$meta$treatment == 1, -1, 1) ## low max iteration number just for testing sts_estimate <- sts(~ treatment*pid_rep, ~ noTreatment, out, K = 3, maxIter = 2) regns <- estimateRegns(sts_estimate, ~treatment*pid_rep, out) printRegnTables(x = regns)
Prints the top words for each document for low, average, and high levels of sentiment-discourse
printTopWords(object, n = 10, lowerPercentile = 0.05, upperPercentile = 0.95)
printTopWords(object, n = 10, lowerPercentile = 0.05, upperPercentile = 0.95)
object |
Model output from sts |
n |
number of words to print to console for each topic |
lowerPercentile |
Percentile to calculate a representative negative sentiment document. |
upperPercentile |
Percentile to calculate a representative positive sentiment document. |
#Examples with the Gadarian Data library("tm"); library("stm"); library("sts") temp<-textProcessor(documents=gadarian$open.ended.response, metadata=gadarian, verbose = FALSE) out <- prepDocuments(temp$documents, temp$vocab, temp$meta, verbose = FALSE) out$meta$noTreatment <- ifelse(out$meta$treatment == 1, -1, 1) ## low max iteration number just for testing sts_estimate <- sts(~ treatment*pid_rep, ~ noTreatment, out, K = 3, maxIter = 2) printTopWords(sts_estimate)
#Examples with the Gadarian Data library("tm"); library("stm"); library("sts") temp<-textProcessor(documents=gadarian$open.ended.response, metadata=gadarian, verbose = FALSE) out <- prepDocuments(temp$documents, temp$vocab, temp$meta, verbose = FALSE) out$meta$noTreatment <- ifelse(out$meta$treatment == 1, -1, 1) ## low max iteration number just for testing sts_estimate <- sts(~ treatment*pid_rep, ~ noTreatment, out, K = 3, maxIter = 2) printTopWords(sts_estimate)
Estimation of the STS Model using variational EM.
The function takes sparse representation of a document-term matrix, covariates
for each document, and an integer number of topics and returns fitted model
parameters. See an overview of functions in the package here:
sts-package
sts( prevalence_sentiment, initializationVar, corpus, K, maxIter = 100, convTol = 1e-05, initialization = "anchor", kappaEstimation = "adjusted", verbose = TRUE, parallelize = FALSE, stmSeed = NULL )
sts( prevalence_sentiment, initializationVar, corpus, K, maxIter = 100, convTol = 1e-05, initialization = "anchor", kappaEstimation = "adjusted", verbose = TRUE, parallelize = FALSE, stmSeed = NULL )
prevalence_sentiment |
A formula object with no response variable or a design matrix with the covariates. The variables must be contained in corpus$meta. |
initializationVar |
A formula with a single variable for use in the initialization of latent sentiment. This argument is usually the key experimental variable (e.g., review rating binary indicator of experiment/control group). |
corpus |
The document term matrix to be modeled in a sparse term count matrix with one row
per document and one column per term. The object must be a list of with each element
corresponding to a document. Each document is represented
as an integer matrix with two rows, and columns equal to the number of unique
vocabulary words in the document. The first row contains the 1-indexed
vocabulary entry and the second row contains the number of times that term
appears. This is the same format in the |
K |
A positive integer (of size 2 or greater) representing the desired number of topics. |
maxIter |
A positive integer representing the max number of VEM iterations allowed. |
convTol |
Convergence tolerance for the variational EM estimation algorithm; Default value = 1e-5. |
initialization |
Character argument that allows the user to specify an initialization
method. The default choice, |
kappaEstimation |
A character input specifying how kappa should be estimated. |
verbose |
A logical flag indicating whether information should be printed to the screen. |
parallelize |
A logical flag indicating whether to parallelize the estimation using all but one CPU cores on your local machine. |
stmSeed |
A prefit STM model object to initialize the STS model. Note this is ignored unless initialization = "stm" |
This is the main function for estimating the Structural Topic and Sentiment-Discourse (STS) Model. Users provide a corpus of documents and a number of topics. Each word in a document comes from exactly one topic and each document is represented by the proportion of its words that come from each of the topics. The document-specific content covariates affect how much (prevalence) and the way in which a topic is discussed (sentiment-discourse).
An object of class sts
alpha |
Estimated prevalence and sentiment-discourse values for each document and topic |
gamma |
Estimated regression coefficients that determine prevalence and sentiment/discourse for each topic |
kappa |
Estimated kappa coefficients that determine sentiment-discourse and the topic-word distributions |
sigma_inv |
Inverse of the covariance matrix for the alpha parameters |
sigma |
Covariance matrix for the alpha parameters |
elbo |
the ELBO at each iteration of the estimation algorithm |
mv |
the baseline log-transformed occurrence rate of each word in the corpus |
runtime |
Time elapsed in seconds |
vocab |
Vocabulary vector used |
mu |
Mean (fitted) values for alpha based on document-level variables * estimated Gamma for each document |
Roberts, M., Stewart, B., Tingley, D., and Airoldi, E. (2013) "The structural topic model and applied social science." In Advances in Neural Information Processing Systems Workshop on Topic Models: Computation, Application, and Evaluation.
Roberts M., Stewart, B. and Airoldi, E. (2016) "A model of text for experimentation in the social sciences" Journal of the American Statistical Association.
Chen L. and Mankad, S. (forthcoming) "A Structural Topic and Sentiment-Discourse Model for Text Analysis" Management Science.
#An example using the Gadarian data from the stm package. From Raw text to # fitted model using textProcessor() which leverages the tm Package library("tm"); library("stm"); library("sts") temp<-textProcessor(documents=gadarian$open.ended.response, metadata=gadarian, verbose = FALSE) out <- prepDocuments(temp$documents, temp$vocab, temp$meta, verbose = FALSE) out$meta$noTreatment <- ifelse(out$meta$treatment == 1, -1, 1) ## low max iteration number just for testing sts_estimate <- sts(~ treatment*pid_rep, ~ noTreatment, out, K = 3, maxIter = 1, verbose = FALSE)
#An example using the Gadarian data from the stm package. From Raw text to # fitted model using textProcessor() which leverages the tm Package library("tm"); library("stm"); library("sts") temp<-textProcessor(documents=gadarian$open.ended.response, metadata=gadarian, verbose = FALSE) out <- prepDocuments(temp$documents, temp$vocab, temp$meta, verbose = FALSE) out$meta$noTreatment <- ifelse(out$meta$treatment == 1, -1, 1) ## low max iteration number just for testing sts_estimate <- sts(~ treatment*pid_rep, ~ noTreatment, out, K = 3, maxIter = 1, verbose = FALSE)
Function to report on the contents of STS objects
## S3 method for class 'STS' summary(object, ...)
## S3 method for class 'STS' summary(object, ...)
object |
An STS object. |
... |
Additional arguments affecting the summary |
Summary prints a short statement about the model and then runs
printTopWords
.
Calculate an exclusivity metric for an STS model.
topicExclusivity(object, M = 10, frexw = 0.7)
topicExclusivity(object, M = 10, frexw = 0.7)
object |
Model output from sts |
M |
the number of top words to consider per topic |
frexw |
the frex weight |
Roberts et al 2014 proposed an exclusivity measure to help with topic model selection.
The exclusivity measure includes some information on word frequency as well. It is based on the FREX labeling metric (see Roberts et al. 2014) with the weight set to .7 in favor of exclusivity by default.
a numeric vector containing exclusivity for each topic
Mimno, D., Wallach, H. M., Talley, E., Leenders, M., and McCallum, A. (2011, July). "Optimizing semantic coherence in topic models." In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 262-272). Association for Computational Linguistics. Chicago
Bischof and Airoldi (2012) "Summarizing topical content with word frequency and exclusivity" In Proceedings of the International Conference on Machine Learning.
Roberts, M., Stewart, B., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S., Albertson, B., et al. (2014). "Structural topic models for open ended survey responses." American Journal of Political Science, 58(4), 1064-1082.
#An example using the Gadarian data from the stm package. # From Raw text to fitted model using textProcessor() which leverages the # tm Package library("tm"); library("stm"); library("sts") temp<-textProcessor(documents=gadarian$open.ended.response, metadata=gadarian, verbose = FALSE) out <- prepDocuments(temp$documents, temp$vocab, temp$meta, verbose = FALSE) out$meta$noTreatment <- ifelse(out$meta$treatment == 1, -1, 1) ## low max iteration number just for testing sts_estimate <- sts(~ treatment*pid_rep, ~ noTreatment, out, K = 3, maxIter = 2) topicExclusivity(sts_estimate)
#An example using the Gadarian data from the stm package. # From Raw text to fitted model using textProcessor() which leverages the # tm Package library("tm"); library("stm"); library("sts") temp<-textProcessor(documents=gadarian$open.ended.response, metadata=gadarian, verbose = FALSE) out <- prepDocuments(temp$documents, temp$vocab, temp$meta, verbose = FALSE) out$meta$noTreatment <- ifelse(out$meta$treatment == 1, -1, 1) ## low max iteration number just for testing sts_estimate <- sts(~ treatment*pid_rep, ~ noTreatment, out, K = 3, maxIter = 2) topicExclusivity(sts_estimate)
Calculates semantic coherence for an STS model.
topicSemanticCoherence(object, corpus, M = 10)
topicSemanticCoherence(object, corpus, M = 10)
object |
Model output from sts |
corpus |
The document term matrix to be modeled in a sparse term count matrix with one row
per document and one column per term. The object must be a list of with each element
corresponding to a document. Each document is represented
as an integer matrix with two rows, and columns equal to the number of unique
vocabulary words in the document. The first row contains the 1-indexed
vocabulary entry and the second row contains the number of times that term
appears. This is the same format in the |
M |
the number of top words to consider per topic |
a numeric vector containing semantic coherence for each topic
#An example using the Gadarian data from the stm package. From Raw text to # fitted model using textProcessor() which leverages the tm Package library("tm"); library("stm"); library("sts") temp<-textProcessor(documents=gadarian$open.ended.response, metadata=gadarian, verbose = FALSE) out <- prepDocuments(temp$documents, temp$vocab, temp$meta, verbose = FALSE) out$meta$noTreatment <- ifelse(out$meta$treatment == 1, -1, 1) ## low max iteration number just for testing sts_estimate <- sts(~ treatment*pid_rep, ~ noTreatment, out, K = 3, maxIter = 2, verbose = FALSE) topicSemanticCoherence(sts_estimate, out)
#An example using the Gadarian data from the stm package. From Raw text to # fitted model using textProcessor() which leverages the tm Package library("tm"); library("stm"); library("sts") temp<-textProcessor(documents=gadarian$open.ended.response, metadata=gadarian, verbose = FALSE) out <- prepDocuments(temp$documents, temp$vocab, temp$meta, verbose = FALSE) out$meta$noTreatment <- ifelse(out$meta$treatment == 1, -1, 1) ## low max iteration number just for testing sts_estimate <- sts(~ treatment*pid_rep, ~ noTreatment, out, K = 3, maxIter = 2, verbose = FALSE) topicSemanticCoherence(sts_estimate, out)