Title: | Validating Topic Coherence and Topic Labels |
---|---|
Description: | By creating crowd-sourcing tasks that can be easily posted and results retrieved using Amazon's Mechanical Turk (MTurk) API, researchers can use this solution to validate the quality of topics obtained from unsupervised or semi-supervised learning methods, and the relevance of topic labels assigned. This helps ensure that the topic modeling results are accurate and useful for research purposes. See Ying and others (2022) <doi:10.1101/2023.05.02.538599>. For more information, please visit <https://github.com/Triads-Developer/Topic_Model_Validation>. |
Authors: | Luwei Ying [aut, cre] , Jacob Montgomery [aut], Brandon Stewart [aut] |
Maintainer: | Luwei Ying <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.2.1 |
Built: | 2024-12-22 06:48:05 UTC |
Source: | CRAN |
Data frame of 20 example R4WSI0 Tasks, with 5 of them being gold-standard and 15 of them not.
data(allR4WSItasktest)
data(allR4WSItasktest)
A data frame of 20 rows and 6 columns.
topic
Index of topics
id
Index of topics
doc
Example documents associated with each topic
opt1
Words set option 1
opt2
Words set option 2
opt3
Words set option 3
optcrt
Words set option 4, also the correct choice
Check Agreement Rate between Identical Trails
checkAgree(results1, results2, key, type = NULL)
checkAgree(results1, results2, key, type = NULL)
results1 |
first batch of results; outputs from getResults() |
results2 |
first batch of results; outputs from getResults() |
key |
the local task record; outputs from recordTasks() |
type |
Task structures to be specified. Must be one of "WI" (word intrusion), "T8WSI" (top 8 word set intrusion), "R4WSI" (random 4 word set intrusion), "LI" (Label Intrusion), and "OL" (Optimal Label) |
Evaluate workers' performance by agreement rate between identical trails (Notice that this means the two input, results1 and results2, must be identical.); Return 1) the exact agreement rate when both workers agree on the exact same choice, and 2) the binary agreement rate when both workers get the task either right or wrong simultaneously
A numeric value to be returned with output.
Combine the mass of words with the same root
combMass(mod = NULL, vocab = NULL, beta = NULL)
combMass(mod = NULL, vocab = NULL, beta = NULL)
mod |
Fitted structural topic models. |
vocab |
A character vector specifying the words in the corpus. Usually, it can be found in topic model output. |
beta |
A matrix of word probabilities for each topic. Each row represents a topic and each column represents a word. Note this should not be in the logged form. |
Use as a preparing step for validating unstemmed topic models.
A list with two elements:
newvocab |
A matrix of new vocabulary. Each row represents a topic and each column represents a unique stemmed word. |
newbeta |
A matrix of new beta. Each row represents a topic and each column represents the sum of the probabilities of the words with the same root. |
Evaluate results
evalResults(results, key, type = NULL)
evalResults(results, key, type = NULL)
results |
results of human choice; outputs from getResults() |
key |
the local task record; outputs form recordTasks() |
type |
Task structures to be specified. Must be one of "WI" (word intrusion), "T8WSI" (top 8 word set intrusion), "R4WSI" (random 4 word set intrusion), "LI" (Label Intrusion), and "OL" (Optimal Label) |
Evaluate worker performance by gold-standard HITs; Return the accuracy rate (proportion correct) for a specified batch
A list containing the gold-standard HIT correct rate, gold-standard HIT correct rate by workers, and non-gold-standard HIT correct rate
Get results from Mturk
getResults( batch_id = "unspecified", hit_ids, retry = TRUE, retry_in_seconds = 60, AWS_id = Sys.getenv("AWS_ACCESS_KEY_ID"), AWS_secret = Sys.getenv("AWS_SECRET_ACCESS_KEY"), sandbox = getOption("pyMTurkR.sandbox", TRUE) )
getResults( batch_id = "unspecified", hit_ids, retry = TRUE, retry_in_seconds = 60, AWS_id = Sys.getenv("AWS_ACCESS_KEY_ID"), AWS_secret = Sys.getenv("AWS_SECRET_ACCESS_KEY"), sandbox = getOption("pyMTurkR.sandbox", TRUE) )
batch_id |
any number or string to annotate the batch |
hit_ids |
hit ids returned from the MTurk API, i.e., output of sendTasks() |
retry |
if TRUE, retry retriving results from Mturk API five times; default to TRUE |
retry_in_seconds |
default to 60 seconds |
AWS_id |
AWS_ACCESS_KEY_ID |
AWS_secret |
AWS_SECRET_ACCESS_KEY |
sandbox |
sanbox setting |
this function works for complete or incomplete batches
a data frame with columns:
batch_id |
an annotation for the batch |
local_task_id |
an identifier for the task in the batch |
mturk_hit_id |
the ID of the HIT in MTurk |
assignment_id |
the ID of the assignment in MTurk |
worker_id |
the ID of the worker who completed the assignment |
result |
the worker's response to the task |
completed_at |
the time when the worker submitted the assignment |
Data frame of 5 example gold-standard R4WSI0 Tasks.
data(goldR4WSItest)
data(goldR4WSItest)
A data frame of 5 rows and 6 columns.
topic
Index of topics
doc
Example documents associated with each topic
opt1
Words set option 1
opt2
Words set option 2
opt3
Words set option 3
optcrt
Words set option 4, also the correct choice
An output from the make.heldout
function of the stm
package.
data(heldouttest)
data(heldouttest)
A list of the heldout documents, vocab, and missing.
See https://CRAN.R-project.org/package=stm for more details.
Roberts, Margaret E., Brandon M. Stewart, and Dustin Tingley. "Stm: An R package for structural topic models." Journal of Statistical Software 91 (2019): 1-40.
Example Answer Keys
data(keypostedtest)
data(keypostedtest)
A list of two data frames. Similar to recordtest
.
data.frame1
A data frame of tasks with the optcrt
indicating the
machine predicted choice.
data.frame2
A data frame of tasks with randomized choices. Exactly the same with what would be sent online.
A list of two with the words (the most frequent form in each topic) and the corresponding word probabilities.
data(masstest)
data(masstest)
A list of two.
vocab
A matrix of words for each topic. Each row represents a topic and each column represents the words. Words with the same roots are only represented by the most common form in that topic.
beta
A matrix of combined word probabilities for each topic. Each row represents a topic and each column represents a combined word.
Mix the gold-standard tasks with the tasks need to be validated
mixGold(tasks, golds)
mixGold(tasks, golds)
tasks |
All tasks need to be validated |
golds |
Gold standard tasks with the same structure |
A data frame with the same structure as the input, where gold-standard tasks are randomly inserted
A structural topic model (STM) object generated from the stm
package using a random
sample of US senators' Facebook posts.
data(modtest)
data(modtest)
A STM object.
See https://CRAN.R-project.org/package=stm for more details.
Roberts, Margaret E., Brandon M. Stewart, and Dustin Tingley. "Stm: An R package for structural topic models." Journal of Statistical Software 91 (2019): 1-40.
Pick the optimal label from candidate labels
pickLabel( n, text.predict = NULL, text.name = "text", top1.name = "top1", labels.index = NULL, candidate.labels = NULL )
pickLabel( n, text.predict = NULL, text.name = "text", top1.name = "top1", labels.index = NULL, candidate.labels = NULL )
n |
The number of desired tasks |
text.predict |
A data frame or matrix containing both the text and the indicator(s) of the model predicted topic(s). |
text.name |
variable name in 'text.predict' that indicates the text |
top1.name |
variable name in 'text.predict' that indicates the top1 model predicted topic |
labels.index |
The topic index in correspondence with the labels, e.g., c(10, 12, 15). |
candidate.labels |
A list of vectors containing the user-defined labels assigned to the topics, Must be in the same length and order with 'labels.index'. |
Users need to specify four plausible labels for each topic
A matrix with n rows and 6 columns (topic, doc, opt1, opt2, opt3, optcrt) where optcrt is the correct label that was picked.
Plot results
plotResults(path, x, n, taskname, ...)
plotResults(path, x, n, taskname, ...)
path |
path to store the plot |
x |
a vector of counts of successes; could be obtained from getResults() |
n |
a vector of counts of trials |
taskname |
the name of the task for labeling, e.g., Word Intrusion, Optimal Label. |
... |
additional arguments to be passed to plot function |
Visualize the accuracy rate (proportion correct) for a specified batch
Nothing is returned; a plot is created and saved as a pdf file.
Data of 15 example R4WSI0 Tasks structured as a matrix.
data(R4WSItasktest)
data(R4WSItasktest)
A matrix with 15 rows and 6 columns.
topic
Index of topics
doc
Example documents associated with each topic
opt1
Words set option 1
opt2
Words set option 2
opt3
Words set option 3
optcrt
Words set option 4, also the correct choice
Please note that the difference between the R4WSI0 examples used here and the R4WSI tasks is that the R4WSI tasks do not present any documents.
Reform tasks to facilitate sending to Mturk
record(type, tasks, path)
record(type, tasks, path)
type |
(character) one of WI, T8WSI, R4WSI |
tasks |
(data.frame) outputs from validateTopic(), validateLabel(), or mixGold() if users mix in gold-standard HITs |
path |
(character) path to record the tasks (with meta-information) |
Randomize the order of options and record the tasks in a specified local directory
A list of two data frames, containing the original tasks and the randomized options respectively.
Local record generated by the recordTasks
function.
data(recordtest)
data(recordtest)
A list of two data frames.
data.frame1
A data frame of tasks with the optcrt
indicating the
machine preficted choice.
data.frame2
A data frame of tasks with randomized choices. Exactly the same with what would be sent online.
To be compared with the answers from the online workers to evaluate the topic model performance.
Example Results Retrieved from Mturk
data(resultstest)
data(resultstest)
A data frame of ten example tasks retrieved from the Mturk with or without online workers' answers.
assignment_id
Assignment id. Mturk assigned. If 0, then the task hasn't been completed.
batch_id
User specified batch id.
completed_at
Timestamp when the task was completed. If 0, then the task hasn't been completed.
local_task_id
Local task id.
mturk_hit_id
Mturk HIT id. Mturk assigned.
result
Choice made by the worker. 1-4. If 0, then the task hasn't been completed.
worker_id
Mturk worker id. If 0, then the task hasn't been completed.
Send prepared task to Mturk and record the API-returned HIT ids.
sendTasks( hit_type = NULL, hit_layout = NULL, type = NULL, tasksrecord = NULL, tasksids = NULL, HITidspath = NULL, n_assignments = "1", expire_in_seconds = as.character(60 * 60 * 8), batch_annotation = NULL )
sendTasks( hit_type = NULL, hit_layout = NULL, type = NULL, tasksrecord = NULL, tasksids = NULL, HITidspath = NULL, n_assignments = "1", expire_in_seconds = as.character(60 * 60 * 8), batch_annotation = NULL )
hit_type |
find from the Mturk requester's dashboard |
hit_layout |
find from the Mturk requester's dashboard |
type |
one of WI, T8WSI, R4WSI |
tasksrecord |
output of recordTasks() |
tasksids |
ids of tasks to send in numeric form. If left unspecified, the whole batch will be posted |
HITidspath |
path to record the returned HITids |
n_assignments |
number of of assignments per task. For the validation tasks, people almost always want 1 |
expire_in_seconds |
default 8 hours |
batch_annotation |
add if needed |
Pairs the local ids with Mturk ids and save them to specified paths
A list containing two elements:
current_HIT_ids:A vector of the HIT IDs returned by the API.
map_ids:A data frame that maps the tasksids to their corresponding HIT ids.
An output from the prepDocuments
function of the stm
package.
data(stmPreptest)
data(stmPreptest)
A list containing a documents and vocab object.
See https://CRAN.R-project.org/package=stm for more details.
Roberts, Margaret E., Brandon M. Stewart, and Dustin Tingley. "Stm: An R package for structural topic models." Journal of Statistical Software 91 (2019): 1-40.
The 'Topic_Model_Validation' repository is a collection of scripts and functions for performing topic modeling and evaluating topic models. This document provides an overview of the different scripts and functions in the repository and their purpose.
## Python Scripts
### evaluate.py
The 'evaluate.py' script provides functions for evaluating the performance of topic models on different datasets and tasks. The functions within this script include:
- R4WSItasktest()
: Evaluates the performance of a topic model on the R4WSI task, which involves predicting the top k words for a given topic.
- allR4WSItasktest()
: Evaluates the performance of a topic model on multiple versions of the R4WSI task.
- goldR4WSItest()
: Evaluates the performance of a topic model on a gold-standard R4WSI dataset.
- heldouttest()
: Evaluates the performance of a topic model on held-out data.
- keypostedtest()
: Evaluates the performance of a topic model on a key-posted dataset.
- masstest()
: Evaluates the performance of a topic model on a massive dataset.
- modtest()
: Evaluates the performance of a topic model on a given dataset.
- resultstest()
: Evaluates the performance of a topic model on a given dataset and stores the results.
### record.py
The 'record.py' script provides a function for storing the results of topic model evaluations. The function within this script is:
- record()
: Stores the results of topic model evaluations.
## R Scripts
### lda.R
The 'lda.R' script provides functions for performing Latent Dirichlet Allocation (LDA) topic modeling on text data. The functions within this script include:
- lda_model()
: Fits an LDA model to text data.
### lsa.R
The 'lsa.R' script provides functions for performing Latent Semantic Analysis (LSA) topic modeling on text data. The functions within this script include:
- lsa_model()
: Fits an LSA model to text data.
### evaluate.R
The 'evaluate.R' script provides functions for evaluating the performance of topic models using various metrics, such as perplexity and coherence. The functions within this script include:
- evaluate_model()
: Evaluates the performance of a topic model using various metrics.
### helpers.R
The 'helpers.R' script provides various helper functions that are used by the other scripts in the repository. The functions within this script include:
- clean_text()
: Cleans and preprocesses text data for use in topic modeling.
- read_data()
: Reads in text data from a file.
- write_data()
: Writes text data to a file.
Create validation tasks for labels assigned to the topics in the topic model of choice.
validateLabel( type, n, text.predict = NULL, text.name = "text", top1.name = "top1", top2.name = "top2", top3.name = "top3", labels = NULL, labels.index = NULL, labels.add = NULL )
validateLabel( type, n, text.predict = NULL, text.name = "text", top1.name = "top1", top2.name = "top2", top3.name = "top3", labels = NULL, labels.index = NULL, labels.add = NULL )
type |
Task structures to be specified. Must be one of "LI" (Label Intrusion) and "OL" (Optimal Label). |
n |
The number of desired tasks |
text.predict |
A data frame or matrix containing both the text and the indicator(s) of the model predicted topic(s). |
text.name |
variable name in 'text.predict' that indicates the text |
top1.name |
variable name in 'text.predict' that indicates the top1 model predicted topic |
top2.name |
variable name in 'text.predict' that indicates the top2 model predicted topic |
top3.name |
variable name in 'text.predict' that indicates the top3 model predicted topic |
labels |
The user-defined labels assigned to the topics |
labels.index |
The topic index in correspondence with the labels, e.g., c(10, 12, 15). Must be in the same length and order with 'label'. |
labels.add |
Labels from other broad catagories. Default to NULL. Users could specify them to evaluate how well different broad categories are distinguished from one another. #' value A matrix containing the validation tasks as described in the return section. |
Users need to pick a topic model that they deem to be good and label the topics they later would like to use as measures.
A matrix containing the validation tasks. The matrix has six value columns:
The topic index associated with the document.
The text of the document.
The first option label presented to the user.
The second option label presented to the user.
The third option label presented to the user.
The correct label for the document.
Create validation tasks for topic model selection
validateTopic(type, n, text = NULL, vocab, beta, theta = NULL, thres = 20)
validateTopic(type, n, text = NULL, vocab, beta, theta = NULL, thres = 20)
type |
Task structures to be specified. Must be one of "WI" (word intrusion), "T8WSI" (top 8 word set intrusion), and "R4WSI" (random 4 word set intrusion). |
n |
The number of desired tasks |
text |
The pool of documents to be shown to the Mturk workers |
vocab |
A character vector specifying the words in the corpus. Usually, it can be found in topic model output. |
beta |
A matrix of word probabilities for each topic. Each row represents a topic and each column represents a word. Note this should not be in the logged form. |
theta |
A matrix of topic proportions. Each row represents a document and each clums represents a topic. Must be specified if task = "T8WSI" or "R4WSI". |
thres |
the threshold to draw words from, default to top 50 words. |
Users need to fit their own topic models.
A matrix of validation tasks. Each row represents a task and each column represents an aspect of a task, including the topic label, the document text (for "T8WSI" and "R4WSI"), and five words, including four non-intrusive words and one intrusive word.