Package: sentopics 0.7.4

Olivier Delmarcelle

sentopics: Tools for Joint Sentiment and Topic Analysis of Textual Data

A framework that joins topic modeling and sentiment analysis of textual data. The package implements a fast Gibbs sampling estimation of Latent Dirichlet Allocation (Griffiths and Steyvers (2004) <doi:10.1073/pnas.0307752101>) and Joint Sentiment/Topic Model (Lin, He, Everson and Ruger (2012) <doi:10.1109/TKDE.2011.48>). It offers a variety of helpers and visualizations to analyze the result of topic modeling. The framework also allows enriching topic models with dates and externally computed sentiment measures. A flexible aggregation scheme enables the creation of time series of sentiment or topical proportions from the enriched topic models. Moreover, a novel method jointly aggregates topic proportions and sentiment measures to derive time series of topical sentiment.

Authors:Olivier Delmarcelle [aut, cre], Samuel Borms [ctb], Chengua Lin [cph], Yulan He [cph], Jose Bernardo [cph], David Robinson [cph], Julia Silge [cph]

sentopics_0.7.4.tar.gz
sentopics_0.7.4.tar.gz(r-4.5-noble)sentopics_0.7.4.tar.gz(r-4.4-noble)
sentopics_0.7.4.tgz(r-4.4-emscripten)sentopics_0.7.4.tgz(r-4.3-emscripten)
sentopics.pdf |sentopics.html
sentopics/json (API)
NEWS

# Install 'sentopics' in R:
install.packages('sentopics', repos = 'https://cloud.r-project.org')

Bug tracker:https://github.com/odelmarcelle/sentopics/issues2 issues

Uses libs:
  • openblas– Optimized BLAS
  • c++– GNU Standard C++ Library v3
  • openmp– GCC OpenMP (GOMP) support library
Datasets:

On CRAN:

Conda:

openblascppopenmp

3.30 score 300 downloads 38 exports 22 dependencies

Last updated 6 months agofrom:b5438b3e47. Checks:2 OK, 1 NOTE. Indexed: no.

TargetResultLatest binary
Doc / VignettesOKMar 20 2025
R-4.5-linux-x86_64OKMar 20 2025
R-4.4-linux-x86_64NOTEMar 20 2025

Exports:as.JSTas.LDAas.LDA_ldaas.rJSTas.sentopicmodelas.tokenschainsDistanceschainsScorescoherencecompute_PicaultRenault_scoresdocvarsfitfit.sentopicmodelgrowJSTLDALDAvismeltmergeTopicsplot_proportion_topicsplot_sentiment_breakdownplot_sentiment_topicsplot_topWordsproportion_topicsresetrJSTsentiment_breakdownsentiment_seriessentiment_topicssentopicmodelsentopics_datesentopics_date<-sentopics_labelssentopics_labels<-sentopics_sentimentsentopics_sentiment<-tokenstopWords

Dependencies:clidata.tablefastmatchgenericsglueISOcodesjsonlitelatticelifecyclemagrittrMatrixquantedaRcppRcppArmadilloRcppHungarianRcppProgressrlangSnowballCstopwordsstringixml2yaml

Basic usage

Rendered fromBasic_usage.Rmdusingknitr::rmarkdownon Mar 20 2025.

Last update: 2024-04-19
Started: 2022-03-10

Topical time series

Rendered fromTopical_time_series.Rmdusingknitr::rmarkdownon Mar 20 2025.

Last update: 2024-04-19
Started: 2022-03-10

Citation

To cite package ‘sentopics’ in publications use:

Delmarcelle O (2024). sentopics: Tools for Joint Sentiment and Topic Analysis of Textual Data. R package version 0.7.4, https://CRAN.R-project.org/package=sentopics.

Corresponding BibTeX entry:

  @Manual{,
    title = {sentopics: Tools for Joint Sentiment and Topic Analysis of
      Textual Data},
    author = {Olivier Delmarcelle},
    year = {2024},
    note = {R package version 0.7.4},
    url = {https://CRAN.R-project.org/package=sentopics},
  }

Readme and manuals

sentopics

Installation

A stable version sentopics is available on CRAN:

install.packages("sentopics")

The latest development version can be installed from GitHub:

devtools::install_github("odelmarcelle/sentopics") 

The development version requires the appropriate tools to compile C++ and Fortran source code.

Basic usage

Using a sample of press conferences from the European Central Bank, an LDA model is easily created from a list of tokenized texts. See https://quanteda.io for details on tokens input objects and pre-processing functions.

library("sentopics")
print(ECB_press_conferences_tokens, 2)
# Tokens consisting of 3,860 documents and 5 docvars.
# 1_1 :
#  [1] "outcome"           "meeting"           "decision"         
#  [4] ""                  "ecb"               "general"          
#  [7] "council"           "governing_council" "executive"        
# [10] "board"             "accordance"        "escb"             
# [ ... and 7 more ]
# 
# 1_2 :
#  [1] ""              "state"         "government"    "member"       
#  [5] "executive"     "board"         "ecb"           "president"    
#  [9] "vice"          "president"     "date"          "establishment"
# [ ... and 13 more ]
# 
# [ reached max_ndoc ... 3,858 more documents ]
set.seed(123)
lda <- LDA(ECB_press_conferences_tokens, K = 3, alpha = .1)
lda <- fit(lda, 100)
lda
# An LDA model with 3 topics. Currently fitted by 100 Gibbs sampling iterations.
# ------------------Useful methods------------------
# fit       :Estimate the model using Gibbs sampling
# topics    :Return the most important topic of each document
# topWords  :Return a data.table with the top words of each topic/sentiment
# plot      :Plot a sunburst chart representing the estimated mixtures
# This message is displayed once per session, unless calling `print(x, extended = TRUE)`

There are various way to extract results from the model: it is either possible to directly access the estimated mixtures from the lda object or to use some helper functions.

# The document-topic distributions
head(lda$theta) 
#       topic
# doc_id      topic1    topic2      topic3
#    1_1 0.005780347 0.9884393 0.005780347
#    1_2 0.004291845 0.9914163 0.004291845
#    1_3 0.015873016 0.9682540 0.015873016
#    1_4 0.009708738 0.9805825 0.009708738
#    1_5 0.008849558 0.9823009 0.008849558
#    1_6 0.006993007 0.9160839 0.076923077
# The document-topic in a 'long' format & optionally with meta-data
head(melt(lda, include_docvars = FALSE))
#     topic    .id        prob
#    <fctr> <char>       <num>
# 1: topic1    1_1 0.005780347
# 2: topic1    1_2 0.004291845
# 3: topic1    1_3 0.015873016
# 4: topic1    1_4 0.009708738
# 5: topic1    1_5 0.008849558
# 6: topic1    1_6 0.006993007
# The most probable words per topic
topWords(lda, output = "matrix") 
#       topic1        topic2              topic3           
#  [1,] "growth"      "governing_council" "euro_area"      
#  [2,] "annual"      "fiscal"            "economic"       
#  [3,] "rate"        "euro_area"         "growth"         
#  [4,] "price"       "country"           "price"          
#  [5,] "loan"        "growth"            "risk"           
#  [6,] "monetary"    "policy"            "inflation"      
#  [7,] "inflation"   "reform"            "development"    
#  [8,] "euro_area"   "structural"        "price_stability"
#  [9,] "development" "market"            "quarter"        
# [10,] "financial"   "bank"              "outlook"

Two visualization are also implemented: plot_topWords() display the most probable words and plot() summarize the topic proportions and their top words.

plot(lda)

After properly incorporating date and sentiment metadata data (if they are not already present in the tokens input), time series functions allows to study the evolution of topic proportions and related sentiment.

sentopics_date(lda)  |> head(2)
#       .id      .date
#    <char>     <Date>
# 1:    1_1 1998-06-09
# 2:    1_2 1998-06-09
sentopics_sentiment(lda) |> head(2)
#       .id  .sentiment
#    <char>       <num>
# 1:    1_1 -0.01470588
# 2:    1_2 -0.02500000
proportion_topics(lda, period = "month") |> head(2)
#                topic1    topic2     topic3
# 1998-06-01 0.04004786 0.9100265 0.04992568
# 1998-07-01 0.17387955 0.7276814 0.09843903
plot_sentiment_breakdown(lda, period = "quarter", rolling_window = 3)

Advanced usage

Feel free to refer to the vignettes of the package for a more extensive introduction to the features of the package. Because the package is not yet on CRAN, you’ll have to build the vignettes locally.

vignette("Basic_usage", package = "sentopics")
vignette("Topical_time_series", package = "sentopics")

Help Manual

Help pageTopics
Tools for joining sentiment and topic analysis (sentopics)sentopics-package sentopics
Conversions from other packages to LDAas.LDA as.LDA.keyATM_output as.LDA.LDA_Gibbs as.LDA.LDA_VEM as.LDA.STM as.LDA.textmodel_lda as.LDA_lda
Convert back a dfm to a tokens objectas.tokens.dfm
Distances between topic models (chains)chainsDistances
Compute scores of topic models (chains)chainsScores
Coherence of estimated topicscoherence
Compute scores using the Picault-Renault lexiconcompute_PicaultRenault_scores
Corpus of press conferences from the European Central BankECB_press_conferences
Tokenized press conferencesECB_press_conferences_tokens
Estimate a topic modelfit.JST fit.LDA fit.multiChains fit.rJST fit.sentopicmodel grow grow.JST grow.LDA grow.multiChains grow.rJST grow.sentopicmodel
Download press conferences from the European Central Bankget_ECB_press_conferences
Download and pre-process speeches from the European Central Bankget_ECB_speeches
Create a Joint Sentiment/Topic modelJST
Create a Latent Dirichlet Allocation modelLDA
Visualize a LDA model using 'LDAvis'LDAvis
Loughran-McDonald lexiconLoughranMcDonald
Replacement generic for 'data.table::melt()'melt
Melt for sentopicmodelsmelt.sentopicmodel
Merge topics into fewer themesmergeTopics
Picault-Renault lexiconPicaultRenault
Regression dataset based on Picault & Renault (2017)PicaultRenault_data
Plot the distances between topic models (chains)plot.multiChains
Plot a topic model using Plotlyplot.sentopicmodel
Print method for sentopics modelsprint.JST print.LDA print.rJST print.sentopicmodel
Compute the topic or sentiment proportion time seriesplot_proportion_topics proportion_topics
Re-initialize a topic modelreset
Create a Reversed Joint Sentiment/Topic modelrJST rJST.default rJST.LDA
Breakdown the sentiment into topical componentsplot_sentiment_breakdown sentiment_breakdown
Compute a sentiment time seriessentiment_series
Compute time series of topical sentimentsplot_sentiment_topics sentiment_topics
Internal datesentopics_date sentopics_date<-
Setting topic or sentiment labelssentopics_labels sentopics_labels<-
Internal sentimentsentopics_sentiment sentopics_sentiment<-
Extract the most representative words from topicsplot_topWords topWords