--- title: "SentimentAnalysis Vignette" author: - "Stefan Feuerriegel" - "Nicolas Proellochs" date: "`r Sys.Date()`" bibliography: bibliography.bib output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{SentimentAnalysis Vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- The `SentimentAnalysis` package introduces a powerful toolchain facilitating the sentiment analysis of textual contents in R. This implementation utilizes various existing dictionaries, such as QDAP, Harvard IV and Loughran-McDonald. Furthermore, it can also create customized dictionaries. The latter function uses LASSO regularization as a statistical approach to select relevant terms based on an exogenous response variable. Finally, all methods can be easily compared using built-in evaluation routines. # Introduction Sentiment analysis is a research branch located at the heart of natural language processing (NLP), computational linguistics and text mining. It refers to any measures by which subjective information is extracted from textual documents. In other words, it extracts the polarity of the expressed opinion in a range spanning from positive to negative. As a result, one may also refer to sentiment analysis as *opinion mining* [@Pang.2008]. ## Applications in research Sentiment analysis has received great traction lately [@Ravi.2015; @Pang.2008], which we explore in the following. Current research in finance and the social sciences utilizes sentiment analysis to understand human decisions in response to textual materials. This immediately reveals manifold implications for practitioners, as well as those involved in the fields of finance research and the social sciences: researchers can use R to extract text components that are relevant for readers and test their hypotheses on this basis. By the same token, practitioners can measure which wording actually matters to their readership and enhance their writing accordingly [@ECIS.2015]. We demonstrate below the added benefits in two case studies drawn from finance and the social sciences. ## Applications in practice Several applications demonstrate the uses of sentiment analysis for organizations and enterprises: * **Finance:** Investors in financial markets refer to textual information in the form of financial news disclosures before exercising ownership in stocks. Interestingly, they rely not only on quantitative numbers, but also soft information, such as tone and sentiment [@Henry.2008; @Loughran.2011; @Tetlock.2007], which thereby strongly influences stock prices. By utilizing sentiment analysis, automated traders can automatically analyze the sentiment conveyed in financial disclosures in order to trigger investment decisions within milliseconds. * **Marketing:** Marketing departments are often interested in tracking brand image. For that purpose, they collect large volumes of user opinions from social media and evaluate the feelings of individuals towards brands, products and services. Practitioners in the field of marketing can exploit these insights to enhance their wording according to the feedback of their readership. * **Rating and review platforms:** Rating and review platforms fulfill a valuable function by collecting user ratings or preferences for certain products and services. Here, one can automatically process large volumes of user-generated content and exploit the knowledge gained thereby. For example, one can identify which cues convey a positive or negative opinion, or even automatically validate their credibility. # Methods for sentiment analysis As sentiment analysis is applied to a broad variety of domains and textual sources, research has devised various approaches to measuring sentiment. A recent literature overview [@Pang.2008] provides a comprehensive, domain-independent survey. On the one hand, machine learning approaches are preferred when one strives for high prediction performance. However, machine learning usually works as a black-box, thereby making interpretations diffucult. On the other hand, dictionary-based approaches generate lists of positive and negative words. The respective occurrences of these words are then combined into a single sentiment score. Therefore, the underlying decisions become traceable and researchers can understand the factors that result in a specific sentiment. In addition, `SentimentAnalysis` allows one to generate tailored dictionaries. These are customized to a specific domain, improve prediction performance compared to pure dictionaries and allow full interpretability. Details of this methodology can be found in [@PLOSONE.2018]. In the process of performing sentiment analysis, one must convert the running text into a machine-readable format. This is achieved by executing a series of preprocessing operations. First, the text is tokenized into single words, followed by what are common preprocessing steps: stopword removal, stemming, removal of punctuation and conversion to lower-case. These operations are also conducted by default in `SentimentAnalysis`, but can be adapted to one's personal needs. # Setup of the SentimentAnalysis package Even though sentiment analysis has received great traction lately, the available tools are not yet living up to the needs of researchers. The `SentimentAnalysis` package is intended to partially close this gap and offer capabilities that most research demands. First, simply install the package `SentimentAnalysis` from CRAN. Afterwards, one merely needs to load the `SentimentAnalysis` package as follows. This section shows the basic functionality to crawl for ad hoc filings. The following lines extract the ad hoc disclosure that was published most recently. ```{r} # install.packages("SentimentAnalysis") library(SentimentAnalysis) ``` # Brief demonstration ```{r} # Analyze a single string to obtain a binary response (positive / negative) sentiment <- analyzeSentiment("Yeah, this was a great soccer game for the German team!") convertToBinaryResponse(sentiment)$SentimentQDAP ``` ```{r} # Create a vector of strings documents <- c("Wow, I really like the new light sabers!", "That book was excellent.", "R is a fantastic language.", "The service in this restaurant was miserable.", "This is neither positive or negative.", "The waiter forget about my dessert -- what poor service!") # Analyze sentiment sentiment <- analyzeSentiment(documents) # Extract dictionary-based sentiment according to the QDAP dictionary sentiment$SentimentQDAP # View sentiment direction (i.e. positive, neutral and negative) convertToDirection(sentiment$SentimentQDAP) response <- c(+1, +1, +1, -1, 0, -1) compareToResponse(sentiment, response) compareToResponse(sentiment, convertToBinaryResponse(response)) plotSentimentResponse(sentiment$SentimentQDAP, response) ``` The `SentimentAnalysis` package works very cleverly and neatly here in order to remove the effort for the user: it recognizes that the user has inserted a vector of strings and thus automatically performs a set of default preprocessing operations from text mining. Hence, it tokenizes each document and finally converts the input into a document-term matrix. All of the previous operations are undertaken without manual specification. The `analyzeSentiment()` routine also accepts other input formats in case the user has already performed a preprocessing step or wants to implement a specific set of operations. # Functionality The following sections present the functionality in terms of working with different input formats and the underlying dictionaries. ## Interface The `SentimentAnalysis` package provides interfaces with several other input formats, among which are * Vector of strings. * DocumentTermMatrix and TermDocumentMatrix as implemented in the `tm` package [@Feinerer.2008]. * Corpus object as implemented by the `tm` package [@Feinerer.2008]. We provide examples in the following. ### Vector of strings ```{r} documents <- c("This is good", "This is bad", "This is inbetween") convertToDirection(analyzeSentiment(documents)$SentimentQDAP) ``` ### Document-term matrix ```{r} library(tm) corpus <- VCorpus(VectorSource(documents)) convertToDirection(analyzeSentiment(corpus)$SentimentQDAP) ``` ### Corpus object ```{r} dtm <- preprocessCorpus(corpus) convertToDirection(analyzeSentiment(dtm)$SentimentQDAP) ``` Since the package can work directly with a document-term matrix, this allows one to use customized preprocessing operations in the first place. Afterwards, one can utilize the `SentimentAnalysis` package for the computation of sentiment scores. For instance, one can replace the stopwords with those from a different list, or even perform tailored synonym merging, among other options. By default, the package uses the built-in routines `transformIntoCorpus()` to convert the input into a `Corpus` object and `preprocessCorpus()` to convert it into a `DocumentTermMatrix`. ## Built-in dictionaries The `SentimentAnalysis` package entails three different dictionaries: * Harvard-IV dictionary * Henry's Financial dictionary [@Henry.2008] * Loughran-McDonald Financial dictionary [@Loughran.2011] * QDAP dictionary from the package [`qdapDictionaries`](https://cran.r-project.org/package=qdapDictionaries) All of them can be manually inspected and even accessed as follows: ```{r} # Make dictionary available in the current R environment data(DictionarHE) # Display the internal structure str(DictionaryHE) # Access dictionary as an object of type SentimentDictionary dict.HE <- loadDictionaryHE() # Print summary statistics of dictionary summary(dict.HE) data(DictionaryLM) str(DictionaryLM) ``` ## Dictionary functions The `SentimentAnalysis` package distinguishes between three different types of dictionaries. All of them differ by the data they store, which ultimately also controls which methods of sentiment analysis one can apply. The dictionaries are as follows: * `SentimentDictionaryWordlist` contains a list of words belonging to a single category. For instance, it can bundle a list of uncertainty words in order to compute the ratio of uncertainty words in that particular document. * `SentimentDictionaryBinary` stores two lists of words, one for positive and one for negative entries. This allows one to later compute the polarity of the document on a scale from very positive to very negative. However, the categories are not further distinguished or rated, i.e. all positive words are assigned the same degree of positivity. * `SentimentDictionaryWeighted` allows words to take on continuous sentiment scores. This allows one, for instance, to rate *increase* as being more positive than *improve*. These weights can then be transformed into a linear model. For this purpose, the **SentimentDictionaryWeighted** also entails an intercept. It can also store an additional factor in order to revert the weighting by an inverse document frequency. ### SentimentDictionaryWordlist ```{r} d <- SentimentDictionaryWordlist(c("uncertain", "possible", "likely")) summary(d) # Alternative call d <- SentimentDictionary(c("uncertain", "possible", "likely")) summary(d) ``` ### SentimentDictionaryBinary ```{r} d <- SentimentDictionaryBinary(c("increase", "rise", "more"), c("fall", "drop")) summary(d) # Alternative call d <- SentimentDictionary(c("increase", "rise", "more"), c("fall", "drop")) summary(d) ``` ### SentimentDictionaryWeighted ```{r} d <- SentimentDictionaryWeighted(c("increase", "decrease", "exit"), c(+1, -1, -10), rep(NA, 3)) summary(d) # Alternative call d <- SentimentDictionary(c("increase", "decrease", "exit"), c(+1, -1, -10), rep(NA, 3)) summary(d) ``` # Dictionary generation The following example shows how the `SentimentAnalysis` package can extract statistically relevant textual drivers based on an exogenous response variable. The details of this method are presented in [@PLOSONE.2018], while we provide a brief summary here. Let \eqn{R} denote a response variable in the form of a vector. Furthermore, variables \eqn{w_1, \ldots, w_n} give the number of occurrences of word \eqn{w_i} in a document. The methodology then estimates a linear model \deqn{R = \alpha + \beta_1 w_1 + \ldots + \beta_n w_n} with intercept \eqn{\alpha} and coefficients \eqn{\beta_1, \ldots, \beta_n}. The estimation routine is based on LASSO regularization, which implicitly performs variable selection. In so doing, it sets some of the coefficients \eqn{\beta_i} to exactly zero. The remaining words can then be ranked by polarity according to their coefficient. ```{r} # Create a vector of strings documents <- c("This is a good thing!", "This is a very good thing!", "This is okay.", "This is a bad thing.", "This is a very bad thing.") response <- c(1, 0.5, 0, -0.5, -1) # Generate dictionary with LASSO regularization dict <- generateDictionary(documents, response) dict summary(dict) ``` In practice, users have several options for fine-tuning. Among these, they can disable the intercept \eqn{\alpha} and fix it to zero, or standardize the response variable \eqn{R}. In addition, it is possible to replace the LASSO with any variant of the elastic net, simply by changing the argument `alpha`. Finally, one can save and reload dictionaries using `read()` and `write()` as follows: ```{r,eval=FALSE} write(dict, file="dictionary.dict") dict <- read("dictionary.dict") ``` ## Performance evaluation Ultimately, several routines allow one to exlore the generated dictionary further. On the one hand, a simple overview can be displayed by means of the `summary()` routine. On the other hand, a Kernel Density Estimation can also visualize the distribution of positive and negative words. For instance, one can identify whether the opinionated words were skewed to either end of the polarity scale. Lastly, the `compareDictionary()` routine can compare the generated dictionary to dictionaries from the literature. It automatically computes various metrics, among which are the overlap or the correlation. ```{r} compareDictionaries(dict, loadDictionaryQDAP()) sentiment <- predict(dict, documents) compareToResponse(sentiment, response) plotSentimentResponse(sentiment, response) ``` The following example demonstrates how a calculated dictionary can be used for predicting the sentiment of out-of-sample data. In addition, the code then evaluates the prediction performance by comparing it to the built-in dictionaries. ```{r} test_documents <- c("This is neither good nor bad", "What a good idea!", "Not bad") test_response <- c(0, 1, 1) pred <- predict(dict, test_documents) compareToResponse(pred, test_response) plotSentimentResponse(pred, test_response) compareToResponse(analyzeSentiment(test_documents), test_response) ``` ## Configuration of preprocessing When desired, one can implement a tailored preprocessing stage that adapts to specific needs. The following code snippets demonstrate such adaptation. In particular, the `SentimentAnalysis` package ships a function `ngram_tokenize()` in order to extract \eqn{n}-grams from the corpus. This does not affect the results of the built-in dictionaries but rather changes the features used as part of dictionary generation. ```{r} corpus <- VCorpus(VectorSource(documents)) tdm <- TermDocumentMatrix(corpus, control=list(wordLengths=c(1,Inf), tokenize=function(x) ngram_tokenize(x, char=FALSE, ngmin=1, ngmax=2))) rownames(tdm) dict <- generateDictionary(tdm, response) summary(dict) dict ``` ## Performance optimization Once the user has decided upon a preferred rule, he can adapt the `analyzeSentiment()` routine by restricting it to calculate only the rules of interest. Such behavior can be implemented by changing the default value of the argument `rules`. See the following code snippets for an example: ```{r} sentiment <- analyzeSentiment(documents, rules=list("SentimentLM"=list(ruleSentiment, loadDictionaryLM()))) sentiment ``` ## Language support and extensibility `SentimentAnalysis` can be adapted for use with languages other than English. In order to do this, one needs to introduce changes at two points: * **Preprocessing:** The built-in routines use a parameter `language="english"` to perform all preprocessing operations for the English language. Instead, one might prefer to change stemming and stopwords to a desired language. If one wishes to make further changes to the preprocessing, it might be beneficial to replace the automatic preprocessing with one's own routines, which then return a `DocumentTermMatrix`. * **Dictionary:** If one has a response or baseline variable, one can use the dictionary generation approach that is shipped with `SentimentAnalysis`. This can then automatically generate a dictionary of positive and negative words that can be applied to the given language. Otherwise, if one has no baseline variable at hand, one needs to load a dictionary for that langauge. It might be worthwhile to search online for pre-defined lists of positive and negative words. The following example demonstrates how `SentimentAnalysis` can be adapted to work with a sample in German. Here, we supply a positive and negative document in the variable `documents`. Afterwards, we introduce a very small dictionary of positive and negative words, which is stored in `dictionaryGerman`. Finally, we use `analyzeSentiment()` to perform a sentiment analysis, where we introduce changes as follows: first of all, we supply `language="german"` to ensure that all preprocessing operations are being made for the German language. Additionally, we define our custom rule for `GermanSentiment` that uses our previous, customized dictionary. ```{r} documents <- c("Das ist ein gutes Resultat", "Das Ergebnis war schlecht") dictionaryGerman <- SentimentDictionaryBinary(c("gut"), c("schlecht")) sentiment <- analyzeSentiment(documents, language="german", rules=list("GermanSentiment"=list(ruleSentiment, dictionaryGerman))) sentiment convertToBinaryResponse(sentiment$GermanSentiment) ``` Similarly, one can implement a dictionary with custom sentiment scores. ```{r} woorden <- c("goed","slecht") scores <- c(0.8,-0.5) dictionaryDutch <- SentimentDictionaryWeighted(woorden, scores) documents <- "dit is heel slecht" sentiment <- analyzeSentiment(documents, language="dutch", rules=list("DutchSentiment"=list(ruleLinearModel, dictionaryDutch))) sentiment ``` Notes: * The argument `rules` is a named list of approaches, where each entry specifies a combination of a rule and a dictionary. * Caution is needed when working with stemming. The default routines of `SentimentAnalysis` automatically perform stemming. Therefore, it is necessary to included stemmed terms in the original dictionary. One can easily achieve such a conversion by calling `tm::stemDocument()`. # Worked examples The following example shows the usage of `SentimentAnalysis` in an applied setting. More precisely, we utilize Reuters oil-related news from the `tm` package. ```{r} library(tm) data("crude") # Analyze sentiment sentiment <- analyzeSentiment(crude) # Count positive and negative news releases table(convertToBinaryResponse(sentiment$SentimentLM)) # News releases with highest and lowest sentiment crude[[which.max(sentiment$SentimentLM)]]$meta$heading crude[[which.min(sentiment$SentimentLM)]]$meta$heading # View summary statistics of sentiment variable summary(sentiment$SentimentLM) # Visualize distribution of standardized sentiment variable hist(scale(sentiment$SentimentLM)) # Compute cross-correlation cor(sentiment[, c("SentimentLM", "SentimentHE", "SentimentQDAP")]) # crude oil news between 1987-02-26 until 1987-03-02 datetime <- do.call(c, lapply(crude, function(x) x$meta$datetimestamp)) plotSentiment(sentiment$SentimentLM) plotSentiment(sentiment$SentimentLM, x=datetime, cumsum=TRUE) ``` # Word couting `SentimentAnalysis` can also be used to count words with the help of `countWords()` in documents. ```{r} # count words (without stopwords) countWords(documents) # count all words (including stopwords) countWords(documents, removeStopwords=FALSE) ``` Note: The package has a built-in rule `ruleWordCount()`, which is used for the "WordCount" column when calling `analyzeSentiment()`. However, the former is likely to return different results as it is subject to the preprocessing rules of `analyzeSentiment()`. By default, it removes stopwords, excludes words with equal or less than 3 letters and might apply a sparsity operation. Hence, one should always use `countWords()` when working with word counts. # Outlook The current version leaves open avenues for further enhancement. In the future, we see the following items as being potentially subject to improvements: * **Negations:** We envision a generic negation rule object that can be injected to negate fixed windows or apply complex negation rules [@DSS.2016]. * **Multi-language support:** The current version has built-in dictionaries for the English language only. We think that the package would benefit greatly from support of further languages. In such a setup, one would not need to adapt the preprocessing routines, as the underlying `tm` package would already have support for further languages [@Feinerer.2008]. Instead, it would only be required that the user tailor the applied dictionaries. We cordially invite everyone to contribute source code, dictionaries and further demos. # License **SentimentAnalysis** is released under the [MIT License](https://opensource.org/license/mit/) Copyright (c) 2021 Stefan Feuerriegel & Nicolas Pröllochs # References