--- title: "sentixr with tidytext and quanteda" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{sentixr with tidytext and quanteda} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, warning = F, message = F, comment = "#>" ) ``` `sentixr` is designed to be used with other R packages for text analysis. If you prefer the `tidytext` ecosystem or the `quanteda` framework, you can export the lexicons and use them in your existing pipelines. ## Setup ```{r} library(sentixr) ``` We'll use the example data provided by the package: ```{r} data(recensioni_tv) recensioni_tv ``` ## sentixr with tidytext For the `tidytext` ecosystem, `sentixr` provides access to its lexicons in a tidy format, ready for joining. ```{r} library(tidytext) ``` ### Get the Lexicon Use `get_sentix()` to retrieve the lexicon as a tidy tibble, with only two columns. For `tidytext` workflows on raw text (without lemmatization), the **MAL** lexicon is preferred because it contains inflected forms, matching the output of standard tokenizers. ```{r} # Get the MAL lexicon (inflected forms) mal_dict <- get_sentix("MAL") head(mal_dict) ``` ### Tokenize and Join Use `tidytext::unnest_tokens()` to split the text into words. ```{r} # Tokenize tidy_text <- recensioni_tv |> unnest_tokens(word, text) ``` ```{r} # Join with lexicon tidy_sent <- tidy_text |> left_join(mal_dict, by = "word") head(tidy_sent) ``` Here, `left_join` is used to keep all words (even those without a score) so that the token count (`n_tokens`) remains accurate: `score` will be `NA` for words not found in the lexicon. Alternatively, an `inner_join()` can be used to keep only the words present in the lexicon. ### Analyze To get the sentiment scores, you can use the native `sentix_summarize()` function for a quick analysis on the joined data: ```{r} # Calculate average sentiment per document sentix_summarize(tidy_sent, simplify = FALSE) ``` Alternatively, if you prefer custom metrics (e.g., standard deviation or median), you can manually group and summarize: ```{r} # Manual summary with dplyr tidy_sent |> group_by(doc_id) |> summarise( sentiment = mean(score, na.rm = T), n_tokens = n(), n_scored = sum(!is.na(score)) ) ``` ## Polarity Analysis To perform polarity analysis (counting positive vs negative words), retrieve the lexicon with `polarity = TRUE` (assigns "positive"/"negative" labels), and join as before. ```{r} # Get MAL with polarity labels polar_dict <- get_sentix("MAL", polarity = TRUE) head(polar_dict) ``` ```{r} # Join with tokenized text tidy_text |> left_join(polar_dict, by = "word") |> head() ``` It is also possible to generate polarity labels from continuous scores in a custom way, using the `make_polarity()` function. Here the threshold is set to `0.125` (positive scores above 0.125 are "positive", negative scores below -0.125 are "negative", and scores in between are "neutral"): ```{r} mal_dict |> mutate(polarity = make_polarity(score, threshold = 0.125)) |> head() ``` Or, to directly convert all numeric scores into polarity labels: ```{r} get_elita() |> mutate(across(where(is.numeric), ~ make_polarity(.x))) |> tail() ``` It is also possible to set different thresholds for positive and negative classifications, by providing a vector of two values, such as `c(0.125, -0.135)`. ## sentixr with Quanteda `sentixr` also includes helper functions to convert its lexicons into `quanteda::dictionary` objects, facilitating integration with the `quanteda` framework. ```{r} library(quanteda) ``` ```{r data} data(recensioni_tv) sentix_toks <- corpus(recensioni_tv) |> tokens(remove_punct = TRUE) ``` ### Creating a Quanteda Dictionary The `df_to_dict()` function converts a dataframe lexicon into a Quanteda dictionary that can be used, for example, with `tokens_lookup()` or `dfm_lookup()`. If the package `quanteda.sentiment` is installed, the function will automatically assign the appropriate polarity or valence attributes, making the dictionary compatible with `textstat_valence()` or `textstat_polarity()`. Other helper functions include `df_to_valence()` and `df_to_polar()` for explicit control. ```{r} # Convert MAL to a valence dictionary my_dict <- df_to_dict(mal_dict) ``` that is equivalent to: ```{r eval=FALSE} df_to_valence(MAL) ``` If the `quanteda.sentiment` package is installed, the valence scores will be automatically assigned to the dictionary's "valence" attribute, and the dictionary will be ready for use with: ```{r, eval = FALSE} # Compute valence quanteda.sentiment::textstat_valence(sentix_toks, dictionary = my_dict) #> doc_id sentiment #> 1 doc1 0.2689482 #> 2 doc2 -0.1755017 #> 3 doc3 0.2788701 #> 4 doc4 0.1295423 #> 5 doc5 -0.0208181 ``` Otherwise, the function will create a standard Quanteda dictionary. To get a polarity dictionary: ```{r eval=FALSE} my_dict2 <- get_sentix("MAL", polarity = TRUE) |> # if there are other numeric columns, other than 'polarity' df_to_polar() ``` which is equivalent to applying `df_to_dict()` to the polarity version of the lexicon: ```{r} my_dict2 <- df_to_dict(polar_dict) ``` ```{r, eval = FALSE} # Compute polarity scores quanteda.sentiment::textstat_polarity(sentix_toks, dictionary = my_dict2) #> doc_id sentiment #> 1 doc1 2.8332133 #> 2 doc2 0.0000000 #> 3 doc3 1.4663371 #> 4 doc4 0.9555114 #> 5 doc5 0.0000000 ```