sentixr is designed
to be used with other R packages for text analysis. If you prefer the
tidytext ecosystem or the quanteda framework,
you can export the lexicons and use them in your existing pipelines.
We’ll use the example data provided by the package:
data(recensioni_tv)
recensioni_tv
#> doc_id
#> 1 doc1
#> 2 doc2
#> 3 doc3
#> 4 doc4
#> 5 doc5
#> text
#> 1 Ottimo prodotto, la qualità dell'immagine è buona, colori molto vivi.
#> 2 Ho riscontrato subito problemi; mi sono ostinato a fare delle prove, purtroppo senza risultati
#> 3 La tv è molto bella, ma la qualità dell'audio ha delle mancanze.
#> 4 Il prodotto va benissimo. C'è da dire che il costo irrisorio corrisponde ad alcuni limiti.
#> 5 I colori sono eccessivamente saturi, per non parlare dell'audio, a dir poco pessimo!For the tidytext ecosystem, sentixr
provides access to its lexicons in a tidy format, ready for joining.
Use get_sentix() to retrieve the lexicon as a tidy
tibble, with only two columns.
For tidytext workflows on raw text (without
lemmatization), the MAL lexicon is preferred because it
contains inflected forms, matching the output of standard
tokenizers.
Use tidytext::unnest_tokens() to split the text into
words.
# Join with lexicon
tidy_sent <- tidy_text |>
left_join(mal_dict, by = "word")
head(tidy_sent)
#> doc_id word score
#> 1 doc1 ottimo 0.5625000
#> 2 doc1 prodotto 0.1250000
#> 3 doc1 la 0.0000000
#> 4 doc1 qualità 0.3631757
#> 5 doc1 dell'immagine NA
#> 6 doc1 è 0.1256011Here, left_join is used to keep all words (even those
without a score) so that the token count (n_tokens) remains
accurate: score will be NA for words not found
in the lexicon.
Alternatively, an inner_join() can be used to keep only
the words present in the lexicon.
To get the sentiment scores, you can use the native
sentix_summarize() function for a quick analysis on the
joined data:
# Calculate average sentiment per document
sentix_summarize(tidy_sent, simplify = FALSE)
#> # A tibble: 5 × 4
#> doc_id score n_tokens n_scored
#> <chr> <dbl> <int> <int>
#> 1 doc1 0.239 10 9
#> 2 doc2 -0.160 14 11
#> 3 doc3 0.195 12 10
#> 4 doc4 0.115 15 9
#> 5 doc5 -0.0208 13 8Alternatively, if you prefer custom metrics (e.g., standard deviation or median), you can manually group and summarize:
# Manual summary with dplyr
tidy_sent |>
group_by(doc_id) |>
summarise(
sentiment = mean(score, na.rm = T),
n_tokens = n(),
n_scored = sum(!is.na(score))
)
#> # A tibble: 5 × 4
#> doc_id sentiment n_tokens n_scored
#> <chr> <dbl> <int> <int>
#> 1 doc1 0.239 10 9
#> 2 doc2 -0.160 14 11
#> 3 doc3 0.195 12 10
#> 4 doc4 0.115 15 9
#> 5 doc5 -0.0208 13 8To perform polarity analysis (counting positive vs negative words),
retrieve the lexicon with polarity = TRUE (assigns
“positive”/“negative” labels), and join as before.
# Get MAL with polarity labels
polar_dict <- get_sentix("MAL", polarity = TRUE)
head(polar_dict)
#> # A tibble: 6 × 2
#> word polarity
#> <chr> <chr>
#> 1 genere_lucilia negative
#> 2 gotta negative
#> 3 pianississimo negative
#> 4 posse positive
#> 5 siboglinidae positive
#> 6 cacaphony positive# Join with tokenized text
tidy_text |>
left_join(polar_dict, by = "word") |>
head()
#> doc_id word polarity
#> 1 doc1 ottimo positive
#> 2 doc1 prodotto positive
#> 3 doc1 la neutral
#> 4 doc1 qualità positive
#> 5 doc1 dell'immagine <NA>
#> 6 doc1 è positiveIt is also possible to generate polarity labels from continuous
scores in a custom way, using the make_polarity() function.
Here the threshold is set to 0.125 (positive scores above
0.125 are “positive”, negative scores below -0.125 are “negative”, and
scores in between are “neutral”):
mal_dict |>
mutate(polarity = make_polarity(score,
threshold = 0.125)) |>
head()
#> # A tibble: 6 × 3
#> word score polarity
#> <chr> <dbl> <chr>
#> 1 genere_lucilia -0.25 negative
#> 2 gotta -0.375 negative
#> 3 pianississimo -0.125 negative
#> 4 posse 0.125 positive
#> 5 siboglinidae 0.125 positive
#> 6 cacaphony 0.777 positiveOr, to directly convert all numeric scores into polarity labels:
get_elita() |>
mutate(across(where(is.numeric),
~ make_polarity(.x))) |>
tail()
#> # A tibble: 6 × 4
#> lemma valenza attivazione dominanza
#> <chr> <chr> <chr> <chr>
#> 1 spropositato positive positive positive
#> 2 strascico negative negative negative
#> 3 suggestivo positive positive positive
#> 4 ufficialmente neutral neutral negative
#> 5 verificare negative positive negative
#> 6 vorticoso positive positive negativeIt is also possible to set different thresholds for positive and
negative classifications, by providing a vector of two values, such as
c(0.125, -0.135).
sentixr also includes helper functions to convert its
lexicons into quanteda::dictionary objects, facilitating
integration with the quanteda framework.
The df_to_dict() function converts a dataframe lexicon
into a Quanteda dictionary that can be used, for example, with
tokens_lookup() or dfm_lookup().
If the package quanteda.sentiment is installed, the
function will automatically assign the appropriate polarity or valence
attributes, making the dictionary compatible with
textstat_valence() or textstat_polarity().
Other helper functions include df_to_valence() and
df_to_polar() for explicit control.
that is equivalent to:
If the quanteda.sentiment package is installed, the
valence scores will be automatically assigned to the dictionary’s
“valence” attribute, and the dictionary will be ready for use with:
# Compute valence
quanteda.sentiment::textstat_valence(sentix_toks, dictionary = my_dict)
#> doc_id sentiment
#> 1 doc1 0.2689482
#> 2 doc2 -0.1755017
#> 3 doc3 0.2788701
#> 4 doc4 0.1295423
#> 5 doc5 -0.0208181Otherwise, the function will create a standard Quanteda dictionary.
To get a polarity dictionary:
my_dict2 <- get_sentix("MAL", polarity = TRUE) |>
# if there are other numeric columns, other than 'polarity'
df_to_polar()which is equivalent to applying df_to_dict() to the
polarity version of the lexicon: