| Title: | Lexicons and Tools for Italian Sentiment Analysis |
|---|---|
| Description: | Lexicons and tools to perform sentiment analysis on Italian texts. Lexicons included: Sentix 3.0, MAL, ElIta VAD and basic emotions (Plutchik's wheel of emotions). For more details about the lexicons, see Basile & Nissim (2013), "Sentiment Analysis on Italian Tweets", <https://aclanthology.org/W13-1614/>; Vassallo et al. (2019), "The Tenuousness of Lemmatization in Lexicon-based Sentiment Analysis", <https://aclanthology.org/2019.clicit-1.79/>; Di Palma (2024), "ELIta: A New Italian Language Resource for Emotion Analysis", <https://aclanthology.org/2024.clicit-1.36/>. |
| Authors: | Agnese Vardanega [aut, cre] (ORCID: <https://orcid.org/0000-0002-1419-9896>), Valerio Basile [aut] (ORCID: <https://orcid.org/0000-0001-8110-6832>), Eliana Di Palma [aut] (ORCID: <https://orcid.org/0000-0003-2154-2696>), Giuliano Gabrieli [aut] (ORCID: <https://orcid.org/0009-0005-1153-5662>), Marco Vassallo [aut] (ORCID: <https://orcid.org/0000-0001-7016-6549>) |
| Maintainer: | Agnese Vardanega <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 0.2.0 |
| Built: | 2026-06-24 10:40:50 UTC |
| Source: | https://github.com/cran/sentixr |
Converts a data frame (tibble) containing a lexicon into a Quanteda dictionary
with valence or polarity.
Requires the package Quanteda. If the quanteda.sentiment package is also
installed,
the polarity or valence attributes will be detected and assigned automatically.
Otherwise,
a standard Quanteda dictionary will be created.
The function is a wrapper for df_to_valence() and df_to_polar(),
automatically determining, where possible,
the most appropriate type of dictionary for the input
data frame (see Details).
Note: The function cannot handle duplicate entries, and will remove rows with NAs.
df_to_dict( x, word_field = NULL, type = "auto", polar_field = "polarity", polar_map = NULL )df_to_dict( x, word_field = NULL, type = "auto", polar_field = "polarity", polar_map = NULL )
x |
A |
word_field |
A string with the name of the column containing the terms.
If |
type |
The type of dictionary to create. Can be |
polar_field |
A string with the name of the column containing the
categories (polarities; i.e. "Positive", "Negative").
Defaults to |
polar_map |
A named character vector to manually map dictionary keys
to standard polarity values ( |
The function handles the sentiment scores or categories as follows:
Valence Dictionaries: The names of the numeric columns are used
as dictionary keys. When there is only one
numeric column, the word_field is used as the key name
(see quanteda.sentiment::valence if installed).
Polarity Dictionaries:
The character or factor column (other
than the word_field) is used to group terms into the categories
(polar_field) that are then associated with the standard
"polarity" attribute ("pos", "neg", optionally "neut";
see quanteda.sentiment::polarity if installed).
The "polarity" attribute is assigned via the polar_map argument, or
automatically if the categories in the polar_field are explicit:
"positive", "negative" (and, optionally, "neutral"; case-insensitive).
A quanteda::dictionary2 object.
df_to_dict(), df_to_polar(),
dictionary
if(requireNamespace("quanteda")){ # only numeric fields are present my_dict <- get_sentix() df_to_dict(my_dict) # no numeric fields are present my_dict <- get_sentix(polarity = TRUE) df_to_dict(my_dict) }if(requireNamespace("quanteda")){ # only numeric fields are present my_dict <- get_sentix() df_to_dict(my_dict) # no numeric fields are present my_dict <- get_sentix(polarity = TRUE) df_to_dict(my_dict) }
Converts a data frame (tibble) containing a lexicon into a Quanteda dictionary
with polarity, to be used with quanteda.sentiment::textstat_polarity().
Requires the package Quanteda. If the quanteda.sentiment package is also
installed,
the polarity attribute will be detected and assigned automatically.
Otherwise, a standard Quanteda dictionary will be created.
Note: The function cannot handle duplicate entries, and will remove rows with NAs.
df_to_polar(x, word_field = NULL, polar_field = "polarity", polar_map = NULL)df_to_polar(x, word_field = NULL, polar_field = "polarity", polar_map = NULL)
x |
A |
word_field |
A string with the name of the column containing the terms.
If |
polar_field |
A string with the name of the column containing the
categories (polarities; i.e. "Positive", "Negative").
Defaults to |
polar_map |
A named character vector to manually map dictionary keys
to standard polarity values ( |
The function handles the sentiment categories as follows:
The character or factor column (other
than the word_field) is used to group terms into the categories
(polar_field) that are then associated with the standard
"polarity" attribute ("pos", "neg", optionally "neut";
see quanteda.sentiment::polarity if installed).
The "polarity" attribute is assigned via the polar_map argument, or
automatically if the categories in the polar_field are explicit:
"positive", "negative" (and, optionally, "neutral"; case-insensitive).
A quanteda::dictionary2 object.
df_to_dict(), df_to_valence(),
dictionary
if(requireNamespace("quanteda")){ # Create a polarity dictionary from sentix my_dict <- get_sentix(polarity = TRUE) my_pol_dict <- df_to_polar(my_dict) }if(requireNamespace("quanteda")){ # Create a polarity dictionary from sentix my_dict <- get_sentix(polarity = TRUE) my_pol_dict <- df_to_polar(my_dict) }
Converts a data frame (tibble) containing a lexicon into a Quanteda dictionary
with valence, to be used with quanteda.sentiment::textstat_valence().
Requires the package Quanteda. If the quanteda.sentiment package is also
installed,
the valence attribute will be detected and assigned automatically. Otherwise,
a standard Quanteda dictionary will be created.
Note: The function cannot handle duplicate entries, and will remove rows with NAs.
df_to_valence(x, word_field = NULL)df_to_valence(x, word_field = NULL)
x |
A |
word_field |
A string with the name of the column containing the terms.
If |
The names of the numeric columns are used as dictionary keys. When there is
only one numeric column, the word_field is used as the key name
(see quanteda.sentiment::valence if installed).
A quanteda::dictionary2 object.
df_to_dict(), df_to_polar(),
dictionary
if(requireNamespace("quanteda")){ # Create a valence dictionary from elita_VAD data(elita_VAD) elita_dict <- df_to_valence(elita_VAD) }if(requireNamespace("quanteda")){ # Create a valence dictionary from elita_VAD data(elita_VAD) elita_dict <- df_to_valence(elita_VAD) }
A dataset containing scores for 6,905 Italian lexical entries (lemmas and emojis) on the eight basic emotions of Plutchik’s wheel together with the dyad love, formed by the combination of trust and joy (Plutchik 1980). It uses a scale from "non associated" (0), "weakly associated" (0.25), "moderately associated" (0.75) to "strongly associated" (1).
This dataset is a subset of the broader ELIta framework: see
elita_VAD, for the VAD dimensional approach (Valence, Arousal, and
Dominance)
data(elita_basic)data(elita_basic)
A tibble with 6,905 rows and 10 columns:
Italian lemmas and emojis (character).
Joy: 0, +1, (double).
Sadness: 0, +1, (double).
Anger: 0, +1, (double).
Disgust: 0, +1, (double).
Fear: 0, +1, (double).
Trust: 0, +1, (double).
Surprise: 0, +1, (double).
Anticipation: 0, +1, (double).
Love: 0, +1, (double).
The dataset is distributed under the Creative Commons Universal License (CC0 1.0).
GitHub Repository: https://github.com/elianadipalma/ELIta
Di Palma, E. (2024a). ELIta: A New Italian Language Resource for Emotion Analysis. Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024), 297–307. https://aclanthology.org/2024.clicit-1.36/
Di Palma, E. (2024b). ELIta (Emotion Lexicon for Italian). http://hdl.handle.net/20.500.11752/OPEN-1036
Plutchik, R. (1980). A general psychoevolutionary theory of emotion. In R. Plutchik & H. Kellerman (eds.), Theories of Emotion (pp. 3–33). Academic Press.
data(elita_basic) get_elita(dict = "elita_basic")data(elita_basic) get_elita(dict = "elita_basic")
A dataset containing scores for 6,905 Italian lexical entries (lemmas and emojis) on the VAD dimensions (Valence, Arousal, and Dominance; see Russell 1980)
This dataset is a subset of the broader ELIta framework.
See elita_basic, for basic discrete emotions (Plutchik's wheel).
data(elita_VAD)data(elita_VAD)
A tibble with 6,905 rows and 4 columns:
Italian lemmas and emojis (character).
Valence (unpleasant - pleasant): -4, +4 (double).
Arousal (calm - excited/active): -4, +4 (double).
Dominance (submissive/controlled - dominant/in control): -4 to +4 (double).
The dataset is distributed under the Creative Commons Universal License (CC0 1.0).
GitHub Repository: https://github.com/elianadipalma/ELIta
Di Palma, E. (2024a). ELIta: A New Italian Language Resource for Emotion Analysis. Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024), 297–307. https://aclanthology.org/2024.clicit-1.36/
Di Palma, E. (2024b). ELIta (Emotion Lexicon for Italian). http://hdl.handle.net/20.500.11752/OPEN-1036
Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178.
data(elita_VAD) # To rescale scores to -1, + 1 get_elita(dict = "elita_VAD")data(elita_VAD) # To rescale scores to -1, + 1 get_elita(dict = "elita_VAD")
A utility function to access ELIta-family lexicons (elita_basic,
elita_VAD) with convenient defaults.
For all lexicons, it returns the key entry (lemma or word) and score
columns, suitable for joining.
For elita_VAD, which contains scores on a -4 to +4 scale, scores are
"centered" by default (divided by 4 to map to a -1 to +1 theoretical range).
get_elita(dict = "elita_VAD", rescale = "default")get_elita(dict = "elita_VAD", rescale = "default")
dict |
The name of the lexicon to retrieve. Must be one of:
|
rescale |
Character string indicating the rescaling method applied to the scores. Options are:
This argument only applies |
A tibble.
elita_VAD, elita_basic, get_sentix()
# Get the default elita_VAD lexicon (centered scores) my_dict_VAD <- get_elita("elita_VAD") # Get elita_VAD without any rescaling my_dict_VAD <- get_elita("elita_VAD", rescale = "none") # Get elita_basic lexicon my_dict_basic <- get_elita("elita_basic")# Get the default elita_VAD lexicon (centered scores) my_dict_VAD <- get_elita("elita_VAD") # Get elita_VAD without any rescaling my_dict_VAD <- get_elita("elita_VAD", rescale = "none") # Get elita_basic lexicon my_dict_basic <- get_elita("elita_basic")
A utility function to access Sentix-family lexicons (sentix, MAL
) with convenient defaults.
For all lexicons, it returns by default the key entry (lemma or word) and
score
columns, suitable for joining. Polarity classification can be computed via
make_polarity().
Other columns (polypathy_index) are accessible via arguments.
get_sentix( dict = "sentix", polypathy = FALSE, polarity = FALSE, polar_field = "polarity", threshold = 0 )get_sentix( dict = "sentix", polypathy = FALSE, polarity = FALSE, polar_field = "polarity", threshold = 0 )
dict |
The name of the lexicon to retrieve. Must be one of: |
polypathy |
Logical. If |
polarity |
Logical. If |
polar_field |
Character string. The name of the new polarity column.
Defaults to |
threshold |
Numeric. The threshold for |
A tibble.
sentix, MAL, get_elita(), make_polarity()
# Get the default sentix lexicon (key and score) my_dict <- get_sentix() # Get the sentix lexicon with polarity field my_dict <- get_sentix(polarity = TRUE) # Get MAL and polypathy index my_dict_poly <- get_sentix("MAL", polypathy = TRUE)# Get the default sentix lexicon (key and score) my_dict <- get_sentix() # Get the sentix lexicon with polarity field my_dict <- get_sentix(polarity = TRUE) # Get MAL and polypathy index my_dict_poly <- get_sentix("MAL", polypathy = TRUE)
Utility function for adding polarity columns to sentiment lexicons, to be used
within mutate.
Classifies numeric sentiment scores into "positive", "negative", or "neutral", based on a specified threshold (defaults to 0).
make_polarity(score, threshold = 0)make_polarity(score, threshold = 0)
score |
A numeric vector of sentiment scores. |
threshold |
A numeric vector.
If length 1 (i.e. Scores |
A character vector with "positive", "negative", or "neutral".
sentix |> mutate(polarity = make_polarity(score)) # with custom threshold elita_VAD |> mutate(across(where(is.numeric), ~ make_polarity(.x, 0.125))) # with custom asymmetric thresholds get_sentix("MAL") |> mutate(polarity = make_polarity(score, threshold = c(0.125, -0.135)))sentix |> mutate(polarity = make_polarity(score)) # with custom threshold elita_VAD |> mutate(across(where(is.numeric), ~ make_polarity(.x, 0.125))) # with custom asymmetric thresholds get_sentix("MAL") |> mutate(polarity = make_polarity(score, threshold = c(0.125, -0.135)))
MAL (Morphologically-inflected Affective Lexicon) is an affective lexicon
for the Italian language. It expands sentix with inflected forms from
Morph-it! (Vassallo et al. 2019; see Zanchetta & Baroni 2005), and can be
therefore used without lemmatization.
It contains 295,032 inflected forms (field word), with associated affective
scores, and an index of polypathy.
Affective scores are inherited from the corresponding sentix entries
(lemmas).
data(MAL)data(MAL)
A tibble with 297,592 rows and 4 columns:
Italian inflected forms (character).
Sentiment valence: -1, +1 (double).
Index of ambiguity: "0", "1", "2", "3" (ordered factor; see sentix for details).
The dataset is distributed under the CC BY-SA 4.0 license.
Zenodo Repository: https://zenodo.org/records/18709688.
Vassallo, M., Gabrieli, G., Basile, V., & Bosco, C. (2019). The Tenuousness of Lemmatization in Lexicon-based Sentiment Analysis. In Proceedings of the Sixth Italian Conference on Computational Linguistics (CLiC-it 2019), pages 520–525, Bari, Italy. CEUR Workshop Proceedings. https://aclanthology.org/2019.clicit-1.79/
Zanchetta, E., & Baroni, M. (2005). Morph-it! A free corpus-based morphological resource for the italian language. In Proceedings of Corpus linguistics Conference Series 2005, University of Birmingham. https://cris.unibo.it/handle/11585/15321
data(MAL) get_sentix(dict = "MAL")data(MAL) get_sentix(dict = "MAL")
A dataset containing 5 sentences in Italian, derived from TV reviews on Amazon, for testing and demonstrating the package functions.
data(recensioni_tv)data(recensioni_tv)
A tibble with 5 rows and 2 variables:
Unique identifier for the document (doc1 to doc5)
The text content of the reviews
Sentix is an affective lexicon for the Italian language (Basile & Nissim 2013; Basile et al. 2025).
It includes 68,190 italian lemmas (field lemma) with associated affective
scores and an index of polypathy (see:
Details).
data(sentix)data(sentix)
A tibble with 68,190 rows and 4 columns:
Italian lemmas (character).
Sentiment valence: -1, +1 (double).
Index of ambiguity (see Details): "0", "1", "2", "3" (ordered factor).
The polypathy_index provides information on the
ambivalence and stability of the sentiment scores, on the basis of the
original multiple entries for each lemma.
The values are interpreted as follows:
"0": No multiple entries for the lemma.
"1": Multiple entries with a low range (max - min) of original scores.
"2": Multiple entries with a high range of original scores.
"3": Multiple entries with a high range of original scores, and ambivalence (sign change).
The dataset is distributed under the CC BY-SA 4.0 license.
Zenodo Repository: https://zenodo.org/records/15609186.
GitHub Repository: https://github.com/valeriobasile/sentix
Basile, V., & Nissim, M. (2013). Sentiment Analysis on Italian Tweets. In Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 100–107, Atlanta, Georgia. Association for Computational Linguistics. https://aclanthology.org/W13-1614/.
Basile, V., Nissim, M., Bosco, C., Vassallo, M., & Gabrieli, G. (2025). Sentix (3.1). Zenodo. doi:10.5281/zenodo.15609185.
data(sentix) get_sentix()data(sentix) get_sentix()
Annotates a character vector or a data frame of texts using
udpipe (providing
a model where necessary) and joins the results with a selected sentiment
lexicon.
sentix_annotate( x, model = NULL, text_field = NULL, docid_field = NULL, dict = "sentix", rescale = "default", simplify = TRUE, ... )sentix_annotate( x, model = NULL, text_field = NULL, docid_field = NULL, dict = "sentix", rescale = "default", simplify = TRUE, ... )
x |
A character vector, a data frame, or a list of texts. |
model |
A UDpipe model used for annotation. The following options are supported:
|
text_field |
A character string specifying the name of the column
containing the text to be parsed.
If |
docid_field |
A character string specifying the name of the column
containing document identifiers. If |
dict |
The name of the lexicon to use. Can be one of the Sentix family
( |
rescale |
Character string indicating the rescaling method for scores:
|
simplify |
Logical. Defaults to |
... |
Additional arguments passed to
the lexicon retrieval function (see |
This function uses udpipe to process the
input texts, then joins the tokenized output with the specified lexicon, using
dplyr join. It performs two main steps:
Parsing: Texts are tokenized, tagged, and lemmatized. For
larger corpora, the function supports parallel processing by passing
the argument parallel.cores to udpipe.
Matching: The parsed tokens are joined with the selected
lexicon using left_join.
All tokens are thus preserved in the output to maintain context, with
NA assigned to tokens not found in the lexicon.
A tibble with one row per token, containing the following
columns: doc_id, sentence_id, token_id, token, lemma, upos, one
or more sentiment score columns, named after those of the selected lexicon.
When simplify = FALSE it will include standard UDpipe columns (see
as.data.frame.udpipe_connlu), plus sentiment score
columns.
get_sentix(), get_elita(), sentix_summarize()
## Not run: # This example is not executed because it requires the udpipe package and # downloading a model # Auto-download model ann_df <- sentix_annotate("Oggi è una bella giornata") # Use a local model file in the working directory (i.e. if already # downloaded) ann_df <- sentix_annotate("Uso un modello locale.", model = "local") # Use specific model path and lexicon ann_df <- sentix_annotate("Oggi è una bella giornata", model = "path/to/model.udpipe", dict = "elita_VAD") # With a data frame, and a loaded model data("recensioni_tv") model <- udpipe::udpipe_load_model("italian-isdt-ud-2.5-191206.udpipe") ann_df <- sentix_annotate(recensioni_tv, model = model) ## End(Not run)## Not run: # This example is not executed because it requires the udpipe package and # downloading a model # Auto-download model ann_df <- sentix_annotate("Oggi è una bella giornata") # Use a local model file in the working directory (i.e. if already # downloaded) ann_df <- sentix_annotate("Uso un modello locale.", model = "local") # Use specific model path and lexicon ann_df <- sentix_annotate("Oggi è una bella giornata", model = "path/to/model.udpipe", dict = "elita_VAD") # With a data frame, and a loaded model data("recensioni_tv") model <- udpipe::udpipe_load_model("italian-isdt-ud-2.5-191206.udpipe") ann_df <- sentix_annotate(recensioni_tv, model = model) ## End(Not run)
Calculates sentiment scores and, optionally, ambiguity metrics, aggregating token-level sentiment annotations to the document level.
sentix_summarize( x, aggregation = "mean", cols = NULL, by = "doc_id", simplify = FALSE, ambiguity = "3" )sentix_summarize( x, aggregation = "mean", cols = NULL, by = "doc_id", simplify = FALSE, ambiguity = "3" )
x |
A data frame containing at least a |
aggregation |
Character. |
cols |
Character vector, specifying columns to summarize. If |
by |
Character vector, specifying the column(s) to group by. Defaults
to |
simplify |
Logical. Defaults to |
ambiguity |
Character. The minimum |
This function takes the output of sentix_annotate() or a data frame or
tibble withat least a doc_id column and sentiment scores (numeric columns).
Metrics Calculated:
score: the average (or sum) of the sentiment columns.
ambiguity: n_poly / n_scored (if polypathy_index is present).
n_tokens: total valid tokens, excluding punctuation. UDpipe's CoNLL-U
format expands Multi-Word Tokens (MWTs) into their syntactic components,
including articulated prepositions: e.g., 'nella' becomes 'in' + 'la'.
The count only considers the components (e.g., 'nella' counts for 2
tokens, not 3).
n_scored: tokens with at least one sentiment score.
n_poly: count of ambiguous tokens, based on the ambiguity level
setting, and if the column polypathy_index is present in the lexicon.
A tibble with one row per document.
get_sentix(), sentix, sentix_annotate()
## Not run: # This example is not executed because it requires the udpipe package and # downloading a model testo <- "Oggi è una bella giornata. Uscirò a fare una passeggiata" # With the output of sentix_annotate ann_df <- sentix_annotate(testo, model = "local") sentix_summarize(ann_df) # With only basic measures sentix_summarize(ann_df, simplify = TRUE) # With custom grouping (e.g., per sentence) sentix_summarize(ann_df, by =c("doc_id", "sentence_id")) # With the output of sentix_annotate, ambiguity and other intermediate # measures ann_df <- sentix_annotate(testo, polypathy = TRUE, model = "local") sentix_summarize(ann_df) ## End(Not run)## Not run: # This example is not executed because it requires the udpipe package and # downloading a model testo <- "Oggi è una bella giornata. Uscirò a fare una passeggiata" # With the output of sentix_annotate ann_df <- sentix_annotate(testo, model = "local") sentix_summarize(ann_df) # With only basic measures sentix_summarize(ann_df, simplify = TRUE) # With custom grouping (e.g., per sentence) sentix_summarize(ann_df, by =c("doc_id", "sentence_id")) # With the output of sentix_annotate, ambiguity and other intermediate # measures ann_df <- sentix_annotate(testo, polypathy = TRUE, model = "local") sentix_summarize(ann_df) ## End(Not run)