---
title: "sentixr with tidytext and quanteda" 
output: rmarkdown::html_vignette 
vignette: >
  %\VignetteIndexEntry{sentixr with tidytext and quanteda}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE, warning = F, message = F,
  comment = "#>"
)
```

`sentixr` is designed to be used with other R packages for text analysis. If you prefer the `tidytext` ecosystem or the `quanteda` framework, you can export the lexicons and use them in your existing pipelines.

## Setup

```{r}
library(sentixr)
```

We'll use the example data provided by the package:

```{r}
data(recensioni_tv)
recensioni_tv
```

## sentixr with tidytext

For the `tidytext` ecosystem, `sentixr` provides access to its lexicons in a tidy format, ready for joining.

```{r}
library(tidytext)
```




### Get the Lexicon

Use `get_sentix()` to retrieve the lexicon as a tidy tibble, with only two columns.

For `tidytext` workflows on raw text (without lemmatization), the **MAL** lexicon is preferred because it contains inflected forms, matching the output of standard tokenizers.

```{r}
# Get the MAL lexicon (inflected forms)
mal_dict <- get_sentix("MAL")
head(mal_dict)
```

### Tokenize and Join

Use `tidytext::unnest_tokens()` to split the text into words.

```{r}
# Tokenize
tidy_text <- recensioni_tv |> 
  unnest_tokens(word, text)
```

```{r}
# Join with lexicon
tidy_sent <- tidy_text |>
  left_join(mal_dict, by = "word")

head(tidy_sent)
```


Here, `left_join` is used to keep all words (even those without a score) so that the token count (`n_tokens`) remains accurate: `score` will be `NA` for words not found in the lexicon.

Alternatively, an `inner_join()` can be used to keep only the words present in the lexicon.

### Analyze

To get the sentiment scores, you can use the native `sentix_summarize()` function for a quick analysis on the joined data:


```{r}
# Calculate average sentiment per document
sentix_summarize(tidy_sent, simplify = FALSE)
```


Alternatively, if you prefer custom metrics (e.g., standard deviation or median), you can manually group and summarize:


```{r}
# Manual summary with dplyr
tidy_sent |>
  group_by(doc_id) |>
  summarise(
    sentiment = mean(score, na.rm = T),
    n_tokens = n(),
    n_scored = sum(!is.na(score))
  )
```

## Polarity Analysis


To perform polarity analysis (counting positive vs negative words), retrieve the lexicon with `polarity = TRUE` (assigns "positive"/"negative" labels), and join as before.

```{r}
# Get MAL with polarity labels
polar_dict <- get_sentix("MAL", polarity = TRUE)
head(polar_dict)
```

```{r}
# Join with tokenized text
tidy_text |>
  left_join(polar_dict, by = "word") |>
  head()
```

It is also possible to generate polarity labels from continuous scores in a custom way, using the `make_polarity()` function. Here the threshold is set to `0.125` (positive scores above 0.125 are "positive", negative scores below -0.125 are "negative", and scores in between are "neutral"):


```{r}
mal_dict |> 
  mutate(polarity = make_polarity(score, 
                                  threshold = 0.125)) |> 
  head()
```

Or, to directly convert all numeric scores into polarity labels:

```{r}
get_elita() |> 
  mutate(across(where(is.numeric), 
                ~ make_polarity(.x))) |> 
  tail()
```


It is also possible to set different thresholds for positive and negative classifications, by providing a vector of two values, such as `c(0.125, -0.135)`.

## sentixr with Quanteda

`sentixr` also includes helper functions to convert its lexicons into `quanteda::dictionary` objects, facilitating integration with the `quanteda` framework.

```{r}
library(quanteda)
```

```{r data}
data(recensioni_tv)
sentix_toks <- corpus(recensioni_tv) |>
  tokens(remove_punct = TRUE)
```



### Creating a Quanteda Dictionary

The `df_to_dict()` function converts a dataframe lexicon into a Quanteda dictionary that can be used, for example, with `tokens_lookup()` or `dfm_lookup()`.

If the package `quanteda.sentiment` is installed, the function will automatically assign the appropriate polarity or valence attributes, making the dictionary compatible with `textstat_valence()` or `textstat_polarity()`. 

Other helper functions include `df_to_valence()` and `df_to_polar()` for explicit control.

```{r}
# Convert MAL to a valence dictionary
my_dict <- df_to_dict(mal_dict)
```

that is equivalent to:

```{r eval=FALSE}
df_to_valence(MAL)
```

If the `quanteda.sentiment` package is installed, the valence scores will be automatically assigned to the dictionary's "valence" attribute, and the dictionary will be ready for use with:

```{r, eval = FALSE}
# Compute valence
quanteda.sentiment::textstat_valence(sentix_toks, dictionary = my_dict)
#>   doc_id  sentiment
#> 1   doc1  0.2689482
#> 2   doc2 -0.1755017
#> 3   doc3  0.2788701
#> 4   doc4  0.1295423
#> 5   doc5 -0.0208181
```

Otherwise, the function will create a standard Quanteda dictionary. 

To get a polarity dictionary:

```{r eval=FALSE}
my_dict2 <- get_sentix("MAL", polarity = TRUE) |> 
  # if there are other numeric columns, other than 'polarity'
  df_to_polar()
```


which is equivalent to applying `df_to_dict()` to the polarity version of the lexicon:

```{r}
my_dict2 <- df_to_dict(polar_dict)
```



```{r, eval = FALSE}
# Compute polarity scores
quanteda.sentiment::textstat_polarity(sentix_toks, 
                                      dictionary = my_dict2)
#>   doc_id sentiment
#> 1   doc1 2.8332133
#> 2   doc2 0.0000000
#> 3   doc3 1.4663371
#> 4   doc4 0.9555114
#> 5   doc5 0.0000000
```