Title: | An Integrated Framework for Textual Sentiment Time Series Aggregation and Prediction |
---|---|
Description: | Optimized prediction based on textual sentiment, accounting for the intrinsic challenge that sentiment can be computed and pooled across texts and time in various ways. See Ardia et al. (2021) <doi:10.18637/jss.v099.i02>. |
Authors: | Samuel Borms [aut, cre] , David Ardia [aut] , Keven Bluteau [aut] , Kris Boudt [aut] , Jeroen Van Pelt [ctb], Andres Algaba [ctb] |
Maintainer: | Samuel Borms <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.0.0 |
Built: | 2024-12-15 07:47:28 UTC |
Source: | CRAN |
The sentometrics package is an integrated framework for textual sentiment time series aggregation and prediction. It accounts for the intrinsic challenge that, for a given text, sentiment can be computed in many different ways, as well as the large number of possibilities to pool sentiment across texts and time. This additional layer of manipulation does not exist in standard text mining and time series analysis packages. The package therefore integrates the fast quantification of sentiment from texts, the aggregation into different sentiment time series and the optimized prediction based on these measures.
Corpus (features) generation: sento_corpus
, add_features
,
as.sento_corpus
Sentiment computation and aggregation into sentiment measures: ctr_agg
,
sento_lexicons
, compute_sentiment
, aggregate.sentiment
,
as.sentiment
, sento_measures
, peakdocs
,
peakdates
, aggregate.sento_measures
Sparse modeling: ctr_model
, sento_model
Prediction and post-modeling analysis: predict.sento_model
,
attributions
Please cite the package in publications. Use citation("sentometrics")
.
Maintainer: Samuel Borms [email protected] (ORCID)
Authors:
David Ardia [email protected] (ORCID)
Keven Bluteau [email protected] (ORCID)
Kris Boudt [email protected] (ORCID)
Other contributors:
Jeroen Van Pelt [email protected] [contributor]
Andres Algaba [email protected] [contributor]
Ardia, Bluteau, Borms and Boudt (2021). The R Package sentometrics to Compute, Aggregate, and Predict with Textual Sentiment. Journal of Statistical Software 99(2), 1-40, doi:10.18637/jss.v099.i02.
Ardia, Bluteau and Boudt (2019). Questioning the news about economic growth: Sparse forecasting using thousands of news-based sentiment values. International Journal of Forecasting 35, 1370-1386, doi:10.1016/j.ijforecast.2018.10.010.
Useful links:
Report bugs at https://github.com/SentometricsResearch/sentometrics/issues
Adds new feature columns, either user-supplied or based on keyword(s)/regex pattern search, to
a provided sento_corpus
or a quanteda corpus
object.
add_features( corpus, featuresdf = NULL, keywords = NULL, do.binary = TRUE, do.regex = FALSE )
add_features( corpus, featuresdf = NULL, keywords = NULL, do.binary = TRUE, do.regex = FALSE )
corpus |
a |
featuresdf |
a named |
keywords |
a named |
do.binary |
a |
do.regex |
a |
If a provided feature name is already part of the corpus, it will be replaced. The featuresdf
and
keywords
arguments can be provided at the same time, or only one of them, leaving the other at NULL
. We use
the stringi package for searching the keywords. The do.regex
argument points to the corresponding elements
in keywords
. For FALSE
, we transform the keywords into a simple regex expression, involving "\b"
for
exact word boundary matching and (if multiple keywords) |
as OR operator. The elements associated to TRUE
do
not undergo this transformation, and are evaluated as given, if the corresponding keywords vector consists of only one
expression. For a large corpus and/or complex regex patterns, this function may require some patience. Scaling between 0
and 1 is performed via min-max normalization, per column.
An updated corpus
object.
Samuel Borms
set.seed(505) # construct a corpus and add (a) feature(s) to it corpus <- quanteda::corpus_sample( sento_corpus(corpusdf = sentometrics::usnews), 500 ) corpus1 <- add_features(corpus, featuresdf = data.frame(random = runif(quanteda::ndoc(corpus)))) corpus2 <- add_features(corpus, keywords = list(pres = "president", war = "war"), do.binary = FALSE) corpus3 <- add_features(corpus, keywords = list(pres = c("Obama", "US president"))) corpus4 <- add_features(corpus, featuresdf = data.frame(all = 1), keywords = list(pres1 = "Obama|US [p|P]resident", pres2 = "\\bObama\\b|\\bUS president\\b", war = "war"), do.regex = c(TRUE, TRUE, FALSE)) sum(quanteda::docvars(corpus3, "pres")) == sum(quanteda::docvars(corpus4, "pres2")) # TRUE # adding a complementary feature nonpres <- data.frame(nonpres = as.numeric(!quanteda::docvars(corpus3, "pres"))) corpus3 <- add_features(corpus3, featuresdf = nonpres)
set.seed(505) # construct a corpus and add (a) feature(s) to it corpus <- quanteda::corpus_sample( sento_corpus(corpusdf = sentometrics::usnews), 500 ) corpus1 <- add_features(corpus, featuresdf = data.frame(random = runif(quanteda::ndoc(corpus)))) corpus2 <- add_features(corpus, keywords = list(pres = "president", war = "war"), do.binary = FALSE) corpus3 <- add_features(corpus, keywords = list(pres = c("Obama", "US president"))) corpus4 <- add_features(corpus, featuresdf = data.frame(all = 1), keywords = list(pres1 = "Obama|US [p|P]resident", pres2 = "\\bObama\\b|\\bUS president\\b", war = "war"), do.regex = c(TRUE, TRUE, FALSE)) sum(quanteda::docvars(corpus3, "pres")) == sum(quanteda::docvars(corpus4, "pres2")) # TRUE # adding a complementary feature nonpres <- data.frame(nonpres = as.numeric(!quanteda::docvars(corpus3, "pres"))) corpus3 <- add_features(corpus3, featuresdf = nonpres)
Aggregates textual sentiment scores at sentence- or document-level into a panel of textual
sentiment measures. Can also be used to aggregate sentence-level sentiment scores into
document-level sentiment scores. This function is called within the sento_measures
function.
## S3 method for class 'sentiment' aggregate(x, ctr, do.full = TRUE, ...)
## S3 method for class 'sentiment' aggregate(x, ctr, do.full = TRUE, ...)
x |
a |
ctr |
output from a |
do.full |
if |
... |
not used. |
A document-level sentiment
object or a fully aggregated sento_measures
object.
Samuel Borms, Keven Bluteau
compute_sentiment
, ctr_agg
, sento_measures
set.seed(505) data("usnews", package = "sentometrics") data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") # computation of sentiment corpus <- sento_corpus(corpusdf = usnews) corpusSample <- quanteda::corpus_sample(corpus, size = 500) l1 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")], list_valence_shifters[["en"]]) l2 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")], list_valence_shifters[["en"]][, c("x", "t")]) sent1 <- compute_sentiment(corpusSample, l1, how = "counts") sent2 <- compute_sentiment(corpusSample, l2, do.sentence = TRUE) sent3 <- compute_sentiment(as.character(corpusSample), l2, do.sentence = TRUE) ctr <- ctr_agg(howTime = c("linear"), by = "year", lag = 3) # aggregate into sentiment measures sm1 <- aggregate(sent1, ctr) sm2 <- aggregate(sent2, ctr) # two-step aggregation (first into document-level sentiment) sd2 <- aggregate(sent2, ctr, do.full = FALSE) sm3 <- aggregate(sd2, ctr) # aggregation of a sentiment data.table cols <- c("word_count", names(l2)[-length(l2)]) sd3 <- sent3[, lapply(.SD, sum), by = "id", .SDcols = cols]
set.seed(505) data("usnews", package = "sentometrics") data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") # computation of sentiment corpus <- sento_corpus(corpusdf = usnews) corpusSample <- quanteda::corpus_sample(corpus, size = 500) l1 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")], list_valence_shifters[["en"]]) l2 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")], list_valence_shifters[["en"]][, c("x", "t")]) sent1 <- compute_sentiment(corpusSample, l1, how = "counts") sent2 <- compute_sentiment(corpusSample, l2, do.sentence = TRUE) sent3 <- compute_sentiment(as.character(corpusSample), l2, do.sentence = TRUE) ctr <- ctr_agg(howTime = c("linear"), by = "year", lag = 3) # aggregate into sentiment measures sm1 <- aggregate(sent1, ctr) sm2 <- aggregate(sent2, ctr) # two-step aggregation (first into document-level sentiment) sd2 <- aggregate(sent2, ctr, do.full = FALSE) sm3 <- aggregate(sd2, ctr) # aggregation of a sentiment data.table cols <- c("word_count", names(l2)[-length(l2)]) sd3 <- sent3[, lapply(.SD, sum), by = "id", .SDcols = cols]
Aggregates sentiment measures by combining across provided lexicons, features, and time weighting
schemes dimensions. For do.global = FALSE
, the combination occurs by taking the mean of the relevant
measures. For do.global = TRUE
, this function aggregates all sentiment measures into a weighted global textual
sentiment measure for each of the dimensions.
## S3 method for class 'sento_measures' aggregate( x, features = NULL, lexicons = NULL, time = NULL, do.global = FALSE, do.keep = FALSE, ... )
## S3 method for class 'sento_measures' aggregate( x, features = NULL, lexicons = NULL, time = NULL, do.global = FALSE, do.keep = FALSE, ... )
x |
a |
features |
a |
lexicons |
a |
time |
a |
do.global |
a |
do.keep |
a |
... |
not used. |
If do.global = TRUE
, the measures are constructed from weights that indicate the importance (and sign)
along each component from the lexicons
, features
, and time
dimensions. There is no restriction in
terms of allowed weights. For example, the global index based on the supplied lexicon weights ("globLex"
) is obtained
first by multiplying every sentiment measure with its corresponding weight (meaning, the weight given to the lexicon the
sentiment is computed with), then by taking the average per date.
If do.global = FALSE
, a modified sento_measures
object, with the aggregated sentiment
measures, including updated information and statistics, but the original sentiment scores data.table
untouched.
If do.global = TRUE
, a data.table
with the different types of weighted global sentiment measures,
named "globLex"
, "globFeat"
, "globTime"
and "global"
, with "date"
as the first
column. The last measure is an average of the the three other measures.
Samuel Borms
data("usnews", package = "sentometrics") data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") # construct a sento_measures object to start with corpus <- sento_corpus(corpusdf = usnews) corpusSample <- quanteda::corpus_sample(corpus, size = 500) l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")], list_valence_shifters[["en"]]) ctr <- ctr_agg(howTime = c("equal_weight", "linear"), by = "year", lag = 3) sento_measures <- sento_measures(corpusSample, l, ctr) # aggregation across specified components smAgg <- aggregate(sento_measures, time = list(W = c("equal_weight", "linear")), features = list(journals = c("wsj", "wapo")), do.keep = TRUE) # aggregation in full dims <- get_dimensions(sento_measures) smFull <- aggregate(sento_measures, lexicons = list(L = dims[["lexicons"]]), time = list(T = dims[["time"]]), features = list(F = dims[["features"]])) # "global" aggregation smGlobal <- aggregate(sento_measures, do.global = TRUE, lexicons = c(0.3, 0.1), features = c(1, -0.5, 0.3, 1.2), time = NULL) ## Not run: # aggregation won't work, but produces informative error message aggregate(sento_measures, time = list(W = c("equal_weight", "almon1")), lexicons = list(LEX = c("LM_en")), features = list(journals = c("notInHere", "wapo"))) ## End(Not run)
data("usnews", package = "sentometrics") data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") # construct a sento_measures object to start with corpus <- sento_corpus(corpusdf = usnews) corpusSample <- quanteda::corpus_sample(corpus, size = 500) l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")], list_valence_shifters[["en"]]) ctr <- ctr_agg(howTime = c("equal_weight", "linear"), by = "year", lag = 3) sento_measures <- sento_measures(corpusSample, l, ctr) # aggregation across specified components smAgg <- aggregate(sento_measures, time = list(W = c("equal_weight", "linear")), features = list(journals = c("wsj", "wapo")), do.keep = TRUE) # aggregation in full dims <- get_dimensions(sento_measures) smFull <- aggregate(sento_measures, lexicons = list(L = dims[["lexicons"]]), time = list(T = dims[["time"]]), features = list(F = dims[["features"]])) # "global" aggregation smGlobal <- aggregate(sento_measures, do.global = TRUE, lexicons = c(0.3, 0.1), features = c(1, -0.5, 0.3, 1.2), time = NULL) ## Not run: # aggregation won't work, but produces informative error message aggregate(sento_measures, time = list(W = c("equal_weight", "almon1")), lexicons = list(LEX = c("LM_en")), features = list(journals = c("notInHere", "wapo"))) ## End(Not run)
Extracts the sentiment measures data.table
in either wide (by default)
or long format.
## S3 method for class 'sento_measures' as.data.table(x, keep.rownames = FALSE, format = "wide", ...)
## S3 method for class 'sento_measures' as.data.table(x, keep.rownames = FALSE, format = "wide", ...)
x |
a |
keep.rownames |
see |
format |
a single |
... |
not used. |
The panel of sentiment measures under sento_measures[["measures"]]
,
in wide or long format.
Samuel Borms
data("usnews", package = "sentometrics") data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") sm <- sento_measures(sento_corpus(corpusdf = usnews[1:200, ]), sento_lexicons(list_lexicons["LM_en"]), ctr_agg(lag = 3)) data.table::as.data.table(sm) data.table::as.data.table(sm, format = "long")
data("usnews", package = "sentometrics") data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") sm <- sento_measures(sento_corpus(corpusdf = usnews[1:200, ]), sento_lexicons(list_lexicons["LM_en"]), ctr_agg(lag = 3)) data.table::as.data.table(sm) data.table::as.data.table(sm, format = "long")
Converts a properly structured sentiment table into a sentiment
object, that can be used
for further aggregation with the aggregate.sentiment
function. This allows to start from
sentiment scores not necessarily computed with compute_sentiment
.
as.sentiment(s)
as.sentiment(s)
s |
a |
A sentiment
object.
Samuel Borms
set.seed(505) data("usnews", package = "sentometrics") data("list_lexicons", package = "sentometrics") ids <- paste0("id", 1:200) dates <- sample(seq(as.Date("2015-01-01"), as.Date("2018-01-01"), by = "day"), 200, TRUE) word_count <- sample(150:850, 200, replace = TRUE) sent <- matrix(rnorm(200 * 8), nrow = 200) s1 <- s2 <- data.table::data.table(id = ids, date = dates, word_count = word_count, sent) s3 <- data.frame(id = ids, date = dates, word_count = word_count, sent, stringsAsFactors = FALSE) s4 <- compute_sentiment(usnews$texts[201:400], sento_lexicons(list_lexicons["GI_en"]), "counts", do.sentence = TRUE) m <- "method" colnames(s1)[-c(1:3)] <- paste0(m, 1:8) sent1 <- as.sentiment(s1) colnames(s2)[-c(1:3)] <- c(paste0(m, 1:4, "--", "feat1"), paste0(m, 1:4, "--", "feat2")) sent2 <- as.sentiment(s2) colnames(s3)[-c(1:3)] <- c(paste0(m, 1:3, "--", "feat1"), paste0(m, 1:3, "--", "feat2"), paste0(m, 4:5)) sent3 <- as.sentiment(s3) s4[, "date" := rep(dates, s4[, max(sentence_id), by = id][[2]])] sent4 <- as.sentiment(s4) # further aggregation from then on is easy... sentMeas1 <- aggregate(sent1, ctr_agg(lag = 10)) sent5 <- aggregate(sent4, ctr_agg(howDocs = "proportional"), do.full = FALSE)
set.seed(505) data("usnews", package = "sentometrics") data("list_lexicons", package = "sentometrics") ids <- paste0("id", 1:200) dates <- sample(seq(as.Date("2015-01-01"), as.Date("2018-01-01"), by = "day"), 200, TRUE) word_count <- sample(150:850, 200, replace = TRUE) sent <- matrix(rnorm(200 * 8), nrow = 200) s1 <- s2 <- data.table::data.table(id = ids, date = dates, word_count = word_count, sent) s3 <- data.frame(id = ids, date = dates, word_count = word_count, sent, stringsAsFactors = FALSE) s4 <- compute_sentiment(usnews$texts[201:400], sento_lexicons(list_lexicons["GI_en"]), "counts", do.sentence = TRUE) m <- "method" colnames(s1)[-c(1:3)] <- paste0(m, 1:8) sent1 <- as.sentiment(s1) colnames(s2)[-c(1:3)] <- c(paste0(m, 1:4, "--", "feat1"), paste0(m, 1:4, "--", "feat2")) sent2 <- as.sentiment(s2) colnames(s3)[-c(1:3)] <- c(paste0(m, 1:3, "--", "feat1"), paste0(m, 1:3, "--", "feat2"), paste0(m, 4:5)) sent3 <- as.sentiment(s3) s4[, "date" := rep(dates, s4[, max(sentence_id), by = id][[2]])] sent4 <- as.sentiment(s4) # further aggregation from then on is easy... sentMeas1 <- aggregate(sent1, ctr_agg(lag = 10)) sent5 <- aggregate(sent4, ctr_agg(howDocs = "proportional"), do.full = FALSE)
Converts most common quanteda and tm corpus objects into a
sento_corpus
object. Appropriate available metadata is integrated as features;
for a quanteda corpus, this can come from docvars(x)
, for a tm corpus,
only meta(x, type = "indexed")
metadata is considered.
as.sento_corpus(x, dates = NULL, do.clean = FALSE)
as.sento_corpus(x, dates = NULL, do.clean = FALSE)
x |
a quanteda |
dates |
an optional sequence of dates as |
do.clean |
see |
A sento_corpus
object, as returned by the sento_corpus
function.
Samuel Borms
corpus
, SimpleCorpus
, VCorpus
,
sento_corpus
data("usnews", package = "sentometrics") txt <- system.file("texts", "txt", package = "tm") reuters <- system.file("texts", "crude", package = "tm") # reshuffle usnews data.frame for use in quanteda and tm dates <- usnews$date usnews$wrong <- "notNumeric" colnames(usnews)[c(1, 3)] <- c("doc_id", "text") # conversion from a quanteda corpus qcorp <- quanteda::corpus(usnews, text_field = "text", docid_field = "doc_id") corp1 <- as.sento_corpus(qcorp) corp2 <- as.sento_corpus(qcorp, sample(dates)) # overwrites "date" column # conversion from a tm SimpleCorpus corpus (DataframeSource) tmSCdf <- tm::SimpleCorpus(tm::DataframeSource(usnews)) corp3 <- as.sento_corpus(tmSCdf) # conversion from a tm SimpleCorpus corpus (DirSource) tmSCdir <- tm::SimpleCorpus(tm::DirSource(txt)) corp4 <- as.sento_corpus(tmSCdir, dates[1:length(tmSCdir)]) # conversion from a tm VCorpus corpus (DataframeSource) tmVCdf <- tm::VCorpus(tm::DataframeSource(usnews)) corp5 <- as.sento_corpus(tmVCdf) # conversion from a tm VCorpus corpus (DirSource) tmVCdir <- tm::VCorpus(tm::DirSource(reuters), list(reader = tm::readReut21578XMLasPlain)) corp6 <- as.sento_corpus(tmVCdir, dates[1:length(tmVCdir)])
data("usnews", package = "sentometrics") txt <- system.file("texts", "txt", package = "tm") reuters <- system.file("texts", "crude", package = "tm") # reshuffle usnews data.frame for use in quanteda and tm dates <- usnews$date usnews$wrong <- "notNumeric" colnames(usnews)[c(1, 3)] <- c("doc_id", "text") # conversion from a quanteda corpus qcorp <- quanteda::corpus(usnews, text_field = "text", docid_field = "doc_id") corp1 <- as.sento_corpus(qcorp) corp2 <- as.sento_corpus(qcorp, sample(dates)) # overwrites "date" column # conversion from a tm SimpleCorpus corpus (DataframeSource) tmSCdf <- tm::SimpleCorpus(tm::DataframeSource(usnews)) corp3 <- as.sento_corpus(tmSCdf) # conversion from a tm SimpleCorpus corpus (DirSource) tmSCdir <- tm::SimpleCorpus(tm::DirSource(txt)) corp4 <- as.sento_corpus(tmSCdir, dates[1:length(tmSCdir)]) # conversion from a tm VCorpus corpus (DataframeSource) tmVCdf <- tm::VCorpus(tm::DataframeSource(usnews)) corp5 <- as.sento_corpus(tmVCdf) # conversion from a tm VCorpus corpus (DirSource) tmVCdir <- tm::VCorpus(tm::DirSource(reuters), list(reader = tm::readReut21578XMLasPlain)) corp6 <- as.sento_corpus(tmVCdir, dates[1:length(tmVCdir)])
Computes the attributions to predictions for a (given) number of dates at all possible sentiment dimensions, based on the coefficients associated to each sentiment measure, as estimated in the provided model object.
attributions( model, sento_measures, do.lags = TRUE, do.normalize = FALSE, refDates = NULL, factor = NULL )
attributions( model, sento_measures, do.lags = TRUE, do.normalize = FALSE, refDates = NULL, factor = NULL )
model |
a |
sento_measures |
the |
do.lags |
a |
do.normalize |
a |
refDates |
the dates (as |
factor |
the factor level as a single |
See sento_model
for an elaborate modeling example including the calculation and plotting of
attributions. The attribution for logistic models is represented in terms of log odds. For binomial models, it is
calculated with respect to the last factor level or factor column. A NULL
value for document-level attribution
on a given date means no documents are directly implicated in the associated prediction.
A list
of class attributions
, with "documents"
, "lags"
, "lexicons"
,
"features"
and "time"
as attribution dimensions. The last four dimensions are
data.table
s having a "date"
column and the other columns the different components of the dimension, with
the attributions as values. Document-level attribution is further decomposed into a data.table
per date, with
"id"
, "date"
and "attrib"
columns. If do.lags = FALSE
, the "lags"
element is set
to NULL
.
Samuel Borms, Keven Bluteau
Given a corpus of texts, computes sentiment per document or sentence using the valence shifting augmented bag-of-words approach, based on the lexicons provided and a choice of aggregation across words.
compute_sentiment( x, lexicons, how = "proportional", tokens = NULL, do.sentence = FALSE, nCore = 1 )
compute_sentiment( x, lexicons, how = "proportional", tokens = NULL, do.sentence = FALSE, nCore = 1 )
x |
either a |
lexicons |
a |
how |
a single |
tokens |
a |
do.sentence |
a |
nCore |
a positive |
For a separate calculation of positive (resp. negative) sentiment, provide distinct positive (resp.
negative) lexicons (see the do.split
option in the sento_lexicons
function). All NA
s
are converted to 0, under the assumption that this is equivalent to no sentiment. Per default tokens = NULL
,
meaning the corpus is internally tokenized as unigrams, with punctuation and numbers but not stopwords removed.
All tokens are converted to lowercase, in line with what the sento_lexicons
function does for the
lexicons and valence shifters. Word counts are based on that same tokenization.
If x
is a sento_corpus
object: a sentiment
object, i.e., a data.table
containing
the sentiment scores data.table
with an "id"
, a "date"
and a "word_count"
column,
and all lexicon-feature sentiment scores columns. The tokenized sentences are not provided but can be
obtained as stringi::stri_split_boundaries(texts, type = "sentence")
. A sentiment
object can
be aggregated (into time series) with the aggregate.sentiment
function.
If x
is a quanteda corpus
object: a sentiment scores
data.table
with an "id"
and a "word_count"
column, and all lexicon-feature
sentiment scores columns.
If x
is a tm SimpleCorpus
object, a tm VCorpus
object, or a character
vector: a sentiment scores data.table
with an auto-created "id"
column, a "word_count"
column, and all lexicon sentiment scores columns.
When do.sentence = TRUE
, an additional "sentence_id"
column along the
"id"
column is added.
If the lexicons
argument has no "valence"
element, the sentiment computed corresponds to simple unigram
matching with the lexicons [unigrams approach]. If valence shifters are included in lexicons
with a
corresponding "y"
column, the polarity of a word detected from a lexicon gets multiplied with the associated
value of a valence shifter if it appears right before the detected word (examples: not good or can't defend) [bigrams
approach]. If the valence table contains a "t"
column, valence shifters are searched for in a cluster centered around
a detected polarity word [clusters approach]. The latter approach is a simplified version of the one utilized by the
sentimentr package. A cluster amounts to four words before and two words after a polarity word. A cluster never overlaps
with a preceding one. Roughly speaking, the polarity of a cluster is calculated as . The polarity
score of the detected word is
,
represents polarities of eventual other sentiment words, and
is
the difference between the number of amplifiers (
t = 2
) and the number of deamplifiers (t = 3
). If there
is an odd number of negators (t = 1
), and amplifiers are counted as deamplifiers, else
.
The sentence-level sentiment calculation approaches each sentence as if it is a document. Depending on the input either
the unigrams, bigrams or clusters approach is used. We enhanced latter approach following more closely the default
sentimentr settings. They use a cluster of five words before and two words after a polarized word. The cluster
is limited to the words after a previous comma and before a next comma. Adversative conjunctions (t = 4
) are
accounted for here. The cluster is reweighted based on the value , where
is the difference
between the number of adversative conjunctions found before and after the polarized word.
Samuel Borms, Jeroen Van Pelt, Andres Algaba
data("usnews", package = "sentometrics") txt <- system.file("texts", "txt", package = "tm") reuters <- system.file("texts", "crude", package = "tm") data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") l1 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")]) l2 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")], list_valence_shifters[["en"]]) l3 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")], list_valence_shifters[["en"]][, c("x", "t")]) # from a sento_corpus object - unigrams approach corpus <- sento_corpus(corpusdf = usnews) corpusSample <- quanteda::corpus_sample(corpus, size = 200) sent1 <- compute_sentiment(corpusSample, l1, how = "proportionalPol") # from a character vector - bigrams approach sent2 <- compute_sentiment(usnews[["texts"]][1:200], l2, how = "counts") # from a corpus object - clusters approach corpusQ <- quanteda::corpus(usnews, text_field = "texts") corpusQSample <- quanteda::corpus_sample(corpusQ, size = 200) sent3 <- compute_sentiment(corpusQSample, l3, how = "counts") # from an already tokenized corpus - using the 'tokens' argument toks <- as.list(quanteda::tokens(corpusQSample, what = "fastestword")) sent4 <- compute_sentiment(corpusQSample, l1[1], how = "counts", tokens = toks) # from a SimpleCorpus object - unigrams approach scorp <- tm::SimpleCorpus(tm::DirSource(txt)) sent5 <- compute_sentiment(scorp, l1, how = "proportional") # from a VCorpus object - unigrams approach ## in contrast to what as.sento_corpus(vcorp) would do, the ## sentiment calculator handles multiple character vectors within ## a single corpus element as separate documents vcorp <- tm::VCorpus(tm::DirSource(reuters)) sent6 <- compute_sentiment(vcorp, l1) # from a sento_corpus object - unigrams approach with tf-idf weighting sent7 <- compute_sentiment(corpusSample, l1, how = "TFIDF") # sentence-by-sentence computation sent8 <- compute_sentiment(corpusSample, l1, how = "proportionalSquareRoot", do.sentence = TRUE) # from a (fake) multilingual corpus usnews[["language"]] <- "en" # add language column usnews$language[1:100] <- "fr" lEn <- sento_lexicons(list("FEEL_en" = list_lexicons$FEEL_en_tr, "HENRY" = list_lexicons$HENRY_en), list_valence_shifters$en) lFr <- sento_lexicons(list("FEEL_fr" = list_lexicons$FEEL_fr), list_valence_shifters$fr) lexicons <- list(en = lEn, fr = lFr) corpusLang <- sento_corpus(corpusdf = usnews[1:250, ]) sent9 <- compute_sentiment(corpusLang, lexicons, how = "proportional")
data("usnews", package = "sentometrics") txt <- system.file("texts", "txt", package = "tm") reuters <- system.file("texts", "crude", package = "tm") data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") l1 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")]) l2 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")], list_valence_shifters[["en"]]) l3 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")], list_valence_shifters[["en"]][, c("x", "t")]) # from a sento_corpus object - unigrams approach corpus <- sento_corpus(corpusdf = usnews) corpusSample <- quanteda::corpus_sample(corpus, size = 200) sent1 <- compute_sentiment(corpusSample, l1, how = "proportionalPol") # from a character vector - bigrams approach sent2 <- compute_sentiment(usnews[["texts"]][1:200], l2, how = "counts") # from a corpus object - clusters approach corpusQ <- quanteda::corpus(usnews, text_field = "texts") corpusQSample <- quanteda::corpus_sample(corpusQ, size = 200) sent3 <- compute_sentiment(corpusQSample, l3, how = "counts") # from an already tokenized corpus - using the 'tokens' argument toks <- as.list(quanteda::tokens(corpusQSample, what = "fastestword")) sent4 <- compute_sentiment(corpusQSample, l1[1], how = "counts", tokens = toks) # from a SimpleCorpus object - unigrams approach scorp <- tm::SimpleCorpus(tm::DirSource(txt)) sent5 <- compute_sentiment(scorp, l1, how = "proportional") # from a VCorpus object - unigrams approach ## in contrast to what as.sento_corpus(vcorp) would do, the ## sentiment calculator handles multiple character vectors within ## a single corpus element as separate documents vcorp <- tm::VCorpus(tm::DirSource(reuters)) sent6 <- compute_sentiment(vcorp, l1) # from a sento_corpus object - unigrams approach with tf-idf weighting sent7 <- compute_sentiment(corpusSample, l1, how = "TFIDF") # sentence-by-sentence computation sent8 <- compute_sentiment(corpusSample, l1, how = "proportionalSquareRoot", do.sentence = TRUE) # from a (fake) multilingual corpus usnews[["language"]] <- "en" # add language column usnews$language[1:100] <- "fr" lEn <- sento_lexicons(list("FEEL_en" = list_lexicons$FEEL_en_tr, "HENRY" = list_lexicons$HENRY_en), list_valence_shifters$en) lFr <- sento_lexicons(list("FEEL_fr" = list_lexicons$FEEL_fr), list_valence_shifters$fr) lexicons <- list(en = lEn, fr = lFr) corpusLang <- sento_corpus(corpusdf = usnews[1:250, ]) sent9 <- compute_sentiment(corpusLang, lexicons, how = "proportional")
Summarizes the sento_corpus
object and returns insights about the evolution of
documents, features and tokens over time.
corpus_summarize(x, by = "day", features = NULL)
corpus_summarize(x, by = "day", features = NULL)
x |
is a |
by |
a single |
features |
a |
This function summarizes the sento_corpus
object by generating statistics about
documents, features and tokens over time. The insights can be narrowed down to a chosen set of metadata
features. The same tokenization as in the sentiment calculation in compute_sentiment
is used.
returns a list
containing:
stats |
a |
plots |
a |
Jeroen Van Pelt, Samuel Borms, Andres Algaba
data("usnews", package = "sentometrics") corpus <- sento_corpus(usnews) # summary of corpus by day summary1 <- corpus_summarize(corpus) # summary of corpus by month for both journals summary2 <- corpus_summarize(corpus, by = "month", features = c("wsj", "wapo"))
data("usnews", package = "sentometrics") corpus <- sento_corpus(usnews) # summary of corpus by day summary1 <- corpus_summarize(corpus) # summary of corpus by month for both journals summary2 <- corpus_summarize(corpus, by = "month", features = c("wsj", "wapo"))
Sets up control object for (computation of textual sentiment and) aggregation into textual sentiment measures.
ctr_agg( howWithin = "proportional", howDocs = "equal_weight", howTime = "equal_weight", do.sentence = FALSE, do.ignoreZeros = TRUE, by = "day", lag = 1, fill = "zero", alphaExpDocs = 0.1, alphasExp = seq(0.1, 0.5, by = 0.1), do.inverseExp = FALSE, ordersAlm = 1:3, do.inverseAlm = TRUE, aBeta = 1:4, bBeta = 1:4, weights = NULL, tokens = NULL, nCore = 1 )
ctr_agg( howWithin = "proportional", howDocs = "equal_weight", howTime = "equal_weight", do.sentence = FALSE, do.ignoreZeros = TRUE, by = "day", lag = 1, fill = "zero", alphaExpDocs = 0.1, alphasExp = seq(0.1, 0.5, by = 0.1), do.inverseExp = FALSE, ordersAlm = 1:3, do.inverseAlm = TRUE, aBeta = 1:4, bBeta = 1:4, weights = NULL, tokens = NULL, nCore = 1 )
howWithin |
a single |
howDocs |
a single |
howTime |
a |
do.sentence |
see |
do.ignoreZeros |
a |
by |
a single |
lag |
a single |
fill |
a single |
alphaExpDocs |
a single |
alphasExp |
a |
do.inverseExp |
a |
ordersAlm |
a |
do.inverseAlm |
a |
aBeta |
a |
bBeta |
a |
weights |
optional own weighting scheme(s), used if provided as a |
tokens |
see |
nCore |
see |
For available options on how aggregation can occur (via the howWithin
,
howDocs
and howTime
arguments), inspect get_hows
. The control parameters
associated to howDocs
are used both for aggregation across documents and across sentences.
A list
encapsulating the control parameters.
Samuel Borms, Keven Bluteau
measures_fill
, almons
, compute_sentiment
set.seed(505) # simple control function ctr1 <- ctr_agg(howTime = "linear", by = "year", lag = 3) # more elaborate control function (particular attention to time weighting schemes) ctr2 <- ctr_agg(howWithin = "proportionalPol", howDocs = "exponential", howTime = c("equal_weight", "linear", "almon", "beta", "exponential", "own"), do.ignoreZeros = TRUE, by = "day", lag = 20, ordersAlm = 1:3, do.inverseAlm = TRUE, alphasExp = c(0.20, 0.50, 0.70, 0.95), aBeta = c(1, 3), bBeta = c(1, 3, 4, 7), weights = data.frame(myWeights = runif(20)), alphaExp = 0.3) # set up control function with one linear and two chosen Almon weighting schemes a <- weights_almon(n = 70, orders = 1:3, do.inverse = TRUE, do.normalize = TRUE) ctr3 <- ctr_agg(howTime = c("linear", "own"), by = "year", lag = 70, weights = data.frame(a1 = a[, 1], a2 = a[, 3]), do.sentence = TRUE)
set.seed(505) # simple control function ctr1 <- ctr_agg(howTime = "linear", by = "year", lag = 3) # more elaborate control function (particular attention to time weighting schemes) ctr2 <- ctr_agg(howWithin = "proportionalPol", howDocs = "exponential", howTime = c("equal_weight", "linear", "almon", "beta", "exponential", "own"), do.ignoreZeros = TRUE, by = "day", lag = 20, ordersAlm = 1:3, do.inverseAlm = TRUE, alphasExp = c(0.20, 0.50, 0.70, 0.95), aBeta = c(1, 3), bBeta = c(1, 3, 4, 7), weights = data.frame(myWeights = runif(20)), alphaExp = 0.3) # set up control function with one linear and two chosen Almon weighting schemes a <- weights_almon(n = 70, orders = 1:3, do.inverse = TRUE, do.normalize = TRUE) ctr3 <- ctr_agg(howTime = c("linear", "own"), by = "year", lag = 70, weights = data.frame(a1 = a[, 1], a2 = a[, 3]), do.sentence = TRUE)
Sets up control object for linear or nonlinear modeling of a response variable onto a large panel of
textual sentiment measures (and potentially other variables). See sento_model
for details on the
estimation and calibration procedure.
ctr_model( model = c("gaussian", "binomial", "multinomial"), type = c("BIC", "AIC", "Cp", "cv"), do.intercept = TRUE, do.iter = FALSE, h = 0, oos = 0, do.difference = FALSE, alphas = seq(0, 1, by = 0.2), lambdas = NULL, nSample = NULL, trainWindow = NULL, testWindow = NULL, start = 1, do.shrinkage.x = FALSE, do.progress = TRUE, nCore = 1 )
ctr_model( model = c("gaussian", "binomial", "multinomial"), type = c("BIC", "AIC", "Cp", "cv"), do.intercept = TRUE, do.iter = FALSE, h = 0, oos = 0, do.difference = FALSE, alphas = seq(0, 1, by = 0.2), lambdas = NULL, nSample = NULL, trainWindow = NULL, testWindow = NULL, start = 1, do.shrinkage.x = FALSE, do.progress = TRUE, nCore = 1 )
model |
a |
type |
a |
do.intercept |
a |
do.iter |
a |
h |
an |
oos |
a non-negative |
do.difference |
a |
alphas |
a |
lambdas |
a |
nSample |
a positive |
trainWindow |
a positive |
testWindow |
a positive |
start |
a positive |
do.shrinkage.x |
a |
do.progress |
a |
nCore |
a positive |
A list
encapsulating the control parameters.
Samuel Borms, Keven Bluteau
Tibshirani and Taylor (2012). Degrees of freedom in LASSO problems. The Annals of Statistics 40, 1198-1232, doi:10.1214/12-AOS1003.
Zou, Hastie and Tibshirani (2007). On the degrees of freedom of the LASSO. The Annals of Statistics 35, 2173-2192, doi:10.1214/009053607000000127.
# information criterion based model control functions ctrIC1 <- ctr_model(model = "gaussian", type = "BIC", do.iter = FALSE, h = 0, alphas = seq(0, 1, by = 0.10)) ctrIC2 <- ctr_model(model = "gaussian", type = "AIC", do.iter = TRUE, h = 4, nSample = 100, do.difference = TRUE, oos = 3) # cross-validation based model control functions ctrCV1 <- ctr_model(model = "gaussian", type = "cv", do.iter = FALSE, h = 0, trainWindow = 250, testWindow = 4, oos = 0, do.progress = TRUE) ctrCV2 <- ctr_model(model = "binomial", type = "cv", h = 0, trainWindow = 250, testWindow = 4, oos = 0, do.progress = TRUE) ctrCV3 <- ctr_model(model = "multinomial", type = "cv", h = 2, trainWindow = 250, testWindow = 4, oos = 2, do.progress = TRUE) ctrCV4 <- ctr_model(model = "gaussian", type = "cv", do.iter = TRUE, h = 0, trainWindow = 45, testWindow = 4, oos = 0, nSample = 70, do.progress = TRUE)
# information criterion based model control functions ctrIC1 <- ctr_model(model = "gaussian", type = "BIC", do.iter = FALSE, h = 0, alphas = seq(0, 1, by = 0.10)) ctrIC2 <- ctr_model(model = "gaussian", type = "AIC", do.iter = TRUE, h = 4, nSample = 100, do.difference = TRUE, oos = 3) # cross-validation based model control functions ctrCV1 <- ctr_model(model = "gaussian", type = "cv", do.iter = FALSE, h = 0, trainWindow = 250, testWindow = 4, oos = 0, do.progress = TRUE) ctrCV2 <- ctr_model(model = "binomial", type = "cv", h = 0, trainWindow = 250, testWindow = 4, oos = 0, do.progress = TRUE) ctrCV3 <- ctr_model(model = "multinomial", type = "cv", h = 2, trainWindow = 250, testWindow = 4, oos = 2, do.progress = TRUE) ctrCV4 <- ctr_model(model = "gaussian", type = "cv", do.iter = TRUE, h = 0, trainWindow = 45, testWindow = 4, oos = 0, nSample = 70, do.progress = TRUE)
Differences the sentiment measures from a sento_measures
object.
## S3 method for class 'sento_measures' diff(x, lag = 1, differences = 1, ...)
## S3 method for class 'sento_measures' diff(x, lag = 1, differences = 1, ...)
x |
a |
lag |
a |
differences |
a |
... |
not used. |
A modified sento_measures
object, with the measures replaced by the differenced measures as well as updated
statistics.
Samuel Borms
data("usnews", package = "sentometrics") data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") # construct a sento_measures object to start with corpus <- sento_corpus(corpusdf = usnews) corpusSample <- quanteda::corpus_sample(corpus, size = 500) l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")], list_valence_shifters[["en"]]) ctr <- ctr_agg(howTime = c("equal_weight", "linear"), by = "year", lag = 3) sento_measures <- sento_measures(corpusSample, l, ctr) # first-order difference sentiment measures with a lag of two diffed <- diff(sento_measures, lag = 2, differences = 1)
data("usnews", package = "sentometrics") data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") # construct a sento_measures object to start with corpus <- sento_corpus(corpusdf = usnews) corpusSample <- quanteda::corpus_sample(corpus, size = 500) l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")], list_valence_shifters[["en"]]) ctr <- ctr_agg(howTime = c("equal_weight", "linear"), by = "year", lag = 3) sento_measures <- sento_measures(corpusSample, l, ctr) # first-order difference sentiment measures with a lag of two diffed <- diff(sento_measures, lag = 2, differences = 1)
Monthly news-based U.S. Economic Policy Uncertainty (EPU) index (Baker, Bloom and Davis, 2016). Goes from January 1985 to July 2018, and includes a binomial and a multinomial example series. Following columns are present:
date. Date as "yyyy-mm-01"
.
index. A numeric
monthly index value.
above. A factor
with value "above"
if the index is greater than the mean of the entire series, else
"below"
.
aboveMulti. A factor
with values "above+"
, "above"
, "below"
and "below-"
if the
index is greater than the 75% quantile and the 50% quantile, or smaller than the 50% quantile and the 25% quantile,
respectively and in a mutually exclusive sense.
data("epu")
data("epu")
A data.frame
with 403 rows and 4 columns.
Measuring Economic Policy Uncertainty. Retrieved August 24, 2018.
Baker, Bloom and Davis (2016). Measuring Economic Policy Uncertainty. The Quarterly Journal of Economics 131, 1593-1636, doi:10.1093/qje/qjw024.
data("epu", package = "sentometrics") head(epu)
data("epu", package = "sentometrics") head(epu)
Returns the dates of the sentiment time series.
get_dates(sento_measures)
get_dates(sento_measures)
sento_measures |
a |
The "date"
column in sento_measures[["measures"]]
as a character
vector.
Samuel Borms
Returns the components across all three dimensions of the sentiment measures.
get_dimensions(sento_measures)
get_dimensions(sento_measures)
sento_measures |
a |
The "features"
, "lexicons"
and "time"
elements in sento_measures
.
Samuel Borms
Outputs the supported aggregation arguments. Call for information purposes only. Used within
ctr_agg
to check if supplied aggregation hows are supported.
get_hows()
get_hows()
See the package's vignette for a detailed explanation of all aggregation options.
A list with the supported aggregation hows for arguments howWithin
("words"
), howDows
("docs"
) and howTime
("time"
), to be supplied to ctr_agg
.
Structures specific performance data for a set of different sento_modelIter
objects as loss data.
Can then be used, for instance, as an input to create a model confidence set (Hansen, Lunde and Nason, 2011) with
the MCS package.
get_loss_data(models, loss = c("DA", "error", "errorSq", "AD", "accuracy"))
get_loss_data(models, loss = c("DA", "error", "errorSq", "AD", "accuracy"))
models |
a named |
loss |
a single |
A matrix
of loss data.
Samuel Borms
Hansen, Lunde and Nason (2011). The model confidence set. Econometrica 79, 453-497, doi:10.3982/ECTA5771.
## Not run: data("usnews", package = "sentometrics") data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") data("epu", package = "sentometrics") set.seed(505) # construct two sento_measures objects corpusAll <- sento_corpus(corpusdf = usnews) corpus <- quanteda::corpus_subset(corpusAll, date >= "1997-01-01" & date < "2014-10-01") l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")], list_valence_shifters[["en"]]) ctrA <- ctr_agg(howWithin = "proportionalPol", howDocs = "proportional", howTime = c("equal_weight", "linear"), by = "month", lag = 3) sentMeas <- sento_measures(corpus, l, ctrA) # prepare y and other x variables y <- epu[epu$date %in% get_dates(sentMeas), "index"] length(y) == nobs(sentMeas) # TRUE x <- data.frame(runif(length(y)), rnorm(length(y))) # two other (random) x variables colnames(x) <- c("x1", "x2") # estimate different type of regressions ctrM <- ctr_model(model = "gaussian", type = "AIC", do.iter = TRUE, h = 0, nSample = 120, start = 50) out1 <- sento_model(sentMeas, y, x = x, ctr = ctrM) out2 <- sento_model(sentMeas, y, x = NULL, ctr = ctrM) out3 <- sento_model(subset(sentMeas, select = "linear"), y, x = x, ctr = ctrM) out4 <- sento_model(subset(sentMeas, select = "linear"), y, x = NULL, ctr = ctrM) lossData <- get_loss_data(models = list(m1 = out1, m2 = out2, m3 = out3, m4 = out4), loss = "errorSq") mcs <- MCS::MCSprocedure(lossData) ## End(Not run)
## Not run: data("usnews", package = "sentometrics") data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") data("epu", package = "sentometrics") set.seed(505) # construct two sento_measures objects corpusAll <- sento_corpus(corpusdf = usnews) corpus <- quanteda::corpus_subset(corpusAll, date >= "1997-01-01" & date < "2014-10-01") l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")], list_valence_shifters[["en"]]) ctrA <- ctr_agg(howWithin = "proportionalPol", howDocs = "proportional", howTime = c("equal_weight", "linear"), by = "month", lag = 3) sentMeas <- sento_measures(corpus, l, ctrA) # prepare y and other x variables y <- epu[epu$date %in% get_dates(sentMeas), "index"] length(y) == nobs(sentMeas) # TRUE x <- data.frame(runif(length(y)), rnorm(length(y))) # two other (random) x variables colnames(x) <- c("x1", "x2") # estimate different type of regressions ctrM <- ctr_model(model = "gaussian", type = "AIC", do.iter = TRUE, h = 0, nSample = 120, start = 50) out1 <- sento_model(sentMeas, y, x = x, ctr = ctrM) out2 <- sento_model(sentMeas, y, x = NULL, ctr = ctrM) out3 <- sento_model(subset(sentMeas, select = "linear"), y, x = x, ctr = ctrM) out4 <- sento_model(subset(sentMeas, select = "linear"), y, x = NULL, ctr = ctrM) lossData <- get_loss_data(models = list(m1 = out1, m2 = out2, m3 = out3, m4 = out4), loss = "errorSq") mcs <- MCS::MCSprocedure(lossData) ## End(Not run)
A list
containing all built-in lexicons as a data.table
with two columns: a x
column with the words,
and a y
column with the polarities. The list
element names incorporate consecutively the name and language
(based on the two-letter ISO code convention as in stopwords
), and "_tr"
as
suffix if the lexicon is translated. The translation was done via Microsoft Translator through Microsoft
Word. Only the entries that conform to the original language entry after retranslation, and those that have actually been
translated, are kept. The last condition is assumed to be fulfilled when the translation differs from the original entry.
All words are unigrams and in lowercase. The built-in lexicons are the following:
FEEL_en_tr
FEEL_fr (Abdaoui, Azé, Bringay and Poncelet, 2017)
FEEL_nl_tr
GI_en (General Inquirer, i.e. Harvard IV-4 combined with Laswell)
GI_fr_tr
GI_nl_tr
HENRY_en (Henry, 2008)
HENRY_fr_tr
HENRY_nl_tr
LM_en (Loughran and McDonald, 2011)
LM_fr_tr
LM_nl_tr
Other useful lexicons can be found in the lexicon package, more specifically the datasets preceded by
hash_sentiment_
.
data("list_lexicons")
data("list_lexicons")
A list
with all built-in lexicons, appropriately named as "NAME_language(_tr)"
.
FEEL lexicon. Retrieved November 1, 2017.
GI lexicon. Retrieved November 1, 2017.
HENRY lexicon. Retrieved November 1, 2017.
LM lexicon. Retrieved November 1, 2017.
Abdaoui, Azé, Bringay and Poncelet (2017). FEEL: French Expanded Emotion Lexicon. Language Resources & Evaluation 51, 833-855, doi:10.1007/s10579-016-9364-5.
Henry (2008). Are investors influenced by how earnings press releases are written?. Journal of Business Communication 45, 363-407, doi:10.1177/0021943608319388.
Loughran and McDonald (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks. Journal of Finance 66, 35-65, doi:10.1111/j.1540-6261.2010.01625.x.
data("list_lexicons", package = "sentometrics") list_lexicons[c("FEEL_en_tr", "LM_en")]
data("list_lexicons", package = "sentometrics") list_lexicons[c("FEEL_en_tr", "LM_en")]
A list
containing all built-in valence word lists, as data.table
s with three columns: a x
column with
the words, a y
column with the values associated to each word, and a t
column with the type of valence
shifter (1
= negators, 2
= amplifiers, 3
= deamplifiers,
4
= adversative conjunctions). The list
element names indicate the language
(based on the two-letter ISO code convention as in stopwords
) of the valence word list.
All non-English word lists are translated via Microsoft Translator through Microsoft Word. Only the entries whose
translation differs from the original entry are kept. All words are unigrams and in lowercase. The built-in valence word
lists are available in following languages:
English ("en"
)
French ("fr"
)
Dutch ("nl"
)
data("list_valence_shifters")
data("list_valence_shifters")
A list
with all built-in valence word lists, appropriately named.
hash_valence_shifters
(English valence shifters). Retrieved August 24, 2018.
data("list_valence_shifters", package = "sentometrics") list_valence_shifters["en"]
data("list_valence_shifters", package = "sentometrics") list_valence_shifters["en"]
Adds missing dates between earliest and latest date of a sento_measures
object or two more extreme
boundary dates, such that the time series are continuous date-wise. Fills in any missing date with either 0 or the
most recent non-missing value.
measures_fill( sento_measures, fill = "zero", dateBefore = NULL, dateAfter = NULL )
measures_fill( sento_measures, fill = "zero", dateBefore = NULL, dateAfter = NULL )
sento_measures |
a |
fill |
an element of |
dateBefore |
a date as |
dateAfter |
a date as |
The dateBefore
and dateAfter
dates are converted according to the sento_measures[["by"]]
frequency.
A modified sento_measures
object.
Samuel Borms
# construct a sento_measures object to start with corpus <- sento_corpus(corpusdf = sentometrics::usnews) corpusSample <- quanteda::corpus_sample(corpus, size = 500) l <- sento_lexicons(sentometrics::list_lexicons[c("LM_en", "HENRY_en")], sentometrics::list_valence_shifters[["en"]]) ctr <- ctr_agg(howTime = c("equal_weight", "linear"), by = "day", lag = 7, fill = "none") sento_measures <- sento_measures(corpusSample, l, ctr) # fill measures f1 <- measures_fill(sento_measures) f2 <- measures_fill(sento_measures, fill = "latest") f3 <- measures_fill(sento_measures, fill = "zero", dateBefore = get_dates(sento_measures)[1] - 10, dateAfter = tail(get_dates(sento_measures), 1) + 15)
# construct a sento_measures object to start with corpus <- sento_corpus(corpusdf = sentometrics::usnews) corpusSample <- quanteda::corpus_sample(corpus, size = 500) l <- sento_lexicons(sentometrics::list_lexicons[c("LM_en", "HENRY_en")], sentometrics::list_valence_shifters[["en"]]) ctr <- ctr_agg(howTime = c("equal_weight", "linear"), by = "day", lag = 7, fill = "none") sento_measures <- sento_measures(corpusSample, l, ctr) # fill measures f1 <- measures_fill(sento_measures) f2 <- measures_fill(sento_measures, fill = "latest") f3 <- measures_fill(sento_measures, fill = "zero", dateBefore = get_dates(sento_measures)[1] - 10, dateAfter = tail(get_dates(sento_measures), 1) + 15)
Updates a sento_measures
object based on a new sento_corpus
provided.
Sentiment for the unseen corpus texts calculated and aggregated applying the control variables
from the input sento_measures
object.
measures_update(sento_measures, sento_corpus, lexicons)
measures_update(sento_measures, sento_corpus, lexicons)
sento_measures |
|
sento_corpus |
a |
lexicons |
a |
An updated sento_measures
object.
Jeroen Van Pelt, Samuel Borms, Andres Algaba
sento_measures
, compute_sentiment
data("usnews", package = "sentometrics") corpus1 <- sento_corpus(usnews[1:500, ]) corpus2 <- sento_corpus(usnews[400:2000, ]) ctr <- ctr_agg(howTime = "linear", by = "year", lag = 3) l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")], list_valence_shifters[["en"]]) sento_measures <- sento_measures(corpus1, l, ctr) sento_measuresNew <- measures_update(sento_measures, corpus2, l)
data("usnews", package = "sentometrics") corpus1 <- sento_corpus(usnews[1:500, ]) corpus2 <- sento_corpus(usnews[400:2000, ]) ctr <- ctr_agg(howTime = "linear", by = "year", lag = 3) l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")], list_valence_shifters[["en"]]) sento_measures <- sento_measures(corpus1, l, ctr) sento_measuresNew <- measures_update(sento_measures, corpus2, l)
Combines multiple sentiment
objects with possibly different column names
into a new sentiment
object. Here, too, any resulting NA
values are converted to zero.
## S3 method for class 'sentiment' merge(...)
## S3 method for class 'sentiment' merge(...)
... |
|
The new, combined, sentiment
object, ordered by "date"
and "id"
.
Samuel Borms
data("usnews", package = "sentometrics") data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") l1 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")]) l2 <- sento_lexicons(list_lexicons[c("FEEL_en_tr")]) l3 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en", "FEEL_en_tr")]) corp1 <- sento_corpus(corpusdf = usnews[1:200, ]) corp2 <- sento_corpus(corpusdf = usnews[201:450, ]) corp3 <- sento_corpus(corpusdf = usnews[401:700, ]) s1 <- compute_sentiment(corp1, l1, "proportionalPol") s2 <- compute_sentiment(corp2, l1, "counts") s3 <- compute_sentiment(corp3, l1, "counts") s4 <- compute_sentiment(corp2, l1, "counts", do.sentence = TRUE) s5 <- compute_sentiment(corp3, l2, "proportional", do.sentence = TRUE) s6 <- compute_sentiment(corp3, l1, "counts", do.sentence = TRUE) s7 <- compute_sentiment(corp3, l3, "UShaped", do.sentence = TRUE) # straightforward row-wise merge m1 <- merge(s1, s2, s3) nrow(m1) == 700 # TRUE # another straightforward row-wise merge m2 <- merge(s4, s6) # merge of sentence and non-sentence calculations m3 <- merge(s3, s6) # different methods adds columns m4 <- merge(s4, s5) nrow(m4) == nrow(m2) # TRUE # different methods and weighting adds rows and columns ## rows are added only when the different weighting ## approach for a specific method gives other sentiment values m5 <- merge(s4, s7) nrow(m5) > nrow(m4) # TRUE
data("usnews", package = "sentometrics") data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") l1 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")]) l2 <- sento_lexicons(list_lexicons[c("FEEL_en_tr")]) l3 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en", "FEEL_en_tr")]) corp1 <- sento_corpus(corpusdf = usnews[1:200, ]) corp2 <- sento_corpus(corpusdf = usnews[201:450, ]) corp3 <- sento_corpus(corpusdf = usnews[401:700, ]) s1 <- compute_sentiment(corp1, l1, "proportionalPol") s2 <- compute_sentiment(corp2, l1, "counts") s3 <- compute_sentiment(corp3, l1, "counts") s4 <- compute_sentiment(corp2, l1, "counts", do.sentence = TRUE) s5 <- compute_sentiment(corp3, l2, "proportional", do.sentence = TRUE) s6 <- compute_sentiment(corp3, l1, "counts", do.sentence = TRUE) s7 <- compute_sentiment(corp3, l3, "UShaped", do.sentence = TRUE) # straightforward row-wise merge m1 <- merge(s1, s2, s3) nrow(m1) == 700 # TRUE # another straightforward row-wise merge m2 <- merge(s4, s6) # merge of sentence and non-sentence calculations m3 <- merge(s3, s6) # different methods adds columns m4 <- merge(s4, s5) nrow(m4) == nrow(m2) # TRUE # different methods and weighting adds rows and columns ## rows are added only when the different weighting ## approach for a specific method gives other sentiment values m5 <- merge(s4, s7) nrow(m5) > nrow(m4) # TRUE
Returns the number of sentiment measures.
nmeasures(sento_measures)
nmeasures(sento_measures)
sento_measures |
a |
The number of sentiment measures in the input sento_measures
object.
Samuel Borms
Returns the number of data points available in the sentiment measures.
## S3 method for class 'sento_measures' nobs(object, ...)
## S3 method for class 'sento_measures' nobs(object, ...)
object |
a |
... |
not used. |
The number of rows (observations/data points) in object[["measures"]]
.
Samuel Borms
This function extracts the dates for which aggregated time series sentiment is most extreme (lowest, highest or both in absolute terms). The extracted dates are unique, even when, for example, all most extreme sentiment values (for different sentiment measures) occur on only one date.
peakdates(sento_measures, n = 10, type = "both", do.average = FALSE)
peakdates(sento_measures, n = 10, type = "both", do.average = FALSE)
sento_measures |
a |
n |
a positive |
type |
a |
do.average |
a |
A vector of type "Date"
corresponding to the n
extracted sentiment peak dates.
Samuel Borms
set.seed(505) data("usnews", package = "sentometrics") data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") # construct a sento_measures object to start with corpus <- sento_corpus(corpusdf = usnews) corpusSample <- quanteda::corpus_sample(corpus, size = 500) l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")], list_valence_shifters[["en"]]) ctr <- ctr_agg(howTime = c("equal_weight", "linear"), by = "month", lag = 3) sento_measures <- sento_measures(corpusSample, l, ctr) # extract the peaks peaksAbs <- peakdates(sento_measures, n = 5) peaksAbsQuantile <- peakdates(sento_measures, n = 0.50) peaksPos <- peakdates(sento_measures, n = 5, type = "pos") peaksNeg <- peakdates(sento_measures, n = 5, type = "neg")
set.seed(505) data("usnews", package = "sentometrics") data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") # construct a sento_measures object to start with corpus <- sento_corpus(corpusdf = usnews) corpusSample <- quanteda::corpus_sample(corpus, size = 500) l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")], list_valence_shifters[["en"]]) ctr <- ctr_agg(howTime = c("equal_weight", "linear"), by = "month", lag = 3) sento_measures <- sento_measures(corpusSample, l, ctr) # extract the peaks peaksAbs <- peakdates(sento_measures, n = 5) peaksAbsQuantile <- peakdates(sento_measures, n = 0.50) peaksPos <- peakdates(sento_measures, n = 5, type = "pos") peaksNeg <- peakdates(sento_measures, n = 5, type = "neg")
This function extracts the documents with most extreme sentiment (lowest, highest or both in absolute terms). The extracted documents are unique, even when, for example, all most extreme sentiment values (across sentiment calculation methods) occur only for one document.
peakdocs(sentiment, n = 10, type = "both", do.average = FALSE)
peakdocs(sentiment, n = 10, type = "both", do.average = FALSE)
sentiment |
a |
n |
a positive |
type |
a |
do.average |
a |
A vector of type "character"
corresponding to the n
extracted document identifiers.
Samuel Borms
set.seed(505) data("usnews", package = "sentometrics") data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")]) corpus <- sento_corpus(corpusdf = usnews) corpusSample <- quanteda::corpus_sample(corpus, size = 200) sent <- compute_sentiment(corpusSample, l, how = "proportionalPol") # extract the peaks peaksAbs <- peakdocs(sent, n = 5) peaksAbsQuantile <- peakdocs(sent, n = 0.50) peaksPos <- peakdocs(sent, n = 5, type = "pos") peaksNeg <- peakdocs(sent, n = 5, type = "neg")
set.seed(505) data("usnews", package = "sentometrics") data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")]) corpus <- sento_corpus(corpusdf = usnews) corpusSample <- quanteda::corpus_sample(corpus, size = 200) sent <- compute_sentiment(corpusSample, l, how = "proportionalPol") # extract the peaks peaksAbs <- peakdocs(sent, n = 5) peaksAbsQuantile <- peakdocs(sent, n = 0.50) peaksPos <- peakdocs(sent, n = 5, type = "pos") peaksNeg <- peakdocs(sent, n = 5, type = "neg")
Shows a plot of the attributions along the dimension provided, stacked per date.
## S3 method for class 'attributions' plot(x, group = "features", ...)
## S3 method for class 'attributions' plot(x, group = "features", ...)
x |
an |
group |
a value from |
... |
not used. |
See sento_model
for an elaborate modeling example including the calculation and plotting of
attributions. This function does not handle the plotting of the attribution of individual documents, since there are
often a lot of documents involved and they appear only once at one date (even though a document may contribute to
predictions at several dates, depending on the number of lags in the time aggregation).
Returns a simple ggplot
object, which can be added onto (or to alter its default elements) by using
the +
operator. By default, a legend is positioned at the top if the number of components of the
dimension is at maximum twelve.
Samuel Borms, Keven Bluteau
Plotting method that shows all sentiment measures from the provided sento_measures
object in one plot, or the average along one of the lexicons, features and time weighting dimensions.
## S3 method for class 'sento_measures' plot(x, group = "all", ...)
## S3 method for class 'sento_measures' plot(x, group = "all", ...)
x |
a |
group |
a value from |
... |
not used. |
Returns a simple ggplot
object, which can be added onto (or to alter its default elements) by using
the +
operator (see example). By default, a legend is positioned at the top if there are at maximum twelve line
graphs plotted and group
is different from "all"
.
Samuel Borms
# construct a sento_measures object to start with corpus <- sento_corpus(corpusdf = sentometrics::usnews) corpusSample <- quanteda::corpus_sample(corpus, size = 500) l <- sento_lexicons(sentometrics::list_lexicons[c("LM_en")], sentometrics::list_valence_shifters[["en"]]) ctr <- ctr_agg(howTime = c("equal_weight", "linear"), by = "month", lag = 3) sm <- sento_measures(corpusSample, l, ctr) # plot sentiment measures plot(sm, "features") ## Not run: # adjust appearance of plot library("ggplot2") p <- plot(sm) p <- p + scale_x_date(name = "year", date_labels = "%Y") + scale_y_continuous(name = "newName") p ## End(Not run)
# construct a sento_measures object to start with corpus <- sento_corpus(corpusdf = sentometrics::usnews) corpusSample <- quanteda::corpus_sample(corpus, size = 500) l <- sento_lexicons(sentometrics::list_lexicons[c("LM_en")], sentometrics::list_valence_shifters[["en"]]) ctr <- ctr_agg(howTime = c("equal_weight", "linear"), by = "month", lag = 3) sm <- sento_measures(corpusSample, l, ctr) # plot sentiment measures plot(sm, "features") ## Not run: # adjust appearance of plot library("ggplot2") p <- plot(sm) p <- p + scale_x_date(name = "year", date_labels = "%Y") + scale_y_continuous(name = "newName") p ## End(Not run)
Displays a plot of all predictions made through the iterative model computation as incorporated in the
input sento_modelIter
object, as well as the corresponding true values.
## S3 method for class 'sento_modelIter' plot(x, ...)
## S3 method for class 'sento_modelIter' plot(x, ...)
x |
a |
... |
not used. |
See sento_model
for an elaborate modeling example including the plotting of out-of-sample
performance.
Returns a simple ggplot
object, which can be added onto (or to alter its default elements) by using
the +
operator.
Samuel Borms
Prediction method for sento_model
class, with usage along the lines of
predict.glmnet
, but simplified in terms of parameters.
## S3 method for class 'sento_model' predict(object, newx, type = "response", offset = NULL, ...)
## S3 method for class 'sento_model' predict(object, newx, type = "response", offset = NULL, ...)
object |
a |
newx |
a data |
type |
type of prediction required, a value from |
offset |
not used. |
... |
not used. |
A prediction output depending on the type
argument.
Samuel Borms
Scales and centers the sentiment measures from a sento_measures
object, column-per-column. By default,
the measures are normalized. NA
s are removed first.
## S3 method for class 'sento_measures' scale(x, center = TRUE, scale = TRUE)
## S3 method for class 'sento_measures' scale(x, center = TRUE, scale = TRUE)
x |
a |
center |
a |
scale |
a |
If one of the arguments center
or scale
is a matrix
, this operation will be applied first,
and eventual other centering or scaling is computed on that data.
A modified sento_measures
object, with the measures replaced by the scaled measures as well as updated
statistics.
Samuel Borms
data("usnews", package = "sentometrics") data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") set.seed(505) # construct a sento_measures object to start with corpus <- sento_corpus(corpusdf = usnews) corpusSample <- quanteda::corpus_sample(corpus, size = 500) l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")]) ctr <- ctr_agg(howTime = c("equal_weight", "linear"), by = "year", lag = 3) sento_measures <- sento_measures(corpusSample, l, ctr) # scale sentiment measures to zero mean and unit standard deviation sc1 <- scale(sento_measures) n <- nobs(sento_measures) m <- nmeasures(sento_measures) # subtract a matrix sc2 <- scale(sento_measures, center = matrix(runif(n * m), n, m), scale = FALSE) # divide every row observation based on a one-column matrix, then center sc3 <- scale(sento_measures, center = TRUE, scale = matrix(runif(n)))
data("usnews", package = "sentometrics") data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") set.seed(505) # construct a sento_measures object to start with corpus <- sento_corpus(corpusdf = usnews) corpusSample <- quanteda::corpus_sample(corpus, size = 500) l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")]) ctr <- ctr_agg(howTime = c("equal_weight", "linear"), by = "year", lag = 3) sento_measures <- sento_measures(corpusSample, l, ctr) # scale sentiment measures to zero mean and unit standard deviation sc1 <- scale(sento_measures) n <- nobs(sento_measures) m <- nmeasures(sento_measures) # subtract a matrix sc2 <- scale(sento_measures, center = matrix(runif(n * m), n, m), scale = FALSE) # divide every row observation based on a one-column matrix, then center sc3 <- scale(sento_measures, center = TRUE, scale = matrix(runif(n)))
Formalizes a collection of texts into a sento_corpus
object derived from the quanteda
corpus
object. The quanteda package provides a robust text mining infrastructure
(see their website), including a handy corpus manipulation toolset. This function
performs a set of checks on the input data and prepares the corpus for further analysis by structurally
integrating a date dimension and numeric metadata features.
sento_corpus(corpusdf, do.clean = FALSE)
sento_corpus(corpusdf, do.clean = FALSE)
corpusdf |
a |
do.clean |
a |
A sento_corpus
object is a specialized instance of a quanteda corpus
. Any
quanteda function applicable to its corpus
object can also be applied to a sento_corpus
object. However, changing a given sento_corpus
object too drastically using some of quanteda's functions might
alter the very structure the corpus is meant to have (as defined in the corpusdf
argument) to be able to be used as
an input in other functions of the sentometrics package. There are functions, including
corpus_sample
or corpus_subset
, that do not change the actual corpus
structure and may come in handy.
To add additional features, use add_features
. Binary features are useful as
a mechanism to select the texts which have to be integrated in the respective feature-based sentiment measure(s), but
applies only when do.ignoreZeros = TRUE
. Because of this (implicit) selection that can be performed, having
complementary features (e.g., "economy"
and "noneconomy"
) makes sense.
It is also possible to add one non-numerical feature, that is, "language"
, to designate the language
of the corpus texts. When this feature is provided, a list
of lexicons for different
languages is expected in the compute_sentiment
function.
A sento_corpus
object, derived from a quanteda corpus
object. The corpus is ordered by date.
Samuel Borms
data("usnews", package = "sentometrics") # corpus construction corp <- sento_corpus(corpusdf = usnews) # take a random subset making use of quanteda corpusSmall <- quanteda::corpus_sample(corp, size = 500) # deleting a feature quanteda::docvars(corp, field = "wapo") <- NULL # deleting all features results in the addition of a dummy feature quanteda::docvars(corp, field = c("economy", "noneconomy", "wsj")) <- NULL ## Not run: # to add or replace features, use the add_features() function... quanteda::docvars(corp, field = c("wsj", "new")) <- 1 ## End(Not run) # corpus creation when no features are present corpusDummy <- sento_corpus(corpusdf = usnews[, 1:3]) # corpus creation with a qualitative language feature usnews[["language"]] <- "en" usnews[["language"]][c(200:400)] <- "nl" corpusLang <- sento_corpus(corpusdf = usnews)
data("usnews", package = "sentometrics") # corpus construction corp <- sento_corpus(corpusdf = usnews) # take a random subset making use of quanteda corpusSmall <- quanteda::corpus_sample(corp, size = 500) # deleting a feature quanteda::docvars(corp, field = "wapo") <- NULL # deleting all features results in the addition of a dummy feature quanteda::docvars(corp, field = c("economy", "noneconomy", "wsj")) <- NULL ## Not run: # to add or replace features, use the add_features() function... quanteda::docvars(corp, field = c("wsj", "new")) <- 1 ## End(Not run) # corpus creation when no features are present corpusDummy <- sento_corpus(corpusdf = usnews[, 1:3]) # corpus creation with a qualitative language feature usnews[["language"]] <- "en" usnews[["language"]][c(200:400)] <- "nl" corpusLang <- sento_corpus(corpusdf = usnews)
Structures provided lexicon(s) and optionally valence words. One can for example combine (part of) the
built-in lexicons from data("list_lexicons")
with other lexicons, and add one of the built-in valence word lists
from data("list_valence_shifters")
. This function makes the output coherent, by converting all words to
lowercase and checking for duplicates. All entries consisting of more than one word are discarded, as required for
bag-of-words sentiment analysis.
sento_lexicons(lexiconsIn, valenceIn = NULL, do.split = FALSE)
sento_lexicons(lexiconsIn, valenceIn = NULL, do.split = FALSE)
lexiconsIn |
a named |
valenceIn |
a single valence word list as a |
do.split |
a |
A list
of class sento_lexicons
with each lexicon as a separate element according to its name, as a
data.table
, and optionally an element named valence
that comprises the valence words. Every "x"
column
contains the words, every "y"
column contains the scores. The "t"
column for valence shifters
contains the different types.
Samuel Borms
data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") # lexicons straight from built-in word lists l1 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")]) # including a self-made lexicon, with and without valence shifters lexIn <- c(list(myLexicon = data.table::data.table(w = c("nice", "boring"), s = c(2, -1))), list_lexicons[c("GI_en")]) valIn <- list_valence_shifters[["en"]] l2 <- sento_lexicons(lexIn) l3 <- sento_lexicons(lexIn, valIn) l4 <- sento_lexicons(lexIn, valIn[, c("x", "y")], do.split = TRUE) l5 <- sento_lexicons(lexIn, valIn[, c("x", "t")], do.split = TRUE) l6 <- l5[c("GI_en_POS", "valence")] # preserves sento_lexicons class ## Not run: # include lexicons from lexicon package lexIn2 <- list(hul = lexicon::hash_sentiment_huliu, joc = lexicon::hash_sentiment_jockers) l7 <- sento_lexicons(c(lexIn, lexIn2), valIn) ## End(Not run) ## Not run: # faulty extraction, no replacement allowed l5["valence"] l2[0] l3[22] l4[1] <- l2[1] l4[[1]] <- l2[[1]] l4$GI_en_NEG <- l2$myLexicon ## End(Not run)
data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") # lexicons straight from built-in word lists l1 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")]) # including a self-made lexicon, with and without valence shifters lexIn <- c(list(myLexicon = data.table::data.table(w = c("nice", "boring"), s = c(2, -1))), list_lexicons[c("GI_en")]) valIn <- list_valence_shifters[["en"]] l2 <- sento_lexicons(lexIn) l3 <- sento_lexicons(lexIn, valIn) l4 <- sento_lexicons(lexIn, valIn[, c("x", "y")], do.split = TRUE) l5 <- sento_lexicons(lexIn, valIn[, c("x", "t")], do.split = TRUE) l6 <- l5[c("GI_en_POS", "valence")] # preserves sento_lexicons class ## Not run: # include lexicons from lexicon package lexIn2 <- list(hul = lexicon::hash_sentiment_huliu, joc = lexicon::hash_sentiment_jockers) l7 <- sento_lexicons(c(lexIn, lexIn2), valIn) ## End(Not run) ## Not run: # faulty extraction, no replacement allowed l5["valence"] l2[0] l3[22] l4[1] <- l2[1] l4[[1]] <- l2[[1]] l4$GI_en_NEG <- l2$myLexicon ## End(Not run)
Wrapper function which assembles calls to compute_sentiment
and aggregate
.
Serves as the most direct way towards a panel of textual sentiment measures as a sento_measures
object.
sento_measures(sento_corpus, lexicons, ctr)
sento_measures(sento_corpus, lexicons, ctr)
sento_corpus |
a |
lexicons |
a |
ctr |
output from a |
As a general rule, neither the names of the features, lexicons or time weighting schemes may contain any ‘-’ symbol.
A sento_measures
object, which is a list
containing:
measures |
a |
features |
a |
lexicons |
a |
time |
a |
stats |
a |
sentiment |
the document-level sentiment scores |
attribWeights |
a |
ctr |
a |
Samuel Borms, Keven Bluteau
compute_sentiment
, aggregate
, measures_update
data("usnews", package = "sentometrics") data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") # construct a sento_measures object to start with corpus <- sento_corpus(corpusdf = usnews) corpusSample <- quanteda::corpus_sample(corpus, size = 500) l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")], list_valence_shifters[["en"]]) ctr <- ctr_agg(howWithin = "counts", howDocs = "proportional", howTime = c("equal_weight", "linear", "almon"), by = "month", lag = 3, ordersAlm = 1:3, do.inverseAlm = TRUE) sento_measures <- sento_measures(corpusSample, l, ctr) summary(sento_measures)
data("usnews", package = "sentometrics") data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") # construct a sento_measures object to start with corpus <- sento_corpus(corpusdf = usnews) corpusSample <- quanteda::corpus_sample(corpus, size = 500) l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")], list_valence_shifters[["en"]]) ctr <- ctr_agg(howWithin = "counts", howDocs = "proportional", howTime = c("equal_weight", "linear", "almon"), by = "month", lag = 3, ordersAlm = 1:3, do.inverseAlm = TRUE) sento_measures <- sento_measures(corpusSample, l, ctr) summary(sento_measures)
Linear or nonlinear penalized regression of any dependent variable on the wide number of sentiment measures and potentially other explanatory variables. Either performs a regression given the provided variables at once, or computes regressions sequentially for a given sample size over a longer time horizon, with associated prediction performance metrics.
sento_model(sento_measures, y, x = NULL, ctr)
sento_model(sento_measures, y, x = NULL, ctr)
sento_measures |
a |
y |
a one-column |
x |
a named |
ctr |
output from a |
Models are computed using the elastic net regularization as implemented in the glmnet package, to account for
the multidimensionality of the sentiment measures. Independent variables are normalized in the regression process, but
coefficients are returned in their original space. For a helpful introduction to glmnet, we refer to their
vignette. The optimal elastic net parameters
lambda
and alpha
are calibrated either through a to specify information criterion or through
cross-validation (based on the "rolling forecasting origin" principle, using the train
function).
In the latter case, the training metric is automatically set to "RMSE"
for a linear model and to "Accuracy"
for a logistic model. We suppress many of the details that can be supplied to the glmnet
and
train
functions we rely on, for the sake of user-friendliness.
If ctr$do.iter = FALSE
, a sento_model
object which is a list
containing:
reg |
optimized regression, i.e., a model-specific glmnet object, including for example the estimated coefficients. |
model |
the input argument |
alpha |
calibrated alpha. |
lambda |
calibrated lambda. |
trained |
output from |
ic |
a |
dates |
sample reference dates as a two-element |
nVar |
a vector of size two, with respectively the number of sentiment measures, and the number of other explanatory variables inputted. |
discarded |
a named |
If ctr$do.iter = TRUE
, a sento_modelIter
object which is a list
containing:
models |
all sparse regressions, i.e., separate |
alphas |
calibrated alphas. |
lambdas |
calibrated lambdas. |
performance |
a |
Samuel Borms, Keven Bluteau
ctr_model
, glmnet
, train
, attributions
## Not run: data("usnews", package = "sentometrics") data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") data("epu", package = "sentometrics") set.seed(505) # construct a sento_measures object to start with corpusAll <- sento_corpus(corpusdf = usnews) corpus <- quanteda::corpus_subset(corpusAll, date >= "2004-01-01") l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")]) ctr <- ctr_agg(howWithin = "counts", howDocs = "proportional", howTime = c("equal_weight", "linear"), by = "month", lag = 3) sento_measures <- sento_measures(corpus, l, ctr) # prepare y and other x variables y <- epu[epu$date %in% get_dates(sento_measures), "index"] length(y) == nobs(sento_measures) # TRUE x <- data.frame(runif(length(y)), rnorm(length(y))) # two other (random) x variables colnames(x) <- c("x1", "x2") # a linear model based on the Akaike information criterion ctrIC <- ctr_model(model = "gaussian", type = "AIC", do.iter = FALSE, h = 4, do.difference = TRUE) out1 <- sento_model(sento_measures, y, x = x, ctr = ctrIC) # attribution and prediction as post-analysis attributions1 <- attributions(out1, sento_measures, refDates = get_dates(sento_measures)[20:25]) plot(attributions1, "features") nx <- nmeasures(sento_measures) + ncol(x) newx <- runif(nx) * cbind(data.table::as.data.table(sento_measures)[, -1], x)[30:40, ] preds <- predict(out1, newx = as.matrix(newx), type = "link") # an iterative out-of-sample analysis, parallelized ctrIter <- ctr_model(model = "gaussian", type = "BIC", do.iter = TRUE, h = 3, oos = 2, alphas = c(0.25, 0.75), nSample = 75, nCore = 2) out2 <- sento_model(sento_measures, y, x = x, ctr = ctrIter) summary(out2) # plot predicted vs. realized values p <- plot(out2) p # a cross-validation based model, parallelized cl <- parallel::makeCluster(2) doParallel::registerDoParallel(cl) ctrCV <- ctr_model(model = "gaussian", type = "cv", do.iter = FALSE, h = 0, alphas = c(0.10, 0.50, 0.90), trainWindow = 70, testWindow = 10, oos = 0, do.progress = TRUE) out3 <- sento_model(sento_measures, y, x = x, ctr = ctrCV) parallel::stopCluster(cl) foreach::registerDoSEQ() summary(out3) # a cross-validation based model for a binomial target yb <- epu[epu$date %in% get_dates(sento_measures), "above"] ctrCVb <- ctr_model(model = "binomial", type = "cv", do.iter = FALSE, h = 0, alphas = c(0.10, 0.50, 0.90), trainWindow = 70, testWindow = 10, oos = 0, do.progress = TRUE) out4 <- sento_model(sento_measures, yb, x = x, ctr = ctrCVb) summary(out4) ## End(Not run)
## Not run: data("usnews", package = "sentometrics") data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") data("epu", package = "sentometrics") set.seed(505) # construct a sento_measures object to start with corpusAll <- sento_corpus(corpusdf = usnews) corpus <- quanteda::corpus_subset(corpusAll, date >= "2004-01-01") l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")]) ctr <- ctr_agg(howWithin = "counts", howDocs = "proportional", howTime = c("equal_weight", "linear"), by = "month", lag = 3) sento_measures <- sento_measures(corpus, l, ctr) # prepare y and other x variables y <- epu[epu$date %in% get_dates(sento_measures), "index"] length(y) == nobs(sento_measures) # TRUE x <- data.frame(runif(length(y)), rnorm(length(y))) # two other (random) x variables colnames(x) <- c("x1", "x2") # a linear model based on the Akaike information criterion ctrIC <- ctr_model(model = "gaussian", type = "AIC", do.iter = FALSE, h = 4, do.difference = TRUE) out1 <- sento_model(sento_measures, y, x = x, ctr = ctrIC) # attribution and prediction as post-analysis attributions1 <- attributions(out1, sento_measures, refDates = get_dates(sento_measures)[20:25]) plot(attributions1, "features") nx <- nmeasures(sento_measures) + ncol(x) newx <- runif(nx) * cbind(data.table::as.data.table(sento_measures)[, -1], x)[30:40, ] preds <- predict(out1, newx = as.matrix(newx), type = "link") # an iterative out-of-sample analysis, parallelized ctrIter <- ctr_model(model = "gaussian", type = "BIC", do.iter = TRUE, h = 3, oos = 2, alphas = c(0.25, 0.75), nSample = 75, nCore = 2) out2 <- sento_model(sento_measures, y, x = x, ctr = ctrIter) summary(out2) # plot predicted vs. realized values p <- plot(out2) p # a cross-validation based model, parallelized cl <- parallel::makeCluster(2) doParallel::registerDoParallel(cl) ctrCV <- ctr_model(model = "gaussian", type = "cv", do.iter = FALSE, h = 0, alphas = c(0.10, 0.50, 0.90), trainWindow = 70, testWindow = 10, oos = 0, do.progress = TRUE) out3 <- sento_model(sento_measures, y, x = x, ctr = ctrCV) parallel::stopCluster(cl) foreach::registerDoSEQ() summary(out3) # a cross-validation based model for a binomial target yb <- epu[epu$date %in% get_dates(sento_measures), "above"] ctrCVb <- ctr_model(model = "binomial", type = "cv", do.iter = FALSE, h = 0, alphas = c(0.10, 0.50, 0.90), trainWindow = 70, testWindow = 10, oos = 0, do.progress = TRUE) out4 <- sento_model(sento_measures, yb, x = x, ctr = ctrCVb) summary(out4) ## End(Not run)
Subsets rows of the sentiment measures based on its columns.
## S3 method for class 'sento_measures' subset(x, subset = NULL, select = NULL, delete = NULL, ...)
## S3 method for class 'sento_measures' subset(x, subset = NULL, select = NULL, delete = NULL, ...)
x |
a |
subset |
a logical (non- |
select |
a |
delete |
see the |
... |
not used. |
A modified sento_measures
object, with only the remaining rows and sentiment measures,
including updated information and statistics, but the original sentiment scores data.table
untouched.
Samuel Borms
data("usnews", package = "sentometrics") data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") # construct a sento_measures object to start with corpus <- sento_corpus(corpusdf = usnews) corpusSample <- quanteda::corpus_sample(corpus, size = 500) l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")]) ctr <- ctr_agg(howTime = c("equal_weight", "linear"), by = "year", lag = 3) sm <- sento_measures(corpusSample, l, ctr) # three specified indices in required list format three <- as.list( stringi::stri_split(c("LM_en--economy--linear", "HENRY_en--wsj--equal_weight", "HENRY_en--wapo--equal_weight"), regex = "--") ) # different subsets sub1 <- subset(sm, HENRY_en--economy--equal_weight >= 0.01) sub2 <- subset(sm, date %in% get_dates(sm)[3:12]) sub3 <- subset(sm, 3:12) sub4 <- subset(sm, 1:100) # warning # different selections sel1 <- subset(sm, select = "equal_weight") sel2 <- subset(sm, select = c("equal_weight", "linear")) sel3 <- subset(sm, select = c("linear", "LM_en")) sel4 <- subset(sm, select = list(c("linear", "wsj"), c("linear", "economy"))) sel5 <- subset(sm, select = three) # different deletions del1 <- subset(sm, delete = "equal_weight") del2 <- subset(sm, delete = c("linear", "LM_en")) del3 <- subset(sm, delete = list(c("linear", "wsj"), c("linear", "economy"))) del4 <- subset(sm, delete = c("equal_weight", "linear")) # warning del5 <- subset(sm, delete = three)
data("usnews", package = "sentometrics") data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") # construct a sento_measures object to start with corpus <- sento_corpus(corpusdf = usnews) corpusSample <- quanteda::corpus_sample(corpus, size = 500) l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")]) ctr <- ctr_agg(howTime = c("equal_weight", "linear"), by = "year", lag = 3) sm <- sento_measures(corpusSample, l, ctr) # three specified indices in required list format three <- as.list( stringi::stri_split(c("LM_en--economy--linear", "HENRY_en--wsj--equal_weight", "HENRY_en--wapo--equal_weight"), regex = "--") ) # different subsets sub1 <- subset(sm, HENRY_en--economy--equal_weight >= 0.01) sub2 <- subset(sm, date %in% get_dates(sm)[3:12]) sub3 <- subset(sm, 3:12) sub4 <- subset(sm, 1:100) # warning # different selections sel1 <- subset(sm, select = "equal_weight") sel2 <- subset(sm, select = c("equal_weight", "linear")) sel3 <- subset(sm, select = c("linear", "LM_en")) sel4 <- subset(sm, select = list(c("linear", "wsj"), c("linear", "economy"))) sel5 <- subset(sm, select = three) # different deletions del1 <- subset(sm, delete = "equal_weight") del2 <- subset(sm, delete = c("linear", "LM_en")) del3 <- subset(sm, delete = list(c("linear", "wsj"), c("linear", "economy"))) del4 <- subset(sm, delete = c("equal_weight", "linear")) # warning del5 <- subset(sm, delete = three)
A collection of texts annotated by humans in terms of relevance to the U.S. economy or not. The texts come from two major journals in the U.S. (The Wall Street Journal and The Washington Post) and cover 4145 documents between 1995 and 2014. It contains following information:
id. A character
ID identifier.
date. Date as "yyyy-mm-dd"
.
texts. Texts in character
format.
wsj. Equals 1 if the article comes from The Wall Street Journal.
wapo. Equals 1 if the article comes from The Washington Post (complementary to ‘wsj’).
economy. Equals 1 if the article is relevant to the U.S. economy.
noneconomy. Equals 1 if the article is not relevant to the U.S. economy (complementary to ‘economy’).
data("usnews")
data("usnews")
A data.frame
, formatted as required to be an input for sento_corpus
.
Economic News Article Tone and Relevance. Retrieved November 1, 2017.
data("usnews", package = "sentometrics") usnews[3192, "texts"] usnews[1:5, c("id", "date", "texts")]
data("usnews", package = "sentometrics") usnews[3192, "texts"] usnews[1:5, c("id", "date", "texts")]
Computes Almon polynomial weighting curves. Handy to self-select specific time aggregation weighting schemes
for input in ctr_agg
using the weights
argument.
weights_almon(n, orders = 1:3, do.inverse = TRUE, do.normalize = TRUE)
weights_almon(n, orders = 1:3, do.inverse = TRUE, do.normalize = TRUE)
n |
a single |
orders |
a |
do.inverse |
|
do.normalize |
a |
The Almon polynomial formula implemented is:
, where
is the lag index ordered from
1 to
. The inverse is computed by changing
to
.
A data.frame
of all Almon polynomial weighting curves, of size length(orders)
(times two if
do.inverse = TRUE
).
Computes Beta weighting curves as in Ghysels, Sinko and Valkanov (2007). Handy to self-select specific
time aggregation weighting schemes for input in ctr_agg
using the weights
argument.
weights_beta(n, a = 1:4, b = 1:4, do.normalize = TRUE)
weights_beta(n, a = 1:4, b = 1:4, do.normalize = TRUE)
n |
a single |
a |
a |
b |
a |
do.normalize |
a |
The Beta weighting abides by following formula:
, where
is the lag index ordered
from 1 to
,
and
are two decay parameters, and
, where
is
the
gamma
function.
A data.frame
of beta weighting curves per combination of a
and b
. If n = 1
,
all weights are set to 1.
Ghysels, Sinko and Valkanov (2007). MIDAS regressions: Further results and new directions. Econometric Reviews 26, 53-90, doi:10.1080/07474930600972467.
Computes exponential weighting curves. Handy to self-select specific time aggregation weighting schemes
for input in ctr_agg
using the weights
argument.
weights_exponential( n, alphas = seq(0.1, 0.5, by = 0.1), do.inverse = FALSE, do.normalize = TRUE )
weights_exponential( n, alphas = seq(0.1, 0.5, by = 0.1), do.inverse = FALSE, do.normalize = TRUE )
n |
a single |
alphas |
a |
do.inverse |
|
do.normalize |
a |
A data.frame
of exponential weighting curves per value of alphas
.