NEWS


quanteda 4.1

Bug fixes and stability enhancements

Changes and additions

quanteda 4.0.2 (2024-04-24)

Bug fixes and stability enhancements

quanteda 4.0.1 (2024-04-08)

Bug fixes and stability enhancements

quanteda 4.0.0 (2024-04-04)

Changes and additions

Removals

Deprecations

Bug fixes and stability enhancements

quanteda 3.3.1 (2023-05-18)

Bug fixes and stability enhancements

quanteda 3.3.0 (2023-04-07)

Changes and additions

Bug fixes and stability enhancements

quanteda 3.2.5

Changes and additions

quanteda 3.2.4 (2022-12-08)

Bug fixes and stability enhancements

Fixes test failures caused by recent changes to Matrix package behaviours on some operating systems.

quanteda 3.2.3 (2022-08-29)

Bug fixes and stability enhancements

quanteda 3.2.2 (2022-08-09)

Bug fixes and stability enhancements

quanteda 3.2.1 (2022-03-01)

Bug fixes and stability enhancements

Changes and additions

quanteda 3.2

Bug fixes and stability enhancements

Changes and additions

quanteda 3.1

Bug fixes and stability enhancements

Changes and additions

Deprecations

quanteda 3.0

quanteda 3.0 is a major release that improves functionality, completes the modularisation of the package begun in v2.0, further improves function consistency by removing previously deprecated functions, and enhances workflow stability and consistency by deprecating some shortcut steps built into some functions.

Changes and additions

Deprecations

The main potentially breaking changes in version 3 relate to the deprecation or elimination of shortcut steps that allowed functions that required tokens inputs to skip the tokens creation step. We did this to require users to take more direct control of tokenization options, or to substitute the alternative tokeniser of their choice (and then coercing it to tokens via [as.tokens()]). This also allows our function behaviour to be more consistent, with each function performing a single task, rather than combining functions (such as tokenisation and constructing a matrix).

The most common example involves constructing a dfm directly from a character or corpus object. Formerly, this would construct a tokens object internally before creating the dfm, and allowed passing arguments to tokens() via .... This is now deprecated, although still functional with a warning.

We strongly encourage either creating a tokens object first, or piping the tokens return to dfm() using %>%. (See examples below.)

We have also deprecated direct character or corpus inputs to [kwic()], since this also requires a tokenised input.

The full listing of deprecations is:

Removals

Bug fixes and stability enhancements

quanteda 2.1.2 (2020-09-23)

Changes

Bug fixes and stability enhancements

quanteda 2.1.1 (2020-07-27)

Changes

Bug fixes and stability enhancements

quanteda 2.1.0 (2020-07-05)

Changes

Bug fixes and stability enhancements

quanteda 2.0.1 (2020-03-18)

Changes

Bug fixes and stability enhancements

quanteda 2.0

Changes

quanteda 2.0 introduces some major changes, detailed here.

  1. New corpus object structure.

    The internals of the corpus object have been redesigned, and now are based around a character vector with meta- and system-data in attributes. These are all updated to work with the existing extractor and replacement functions. If you were using these before, then you should not even notice the change. Docvars are now handled separately from the texts, in the same way that docvars are handled for tokens objects.

  2. New metadata handling.

    Corpus-level metadata is now inserted in a user metadata list via meta() and meta<-(). metacorpus() is kept as a synonym for meta(), for backwards compatibility. Additional system-level corpus information is also recorded, but automatically when an object is created.

    Document-level metadata is deprecated, and now all document-level information is simply a "docvar". For backward compatibility, metadoc() is kept and will insert document variables (docvars) with the name prefixed by an underscore.

  3. Corpus objects now store default summary statistics for efficiency. When these are present, summary.corpus() retrieves them rather than computing them on the fly.

  4. New index operators for core objects. The main change here is to redefine the $ operator for corpus, tokens, and dfm objects (all objects that retain docvars) to allow this operator to access single docvars by name. Some other index operators have been redefined as well, such as [.corpus returning a slice of a corpus, and [[.corpus returning the texts from a corpus.

    See the full details at https://github.com/quanteda/quanteda/wiki/indexing_core_objects.

  5. *_subset() functions.

    The subset argument now must be logical, and the select argument has been removed. (This is part of base::subset() but has never made sense, either in quanteda or base.)

  6. Return format from textstat_simil() and textstat_dist().

    Now defaults to a sparse matrix from the Matrix package, but coercion methods are provided for as.data.frame(), to make these functions return a data.frame just like the other textstat functions. Additional coercion methods are provided for as.dist(), as.simil(), and as.matrix().

  7. settings functions (and related slots and object attributes) are gone. These are now replaced by a new meta(x, type = "object") that records object-specific meta-data, including settings such as the n for tokens (to record the ngrams).

  8. All included data objects are upgraded to the new formats. This includes the three corpus objects, the single dfm data object, and the LSD 2015 dictionary object.

  9. New print methods for core objects (corpus, tokens, dfm, dictionary) now exist, each with new global options to control the number of documents shown, as well as the length of a text snippet (corpus), the tokens (tokens), dfm cells (dfm), or keys and values (dictionary). Similar to the extended printing options for dfm objects, printing of corpus objects now allows for brief summaries of the texts to be printed, and for the number of documents and the length of the previews to be controlled by new global options.

  10. All textmodels and related functions have been moved to a new package quanteda.textmodels. This makes them easier to maintain and update, and keeps the size of the core package down.

  11. quanteda v2 implements major changes to the tokens() constructor. These are designed to simplify the code and its maintenance in quanteda, to allow users to work with other (external) tokenizers, and to improve consistency across the tokens processing options. Changes include:

    • A new method tokens.list(x, ...) constructs a tokens object from named list of characters, allowing users to tokenize texts using some other function (or package) such as tokenize_words(), tokenize_sentences(), or tokenize_tweets() from the tokenizers package, or the list returned by spacyr::spacy_tokenize(). This allows users to use their choice of tokenizer, as long as it returns a named list of characters. With tokens.list(), all tokens processing (remove_*) options can be applied, or the list can be converted directly to a tokens object without processing using as.tokens.list().

    • All tokens options are now intervention options, to split or remove things that by default are not split or removed. All remove_* options to tokens() now remove them from tokens objects by calling tokens.tokens(), after constructing the object. "Pre-processing" is now actually post-processing using tokens_*() methods internally, after a conservative tokenization on token boundaries. This both improves performance and improves consistency in handling special characters (e.g. Twitter characters) across different tokenizer engines. (#1503, #1446, #1801)

    Note that tokens.tokens() will remove what is found, but cannot "undo" a removal -- for instance it cannot replace missing punctuation characters if these have already been removed.

    • The option remove_hyphens is removed and deprecated, but replaced by split_hyphens. This preserves infix (internal) hyphens rather than splitting them. This behaviour is implemented in both the what = "word" and what = "word2" tokenizer options. This option is FALSE by default.

    • The option remove_twitter has been removed. The new what = "word" is a smarter tokenizer that preserves social media tags, URLs, and email-addresses. "Tags" are defined as valid social media hashtags and usernames (using Twitter rules for validity) rather than removing the # and @ punctuation characters, even if remove_punct = TRUE.

New features

Behaviour changes

Bug fixes and stability enhancements

Other improvements

quanteda 1.5.2 (2019-11-26)

New features

Bug fixes

quanteda 1.5.1 (2019-07-30)

New features

Bug fixes and stability enhancements

quanteda 1.5.0 (2019-07-04)

New features

Behaviour changes

Bug fixes and stability enhancements

quanteda 1.4.1 (2019-02-26)

Bug fixes and stability enhancements

quanteda 1.4.0 (2019-01-30)

Bug fixes and stability enhancements

New features

Behaviour changes

quanteda 1.3.14 (2018-11-19)

Bug fixes and stability enhancements

New Features

quanteda 1.3.13 (2018-11-01)

Bug fixes and stability enhancements

New Features

Behaviour changes

quanteda 1.3.4 (2018-07-15)

Bug fixes and stability enhancements

New Features

quanteda 1.3.0 (2018-06-05)

New Features

Behaviour changes

Bug fixes

quanteda 1.2.0 (2018-04-15)

New Features

Bug fixes and stability enhancements

Behaviour changes

quanteda 1.1.1 (2018-03-07)

New Features

Bug fixes and stability enhancements

Performance improvements

Behaviour changes

quanteda 1.0.0 (2018-01-28)

New Features

Bug fixes and stability enhancements

Behaviour Changes

quanteda 0.99.12 (2017-10-06)

New Features

Bug fixes and stability enhancements

Behaviour Changes

quanteda 0.99.9 (2017-09-22)

New Features

Bug fixes and stability enhancements

quanteda 0.99 (2017-08-15)

New Features

Bug fixes and stability enhancements

Behaviour changes

quanteda 0.9.9-65 (2017-05-26)

New features

Behaviour changes

Bug fixes and stability enhancements

quanteda 0.9.9-50 (2017-04-20)

New features

Bug fixes and stability enhancements

quanteda 0.9.9-24 (2017-02-13)

New features

Behaviour changes

Bug fixes

quanteda 0.9.9-17 (2017-01-27)

New features

Bug fixes

quanteda 0.9.9-3

Bug fixes

New features

This release has some major changes to the API, described below.

Data objects

Renamed data objects

new name | original name | notes :--------|:------------- |:----- data_char_sampletext | exampleString | data_char_mobydick | mobydickText| data_dfm_LBGexample | LBGexample | data_char_sampletext | exampleString |

Renamed internal data objects

The following objects have been renamed, but will not affect user-level functionality because they are primarily internal. Their man pages have been moved to a common ?data-internal man page, hidden from the index, but linked from some of the functions that use them.

new name | original name | notes :--------|:------------- |:----- data_int_syllables | englishSyllables | (used by textcount_syllables()) data_char_wordlists | wordlists | (used by readability()) data_char_stopwords | .stopwords | (used by stopwords()

Deprecated data objects

In v.0.9.9 the old names remain available, but are deprecated.

new name | original name | notes :--------|:------------- |:----- data_char_ukimmig2010 | ukimmigTexts | data_corpus_irishbudget2010 | ie2010Corpus | data_char_inaugural | inaugTexts | data_corpus_inaugural | inaugCorpus |

Deprecated functions

The following functions will still work, but issue a deprecation warning:

new function | deprecated function | constructs: :--------|:------------- |:------- tokens | tokenize() | tokens class object corpus_subset | subset.corpus | corpus class object corpus_reshape | changeunits | corpus class object corpus_sample | sample| corpus class object corpus_segment | segment| corpus class object dfm_compress | compress | dfm class object dfm_lookup | applyDictionary | dfm class object dfm_remove | removeFeatures.dfm | dfm class object dfm_sample | sample.dfm | dfm class object dfm_select | selectFeatures.dfm | dfm class object dfm_smooth | smoother | dfm class object dfm_sort | sort.dfm | dfm class object dfm_trim | trim.dfm | dfm class object dfm_weight | weight | dfm class object textplot_wordcloud | plot.dfm | (plot) textplot_xray | plot.kwic | (plot) textstat_readability | readability | data.frame textstat_lexdiv | lexdiv | data.frame textstat_simil | similarity | dist textstat_dist | similarity | dist featnames | features | character nsyllable | syllables | (named) integer nscrabble | scrabble | (named) integer tokens_ngrams | ngrams | tokens class object tokens_skipgrams | skipgrams | tokens class object tokens_toupper | toUpper.tokens, toUpper.tokenizedTexts | tokens, tokenizedTexts tokens_tolower | toLower.tokens, toLower.tokenizedTexts | tokens, tokenizedTexts char_toupper | toUpper.character, toUpper.character | character char_tolower | toLower.character, toLower.character | character tokens_compound | joinTokens, phrasetotoken | tokens class object

New functions

The following are new to v0.9.9 (and not associated with deprecated functions):

new function | description | output class :--------|:------------- |:------- fcm() | constructor for a feature co-occurrence matrix | fcm fcm_select | selects features from an fcm | fcm fcm_remove | removes features from an fcm | fcm fcm_sort | sorts an fcm in alphabetical order of its features| fcm fcm_compress | compacts an fcm | fcm fcm_tolower | lowercases the features of an fcm and compacts | fcm fcm_toupper | uppercases the features of an fcm and compacts | fcm dfm_tolower | lowercases the features of a dfm and compacts | dfm dfm_toupper | uppercases the features of a dfm and compacts | dfm sequences | experimental collocation detection | sequences

Deleted functions and data objects

new name | reason :--------|:------------- encodedTextFiles.zip | moved to the readtext package describeTexts | deprecated several versions ago for summary.character textfile | moved to package readtext encodedTexts | moved to package readtext, as data_char_encodedtexts findSequences | replaced by sequences

Other new features

quanteda 0.9.8 (2016-07-28)

New Features

Bug fixes

Changes

quanteda 0.9.6

Bug fixes >= 0.9.6-3

Bug fixes

quanteda 0.9.4 (2016-02-21)

Bug fixes

quanteda 0.9.2

Bug fixes

quanteda 0.9.0

Bug Fixes

quanteda 0.8.6

Bug fixes

quanteda 0.8.4

Bug fixes

quanteda 0.8.2

Bug Fixes

Deletions

API changes

Imminent Changes

quanteda 0.8.0

Syntax changes and workflow streamlining

The workflow is now more logical and more streamlined, with a new workflow vignette as well as a design vignette explaining the principles behind the workflow and the commands that encourage this workflow. The document also details the development plans and things remaining to be done on the project.

Encoding detection and conversion

Newly rewritten command encoding() detects encoding for character, corpus, and corpusSource objects (created by textfile). When creating a corpus using corpus(), detection is automatic to UTF-8 if an encoding other than UTF-8, ASCII, or ISO-8859-1 is detected.

Major infrastructural changes

The tokenization, cleaning, lower-casing, and dfm construction functions now use the stringi package, based on the ICU library. This results not only in substantial speed improvements, but also more correctly handles Unicode characters and strings.

Other changes

Bug fixes

quanteda 0.7.3

quanteda 0.7.2 (2015-04-07)

quanteda 0.7.1

Many major changes to the syntax in this version.

quanteda 0.7.0

quanteda 0.6.6

quanteda 0.6.5

quanteda 0.6.4

quanteda 0.6.3

quanteda 0.6.2

quanteda 0.6.1

quanteda 0.6.0

quanteda 0.5.8

Classification and scaling methods

quanteda 0.5.7

New arguments for dfm()

quanteda 0.5.6

quanteda 0.5.5

quanteda 0.5.4

quanteda 0.5.3

quanteda 0.5.2

quanteda 0.5.1

quanteda 0.5.0

Lots of new functions

Old functions vastly improved

Better object and class design

more complete documentation