Title: | A Companion to the Multi-CAST Collection |
---|---|
Description: | Provides a basic interface for accessing annotation data from the Multi-CAST collection, a database of spoken natural language texts edited by Geoffrey Haig and Stefan Schnell. The collection draws from a diverse set of languages and has been annotated across multiple levels. Annotation data is downloaded on request from the servers of the University of Bamberg. See the Multi-CAST website <https://multicast.aspra.uni-bamberg.de/> for more information and a list of related publications. |
Authors: | Nils Norman Schiborr [aut, cre] |
Maintainer: | Nils Norman Schiborr <[email protected]> |
License: | CC BY 4.0 |
Version: | 2.0.0 |
Built: | 2024-11-13 06:23:05 UTC |
Source: | CRAN |
mc_clauses
counts the number of clause units (bounded by the
<##>
or <#>
GRAID annotation symbols) in a multicastR table.
mc_clauses(x, bytext = FALSE, printToConsole = FALSE)
mc_clauses(x, bytext = FALSE, printToConsole = FALSE)
x |
A |
bytext |
Logical. If |
printToConsole |
Logical. If |
A data.frame
with the corpus
, text
(if
bytext
is TRUE
), the number of valid clause units in each
corpus (nClause
), the total number of clause units (nAll
),
the number of clause units not analyzed (nNC
), and the percentage
the later make up of the total (pNC
).
multicast
, mc_index
,
mc_metadata
, mc_referents
,
mc_clauses
## Not run: # count clause units in the most recent version # of the Multi-CAST data, by corpus n <- mc_clauses(multicast()) # count by text instead m <- mc_clauses(multicast(), bytext = TRUE) # number of clauses units in the whole collection sum(n$nClauses) ## End(Not run)
## Not run: # count clause units in the most recent version # of the Multi-CAST data, by corpus n <- mc_clauses(multicast()) # count by text instead m <- mc_clauses(multicast(), bytext = TRUE) # number of clauses units in the whole collection sum(n$nClauses) ## End(Not run)
mc_index
downloads a tabular index of the versions of the Multi-CAST
corpus data from the servers of the University of Bamberg. The value in the
leftmost version
column may be passed to the multicast
method for access to earlier versions of the annotations.
mc_index()
mc_index()
A data.frame
with five columns:
[, 1] version
Version key. Used for the vkey
argument
of other functions in this package.
[, 2] date
Publication date in YYYY-MM-DD format.
[, 3] corpora
Number of corpora (languages).
[, 4] texts
Number of texts.
[, 5]
size
Total file size in kilobytes.
multicast
, mc_metadata
,
mc_referents
, mc_clauses
## Not run: # retrieve version index mc_index() ## End(Not run)
## Not run: # retrieve version index mc_index() ## End(Not run)
mc_metadata
downloads a table with metadata on the texts and speakers
in the Multi-CAST collection from the servers of the University of Bamberg.
mc_metadata(vkey = NULL)
mc_metadata(vkey = NULL)
vkey |
A four-digit number specifying the requested version of the
metadata. Must be one of the version keys listed in the first column of
|
A data.frame
containing metadata on the Multi-CAST
collection. The table has the following eight columns:
[, 1] corpus
The name of the corpus.
[, 2] text
The title of the text.
[, 3]
type
The text type, either TN
'traditional narrative', AN
'autobiographical narrative', or SN
'stimulus-based narrative'.
[, 4] recorded
The year (YYYY) the text was recorded.
[, 5] speaker
The identifier for the speaker.
[,
6] gender
The speaker's gender.
[, 7] age
The speaker's
age at the time of recording. Approximate values are prefixed with a
c
.
[, 8] born
The speaker's birth year (YYY).
Approximate values are prefixed with a c
.
multicast
, mc_index
,
mc_referents
, mc_clauses
## Not run: # retrieve the most recent version of the Multi-CAST metadata mc_metadata() # retrieve the lists of referents published in January 2021 mc_metadata(2101) # join the metadata to a table with annotation values mc <- multicast() merge(mc, mc_metadata(), by = c("corpus", "text")) ## End(Not run)
## Not run: # retrieve the most recent version of the Multi-CAST metadata mc_metadata() # retrieve the lists of referents published in January 2021 mc_metadata(2101) # join the metadata to a table with annotation values mc <- multicast() merge(mc, mc_metadata(), by = c("corpus", "text")) ## End(Not run)
mc_referents
downloads a tabular list of all discourse referents
occuring in those texts in the Multi-CAST collection that have been annotated
with the RefIND scheme (Schiborr et al. 2018). The data are downloaded from
the servers of University of Bamberg.
mc_referents(vkey = NULL)
mc_referents(vkey = NULL)
vkey |
A four-digit number specifying the requested version of the list
of referents. Must be one of the version keys listed in the first column of
|
A data.frame
containing a list of referents for all
texts with RefIND annotations in the Multi-CAST collection. The table has
the following eight columns:
[, 1] corpus
The name of the corpus.
[, 2] text
The title of the text.
[, 3]
refind
The four-digit referent index, unique to each referent in a text.
[, 4] label
The label used for the referent.
[,
5] description
A short description of the referent.
[, 6]
class
The semantic class of the referent. Legend: hum
= human,
anm
= animate, inm
= inanimate, bdp
= body part,
mss
= mass, loc
= location, tme
= time, abs
=
abstract.
[, 7] relations
Relations of the referent to other
referents. Legend: <
= set member of (partial co-reference),
>
= includes (split antecedence), M
= part-whole.
[, 8] notes
Annotators' notes on the referent and its properties.
multicast
, mc_index
,
mc_metadata
, mc_clauses
## Not run: # retrieve the most recent version of the Multi-CAST list of referents mc_referents() # retrieve the lists of referents published in January 2021 mc_referents(2021) # join the list of referents to a table with annotation values mc <- multicast() merge(mc, mc_referents(), by = c("corpus", "text", "refind"), all.x = TRUE) ## End(Not run)
## Not run: # retrieve the most recent version of the Multi-CAST list of referents mc_referents() # retrieve the lists of referents published in January 2021 mc_referents(2021) # join the list of referents to a table with annotation values mc <- multicast() merge(mc, mc_referents(), by = c("corpus", "text", "refind"), all.x = TRUE) ## End(Not run)
multicast
downloads corpus data from the Multi-CAST collection (Haig &
Schnell 2015) from the servers of the University of Bamberg. As the
Multi-CAST collection is continuously evolving through the addition of
further data sets and the revision of older annotations, the multicast
function takes an optional argument vkey
to select earlier versions of
the annotation data, ensuring scientific accountability and the
reproducibility of results.
multicast(vkey = NULL)
multicast(vkey = NULL)
vkey |
A four-digit number specifying the requested version of the
metadata. Must be one of the version keys listed in the first column of
|
A data.frame
with eleven columns:
[, 1] corpus
The name of the corpus.
[, 2]
text
The name of the text.
[, 3] uid
The utterance identifier. Uniquely identifies an utterance within a text.
[,
4] gword
Grammatical words. The tokenized utterances in the object language.
[, 5] gloss
Morphological glosses following the Leipzig Glossing Rules.
[, 6] graid
Annotations with the GRAID scheme (Haig & Schnell 2014).
[, 7] gform
The form symbol of a GRAID gloss.
[, 8] ganim
The person-animacy symbol of a GRAID gloss.
[, 9] gfunc
The function symbol of a GRAID gloss.
[, 10] refind
Referent tracking using the RefIND scheme (Schiborr et al. 2018).
[, 11]
isnref
Annotations of the information status of newly introduced referents.
The Multi-CAST annotation data accessed by this package are published under a Create Commons Attribution 4.0 International (CC-BY 4.0) licence (https://creativecommons.org/licenses/by-sa/4.0/). Please refer to the Multi-CAST website for information on how to give proper credit to its contributors.
Data from the Multi-CAST collection should be cited as:
Haig, Geoffrey & Schnell, Stefan (eds.). 2015. Multi-CAST: Multilinguial Corpus of Annotated Spoken Texts. (https://multicast.aspra.uni-bamberg.de/) (Accessed date.)
If
for some reason you need to cite this package specifically, please refer to
citation(multicastR)
.
Haig, Geoffrey & Schnell, Stefan. 2014. Annotations using GRAID (Grammatical Relations and Animacy in Discourse): Introduction and guidelines for annotators. Version 7.0. (https://multicast.aspra.uni-bamberg.de/#annotations)
Schiborr, Nils N. & Schnell, Stefan & Thiele, Hanna. 2018. RefIND – Referent Indexing in Natural-language Discourse: Annotation guidelines. Version 1.1. (https://multicast.aspra.uni-bamberg.de/#annotations)
mc_index
, mc_metadata
,
mc_referents
, mc_clauses
## Not run: # retrieve and print the most recent version of the # Multi-CAST annotations multicast() # retrieve the version of the annotation data published # in January 2021 multicast(2021) ## End(Not run)
## Not run: # retrieve and print the most recent version of the # Multi-CAST annotations multicast() # retrieve the version of the annotation data published # in January 2021 multicast(2021) ## End(Not run)
The multicastR
package provides a basic interface for accessing the
annotated corpus data in the Multi-CAST collection (edited by Geoffrey Haig
and Stefan Schnell), a database of spoken natural language texts that draws
from a diverse set of languages.
The corpus data are downloaded on command from the servers of the University
of Bamberg via the multicast
method. Details on the
Multi-CAST project and a list of publications can be found online at
https://multicast.aspra.uni-bamberg.de/.
The Multi-CAST annotation data accessed by this package are published under a Create Commons Attribution 4.0 International (CC-BY 4.0) licence (https://creativecommons.org/licenses/by-sa/4.0/). Please refer to the Multi-CAST website for information on how to give proper credit to its contributors.
Data from the Multi-CAST collection should be cited as:
Haig, Geoffrey & Schnell, Stefan (eds.). 2015. Multi-CAST: Multilinguial Corpus of Annotated Spoken Texts. (https://multicast.aspra.uni-bamberg.de/) (Accessed date.)
If for some reason you need to cite this package
specifically, please refer to citation(multicastR)
.
multicast
, mc_index
,
mc_metadata
, mc_referents
,
mc_clauses