Package: tok 0.1.2
tok:Fast Text Tokenization
Interfaces with the 'Hugging Face' tokenizers library to provide implementations of today's most used tokenizers such as the 'Byte-Pair Encoding' algorithm <https://huggingface.co/docs/tokenizers/index>. It's extremely fast for both training new vocabularies and tokenizing texts.
Authors:
tok_0.1.2.tar.gz
tok_0.1.2.tar.gz(r-4.5-noble)tok_0.1.2.tar.gz(r-4.4-noble)
tok.pdf |tok.html✨
tok/json (API)
NEWS
# Installtok in R: |
install.packages('tok',repos = c('https://cran.r-universe.dev', 'https://cloud.r-project.org')) |
Bug tracker:https://github.com/mlverse/tok/issues
Last updated 8 days agofrom:58dd61d3aa
Exports:decoder_byte_levelencodingmodel_bpemodel_unigrammodel_wordpiecenormalizer_nfcnormalizer_nfkcpre_tokenizerpre_tokenizer_byte_levelpre_tokenizer_whitespaceprocessor_byte_leveltok_decodertok_modeltok_normalizertok_processortok_trainertokenizertrainer_bpetrainer_unigramtrainer_wordpiece
Readme and manuals
Help Manual
Help page | Topics |
---|---|
tok: Fast Text Tokenization | tok-package tok |
Byte level decoder | decoder_byte_level |
Encoding | encoding |
BPE model | model_bpe |
An implementation of the Unigram algorithm | model_unigram |
An implementation of the WordPiece algorithm | model_wordpiece |
NFC normalizer | normalizer_nfc |
NFKC normalizer | normalizer_nfkc |
Generic class for tokenizers | pre_tokenizer |
Byte level pre tokenizer | pre_tokenizer_byte_level |
This pre-tokenizer simply splits using the following regex: \w+|[^\w\s]+ | pre_tokenizer_whitespace |
Byte Level post processor | processor_byte_level |
Generic class for decoders | tok_decoder |
Generic class for tokenization models | tok_model |
Generic class for normalizers | tok_normalizer |
Generic class for processors | tok_processor |
Generic training class | tok_trainer |
Tokenizer | tokenizer |
BPE trainer | trainer_bpe |
Unigram tokenizer trainer | trainer_unigram |
WordPiece tokenizer trainer | trainer_wordpiece |