Package: textTinyR 1.1.8

Lampros Mouselimis

textTinyR: Text Processing for Small or Big Data Files

It offers functions for splitting, parsing, tokenizing and creating a vocabulary for big text data files. Moreover, it includes functions for building a document-term matrix and extracting information from those (term-associations, most frequent terms). It also embodies functions for calculating token statistics (collocations, look-up tables, string dissimilarities) and functions to work with sparse matrices. Lastly, it includes functions for Word Vector Representations (i.e. 'GloVe', 'fasttext') and incorporates functions for the calculation of (pairwise) text document dissimilarities. The source code is based on 'C++11' and exported in R through the 'Rcpp', 'RcppArmadillo' and 'BH' packages.

Authors:Lampros Mouselimis [aut, cre]

textTinyR_1.1.8.tar.gz
textTinyR_1.1.8.tar.gz(r-4.5-noble)textTinyR_1.1.8.tar.gz(r-4.4-noble)
textTinyR_1.1.8.tgz(r-4.4-emscripten)textTinyR_1.1.8.tgz(r-4.3-emscripten)
textTinyR.pdf |textTinyR.html
textTinyR/json (API)
NEWS

# Install 'textTinyR' in R:
install.packages('textTinyR', repos = c('https://cran.r-universe.dev', 'https://cloud.r-project.org'))

Peer review:

Bug tracker:https://github.com/mlampros/texttinyr/issues

Uses libs:
  • openblas– Optimized BLAS
  • c++– GNU Standard C++ Library v3
  • openmp– GCC OpenMP (GOMP) support library

4.89 score 1 packages 196 scripts 1.3k downloads 30 exports 7 dependencies

Last updated 12 months agofrom:2662a61144. Checks:OK: 2. Indexed: no.

TargetResultDate
Doc / VignettesOKOct 30 2024
R-4.5-linux-x86_64OKOct 30 2024

Exports:batch_computebig_tokenize_transformbytes_convertercluster_frequencyCOS_TEXTcosine_distanceCount_Rowsdense_2sparsedice_distancedims_of_word_vecsDoc2VecJACCARD_DICElevenshtein_distanceload_sparse_binarymatrix_sparsityread_charactersread_rowssave_sparse_binaryselect_predictorssparse_Meanssparse_Sumssparse_term_matrixTEXT_DOC_DISSIMtext_file_parsertext_intersecttoken_statstokenize_transform_texttokenize_transform_vec_docsutf_localevocabulary_parser

Dependencies:BHdata.tablelatticeMatrixR6RcppRcppArmadillo

Functionality of the textTinyR package

Rendered fromfunctionality_of_textTinyR_package.Rmdusingknitr::rmarkdownon Oct 30 2024.

Last update: 2021-10-13
Started: 2017-01-07

Word vectors - doc2vec - text clustering

Rendered fromword_vectors_doc2vec.Rmdusingknitr::rmarkdownon Oct 30 2024.

Last update: 2021-10-13
Started: 2018-04-03

Readme and manuals

Help Manual

Help pageTopics
Compute batchesbatch_compute
String tokenization and transformation for big data setsbig_tokenize_transform
bytes converter of a text file ( KB, MB or GB )bytes_converter
Frequencies of an existing cluster objectcluster_frequency
Cosine similarity for text documentsCOS_TEXT
cosine distance of two character strings (each string consists of more than one words)cosine_distance
Number of rows of a fileCount_Rows
convert a dense matrix to a sparse matrixdense_2sparse
dice similarity of words using n-gramsdice_distance
dimensions of a word vectors filedims_of_word_vecs
Conversion of text documents to word-vector-representation features ( Doc2Vec )Doc2Vec
Jaccard or Dice similarity for text documentsJACCARD_DICE
levenshtein distance of two wordslevenshtein_distance
load a sparse matrix in binary formatload_sparse_binary
sparsity percentage of a sparse matrixmatrix_sparsity
read a specific number of characters from a text fileread_characters
read a specific number of rows from a text fileread_rows
save a sparse matrix in binary formatsave_sparse_binary
Exclude highly correlated predictorsselect_predictors
RowMens and colMeans for a sparse matrixsparse_Means
RowSums and colSums for a sparse matrixsparse_Sums
Term matrices and statistics ( document-term-matrix, term-document-matrix)sparse_term_matrix
Dissimilarity calculation of text documentsTEXT_DOC_DISSIM
text file parsertext_file_parser
intersection of words or letters in tokenized texttext_intersect
token statisticstoken_stats
String tokenization and transformation ( character string or path to a file )tokenize_transform_text
String tokenization and transformation ( vector of documents )tokenize_transform_vec_docs
utf-locale for the available languagesutf_locale
returns the vocabulary counts for small or medium ( xml and not only ) filesvocabulary_parser