| Title: | A Fast and Flexible Pipeline for Text Classification |
|---|---|
| Description: | A high-level pipeline that simplifies text classification into three streamlined steps: preprocessing, model training, and standardized prediction. It unifies the interface for multiple algorithms (including 'glmnet', 'ranger', 'xgboost', and 'naivebayes') and memory-efficient sparse matrix vectorization methods (Bag-of-Words, Term Frequency, TF-IDF, and Binary). Users can go from raw text to a fully evaluated sentiment model, complete with ROC-optimized thresholds, in just a few function calls. The resulting model artifact automatically aligns the vocabulary of new datasets during the prediction phase, safely appending predicted classes and probability matrices directly to the user's original dataframe to preserve metadata. |
| Authors: | Alabhya Dahal [aut, cre] |
| Maintainer: | Alabhya Dahal <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.3.4 |
| Built: | 2026-05-17 07:32:20 UTC |
| Source: | https://github.com/cran/quickSentiment |
This function takes a character vector of new documents and transforms it into a DFM that has the exact same features as a pre-fitted training DFM, ensuring consistency for prediction.
BOW_test(doc, fit)BOW_test(doc, fit)
doc |
A character vector of new documents to be processed. |
fit |
A fitted BoW object returned by |
A quanteda dfm aligned to the training features.
train_txt <- c("apple orange banana", "apple apple") fit <- BOW_train(train_txt, weighting_scheme = "bow") new_txt <- c("banana pear", "orange apple") test_dfm <- BOW_test(new_txt, fit) test_dfmtrain_txt <- c("apple orange banana", "apple apple") fit <- BOW_train(train_txt, weighting_scheme = "bow") new_txt <- c("banana pear", "orange apple") test_dfm <- BOW_test(new_txt, fit) test_dfm
Train a Bag-of-Words Model
BOW_train(doc, weighting_scheme = "bow", ngram_size = 1)BOW_train(doc, weighting_scheme = "bow", ngram_size = 1)
doc |
A character vector of documents to be processed. |
weighting_scheme |
A string specifying the weighting to apply.
Defaults to
|
ngram_size |
An integer specifying the maximum n-gram size. For example, 'ngram_size = 1' will create unigrams only; 'ngram_size = 2' will create unigrams and bigrams. Defaults to 1. |
An object of class "qs_bow_fit" containing:
dfm_template: a quanteda dfm template
weighting_scheme: the weighting used
ngram_size: the n-gram size used
#'
txt <- c("text one", "text two text") fit <- BOW_train(txt, weighting_scheme = "bow") fit$dfm_templatetxt <- c("text one", "text two text") fit <- BOW_train(txt, weighting_scheme = "bow") fit$dfm_template
Evaluate Model Performance (ROC and Precision-Recall)
evaluate_performance(predicted_probs, actual_classes, positive_label)evaluate_performance(predicted_probs, actual_classes, positive_label)
predicted_probs |
Numeric vector of predicted probabilities for the positive class. |
actual_classes |
Factor or character vector of the actual true labels. |
positive_label |
Character string. The target class you want to evaluate. |
An object of class 'quickSentiment_eval', which is a list containing the following metrics:
target_class |
Character. The specific positive label used for the evaluation. |
auc_roc |
Numeric. The Area Under the Receiver Operating Characteristic curve. |
best_threshold_roc |
Numeric. The optimal probability threshold that maximizes Youden's J statistic. |
auc_pr |
Numeric. The Area Under the Precision-Recall curve. |
best_threshold_pr |
Numeric. The probability threshold that maximizes the F1-Score. |
accuracy_at_best |
Numeric. The overall accuracy of the model if 'best_threshold_pr' is applied. |
roc |
S3 object containing the ROC curve data and metrics. |
prc |
S3 object containing the Precision-Recall curve data and metrics. |
threshold_summary |
A data frame summarizing Accuracy, Precision, Recall, and F1 at 0.1 threshold increments. |
This function trains a logistic regression model using Lasso regularization via the glmnet package. It uses cross-validation to automatically find the optimal regularization strength (lambda).
logit_model( train_vectorized, Y, test_vectorized, parallel = FALSE, tune = FALSE )logit_model( train_vectorized, Y, test_vectorized, parallel = FALSE, tune = FALSE )
train_vectorized |
The training feature matrix (e.g., a 'dfm' from quanteda). This should be a sparse matrix. |
Y |
The response variable for the training set. Should be a factor for classification. |
test_vectorized |
The test feature matrix, which must have the same features as 'train_vectorized'. |
parallel |
Logical |
tune |
Logical |
A list containing two elements:
pred |
A vector of class predictions for the test set. |
probs |
A matrix of predicted probabilities. |
model |
The final, trained 'cv.glmnet' model object. |
best_lambda |
The optimal lambda value found during cross-validation. |
## Not run: # Create dummy vectorized training and test data train_matrix <- matrix(runif(100), nrow = 10, ncol = 10) test_matrix <- matrix(runif(50), nrow = 5, ncol = 10) # Provide column names (vocabulary) required by glmnet colnames(train_matrix) <- paste0("word", 1:10) colnames(test_matrix) <- paste0("word", 1:10) y_train <- factor(sample(c("P", "N"), 10, replace = TRUE)) # Run logistic regression model (glmnet) model_results <- logit_model(train_matrix, y_train, test_matrix) ## End(Not run)## Not run: # Create dummy vectorized training and test data train_matrix <- matrix(runif(100), nrow = 10, ncol = 10) test_matrix <- matrix(runif(50), nrow = 5, ncol = 10) # Provide column names (vocabulary) required by glmnet colnames(train_matrix) <- paste0("word", 1:10) colnames(test_matrix) <- paste0("word", 1:10) y_train <- factor(sample(c("P", "N"), 10, replace = TRUE)) # Run logistic regression model (glmnet) model_results <- logit_model(train_matrix, y_train, test_matrix) ## End(Not run)
Multinomial Naive Bayes for Text Classification
nb_model(train_vectorized, Y, test_vectorized, parallel = FALSE, tune = FALSE)nb_model(train_vectorized, Y, test_vectorized, parallel = FALSE, tune = FALSE)
train_vectorized |
The training feature matrix (e.g., a 'dfm' from quanteda). This should be a sparse matrix. |
Y |
The response variable for the training set. Should be a factor for classification. |
test_vectorized |
The test feature matrix, which must have the same features as 'train_vectorized' |
parallel |
Logical |
tune |
Logical. If TRUE, tests different Laplace smoothing values. |
A list containing four elements:
pred |
A vector of class predictions for the test set. |
probs |
A matrix of predicted probabilities. |
model |
The final, trained 'naivebayes' model object. |
best_lambda |
Placeholder (NULL) for pipeline consistency. |
# 1. Create dummy numeric matrices with BOTH row and column names train_matrix <- matrix( as.numeric(sample(0:5, 100, replace = TRUE)), nrow = 10, ncol = 10, dimnames = list(paste0("doc", 1:10), paste0("word", 1:10)) ) test_matrix <- matrix( as.numeric(sample(0:5, 50, replace = TRUE)), nrow = 5, ncol = 10, dimnames = list(paste0("doc", 1:5), paste0("word", 1:10)) ) # 2. Create dummy target variable y_train <- factor(sample(c("P", "N"), 10, replace = TRUE)) # 3. Run model model_results <- nb_model(train_matrix, y_train, test_matrix) print(model_results$pred)# 1. Create dummy numeric matrices with BOTH row and column names train_matrix <- matrix( as.numeric(sample(0:5, 100, replace = TRUE)), nrow = 10, ncol = 10, dimnames = list(paste0("doc", 1:10), paste0("word", 1:10)) ) test_matrix <- matrix( as.numeric(sample(0:5, 50, replace = TRUE)), nrow = 5, ncol = 10, dimnames = list(paste0("doc", 1:5), paste0("word", 1:10)) ) # 2. Create dummy target variable y_train <- factor(sample(c("P", "N"), 10, replace = TRUE)) # 3. Run model model_results <- nb_model(train_matrix, y_train, test_matrix) print(model_results$pred)
This function takes a data frame with pre-cleaned text and handles the data splitting, vectorization, model training, and evaluation.
pipeline( vect_method, model_name, text_vector, sentiment_vector, n_gram = 1, tune = FALSE, parallel = FALSE )pipeline( vect_method, model_name, text_vector, sentiment_vector, n_gram = 1, tune = FALSE, parallel = FALSE )
vect_method |
A string specifying the vectorization method.
Defaults to
|
model_name |
A string specifying the model to train.
Defaults to
|
text_vector |
A character vector containing the **preprocessed** text. |
sentiment_vector |
A vector or factor containing the target labels (e.g., ratings). |
n_gram |
The n-gram size to use for BoW/TF-IDF. Defaults to 1. |
tune |
Logical. If TRUE, the pipeline will perform hyperparameter tuning for the selected model. Defaults to FALSE. [NEW] |
parallel |
If TRUE, runs model training in parallel. Default FALSE. |
A list containing the trained model object, the DFM template, class levels, and a comprehensive evaluation report.
df <- data.frame( text = c("good product", "excellent", "loved it", "great quality", "bad service", "terrible", "hated it", "awful experience", "not good", "very bad", "fantastic", "wonderful"), y = c("P", "P", "P", "P", "N", "N", "N", "N", "N", "N", "P", "P") ) out <- pipeline("bow", "naive_bayes", text_vector = df$text, sentiment_vector = df$y)df <- data.frame( text = c("good product", "excellent", "loved it", "great quality", "bad service", "terrible", "hated it", "awful experience", "not good", "very bad", "fantastic", "wonderful"), y = c("P", "P", "P", "P", "N", "N", "N", "N", "N", "N", "P", "P") ) out <- pipeline("bow", "naive_bayes", text_vector = df$text, sentiment_vector = df$y)
Plot Precision-Recall Curve
## S3 method for class 'quickSentiment_prc' plot(x, ...)## S3 method for class 'quickSentiment_prc' plot(x, ...)
x |
An object of class 'quickSentiment_prc'. |
... |
Additional graphical parameters. |
Plot ROC Curve
## S3 method for class 'quickSentiment_roc' plot(x, ...)## S3 method for class 'quickSentiment_roc' plot(x, ...)
x |
An object of class 'quickSentiment_roc'. |
... |
Additional graphical parameters. |
This function provides a comprehensive and configurable pipeline for cleaning raw text data. It handles a variety of common preprocessing steps including removing URLs and HTML, lowercasing, stopword removal, and lemmatization.
pre_process( doc_vector, remove_brackets = TRUE, remove_urls = TRUE, remove_html = TRUE, remove_nums = FALSE, remove_emojis_flag = TRUE, to_lowercase = TRUE, remove_punct = TRUE, remove_stop_words = TRUE, custom_stop_words = NULL, keep_words = NULL, lemmatize = TRUE, retain_negations = TRUE )pre_process( doc_vector, remove_brackets = TRUE, remove_urls = TRUE, remove_html = TRUE, remove_nums = FALSE, remove_emojis_flag = TRUE, to_lowercase = TRUE, remove_punct = TRUE, remove_stop_words = TRUE, custom_stop_words = NULL, keep_words = NULL, lemmatize = TRUE, retain_negations = TRUE )
doc_vector |
A character vector where each element is a document. |
remove_brackets |
A logical value indicating whether to remove text in square brackets. |
remove_urls |
A logical value indicating whether to remove URLs and email addresses. |
remove_html |
A logical value indicating whether to remove HTML tags. |
remove_nums |
A logical value indicating whether to remove numbers. |
remove_emojis_flag |
A logical value indicating whether to remove common emojis. |
to_lowercase |
A logical value indicating whether to convert text to lowercase. |
remove_punct |
A logical value indicating whether to remove punctuation. |
remove_stop_words |
A logical value indicating whether to remove English stopwords. |
custom_stop_words |
A character vector of additional custom words to remove (e.g., c("rt", "via")). Default is NULL. |
keep_words |
A character vector of words to protect from deletion (e.g., c("no", "not", "nor")). Default is NULL. |
lemmatize |
A logical value indicating whether to lemmatize words to their dictionary form. |
retain_negations |
Logical. If |
A character vector of the cleaned and preprocessed text.
raw_text <- c( "This is a <b>test</b>! Visit https://example.com", "Email me at [email protected] [important]" ) # Basic preprocessing with defaults clean_text <- pre_process(raw_text) print(clean_text) # Keep punctuation and stopwords clean_text_no_stop <- pre_process( raw_text, remove_stop_words = FALSE, remove_punct = FALSE ) print(clean_text_no_stop)raw_text <- c( "This is a <b>test</b>! Visit https://example.com", "Email me at [email protected] [important]" ) # Basic preprocessing with defaults clean_text <- pre_process(raw_text) print(clean_text) # Keep punctuation and stopwords clean_text_no_stop <- pre_process( raw_text, remove_stop_words = FALSE, remove_punct = FALSE ) print(clean_text_no_stop)
This is a generic prediction function that handles different model types and ensures consistent preprocessing and vectorization for new, unseen text.
predict_sentiment(pipeline_object, text_column, threshold = 0.5)predict_sentiment(pipeline_object, text_column, threshold = 0.5)
pipeline_object |
A list object returned by the main 'pipeline()' function. It must contain the trained model, DFM template, preprocessing function, and n-gram settings. |
text_column |
A string specifying the column name of the text to predict. |
threshold |
Numeric. Optional custom threshold for binary classification. If NULL, uses the optimized threshold from training (if available). |
A data frame containing the 'predicted_class' and probability columns.
if (exists("my_artifacts")) { dummy_df <- data.frame(text = c("loved it", "hated it"), stringsAsFactors = FALSE) preds <- predict_sentiment(my_artifacts, df = dummy_df, text_column = "text") }if (exists("my_artifacts")) { dummy_df <- data.frame(text = c("loved it", "hated it"), stringsAsFactors = FALSE) preds <- predict_sentiment(my_artifacts, df = dummy_df, text_column = "text") }
Print quickSentiment Evaluation Results
## S3 method for class 'quickSentiment_eval' print(x, ...)## S3 method for class 'quickSentiment_eval' print(x, ...)
x |
An object of class 'quickSentiment_eval'. |
... |
Further arguments passed to or from other methods. |
A character vector of 25 common negation words. These words are automatically
protected by the pre_process function when retain_negations = TRUE
to prevent standard stopword lists from destroying sentiment polarity.
qs_negationsqs_negations
An object of class character of length 25.
This function trains a Random Forest model using the high-performance ranger package. It natively utilizes sparse matrices (dgCMatrix) to avoid memory exhaustion and utilizes Out-Of-Bag (OOB) error for rapid hyperparameter tuning.
rf_model(train_vectorized, Y, test_vectorized, parallel = FALSE, tune = FALSE)rf_model(train_vectorized, Y, test_vectorized, parallel = FALSE, tune = FALSE)
train_vectorized |
The training feature matrix (e.g., a 'dfm' from quanteda). |
Y |
The response variable for the training set. Should be a factor. |
test_vectorized |
The test feature matrix, which must have the same features as 'train_vectorized'. |
parallel |
Logical |
tune |
Logical. If TRUE, tunes 'mtry' using native OOB error |
A list containing four elements:
pred |
A vector of class predictions for the test set. |
probs |
A matrix of predicted probabilities. |
model |
The final, trained 'ranger' model object. |
best_lambda |
Placeholder (NULL) for pipeline consistency. |
## Not run: # Create dummy vectorized training and test data train_matrix <- matrix(runif(100), nrow = 10, ncol = 10) test_matrix <- matrix(runif(50), nrow = 5, ncol = 10) # Provide column names (vocabulary) required by ranger colnames(train_matrix) <- paste0("word", 1:10) colnames(test_matrix) <- paste0("word", 1:10) y_train <- factor(sample(c("P", "N"), 10, replace = TRUE)) # Run random forest model model_results <- rf_model(train_matrix, y_train, test_matrix) ## End(Not run)## Not run: # Create dummy vectorized training and test data train_matrix <- matrix(runif(100), nrow = 10, ncol = 10) test_matrix <- matrix(runif(50), nrow = 5, ncol = 10) # Provide column names (vocabulary) required by ranger colnames(train_matrix) <- paste0("word", 1:10) colnames(test_matrix) <- paste0("word", 1:10) y_train <- factor(sample(c("P", "N"), 10, replace = TRUE)) # Run random forest model model_results <- rf_model(train_matrix, y_train, test_matrix) ## End(Not run)
This function trains a model using the xgboost package. It is highly efficient and natively supports sparse matrices, making it ideal for text data. It automatically handles both binary and multi-class classification problems.
xgb_model(train_vectorized, Y, test_vectorized, parallel = FALSE, tune = FALSE)xgb_model(train_vectorized, Y, test_vectorized, parallel = FALSE, tune = FALSE)
train_vectorized |
The training feature matrix (e.g., a 'dfm' from quanteda). |
Y |
The response variable for the training set. Should be a factor. |
test_vectorized |
The test feature matrix, which must have the same features as 'train_vectorized'. |
parallel |
Logical |
tune |
Logical |
A list containing four elements:
pred |
A vector of class predictions for the test set. |
probs |
A matrix of predicted probabilities. |
model |
The final, trained 'xgb.Booster' model object. |
best_lambda |
Placeholder (NULL) for pipeline consistency. |
## Not run: # Create dummy vectorized training and test data train_matrix <- matrix(runif(100), nrow = 10, ncol = 10) test_matrix <- matrix(runif(50), nrow = 5, ncol = 10) # Provide column names (vocabulary) required by xgboost colnames(train_matrix) <- paste0("word", 1:10) colnames(test_matrix) <- paste0("word", 1:10) y_train <- factor(sample(c("P", "N"), 10, replace = TRUE)) # Run xgboost model model_results <- xgb_model(train_matrix, y_train, test_matrix) ## End(Not run)## Not run: # Create dummy vectorized training and test data train_matrix <- matrix(runif(100), nrow = 10, ncol = 10) test_matrix <- matrix(runif(50), nrow = 5, ncol = 10) # Provide column names (vocabulary) required by xgboost colnames(train_matrix) <- paste0("word", 1:10) colnames(test_matrix) <- paste0("word", 1:10) y_train <- factor(sample(c("P", "N"), 10, replace = TRUE)) # Run xgboost model model_results <- xgb_model(train_matrix, y_train, test_matrix) ## End(Not run)