Main Text Processing Function for AI-Based Variable Extraction
Description
This function processes batches of text data using AI models to extract machine-readable variables. It implements robust error handling and retry logic to ensure reliable processing even with API failures.
Usage
main_func(df, to_code_max_id, n_post, provider, worker_env = NULL)main_func(df, to_code_max_id, n_post, provider, worker_env = NULL)
Arguments
df |
Data frame. Input data subset containing text to be processed. |
to_code_max_id |
Integer. Maximum number of posts to process in this batch. |
n_post |
Integer. Number of posts to process per API call (typically 15). |
provider |
Character. AI provider to use, either "OpenAI" or "Groq". |
worker_env |
Environment or NULL. When running in a parallel worker, the worker's
package environment. If NULL (default), |
Details
The function implements a robust processing loop that:
-
Samples posts randomly to avoid processing order bias
-
Makes API calls in manageable batches
-
Validates and cleans AI responses
-
Handles API failures with counter and early stopping
-
Tracks missing IDs for potential reprocessing
-
Collects dataframes into a list for later binding
The AI models are instructed to return data in CSV format for efficient parsing. The function expects responses with exactly N columns as set by set_parameters().
Value
List of data frames. Each element contains successfully processed results with extracted variables. Returns empty list if no successful processing.
See Also
gpt_func(), groq_func(), make_value_row()
Examples
## Not run: # For set_parameters(n_variables = 6): instruction <- "Extract variables: var1,var2,var3,var4,var5,var6" result <- main_func( df = my_data_subset, to_code_max_id = 30, n_post = 15, provider = "OpenAI" ) processed_data <- bind_rows(result) # Combine all dataframes ## End(Not run)## Not run: # For set_parameters(n_variables = 6): instruction <- "Extract variables: var1,var2,var3,var4,var5,var6" result <- main_func( df = my_data_subset, to_code_max_id = 30, n_post = 15, provider = "OpenAI" ) processed_data <- bind_rows(result) # Combine all dataframes ## End(Not run)