--- title: "Source intelligence and AI tools" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Source intelligence and AI tools} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options("tibble.print_min" = 5L, "tibble.print_max" = 5L) library(cohortBuilder) ``` Beyond manually configuring filters, `cohortBuilder` can inspect a source and describe or build filters for you. These features are also the foundation for integrating a cohort with a Large Language Model (LLM), so an assistant can explore the data and apply filters on the user's behalf. This article covers four building blocks: - `describe()` - attach human-readable descriptions to datasets and variables, - `autofilter()` - auto-generate filters from the data structure, - `shape()` - return a structured summary of datasets and filters, - AI tools (`R/ai_tools.R`) - expose the cohort to an `ellmer` chat. ## Describing a source `describe()` builds a small description object (its text plus any extra fields) that you attach to a source via the `description` argument of `set_source()`. The `description` is a nested list keyed by dataset name. Within each dataset, the special key `dataset_` describes the dataset itself, and any other key describes a variable of that dataset: ```{r} iris_source <- set_source( tblist(iris = iris), description = list( iris = list( dataset_ = describe("Edgar Anderson's measurements of iris flowers."), Species = describe("Iris species.", domain = c("setosa", "versicolor", "virginica")) ) ) ) ``` Extra named arguments to `describe()` (such as `domain` above) are stored alongside the text and can be picked up by other features - for example `autofilter()` uses a supplied `domain` instead of scanning the data. `describe()` also accepts a `label` - a short, human-readable name for the field. When the field describes a variable, `autofilter()` reuses the label as the generated filter's `name` (the underlying `variable` is unchanged), which is handy for giving filters friendlier names in a GUI: ```{r} labelled_source <- set_source( tblist(iris = iris), description = list( iris = list( Species = describe("the species of iris", label = "Iris species") ) ) ) |> autofilter(attach_as = "meta") species_filter <- purrr::detect( labelled_source$available_filters, ~ .x@id == "iris-Species" ) species_filter@name ``` ## Generating filters automatically `autofilter()` analyses each column of the source and creates a filter suited to its type (using filter rules such as `rule_character`, `rule_factor`, `rule_numeric`, `rule_Date`, `rule_POSIXct`). The mapping is roughly: | Column type | Filter type | |-------------|-------------| | character / factor | `discrete` (or `discrete_text` when all values are unique) | | numeric / integer | `range` | | Date | `date_range` | | POSIXct | `datetime_range` | The `attach_as` argument controls where the generated filters go. With `attach_as = "step"` (the default) the filters are added as a filtering step, so the cohort is immediately filterable: ```{r} iris_cohort <- set_source(tblist(iris = iris)) |> autofilter(attach_as = "step") |> cohort() sum_up(iris_cohort) ``` With `attach_as = "meta"` the filters are stored in `source$available_filters` rather than applied. This is the "menu" of filters a GUI or an LLM can choose from, without forcing them onto the data: ```{r} meta_source <- iris_source |> autofilter(attach_as = "meta") length(meta_source$available_filters) ``` When a `domain` was provided via `describe()`, the generated filter inherits it instead of scanning the data: ```{r} species_filter <- purrr::detect( meta_source$available_filters, ~ .x@id == "iris-Species" ) species_filter@domain ``` ## Inspecting a source with `shape()` `shape(source)` returns a structured `list(datasets, filters)` describing the source - ideal for programmatic inspection or passing to an LLM: - `datasets` maps each dataset name to its description text (or `NA`). - `filters` is keyed by filter id; each entry is a list with `name`, `dataset`, `type`, `description`, `variables`, and `domain`. The `description` combines the filter-level description with the per-variable descriptions (single- variable filters show the bare variable description; multi-variable filters prefix each with its variable name). ```{r} result <- shape(meta_source) # Dataset descriptions result$datasets # One filter entry str(result$filters$`iris-Species`) ``` When a filter's own `@domain` is unset, `shape()` falls back to the domain stored in the source's metadata statistics, so the `domain` field is populated whenever possible. **Note.** Called with a `field` (and optional `subfield`), `shape(source, field, subfield)` instead performs a description-text lookup - this is the form used internally by `Cohort$show_help()`. ## Connecting a cohort to an LLM The functions in `R/ai_tools.R` wrap cohort operations as tools an [`ellmer`](https://ellmer.tidyverse.org/) chat can call. Each tool is a `cb_tool` object (a function plus a name, description, and argument schema). The built-in tool factories each take a cohort and return a `cb_tool`: | Tool factory | Purpose | |--------------|---------| | `cb_tool_filters_meta()` | Return available-filter metadata (via `shape()`) as JSON | | `cb_tool_describe_state()` | Describe current steps, filters, and pending state | | `cb_tool_get_data_summary()` | Report row counts per dataset and step | | `cb_tool_get_code()` | Return reproducible filtering code | | `cb_tool_add_filters()` | Add filters (no values) to a new or existing step | | `cb_tool_set_filter_values()` | Set values on existing filters | | `cb_tool_apply_filters()` | Add filters and set their values in one call | | `cb_tool_toggle_filters()` | Activate / deactivate filters | | `cb_tool_clear_filters()` | Reset filters to their defaults | | `cb_tool_remove_filters()` | Remove filters from a step | | `cb_tool_remove_step()` | Remove the last step | | `cb_tool_run()` | Run the pipeline (when auto-run is disabled) | A `cb_tool` prints its name, description, and arguments: ```{r} coh <- cohort(meta_source) tool <- cb_tool_filters_meta(coh) print(tool) ``` For LLM-driven filtering to work, the source must expose a menu of filters via `autofilter(attach_as = "meta")` so the assistant knows what it can apply. To register tools with an `ellmer` chat, use `cb_register_tool()` for a single tool or `cb_register_tools()` to register all of them at once: ```{r, eval = FALSE} library(ellmer) source <- set_source(tblist(iris = iris)) |> autofilter(attach_as = "meta") coh <- cohort(source) chat <- chat_openai() chat |> cb_register_tools(coh) chat$chat("Filter the data to setosa flowers with sepal length over 5") ``` By default the cohort runs automatically after each tool modifies it. Set `options(cb_tool_run_cohort = FALSE)` to require an explicit `cb_run` call instead. To trace which tools the LLM invokes (and with which arguments), set `options(cb_tool_verbose = TRUE)`. Each call then emits an informative `message()` such as `[cohortBuilder AI tool] cb_apply_filters (filters = ...; action = new_step)`. Logging is off by default, so tools stay silent during normal use. **Note.** The AI tools require the suggested `ellmer` package.