Implementing custom filters

cohortBuilder provides seven built-in filter types suitable for most common filtering tasks:

  • discrete,
  • discrete_text,
  • range,
  • date_range,
  • datetime_range,
  • multi_discrete,
  • query.

If none of the above filters meets your need, you can create a custom one. Below we describe how custom filters work and walk through an example: creating a logical filter.

Filter architecture

Filters in cohortBuilder are S7 objects that inherit from the CbFilter base class. Each filter type is an S7 class with properties describing its configuration (dataset, variable, value, etc.).

Filtering logic is implemented via S7 multi-dispatch generics that dispatch on both the filter class and the source class. This means you can define how a filter type behaves for different source backends (e.g. tblist, database, etc.).

The key generics are:

Generic Purpose
cb_filter_data(filter, source, data_object) Apply filter to data
cb_get_filter_stats(filter, source, data_object) Compute filter statistics
cb_plot_filter_data(filter, source, data_object) Plot filter data summary
cb_get_filter_data(filter, source, data_object) Get filter-related data subset
cb_get_filter_defaults(filter, source, data_object, cache_object) Get default parameter values
cb_filter_to_expr(filter, source) Generate reproducible code expression

Built-in filter example

Let’s look at a built-in discrete filter to understand the structure:

spec_filter <- filter("discrete", value = "setosa", dataset = "iris", variable = "Species")
spec_filter
#> <cohortBuilder::CbFilterDiscrete>
#>  @ type       : chr "discrete"
#>  @ id         : chr "iris-Species"
#>  @ name       : chr "Species"
#>  @ active     : logi TRUE
#>  @ description: NULL
#>  @ domain     : NULL
#>  @ step_id    : NULL
#>  @ extra      : list()
#>  @ private    :List of 1
#>  .. $ input_param: chr "value"
#>  @ dataset    : chr "iris"
#>  @ variable   : chr "Species"
#>  @ value      : chr "setosa"
#>  @ keep_na    : logi TRUE

The filter is an S7 object with properties accessible via @:

spec_filter@type
#> [1] "discrete"
spec_filter@variable
#> [1] "Species"
spec_filter@value
#> [1] "setosa"
spec_filter@dataset
#> [1] "iris"

All user-facing parameters can be retrieved as a list:

str(get_filter_params(spec_filter))
#> List of 10
#>  $ type       : chr "discrete"
#>  $ id         : chr "iris-Species"
#>  $ name       : chr "Species"
#>  $ active     : logi TRUE
#>  $ description: NULL
#>  $ domain     : NULL
#>  $ dataset    : chr "iris"
#>  $ variable   : chr "Species"
#>  $ value      : chr "setosa"
#>  $ keep_na    : logi TRUE

Creating a custom filter: logical filter

To create a custom filter you need to:

  1. Define an S7 class inheriting from CbFilter.
  2. Register the type so filter("logical", ...) works.
  3. Implement S7 methods for each source type you want to support.

Step 1: Define the S7 filter class

CbFilterLogical <- S7::new_class("CbFilterLogical",
  parent = CbFilter,
  properties = list(
    dataset = S7::class_character,
    variable = S7::class_character,
    value = S7::class_any,
    keep_na = S7::class_logical
  ),
  constructor = function(id = NULL, name = NULL, variable, value = NA,
                         dataset, keep_na = TRUE, description = NULL,
                         active = getOption("cb_active_filter", default = TRUE), ...) {
    if (is.null(id)) id <- paste(dataset, variable, sep = "-")
    if (is.null(name)) name <- variable
    S7::new_object(S7::S7_object(),
      type = "logical", id = id, name = name, input_param = "value",
      variable = variable, value = value, dataset = dataset,
      keep_na = keep_na, active = active, description = description,
      extra = list(...)
    )
  }
)

Key points:

  • parent = CbFilter makes it part of the filter hierarchy.
  • input_param = "value" tells the system which property holds the filtering value.
  • extra = list(...) captures any additional parameters (e.g. gui_input for shinyCohortBuilder).
  • value = NA means no filtering is applied by default.

Step 2: Register the filter type

register_filter_type("logical", CbFilterLogical)

Now filter("logical", ...) works:

my_filter <- filter("logical", variable = "is_setosa", dataset = "iris", value = TRUE)
my_filter@type
#> [1] "logical"
my_filter@value
#> [1] TRUE

Step 3: Implement methods for tblist source

Each method dispatches on (filter_class, source_class). We use tblist_class (provided by cohortBuilder) as the source class for the built-in list-of-tables backend.

filter_data – filtering logic

S7::method(cb_filter_data, list(CbFilterLogical, tblist_class)) <- function(filter, source, data_object, ...) {
  dataset <- filter@dataset
  variable <- filter@variable
  value <- filter@value
  keep_na <- filter@keep_na

  if (keep_na && !identical(value, NA)) {
    data_object[[dataset]] <- data_object[[dataset]] |>
      dplyr::filter(!!sym(variable) == !!value | is.na(!!sym(variable)))
  }
  if (!keep_na && identical(value, NA)) {
    data_object[[dataset]] <- data_object[[dataset]] |>
      dplyr::filter(!is.na(!!sym(variable)))
  }
  if (!keep_na && !identical(value, NA)) {
    data_object[[dataset]] <- data_object[[dataset]] |>
      dplyr::filter(!!sym(variable) == !!value)
  }
  attr(data_object[[dataset]], "filtered") <- TRUE
  data_object
}

Note:

  • When value = NA, no filtering is applied (unless keep_na = FALSE, which removes NA rows).
  • The filtered attribute signals to binding logic that this dataset was touched.

get_stats – data statistics

S7::method(cb_get_filter_stats, list(CbFilterLogical, tblist_class)) <- function(
  filter, source, data_object, name, ...
) {
  dataset <- filter@dataset
  variable <- filter@variable
  if (missing(name)) {
    name <- c("n_data", "choices", "n_missing")
  }
  column <- data_object[[dataset]][[variable]]
  stats <- list(
    choices = if ("choices" %in% name) as.list(table(stats::na.omit(column))),
    n_data = if ("n_data" %in% name) length(stats::na.omit(column)),
    n_missing = if ("n_missing" %in% name) sum(is.na(column))
  )
  if (length(name) == 1L) stats[[name]] else stats[name]
}

Tip: evaluating only the requested stat (via the name parameter) improves performance with large data.

plot_data – visualization

S7::method(cb_plot_filter_data, list(CbFilterLogical, tblist_class)) <- function(filter, source, data_object, ...) {
  dataset <- filter@dataset
  variable <- filter@variable
  if (nrow(data_object[[dataset]])) {
    data_object[[dataset]][[variable]] |> table() |> prop.table() |> graphics::barplot(...)
  } else {
    graphics::barplot(0.0, ylim = c(0.0, 0.1), main = "No data")
  }
}

get_defaults – default parameter values

S7::method(cb_get_filter_defaults, list(CbFilterLogical, tblist_class)) <- function(
  filter, source, data_object, cache_object, ...
) {
  list(value = names(cache_object$choices))
}

filter_to_expr – reproducible code

S7::method(cb_filter_to_expr, list(CbFilterLogical, tblist_class)) <- function(filter, source, ...) {
  dataset <- filter@dataset
  variable <- filter@variable
  value <- filter@value
  keep_na <- filter@keep_na

  if (keep_na && !identical(value, NA)) {
    rlang::expr({
      data_object[[!!dataset]] <- data_object[[!!dataset]] |>
        dplyr::filter(!!sym(variable) == !!value | is.na(!!sym(variable)))
    })
  } else if (!keep_na && identical(value, NA)) {
    rlang::expr({
      data_object[[!!dataset]] <- data_object[[!!dataset]] |>
        dplyr::filter(!is.na(!!sym(variable)))
    })
  } else if (!keep_na && !identical(value, NA)) {
    rlang::expr({
      data_object[[!!dataset]] <- data_object[[!!dataset]] |>
        dplyr::filter(!!sym(variable) == !!value)
    })
  } else {
    NULL
  }
}

Using the filter

Now we can use our filter for building a cohort. For the example we’ll use an extended iris table:

iris2 <- dplyr::mutate(iris, is_setosa = Species == "setosa")
coh <- set_source(tblist(iris = iris2)) |>
  cohort(
    filter("logical", dataset = "iris", variable = "is_setosa", value = TRUE)
  ) |>
  run()

Let’s verify the filtering worked:

nrow(get_data(coh)$iris)
#> [1] 50
collapse::funique(get_data(coh)$iris$is_setosa)
#> [1] TRUE

We can also check statistics and reproducible code:

stat(coh, step_id = 1L, filter_id = 1L)
#> $n_data
#> [1] 50
#> 
#> $choices
#> $choices$`TRUE`
#> [1] 50
#> 
#> 
#> $n_missing
#> [1] 0
code(coh, include_methods = NULL, include_action = NULL)
#> source <- list(dtconn = tblist(iris = iris2))
#> data_object <- source$dtconn
#> step_id <- "1"
#> data_object[["iris"]] <- dplyr::filter(data_object[["iris"]], is_setosa == TRUE |
#>     is.na(is_setosa))

Extra parameters

Filter constructors accept ... which is stored in filter@extra. This is useful for passing additional configuration consumed by other packages (e.g. gui_input for shinyCohortBuilder):

my_filter <- filter("logical",
  dataset = "iris", variable = "is_setosa", value = TRUE,
  gui_input = "checkbox"
)
my_filter@extra$gui_input
#> [1] "checkbox"
get_filter_params(my_filter)$gui_input
#> [1] "checkbox"

Extra parameters are included in get_filter_params() output, persisted in get_state(), and can be updated via update_filter().

Implementing filters in a package

When creating custom filters in a separate package:

  1. Define your S7 filter class(es) and export them.
  2. Call register_filter_type() in your package’s .onLoad().
  3. Register S7 methods for your filter × source combinations.
  4. Add S7 (>= 0.2.0) and cohortBuilder to Imports in DESCRIPTION.
  5. Call S7::methods_register() in your .onLoad() to register S7 methods.
# R/zzz.R
.onLoad <- function(libname, pkgname) {
  S7::methods_register()
  cohortBuilder::register_filter_type("logical", CbFilterLogical)
}