NEWS
canpumf 0.5.2 (2026-07-03)
New features
- Experimental
list_statcan_pumf_catalogue() crawls the live Statistics Canada "Public use microdata" listing and returns one row per discovered survey edition (catalogue_id, Title, edition, format, url, product_url) — a discovery counterpart to the curated list_canpumf_collection() that picks up newly released PUMFs automatically. Editions offered in several formats collapse to a single preferred row (CSV / flat text first), but genuinely distinct surveys or file-types that share a reference year are kept as separate rows (e.g. the GSS cycle and the Giving/Volunteering survey both released in 2007, or the census individual/family/household/hierarchical files for one year). Surveys distributed only by Electronic File Transfer report url = "(EFT)".
- The full crawl is expensive (hundreds of requests), so
list_statcan_pumf_catalogue() caches its result for the duration of the R session and reuses it on subsequent calls with the same arguments. Pass refresh = TRUE to re-scrape the live catalogue and replace the cached result, e.g. to pick up a newly released survey mid-session.
- Census PUMF editions are decoded from their
cenNN / nhsNN filename prefix (the 2011 cycle shipped as the National Household Survey) and ind / fam / hous / hier file type into canonical "YYYY (individuals)"-style strings matching list_canpumf_collection(). This is forward-compatible: the 2026 census PUMF will resolve automatically once released.
list_statcan_pumf_catalogue() now returns a SeriesTitle column (the plain-language series name matching the acronym) alongside an edition-specific Title. For umbrella products whose catalogue title is only the series name (e.g. the consolidated General Social Survey, or a census year's individuals/hierarchical pair) the Title is synthesised as "<series> — <edition>", where the structural edition descriptor disambiguates colliding reference years ("General Social Survey — Cycle 16 (2002)", "Census of Population — 2021 (individuals)"); per-edition products keep Statistics Canada's own title.
get_pumf() now resolves download URLs from the scraped catalogue first for the series the crawler covers (GSS, SHS, SFS, CPSS, CIS, CHS, ITS, CCAHS), so a newly released edition is downloadable without a package update. Series the crawler deliberately does not cover — LFS and Census (which keep their dedicated paths) and the Giving/Volunteering surveys (SGVP, which Statistics Canada ships under reused zip names the umbrella crawl cannot disambiguate) — continue to resolve through the curated list_canpumf_collection(). URL resolution never triggers a live crawl; it reads the cached catalogue.
- The package now ships a frozen snapshot of the full catalogue crawl (
inst/extdata/pumf_catalogue.rds). It is the terminal, always-available fallback for both URL resolution and list_statcan_pumf_catalogue(): a freshly installed package with no user cache and no network still resolves every supported survey's download URL, so a change to the Statistics Canada website cannot silently break get_pumf() between releases. The shipped snapshot is regenerated at each release.
Robustness
get_pumf(), get_pumf_connection() and pumf_metadata() now fail gracefully when Statistics Canada is unreachable: a download failure no longer raises an error but instead emits an informative message and returns NULL. list_available_lfs_pumf_versions() likewise returns an empty result with a warning rather than erroring, matching the existing behaviour of list_canpumf_collection() and list_statcan_pumf_catalogue().
close_pumf(NULL) is now a no-op, so it can be called unconditionally on a get_pumf() result that may be NULL.
- When
options(canpumf.cache_path = ) is not set, the package now notes this once when attached and again on the first download, explaining that data is written to a temporary directory (and discarded at the end of the session) and how to configure a persistent cache. The underlying behaviour is unchanged — without a cache path, data is stored in tempdir() for the session.
Bug fixes
- Surveys whose StatCan ZIP archives carry accented path names stored in CP437/Latin-1 without the UTF-8 flag (e.g. the Survey of Household Spending 2017, whose data live under a
Data - Données/ folder) now extract correctly on Linux and Windows. Previously utils::unzip() either errored with "invalid multibyte string" (Windows) or silently dropped the affected files under a non-UTF-8 locale (Linux), so the survey failed to import with "No parseable metadata files found". Extraction now uses zip::unzip() as the primary, locale-agnostic extractor on every platform (with the macOS ditto/system-unzip chain retained as a fallback for newer ZIP compression variants), giving uniform cross-platform behaviour. zip is a new dependency.
canpumf 0.5.1
New features
- Multi-module survey support. Surveys that ship several linked files sharing a respondent key are now modelled as several joinable tables in one DuckDB file.
get_pumf() returns the survey's primary (respondent-level) module and emits a one-time message listing the available sibling modules; pumf_module(tbl, "<module>") opens a sibling on the same connection so the two are joinable, and announces the shared join key. Each module's join key is recorded in the registry (module_key) so it never has to be guessed (it varies: RECID, PUMFID, MICRO_ID, CASEID, IDNUM). Converted surveys include GSS cycle 16 / "Aging and Social Support" 2002 (MAIN + CG4 + CG6 + CR), GSS Time Use 1998/2010/2015/2022 (Main + Episode), the Survey of Household Spending 2017 (Interview + Diary, each with its own bootstrap weights), and the Giving/Volunteering/Participating cycles 1997–2010 (MAIN + GS/VD/GIVE/VOLNTR).
close_pumf() now also accepts a DuckDB connection returned by get_pumf_connection(), closing it directly, in addition to a lazy dplyr::tbl() returned by get_pumf().
- New
parse_pdf_codebook() metadata parser for StatCan bilingual PDF frequency codebooks. This recovers variable and value labels for surveys whose only machine-readable companion is the data file — notably CPSS cycle 1, which (unlike CPSS 2–6) ships no variables.csv. CPSS 1 now imports with full bilingual labels (parity with the other cycles) when pdftools is installed. Like the existing PDF data-dictionary parser, it is a label fallback that only fires when no command file or codebook CSV is found, and requires pdftools (Suggests).
Documentation
- New "Working with multi-module PUMF surveys" vignette showing how to load the primary module, open sibling modules with
pumf_module(), join them inside DuckDB, and use get_pumf_connection() / close_pumf() directly.
- New "Bootstrap weights" vignette documenting the resampling method, how the weights are stored, stratification, estimating uncertainty, and the incremental re-run behaviour (reuse, adding replicates, and regeneration when rows are added).
Bug fixes
get_pumf("LFS") (and other calls) no longer trigger spurious RStudio "Error in dbSendQuery(...)" Connections-pane popups. Transient internal DuckDB connections (status checks, write phases, BSW edits) are no longer registered in the RStudio Connections pane; only the final connection returned to the user is registered.
add_bootstrap_weights() on an in-memory data.frame/tibble that already has replicate columns now extends the existing set (generating only the additional replicates) instead of regenerating a full set and producing duplicate column names. This matches the DuckDB-backed behaviour.
add_bootstrap_weights() now handles rows added to a survey table that already has bootstrap weights correctly. Previously it generated replicates for the new rows in isolation (resampling only among the new rows), which is statistically wrong. It now deletes and regenerates the affected weights: every row when unstratified, or only the strata that gained rows when strata_cols are in effect (complete strata keep their existing weights).
- GSS Time Use 1998 now imports cleanly regardless of locale. Under a C locale (as in
R CMD check) list.files() selected the Main module's SAS PROC FORMAT, which injected categorical codes onto continuous clock-time, duration, decimal-hour and birth-year variables; these are now declared force_numeric so their values are preserved. In addition, merge_metadata() no longer warns about label conflicts that arise solely from lossy supplement parsers (SAS labels, PDF dictionary/codebook) — authoritative-source conflicts still warn.
canpumf 0.5.0
Major changes
- Data is now imported into DuckDB (breaking change, but only requiring slight modification of code)
- Adaptable metadata parsing registry
- Multiple more robust strategies to parse metadata
- Better data download and import mechanics
- Extensive test suite to prevent regressions and catch if StatCan re-releases data with changed metadata