--- title: "Introduction to pslr" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to pslr} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(pslr) ``` ## What pslr does The [Public Suffix List](https://publicsuffix.org) (PSL) is a community-curated list of the domain suffixes under which Internet users can directly register names. `pslr` bundles a pinned snapshot of that list and implements the official *prevailing-rule* algorithm to answer two core questions about a hostname: * **Public suffix** (also called the effective top-level domain, *eTLD*): the suffix below which registrations happen, e.g. `co.uk` for `example.co.uk`. * **Registrable domain** (*eTLD+1*): the public suffix plus the one label to its left that a registrant actually controls, e.g. `example.co.uk`. ```{r} public_suffix("www.example.co.uk") registrable_domain("www.example.co.uk") ``` The matcher is compiled with `cpp11` and needs no external system library. Hostname canonicalization (case folding and Unicode/IDNA handling) is delegated to the [`punycoder`](https://github.com/bart-turczynski/punycoder) package. ## Terminology * **Rule** — a line in the list, such as `com`, `*.ck`, or `!www.ck`. * **Normal rule** — a literal suffix (`com`, `co.uk`). * **Wildcard rule** — `*.ck` means *every* label directly under `ck` is itself a public suffix. * **Exception rule** — `!www.ck` carves a single name back out of a wildcard. * **Default rule** — the spec's implicit `*`: any unlisted TLD label is treated as a public suffix. * **Section** — the list is split into an **ICANN** part (the official domain hierarchy) and a **PRIVATE** part (suffixes operated by companies, e.g. `github.io`). The prevailing rule is chosen as: an exception beats a wildcard, the longest match beats shorter matches, and the implicit default applies only when nothing else does. ```{r} public_suffix("a.b.kobe.jp") # a wildcard match under kobe.jp public_suffix("city.kobe.jp") # an exception match under kobe.jp ``` ## Choosing a section `section` selects which rules are eligible. Filtering happens *before* prevailing-rule selection, so asking for one section never silently borrows a rule from the other. ```{r} # github.io is a PRIVATE rule sitting under the ICANN suffix io. public_suffix("user.github.io", section = "all") # default scope, both sections public_suffix("user.github.io", section = "icann") # the ICANN rule for io public_suffix("user.github.io", section = "private") ``` ### `section = "private"` fall-through When you restrict to a section and the host matches no explicit rule there, the query falls through to the implicit default rule rather than failing. A plain ICANN host queried under `section = "private"` therefore resolves to its own last label via the default rule: ```{r} public_suffix("example.com", section = "private") ``` To distinguish "no explicit rule matched" from a real match, combine the section with `unknown = "na"` (below). ## Unknown-suffix policy By default an unlisted suffix is handled by the implicit `*` rule, so a made-up TLD still yields a public suffix. Pass `unknown = "na"` to require an *explicit* rule and get `NA` otherwise. ```{r} public_suffix("example.madeuptld") # default rule public_suffix("example.madeuptld", unknown = "na") # explicit-only ``` ### Explicit-membership queries `is_public_suffix()` reports whether a host is itself a public suffix. Under the default policy an unlisted single label is `TRUE` via the implicit rule; use `unknown = "na"` to test explicit list membership instead. ```{r} is_public_suffix("co.uk") is_public_suffix("madeuptld") # TRUE via the implicit default rule is_public_suffix("madeuptld", unknown = "na") # explicit membership only ``` ## Unicode and ASCII output Input may be ASCII, Unicode, or A-label (`xn--`) hostnames; equivalent spellings canonicalize to the same answer. Output is ASCII A-labels by default; pass `output = "unicode"` to decode them. ```{r} public_suffix("example.рф") # ASCII A-label by default public_suffix("example.рф", output = "unicode") # decoded to Unicode public_suffix("example.xn--p1ai") # the A-label spelling agrees ``` ## Terminal dots A single terminal root dot is preserved on hostname-shaped output, so a fully-qualified name round-trips: ```{r} public_suffix("www.example.com.") registrable_domain("www.example.com.") ``` ## Extracting and inspecting `suffix_extract()` splits each host into subdomain, registrant label, and suffix; `public_suffix_rule()` reports which rule prevailed, useful for auditing. ```{r} suffix_extract("blog.user.github.io") public_suffix_rule(c("www.ck", "a.b.kobe.jp", "example.madeuptld")) ``` All query functions are vectorised, length- and name-preserving, and NA-safe. Invalid input (URLs, IPv6, empty labels, dotted-decimal IPv4 literals, ...) is `NA` by default; pass `invalid = "error"` to abort on the first invalid element. ## Refresh and the active list The package ships with a pinned snapshot, so it works fully offline and the bundled list is the default for every query. `psl_refresh()` is the *only* function that touches the network: an explicit, HTTPS-only, validated download into a user cache. `psl_use()` chooses which list backs the session. ```{r, eval = FALSE} # Download and validate a fresh list into the user cache, then activate it: psl_refresh(activate = TRUE) # Switch the active list for this session: psl_use("cache") # the latest refreshed snapshot psl_use("bundled") # back to the shipped snapshot psl_use("path", path = "my_list.dat") # a custom file ``` Activation is session-only and validated before any state changes; a failed refresh never replaces a working cache or active list. ## Reproducibility A public-suffix result depends on both *which list* answered and *how hosts were normalized*. `psl_version()` reports both — the source-snapshot provenance and the runtime normalization identifiers — so a result can be reproduced later. Record this row alongside reproducibility-sensitive output. ```{r} psl_version() ``` `psl_rules()` exposes the active rule table itself: ```{r} nrow(psl_rules("icann")) head(psl_rules("private"), 3) ``` If the shipped index was generated under a different normalization profile or Unicode version than the installed `punycoder`, the list is transparently rebuilt in memory from source on activation, so an index is never mixed with hosts normalized under a different profile. ## Security and scope notes * **Hostnames, not URLs.** The query functions accept DNS hostnames. URL-shaped input is rejected as invalid; parse the host out of a URL first. * **Explicit network only.** Nothing in package load, queries, examples, or tests touches the network. Only `psl_refresh()` does, and only when you call it. It is HTTPS-only, rejects embedded credentials and downgrade redirects, and enforces a source-size ceiling. * **The PSL is advisory.** It is a best-effort community list, not an authoritative statement of ownership or a security boundary by itself. Treat a registrable-domain result as a heuristic for grouping, not proof of control. * **Session-global active list.** The active list is per-session global state; there is no per-call list switching. Concurrent per-list queries are out of scope for this release.