--- title: "Search plans and quota-aware retrieval" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Search plans and quota-aware retrieval} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") ``` ```{r setup} library(scopusflow) ``` The Elsevier Scopus Search API is generous but bounded. A weekly quota limits how many requests you may make, a short-term rate limit caps how fast you may make them, and no single query will return more than its first 5000 records. This article shows how scopusflow works within those bounds so that a large retrieval is reproducible, efficient and resumable. The steps that contact the API need a key and are not run here; everything else runs offline. ## A query, built safely Most queries combine a few terms under a field tag. `scopus_query()` assembles them without the bracket and tag mistakes that creep in when fragments are pasted together by hand. ```{r} q <- scopus_query("language learning", "effect size", .field = "TITLE-ABS-KEY") q ``` The recognised field tags, and what each one searches, are listed by `scopus_field_tags()`. ```{r} scopus_field_tags() ``` ## Describing the search as a plan A plan records exactly what will be fetched, so it can be saved, reviewed and re-run. Partitioning by year is the recommended way to stay under the 5000-record ceiling, since each year becomes its own cell. ```{r} plan <- scopus_plan(q, years = 2010:2020, partition = "year") plan ``` Each cell carries the query, the year, the view and the page size. The page size deserves a moment's attention, because it is where quota is won or lost. ## Why page size is a quota decision Scopus charges quota per request, not per record. A page may hold up to 200 records under the `STANDARD` view, or 25 under `COMPLETE`. Retrieving a thousand records in pages of 200 therefore costs five requests, where pages of 25 would cost forty. For that reason `page_size` defaults to the largest the view allows, which is the same efficiency `rscopus` relies on, and is in no sense an evasion of the quota: every request is counted, and the 5000-record ceiling still holds. ```{r} scopus_plan(q, view = "STANDARD")$page_size[1] scopus_plan(q, view = "COMPLETE")$page_size[1] ``` ## Sizing before spending Counting is cheap and does not download records, so it is worth doing first. The count comes back with the parsed quota attached, which lets a workflow decide whether it has the allowance to proceed. ```{r eval = FALSE} n <- scopus_count(q, years = 2010:2020) n attr(n, "quota") ``` ## Fetching, with caching and resume `scopus_fetch_plan()` runs each cell in turn. Given a cache directory it writes each cell to disk as it completes, so a run interrupted halfway, or stopped by the quota, resumes from where it left off rather than paying for the same cells again. ```{r eval = FALSE} records <- scopus_fetch_plan( plan, cache_dir = scopus_cache_dir(), resume = TRUE ) records ``` The result is a `scopus_records` tibble, the same shape returned by `scopus_fetch()` for a single query and by the bundled `example_records`. ```{r} example_records ``` ## Combining separate retrievals Results gathered in separate runs combine safely with `scopus_combine()`, which renumbers the records and can drop duplicates by Scopus identifier or DOI. This is preferable to `rbind()`, which would leave duplicate entry numbers. ```{r} scopus_combine(example_records, example_records, dedupe = TRUE) ``` ## When the ceiling bites A query matching more than 5000 records cannot be retrieved in full from a single call; `scopus_fetch()` returns the first 5000 and warns. The remedy is the plan: split the search by year, or by any other facet, so that each cell stays under the ceiling. `scopus_count()` tells you in advance whether a split is needed. ## Handling interruptions Network and API problems are raised as typed conditions, all inheriting from `scopus_error`, so a long retrieval can respond to them rather than stopping dead. ```{r eval = FALSE} result <- tryCatch( scopus_fetch_plan(plan, cache_dir = scopus_cache_dir()), scopus_error_rate_limit = function(e) { message("Rate limited; the cached cells are safe. Try again later.") NULL } ) ``` Because each completed cell is already cached, resuming after such a pause costs nothing for the work already done.