--- title: "Comparing topics over time" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Comparing topics over time} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") # The plotting chunks below need ggplot2 (a suggested package); skip them # gracefully when it is not installed. has_ggplot2 <- requireNamespace("ggplot2", quietly = TRUE) ``` ```{r setup} library(scopusflow) ``` A common bibliometric question is not how large a literature is, but how its internal emphasis shifts over time. Within deep-learning research, say, is the share of work that also concerns medical imaging growing faster than the share about computer vision? `scopus_compare_topics()` answers exactly this, and `plot_scopus_comparison()` shows the answer. The comparison itself contacts the API, so it is shown but not run; the plotting is reproduced offline from an object of the same shape. ## What the comparison measures For each year and each comparison term, the function counts the records matching the reference topic *and* that term, and expresses it as a percentage of the records matching the reference *alone*. A value of 30% for "computer vision" in 2020 means that 30% of the deep-learning records that year also mention computer vision. The reference is the denominator, so it sits at 100% by construction and is not drawn. ```{r eval = FALSE} cmp <- scopus_compare_topics( reference_query = "deep learning", comparison_terms = c("computer vision", "natural language processing", "medical imaging", "drug discovery"), years = 2013:2021, field = "TITLE-ABS-KEY" ) ``` ## The shape of the result The result is a tidy table with one row per topic and year. We build one here with the same columns so the rest of the article runs without a key. The reference set grows over the period, which the uncertainty band will reflect. ```{r} years <- 2013:2021 ref_n <- round(seq(400, 1600, length.out = length(years))) mk <- function(from, to) round(seq(from, to, length.out = length(years))) counts <- list( "computer vision" = mk(140, 720), "natural language processing" = mk(90, 540), "medical imaging" = mk(15, 260), "drug discovery" = mk(8, 170) ) cmp <- tibble::tibble( query = "q", query_type = c(rep("reference", length(years)), rep("comparison", length(counts) * length(years))), abridged_query = c(rep("deep learning", length(years)), rep(names(counts), each = length(years))), year = rep(years, length(counts) + 1), n = c(ref_n, unlist(counts, use.names = FALSE)), reference_n = rep(ref_n, length(counts) + 1), comparison_percentage = 100 * c(ref_n, unlist(counts, use.names = FALSE)) / rep(ref_n, length(counts) + 1), average_comparison_percentage = c(rep(100, length(years)), rep(c(40, 33, 15, 9), each = length(years))) ) class(cmp) <- c("scopus_comparison", class(cmp)) cmp ``` The `comparison_percentage` column is the per-year share, and `average_comparison_percentage` is the same ratio computed over the whole period, which is what orders the topics. A year in which the reference has no records has no defined share and is recorded as `NA` rather than as a misleading zero. ## A first plot ```{r eval = has_ggplot2, fig.alt = "Four application areas' share of the deep-learning literature from 2013 to 2021, with shaded uncertainty bands", fig.width = 8, fig.height = 4.6} plot_scopus_comparison(cmp) ``` The chart uses whole-number year breaks, a colour-blind-safe palette and, because there are only a few topics, labels the lines directly so the reader need not match colours to a legend. Each label carries the topic's total record count. The shaded band around each line is a Wilson stability range: it is wide in the early years, when the reference set is small and the share would move easily, and narrows as the literature grows. Because 'Scopus' returns exact counts rather than a sample, the band is illustrative rather than a confidence interval, a point the `plot_scopus_comparison()` help page sets out. ## Drawing the eye to one topic When one topic is the focus of a figure, `highlight` draws it in an accent colour and greys the rest, which keeps the context visible without letting it compete. ```{r eval = has_ggplot2, fig.alt = "The same chart with the medical-imaging topic highlighted against the others in grey", fig.width = 8, fig.height = 4.6} plot_scopus_comparison(cmp, highlight = "medical imaging") ``` ## Adjusting the labels The count suffix on each label can be turned off, and the uncertainty band can be removed, when a cleaner look is wanted. ```{r eval = has_ggplot2, fig.alt = "The comparison chart without record counts or bands", fig.width = 8, fig.height = 4.6} plot_scopus_comparison(cmp, pub_count_in_legend = FALSE, interval = FALSE) ``` The return value is an ordinary [ggplot2](https://ggplot2.tidyverse.org) object, so any further adjustment, a different theme or a saved file, is one `+` or one `ggplot2::ggsave()` away. ## Reading the result as a table Sometimes the numbers matter more than the picture. Because the output is a tibble, the usual tools apply: here are the topics ranked by their average share. ```{r} comp <- cmp[cmp$query_type == "comparison", ] unique(comp[, c("abridged_query", "average_comparison_percentage")]) ```