Package 'inlpubs' reference manual

Title:	USGS INL Project Office Publications
Description:	Contains bibliographic information for the U.S. Geological Survey (USGS) Idaho National Laboratory (INL) Project Office.
Authors:	Jason C. Fisher [aut, cre] , Kerri C. Treinen [aut] , Allison R. Trcka [aut]
Maintainer:	Jason C. Fisher <[email protected]>
License:	CC0
Version:	1.1.3
Built:	2024-10-01 06:51:13 UTC
Source:	CRAN

Contributing Authors to INLPO Publications

Description

Authors who have contributed to the publications by the U.S. Geological Survey (USGS), Idaho Water Science Center, Idaho National Laboratory Project Office (INLPO).

Usage

authors
authors

Format

An object of class 'author' that inherits behavior from the 'data.frame' class and includes the following columns:

author_id: Unique identifier for the author.
name: Name of author, surname first and initials or given name.
person: Information about the person like email address and ORCiD identifier.
pub_id: Identifier(s) of the publication(s) the author has contributed to, referes to the primry key of the pubs data table.
total_pub: Total number of publications.
single_authored: Number of single-authored publications.
multi_authored: Number of multi-authored publications.
first_authored: Number of multi-authored publications where the researcher appears as first author.
first_year: First year author published.
last_year: Last year author published.

Source

Curated by INLPO staff.

Examples

# Subset Jason Fisher's information and display structure:
author <- authors["jfisher", ]
str(author, max.level = 3, width = 75, strict.width = "cut")

# Print author's given name:
author$person |> format(include = "given")
# Subset Jason Fisher's information and display structure:
author <- authors["jfisher", ]
str(author, max.level = 3, width = 75, strict.width = "cut")

# Print author's given name:
author$person |> format(include = "given")

Extract Image from a PDF Document

Description

Extract an image from any PDF document. Requires that the pdftools and magick packages are available.

Usage

extract_pdf_image(
  input,
  output = tempfile(fileext = ".jpg"),
  page = 1,
  width = 300,
  depth = 8,
  quality = 70
)
extract_pdf_image(
  input,
  output = tempfile(fileext = ".jpg"),
  page = 1,
  width = 300,
  depth = 8,
  quality = 70
)

Arguments

`input`	'character' string. File path to PDF document.
`output`	'character' string. Location to write the JPEG image file.
`page`	'integer' number. Page number in the document. Defaults to page 1.
`width`	'integer' number. Image width in pixels.
`depth`	'integer' number. Image color depth (either 8 or 16). Defaults to 8.
`quality`	'integer' number. JPEG quality, a number between 0 and 100. Defaults to 70.

Value

Returns the path to the image file.

Author(s)

J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center

Examples

input <- system.file("extdata", "test.pdf", package = "inlpubs")
path <- extract_pdf_image(input)

unlink(path)
input <- system.file("extdata", "test.pdf", package = "inlpubs")
path <- extract_pdf_image(input)

unlink(path)

Extract Text from a PDF Document

Description

Extract text from any PDF document. Requires that the pdftools and tesseract packages are available.

Usage

extract_pdf_text(
  input,
  output = tempfile(fileext = ".txt"),
  dpi = 600,
  psm = 1
)
extract_pdf_text(
  input,
  output = tempfile(fileext = ".txt"),
  dpi = 600,
  psm = 1
)

Arguments

`input`	'character' string. File path to PDF document.
`output`	'character' string. Location to write the text file.
`dpi`	'integer' number between 100 and 1200. Dots per inch (DPI). The resolution of an image, specifically the number of pixels per inch. For optimal optical character recognition (OCR) accuracy, 600 DPI (the default) is recommended.
`psm`	`integer` number between 0 and 13. Page Segmentation Mode (PSM). Describes the layout of the text you are trying to extract. For processing two columns of text you should use the page segmentation mode 1 (default). PSM 1 (default) is used to automatically segment the page into different text areas and also detect the orientation and script of the text.

Value

Returns the path to the text file. Each page from the PDF is transcribed as a separate line in the file.

Author(s)

J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center

Examples

## Not run: 
  input <- system.file("extdata", "test.pdf", package = "inlpubs")
  path <- extract_pdf_text(input)

  unlink(path)

## End(Not run)
## Not run: 
  input <- system.file("extdata", "test.pdf", package = "inlpubs")
  path <- extract_pdf_text(input)

  unlink(path)

## End(Not run)

Create Word Cloud

Description

Create a word cloud from a frequency table of words, and save to a PNG file. Requires R-packages htmltools, htmlwidgets, magick, webshot2, and wordcloud2 are available. System dependencies include the the following: ImageMagick for displaying the PNG image, OptiPNG for PNG file compression, and Chrome- or a Chromium-based browser with support for the Chrome DevTools protocol. Use find_chromate function to find the path to the Chrome browser.

Usage

make_wordcloud(
  x,
  max_terms = 200,
  size = 1,
  shape = "circle",
  ellipticity = 0.65,
  ...,
  width = 910,
  output = NULL,
  display = FALSE
)
make_wordcloud(
  x,
  max_terms = 200,
  size = 1,
  shape = "circle",
  ellipticity = 0.65,
  ...,
  width = 910,
  output = NULL,
  display = FALSE
)

Arguments

`x`	'data.frame'. A frequency table of terms that includes "term" and "freq" in each column.
`max_terms`	'integer' number. Maximum number of terms to include in the word cloud.
`size`	'numeric' number. Font size.
`shape`	'character' string. Shape of the “cloud” to draw. Possible shapes include a "circle", "cardioid", "diamond", "triangle-forward", "triangle", "pentagon", and "star".
`ellipticity`	'numeric' number. Degree of “flatness” of the shape to draw, a value between 0 and 1.
`...`	Additional arguments to be passed to the `wordcloud2` function.
`width`	'integer' number. Desired image width in pixels.
`output`	'character' string. Path to the output file, by default the word cloud is copied to a temporary file.
`display`	'logical' flag. Whether to display the saved PNG file in a graphics window. Requires access to the magick package.

Value

File path to the word cloud plot in PNG format.

Author(s)

J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center

Examples

## Not run: 
  d <- wordcloud2::demoFreq |> head(n = 10)
  colnames(d) <- c("term", "freq")
  file <- make_wordcloud(d, display = interactive())

  unlink(file)

## End(Not run)
## Not run: 
  d <- wordcloud2::demoFreq |> head(n = 10)
  colnames(d) <- c("term", "freq")
  file <- make_wordcloud(d, display = interactive())

  unlink(file)

## End(Not run)

Mine Text

Description

Performs a term frequency text analysis. A term is defined as a word or group of words.

Usage

mine_text(docs, ngmin = 1, ngmax = ngmin, sparse = NULL)
mine_text(docs, ngmin = 1, ngmax = ngmin, sparse = NULL)

Arguments

`docs`	'list' or 'character' vector. Document text to analyze. Each list item contains the extracted text from a single document.
`ngmin`, `ngmax`	integer number. Splits strings into n-grams with given minimal and maximal numbers of grams. An n-gram is an ordered sequence of n words taken from the body of a text. Requires the RWeka package is available and that the environment variable JAVA_HOME points to where the Java software is located. Recommended for single text compoents only.
`sparse`	'numeric' number that is greater than 0 and less than 1. A threshold of relative document frequency for a term. It specifies the proportion of documents in which a term must appear to be retained. For example if you specify `sparse` equal to 0.99, it removes terms that are more sparse than 0.99. Conversely, at 0.01, only terms appearing in nearly every document will be retained.

Details

HTML entities are decoded when the textutils package is available.

Value

A term-frequency data table giving the number of times each word occurs in the text. A column in the table represents a single component in the docs argument, and each row provides frequency counts for a particular word (also known as a 'term').

Author(s)

J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center

Examples

d <- c(
  "The quick brown fox jumps over the lazy lazy dog.",
  "Pack my brown box.",
  "Jazz fly brown dog."
) |>
  mine_text()

d <- list(
  "A" = "The quick brown fox jumps over the lazy lazy dog.",
  "B" = c("Pack my brown box.", NA, "Jazz fly brown dog."),
  "C" = NA_character_
) |>
  mine_text()
d <- c(
  "The quick brown fox jumps over the lazy lazy dog.",
  "Pack my brown box.",
  "Jazz fly brown dog."
) |>
  mine_text()

d <- list(
  "A" = "The quick brown fox jumps over the lazy lazy dog.",
  "B" = c("Pack my brown box.", NA, "Jazz fly brown dog."),
  "C" = NA_character_
) |>
  mine_text()

Publications of the INLPO

Description

Bibliographic information for reports, articles, maps, and theses related to scientific monitoring and research conducted by the U.S. Geological Survey (USGS), Idaho Water Science Center, Idaho National Laboratory Project Office (INLPO).

Usage

pubs
pubs

Format

An object of class 'pub' that inherits behavior from the 'data.frame' class and includes the following columns:

pub_id: Unique identifier for the publication.
institution: Name of the institution that published and/or sponsored the report.
type: Type of publication.
text_ref: Text reference (also known as the in-text citation) that excludes the year of publication.
year: Year of publication.
author_id: Identifier(s) of the author(s), referes to the primry key of the authors data table.
title: Title of publication.
bibentry: Bibliographic entry of class bibentry.
abstract: Abstract of publication.
annotation: Annotation of publication.
annotation_src: Identifier for the annotation source publication (Knobel and others, 2005; Bartholomay, 2022).
files: File names associated with the publication.

Source

Many of these publications are available through the USGS Publications Warehouse.

References

Bartholomay, R.C., 2022, Historical development of the U.S. Geological Survey hydrological monitoring and investigative programs at the Idaho National Laboratory, Idaho, 2002-2020: U.S. Geological Survey Open-File Report 2022-1027 (DOE/ID-22256), 54 p., doi:10.3133/ofr20221027.

Knobel, L.L., Bartholomay, R.C., and Rousseau, J.P., 2005, Historical development of the U.S. Geological Survey hydrologic monitoring and investigative programs at the Idaho National Engineering and Environmental Laboratory, Idaho, 1949 to 2001: U.S. Geological Survey Open-File Report 2005–1223 (DOE/ID–22195), 93 p., doi:10.3133/ofr20051223.

Examples

# Subset Fisher and others (2012) and display structure:
id <- "FisherOthers2012"
pub <- pubs[id, ]
str(pub, max.level = 3, width = 75, strict.width = "cut")

# Print suggested citation:
attr(unclass(pub$bibentry[[1]])[[1]], which = "textVersion")

# Print authors full name:
format(pub$bibentry[[1]]$author, include = c("given", "family"))

# Print abstract:
pub$abstract
# Subset Fisher and others (2012) and display structure:
id <- "FisherOthers2012"
pub <- pubs[id, ]
str(pub, max.level = 3, width = 75, strict.width = "cut")

# Print suggested citation:
attr(unclass(pub$bibentry[[1]])[[1]], which = "textVersion")

# Print authors full name:
format(pub$bibentry[[1]]$author, include = c("given", "family"))

# Print abstract:
pub$abstract

Search Terms

Description

Pattern matches a search term within the term-frequency data table.

Usage

search_terms(
  x,
  data = inlpubs::terms,
  ignore.case = TRUE,
  ...,
  low_freq = 1,
  high_freq = Inf,
  simplify = TRUE
)
search_terms(
  x,
  data = inlpubs::terms,
  ignore.case = TRUE,
  ...,
  low_freq = 1,
  high_freq = Inf,
  simplify = TRUE
)

Arguments

`x`	'character' string. Term searched for in the term-frequency data table.
`data`	'term' and 'data.frame' class. Term-frequency data table. Defaults to using the term frequencies from the INLPO publications, see `terms` dataset for details.
`ignore.case`	'logical' flag. Whether to ignore character case during pattern matching.
`...`	Additional arguments passed to the `grep` function.
`low_freq`	'numeric' number. Lower frequency bound.
`high_freq`	'numeric' number. Upper frequency bound.
`simplify`	'logical' flag. Whether to return only the unique publication identifiers.

Value

A subset of the data table sorted by decreasing frequency.

Author(s)

J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center

Examples

search_terms("mlms")

out <- search_terms("mlms", simplify = FALSE)
head(out)
search_terms("mlms")

out <- search_terms("mlms", simplify = FALSE)
head(out)

Term Frequency from INLPO Publications

Description

Term frequency from publications by the U.S. Geological Survey (USGS), Idaho Water Science Center, Idaho National Laboratory Project Office (INLPO).

Usage

terms
terms

Format

An object of class 'term' that inherits behavior from the 'data.frame' class and includes the following columns:

term: Term, a word or group of words, represented by an ASCII character string in lowercase.
pub_id: Identifier for a publication, referes to the primry key of the pubs data table.
freq: Frequency count from text analysis.

Source

The publication text was sourced from the original PDF documents using the extract_pdf_text function, and term frequencies were extracted from the text using the mine_text function.

Examples

str(terms, max.level = 3, width = 75, strict.width = "cut")
str(terms, max.level = 3, width = 75, strict.width = "cut")

Package 'inlpubs'

Help Index

Contributing Authors to INLPO Publications

Description

Usage

Format

Source

Examples

Extract Image from a PDF Document

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Extract Text from a PDF Document

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Create Word Cloud

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Mine Text

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Publications of the INLPO

Description

Usage

Format

Source

References

Examples

Search Terms

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Term Frequency from INLPO Publications

Description

Usage

Format

Source

Examples