Package 'rio'

Title: A Swiss-Army Knife for Data I/O
Description: Streamlined data import and export by making assumptions that the user is probably willing to make: 'import()' and 'export()' determine the data format from the file extension, reasonable defaults are used for data import and export, web-based import is natively supported (including from SSL/HTTPS), compressed files can be read directly, and fast import packages are used where appropriate. An additional convenience function, 'convert()', provides a simple method for converting between file types.
Authors: Jason Becker [aut], Chung-hong Chan [aut, cre] , David Schoch [aut] , Geoffrey CH Chan [ctb], Thomas J. Leeper [aut] , Christopher Gandrud [ctb], Andrew MacDonald [ctb], Ista Zahn [ctb], Stanislaus Stadlmann [ctb], Ruaridh Williamson [ctb], Patrick Kennedy [ctb], Ryan Price [ctb], Trevor L Davis [ctb], Nathan Day [ctb], Bill Denney [ctb] , Alex Bokov [ctb] , Hugo Gruson [ctb]
Maintainer: Chung-hong Chan <[email protected]>
License: GPL-2
Version: 1.2.2
Built: 2024-08-20 12:34:53 UTC
Source: CRAN

Help Index


Character conversion of labelled data

Description

Convert labelled variables to character or factor

Usage

characterize(x, ...)

factorize(x, ...)

## Default S3 method:
characterize(x, ...)

## S3 method for class 'data.frame'
characterize(x, ...)

## Default S3 method:
factorize(x, coerce_character = FALSE, ...)

## S3 method for class 'data.frame'
factorize(x, ...)

Arguments

x

A vector or data frame.

...

additional arguments passed to methods

coerce_character

A logical indicating whether to additionally coerce character columns to factor (in factorize). Default FALSE.

Details

characterize converts a vector with a labels attribute of named levels into a character vector. factorize does the same but to factors. This can be useful at two stages of a data workflow: (1) importing labelled data from metadata-rich file formats (e.g., Stata or SPSS), and (2) exporting such data to plain text files (e.g., CSV) in a way that preserves information.

Value

a character vector (for characterize) or factor vector (for factorize)

See Also

gather_attrs()

Examples

## vector method
x <- structure(1:4, labels = c("A" = 1, "B" = 2, "C" = 3))
characterize(x)
factorize(x)

## data frame method
x <- data.frame(v1 = structure(1:4, labels = c("A" = 1, "B" = 2, "C" = 3)),
                v2 = structure(c(1,0,0,1), labels = c("foo" = 0, "bar" = 1)))
str(factorize(x))
str(characterize(x))

## Application
csv_file <- tempfile(fileext = ".csv")
## comparison of exported file contents
import(export(x, csv_file))
import(export(factorize(x), csv_file))

Convert from one file format to another

Description

This function constructs a data frame from a data file using import() and uses export() to write the data to disk in the format indicated by the file extension.

Usage

convert(in_file, out_file, in_opts = list(), out_opts = list())

Arguments

in_file

A character string naming an input file.

out_file

A character string naming an output file.

in_opts

A named list of options to be passed to import().

out_opts

A named list of options to be passed to export().

Value

A character string containing the name of the output file (invisibly).

See Also

Luca Braglia has created a Shiny app called rioweb that provides access to the file conversion features of rio through a web browser.

Examples

## For demo, a temp. file path is created with the file extension .dta (Stata)
dta_file <- tempfile(fileext = ".dta")
## .csv
csv_file <- tempfile(fileext = ".csv")
## .xlsx
xlsx_file <- tempfile(fileext = ".xlsx")


## Create a Stata data file
export(mtcars, dta_file)

## convert Stata to CSV and open converted file
convert(dta_file, csv_file)
import(csv_file)

## correct an erroneous file format
export(mtcars, xlsx_file, format = "tsv") ## DON'T DO THIS
## import(xlsx_file) ## ERROR
## convert the file by specifying `in_opts`
convert(xlsx_file, xlsx_file, in_opts = list(format = "tsv"))
import(xlsx_file)

## convert from the command line:
## Rscript -e "rio::convert('mtcars.dta', 'mtcars.csv')"

Export

Description

Write data.frame to a file

Usage

export(x, file, format, ...)

Arguments

x

A data frame, matrix or a single-item list of data frame to be written into a file. Exceptions to this rule are that x can be a list of multiple data frames if the output file format is an OpenDocument Spreadsheet (.ods, .fods), Excel .xlsx workbook, .Rdata file, or HTML file, or a variety of R objects if the output file format is RDS or JSON. See examples.) To export a list of data frames to multiple files, use export_list() instead.

file

A character string naming a file. Must specify file and/or format.

format

An optional character string containing the file format, which can be used to override the format inferred from file or, in lieu of specifying file, a file with the symbol name of x and the specified file extension will be created. Must specify file and/or format. Shortcuts include: “,” (for comma-separated values), “;” (for semicolon-separated values), “|” (for pipe-separated values), and “dump” for base::dump().

...

Additional arguments for the underlying export functions. This can be used to specify non-standard arguments. See examples.

Details

This function exports a data frame or matrix into a file with file format based on the file extension (or the manually specified format, if format is specified).

The output file can be to a compressed directory, simply by adding an appropriate additional extensiont to the file argument, such as: “mtcars.csv.tar”, “mtcars.csv.zip”, or “mtcars.csv.gz”.

export supports many file formats. See the documentation for the underlying export functions for optional arguments that can be passed via ...

When exporting a data set that contains label attributes (e.g., if imported from an SPSS or Stata file) to a plain text file, characterize() can be a useful pre-processing step that records value labels into the resulting file (e.g., export(characterize(x), "file.csv")) rather than the numeric values.

Use export_list() to export a list of dataframes to separate files.

Value

The name of the output file as a character string (invisibly).

See Also

characterize(), import(), convert(), export_list()

Examples

## For demo, a temp. file path is created with the file extension .csv
csv_file <- tempfile(fileext = ".csv")
## .xlsx
xlsx_file <- tempfile(fileext = ".xlsx")

## create CSV to import
export(iris, csv_file)

## You can certainly export your data with the file name, which is not a variable:
## import(mtcars, "car_data.csv")

## pass arguments to the underlying function
## data.table::fwrite is the underlying function and `col.names` is an argument
export(iris, csv_file, col.names = FALSE)

## export a list of data frames as worksheets
export(list(a = mtcars, b = iris), xlsx_file)

# NOT RECOMMENDED

## specify `format` to override default format
export(iris, xlsx_file, format = "csv") ## That's confusing
## You can also specify only the format; in the following case
## "mtcars.dta" is written [also confusing]

## export(mtcars, format = "stata")

Export list of data frames to files

Description

Use export() to export a list of data frames to a vector of file names or a filename pattern.

Usage

export_list(x, file, archive = "", ...)

Arguments

x

A list of data frames to be written to files.

file

A character vector string containing a single file name with a ⁠\%s⁠ wildcard placeholder, or a vector of file paths for multiple files to be imported. If x elements are named, these will be used in place of ⁠\%s⁠, otherwise numbers will be used; all elements must be named for names to be used.

archive

character. Either empty string (default) to save files in current directory, a path to a (new) directory, or a .zip/.tar file to compress all files into an archive.

...

Additional arguments passed to export().

Details

export() can export a list of data frames to a single multi-dataset file (e.g., an Rdata or Excel .xlsx file). Use export_list to export such a list to multiple files.

Value

The name(s) of the output file(s) as a character vector (invisibly).

See Also

import(), import_list(), export()

Examples

## For demo, a temp. file path is created with the file extension .xlsx
xlsx_file <- tempfile(fileext = ".xlsx")
export(
    list(
        mtcars1 = mtcars[1:10, ],
        mtcars2 = mtcars[11:20, ],
        mtcars3 = mtcars[21:32, ]
    ),
    xlsx_file
)

# import a single file from multi-object workbook
import(xlsx_file, sheet = "mtcars1")
# import all worksheets, the return value is a list
import_list(xlsx_file)
library('datasets')
export(list(mtcars1 = mtcars[1:10,],
            mtcars2 = mtcars[11:20,],
            mtcars3 = mtcars[21:32,]),
    xlsx_file <- tempfile(fileext = ".xlsx")
)

# import all worksheets
list_of_dfs <- import_list(xlsx_file)

# re-export as separate named files

## export_list(list_of_dfs, file = c("file1.csv", "file2.csv", "file3.csv"))

# re-export as separate files using a name pattern; using the names in the list
## This will be written as "mtcars1.csv", "mtcars2.csv", "mtcars3.csv"

## export_list(list_of_dfs, file = "%s.csv")

Gather attributes from data frame variables

Description

gather_attrs moves variable-level attributes to the data frame level and spread_attrs reverses that operation.

Usage

gather_attrs(x)

spread_attrs(x)

Arguments

x

A data frame.

Details

import() attempts to standardize the return value from the various import functions to the extent possible, thus providing a uniform data structure regardless of what import package or function is used. It achieves this by storing any optional variable-related attributes at the variable level (i.e., an attribute for mtcars$mpg is stored in attributes(mtcars$mpg) rather than attributes(mtcars)). gather_attrs moves these to the data frame level (i.e., in attributes(mtcars)). spread_attrs moves attributes back to the variable level.

Value

x, with variable-level attributes stored at the data frame level.

See Also

import(), characterize()


Get File Info

Description

A utility function to retrieve the file information of a filename, path, or URL.

Usage

get_info(file)

get_ext(file)

Arguments

file

A character string containing a filename, file path, or URL.

Value

For get_info(), a list is return with the following slots

  • input file extension or information used to identify the possible file format

  • format file format, see format argument of import()

  • type "import" (supported by default); "suggest" (supported by suggested packages, see install_formats()); "enhance" and "known " are not directly supported; NA is unsupported

  • format_name name of the format

  • import_function What function is used to import this file

  • export_function What function is used to export this file

  • file file

For get_ext(), just input (usually file extension) is returned; retained for backward compatibility.

Examples

get_info("starwars.xlsx")
get_info("starwars.ods")
get_info("https://github.com/ropensci/readODS/raw/v2.1/starwars.ods")
get_info("~/duran_duran_rio.mp3")
get_ext("clipboard") ## "clipboard"
get_ext("https://github.com/ropensci/readODS/raw/v2.1/starwars.ods")

Import

Description

Read in a data.frame from a file. Exceptions to this rule are Rdata, RDS, and JSON input file formats, which return the originally saved object without changing its class.

Usage

import(
  file,
  format,
  setclass = getOption("rio.import.class", "data.frame"),
  which,
  ...
)

Arguments

file

A character string naming a file, URL, or single-file (can be Gzip or Bzip2 compressed), .zip or .tar archive.

format

An optional character string code of file format, which can be used to override the format inferred from file. Shortcuts include: “,” (for comma-separated values), “;” (for semicolon-separated values), and “|” (for pipe-separated values).

setclass

An optional character vector specifying one or more classes to set on the import. By default, the return object is always a “data.frame”. Allowed values include “tbl_df”, “tbl”, or “tibble” (if using tibble), “arrow”, “arrow_table” (if using arrow table; the suggested package arrow must be installed) or “data.table” (if using data.table). Other values are ignored, such that a data.frame is returned. The parameter takes precedents over parameters in ... which set a different class.

which

This argument is used to control import from multi-object files; as a rule import only ever returns a single data frame (use import_list() to import multiple data frames from a multi-object file). If file is an archive format (zip and tar), which can be either a character string specifying a filename or an integer specifying which file (in locale sort order) to extract from the compressed directory. But please see the section which below. For Excel spreadsheets, this can be used to specify a sheet name or number. For .Rdata files, this can be an object name. For HTML files, it identifies which table to extract (from document order). Ignored otherwise. A character string value will be used as a regular expression, such that the extracted file is the first match of the regular expression against the file names in the archive.

...

Additional arguments passed to the underlying import functions. For example, this can control column classes for delimited file types, or control the use of haven for Stata and SPSS or readxl for Excel (.xlsx) format. See details below.

Details

This function imports a data frame or matrix from a data file with the file format based on the file extension (or the manually specified format, if format is specified).

import supports the following file formats:

import attempts to standardize the return value from the various import functions to the extent possible, thus providing a uniform data structure regardless of what import package or function is used. It achieves this by storing any optional variable-related attributes at the variable level (i.e., an attribute for mtcars$mpg is stored in attributes(mtcars$mpg) rather than attributes(mtcars)). If you would prefer these attributes to be stored at the data.frame-level (i.e., in attributes(mtcars)), see gather_attrs().

After importing metadata-rich file formats (e.g., from Stata or SPSS), it may be helpful to recode labelled variables to character or factor using characterize() or factorize() respectively.

Value

A data frame. If setclass is used, this data frame may have additional class attribute values, such as “tibble” or “data.table”.

Trust

For serialization formats (.R, .RDS, and .RData), please note that you should only load these files from trusted sources. It is because these formats are not necessarily for storing rectangular data and can also be used to store many things, e.g. code. Importing these files could lead to arbitary code execution. Please read the security principles by the R Project (Plummer, 2024). When importing these files via rio, you should affirm that you trust these files, i.e. trust = TRUE. See example below. If this affirmation is missing, the current version assumes trust to be true for backward compatibility and a deprecation notice will be printed. In the next major release (2.0.0), you must explicitly affirm your trust when importing these files.

Which

For compressed archives (zip and tar, where a compressed file can contain multiple files), it is possible to come to a situation where the parameter which is used twice to indicate two different concepts. For example, it is unclear for .xlsx.zipwhether which refers to the selection of an exact file in the archive or the selection of an exact sheet in the decompressed Excel file. In these cases, rio assumes that which is only used for the selection of file. After the selection of file with which, rio will return the first item, e.g. the first sheet.

Please note, however, .gz and .bz2 (e.g. .xlsx.gz) are compressed, but not archive format. In those cases, which is used the same way as the non-compressed format, e.g. selection of sheet for Excel.

Note

For csv and txt files with row names exported from export(), it may be helpful to specify row.names as the column of the table which contain row names. See example below.

References

Plummer, M (2024). Statement on CVE-2024-27322. https://blog.r-project.org/2024/05/10/statement-on-cve-2024-27322/

See Also

import_list(), characterize(), gather_attrs(), export(), convert()

Examples

## For demo, a temp. file path is created with the file extension .csv
csv_file <- tempfile(fileext = ".csv")
## .xlsx
xlsx_file <- tempfile(fileext = ".xlsx")
## create CSV to import
export(iris, csv_file)
## specify `format` to override default format: see export()
export(iris, xlsx_file, format = "csv")

## basic
import(csv_file)

## You can certainly import your data with the file name, which is not a variable:
## import("starwars.csv"); import("mtcars.xlsx")

## Override the default format
## import(xlsx_file) # Error, it is actually not an Excel file
import(xlsx_file, format = "csv")

## import CSV as a `data.table`
import(csv_file, setclass = "data.table")

## import CSV as a tibble (or "tbl_df")
import(csv_file, setclass = "tbl_df")

## pass arguments to underlying import function
## data.table::fread is the underlying import function and `nrows` is its argument
import(csv_file, nrows = 20)

## data.table::fread has an argument `data.table` to set the class explicitely to data.table. The
## argument setclass, however, takes precedents over such undocumented features.
class(import(csv_file, setclass = "tibble", data.table = TRUE))

## the default import class can be set with options(rio.import.class = "data.table")
## options(rio.import.class = "tibble"), or options(rio.import.class = "arrow")

## Security
rds_file <- tempfile(fileext = ".rds")
export(iris, rds_file)

## You should only import serialized formats from trusted sources
## In this case, you can trust it because it's generated by you.
import(rds_file, trust = TRUE)

Import list of data frames

Description

Use import() to import a list of data frames from a vector of file names or from a multi-object file (Excel workbook, .Rdata file, compressed directory in a zip file or tar archive, or HTML file)

Usage

import_list(
  file,
  setclass = getOption("rio.import.class", "data.frame"),
  which,
  rbind = FALSE,
  rbind_label = "_file",
  rbind_fill = TRUE,
  ...
)

Arguments

file

A character string containing a single file name for a multi-object file (e.g., Excel workbook, zip file, tar archive, or HTML file), or a vector of file paths for multiple files to be imported.

setclass

An optional character vector specifying one or more classes to set on the import. By default, the return object is always a “data.frame”. Allowed values include “tbl_df”, “tbl”, or “tibble” (if using tibble), “arrow”, “arrow_table” (if using arrow table; the suggested package arrow must be installed) or “data.table” (if using data.table). Other values are ignored, such that a data.frame is returned. The parameter takes precedents over parameters in ... which set a different class.

which

If file is a single file path, this specifies which objects should be extracted (passed to import()'s which argument). Ignored otherwise.

rbind

A logical indicating whether to pass the import list of data frames through data.table::rbindlist().

rbind_label

If rbind = TRUE, a character string specifying the name of a column to add to the data frame indicating its source file.

rbind_fill

If rbind = TRUE, a logical indicating whether to set the fill = TRUE (and fill missing columns with NA).

...

Additional arguments passed to import(). Behavior may be unexpected if files are of different formats.

Details

When file is a vector of file paths and any files are missing, those files are ignored (with warnings) and this function will not raise any error. For compressed files, the file name must also contain information about the file format of all compressed files, e.g. files.csv.zip for this function to work.

Value

If rbind=FALSE (the default), a list of a data frames. Otherwise, that list is passed to data.table::rbindlist() with fill = TRUE and returns a data frame object of class set by the setclass argument; if this operation fails, the list is returned.

Trust

For serialization formats (.R, .RDS, and .RData), please note that you should only load these files from trusted sources. It is because these formats are not necessarily for storing rectangular data and can also be used to store many things, e.g. code. Importing these files could lead to arbitary code execution. Please read the security principles by the R Project (Plummer, 2024). When importing these files via rio, you should affirm that you trust these files, i.e. trust = TRUE. See example below. If this affirmation is missing, the current version assumes trust to be true for backward compatibility and a deprecation notice will be printed. In the next major release (2.0.0), you must explicitly affirm your trust when importing these files.

Which

For compressed archives (zip and tar, where a compressed file can contain multiple files), it is possible to come to a situation where the parameter which is used twice to indicate two different concepts. For example, it is unclear for .xlsx.zipwhether which refers to the selection of an exact file in the archive or the selection of an exact sheet in the decompressed Excel file. In these cases, rio assumes that which is only used for the selection of file. After the selection of file with which, rio will return the first item, e.g. the first sheet.

Please note, however, .gz and .bz2 (e.g. .xlsx.gz) are compressed, but not archive format. In those cases, which is used the same way as the non-compressed format, e.g. selection of sheet for Excel.

References

Plummer, M (2024). Statement on CVE-2024-27322. https://blog.r-project.org/2024/05/10/statement-on-cve-2024-27322/

See Also

import(), export_list(), export()

Examples

## For demo, a temp. file path is created with the file extension .xlsx
xlsx_file <- tempfile(fileext = ".xlsx")
export(
    list(
        mtcars1 = mtcars[1:10, ],
        mtcars2 = mtcars[11:20, ],
        mtcars3 = mtcars[21:32, ]
    ),
    xlsx_file
)

# import a single file from multi-object workbook
import(xlsx_file, sheet = "mtcars1")
# import all worksheets, the return value is a list
import_list(xlsx_file)

# import and rbind all worksheets, the return value is a data frame
import_list(xlsx_file, rbind = TRUE)

Install rio's ‘Suggests’ Dependencies

Description

This function installs various ‘Suggests’ dependencies for rio that expand its support to the full range of support import and export formats. These packages are not installed or loaded by default in order to create a slimmer and faster package build, install, and load.

Usage

install_formats(...)

Arguments

...

Additional arguments passed to utils::install.packages().

Value

NULL

Examples

if (interactive()) {
    install_formats()
}

A Swiss-Army Knife for Data I/O

Description

The aim of rio is to make data file input and output as easy as possible. export() and import() serve as a Swiss-army knife for painless data I/O for data from almost any file format by inferring the data structure from the file extension, natively reading web-based data sources, setting reasonable defaults for import and export, and relying on efficient data import and export packages. An additional convenience function, convert(), provides a simple method for converting between file types.

Note that some of rio's functionality is provided by ‘Suggests’ dependendencies, meaning they are not installed by default. Use install_formats() to make sure these packages are available for use.

Author(s)

Maintainer: Chung-hong Chan [email protected] (ORCID)

Authors:

Other contributors:

References

datamods provides Shiny modules for importing data via rio.

GREA provides an RStudio add-in to import data using rio.

See Also

import(), import_list(), export(), export_list(), convert(), install_formats()

Examples

# export
library("datasets")
export(mtcars, csv_file <- tempfile(fileext = ".csv")) # comma-separated values
export(mtcars, rds_file <- tempfile(fileext = ".rds")) # R serialized
export(mtcars, sav_file <- tempfile(fileext = ".sav")) # SPSS

# import
x <- import(csv_file)
y <- import(rds_file)
z <- import(sav_file)

# convert sav (SPSS) to dta (Stata)
convert(sav_file, dta_file <- tempfile(fileext = ".dta"))

# cleanup
unlink(c(csv_file, rds_file, sav_file, dta_file))