Introduction to working with Data Packages

A Data Package is collection of files and consists of both data, which can be any type of information such as images and CSV files, and meta data. These files are usually stored in one directory (possibly with sub directories) although links to external data are possible. Meta data is data about data and consists of the information needed by software programmes to use the data and information needed by users of the data such as descriptions, names of authors, licences etc. The meta data is stored in a file in the directory that is usually called datapackage.json. The information in this file is what below will be called the Data Package. As mentioned, it contains both information on the data package itself (title, description) and information on a number of Data Resources. The Data Resources describe the data files in the data package and also contains information like a title, description, but also information needed by software to use the data such as the path to the data (location of the data), and technical information such as how the data is stored. This information makes it easier to use the data. Below we will show how we can use the information in a Data Package to easily read in the data and work with the data and we will show how we can create a Data Package for our own data.

Overview of terminology

Below an overview of some of the terminology associated with Data Packages.

Data Package

Contains one or more Data Resources.
Has a number of properties like title, name and description.

Data Resource

Contains data either as inline data in a data property or external data pointed to by a path property.
Has a number of properties, like title,name,encoding`, …

Tabular Data Resource

Is a Data Resource with an additional set of properties and constraints.
Has a Table Schema.

Table Schema

Describes a tabular data set (a data set with rows an columns; as usually stored in a data.frame in R).
Has one or more Field Descriptors.

Field Descriptor

Describes a Field (a column) in a tabular data set.
Has number of properties like, name and type.

tl;dr

open_datapackage() reads the meta data from the datapackage.json. From the output below you can see that the data package has three data resources.

> library(datapackage, warn.conflicts = FALSE)
> dir <- system.file("examples/employ", package = "datapackage")
> dp <- open_datapackage(dir)
> dp
[example] Example data set for the datapackage package

This is an example data set to show how the datapackage package can be used to import data into R.
...

Location: </tmp/Rtmpo1cZ4i/Rinst33f69cf045/datapackage/examples/employ>
Resources:
[employment] Employment status
[codelist-gender] Code list for gender of person
[codelist-employ] Code list for employment status

To read the data beloning to one of the data resources:

> dta <- dp |> dp_resource("employment") |> dp_get_data()
> dta
          id        dob gender employ  income haspartner
1  368509515 1993-06-14      M      E 2691.80      FALSE
2  187844355 1961-10-08      X      U      NA      FALSE
3  273040044 1982-06-17      F      E  533.65      FALSE
4  963831798 1965-02-15      M      E  790.13      FALSE
5  854856378 1990-01-30      F      E  716.79       TRUE
6   20072760 1961-08-18      F      E 1651.60      FALSE
7  429782078 2019-09-23      M      N      NA      FALSE
8  711292034 1994-02-16      M      E  455.98      FALSE
9  949458305 2005-06-16      F      N      NA      FALSE
10 911459071 2007-02-27      F      N      NA      FALSE
11 921370403 1981-09-14      F      E 1461.36      FALSE
12  26901869 1981-03-08      F      E 1153.02       TRUE
13 640668848 1993-08-30      M      E  708.98       TRUE
14 996464509 1960-12-10      M      E 1088.99      FALSE
15  58820512 1962-10-13      M      E 2243.10      FALSE
16 288242988 2013-06-11      M      N      NA      FALSE
17 549758863 1990-06-09      M      E  719.69       TRUE
18 998045846 1973-01-25      F      E 1312.45      FALSE
19 902078272 1962-05-29      F      E  618.27      FALSE
20 594477489 1952-08-31      -      N      NA       TRUE

When the name of the data resource is known, the data can also be read directly from the data package without explicitly opening the Data Package:

> dta <- dp_load_from_datapackage(dir, "employment")
> dta
          id        dob gender employ  income haspartner
1  368509515 1993-06-14      M      E 2691.80      FALSE
2  187844355 1961-10-08      X      U      NA      FALSE
3  273040044 1982-06-17      F      E  533.65      FALSE
4  963831798 1965-02-15      M      E  790.13      FALSE
5  854856378 1990-01-30      F      E  716.79       TRUE
6   20072760 1961-08-18      F      E 1651.60      FALSE
7  429782078 2019-09-23      M      N      NA      FALSE
8  711292034 1994-02-16      M      E  455.98      FALSE
9  949458305 2005-06-16      F      N      NA      FALSE
10 911459071 2007-02-27      F      N      NA      FALSE
11 921370403 1981-09-14      F      E 1461.36      FALSE
12  26901869 1981-03-08      F      E 1153.02       TRUE
13 640668848 1993-08-30      M      E  708.98       TRUE
14 996464509 1960-12-10      M      E 1088.99      FALSE
15  58820512 1962-10-13      M      E 2243.10      FALSE
16 288242988 2013-06-11      M      N      NA      FALSE
17 549758863 1990-06-09      M      E  719.69       TRUE
18 998045846 1973-01-25      F      E 1312.45      FALSE
19 902078272 1962-05-29      F      E  618.27      FALSE
20 594477489 1952-08-31      -      N      NA       TRUE

With the convert_categories argument categorical variables can be converted to factor:

> dta <- dp_load_from_datapackage(dir, "employment", 
+   convert_categories = "to_factor")
> dta
          id        dob  gender                 employ  income haspartner
1  368509515 1993-06-14    Male               Employed 2691.80      FALSE
2  187844355 1961-10-08   Other             Unemployed      NA      FALSE
3  273040044 1982-06-17  Female               Employed  533.65      FALSE
4  963831798 1965-02-15    Male               Employed  790.13      FALSE
5  854856378 1990-01-30  Female               Employed  716.79       TRUE
6   20072760 1961-08-18  Female               Employed 1651.60      FALSE
7  429782078 2019-09-23    Male Non-working-population      NA      FALSE
8  711292034 1994-02-16    Male               Employed  455.98      FALSE
9  949458305 2005-06-16  Female Non-working-population      NA      FALSE
10 911459071 2007-02-27  Female Non-working-population      NA      FALSE
11 921370403 1981-09-14  Female               Employed 1461.36      FALSE
12  26901869 1981-03-08  Female               Employed 1153.02       TRUE
13 640668848 1993-08-30    Male               Employed  708.98       TRUE
14 996464509 1960-12-10    Male               Employed 1088.99      FALSE
15  58820512 1962-10-13    Male               Employed 2243.10      FALSE
16 288242988 2013-06-11    Male Non-working-population      NA      FALSE
17 549758863 1990-06-09    Male               Employed  719.69       TRUE
18 998045846 1973-01-25  Female               Employed 1312.45      FALSE
19 902078272 1962-05-29  Female               Employed  618.27      FALSE
20 594477489 1952-08-31 Unknown Non-working-population      NA       TRUE

Or, they can be converted to the code class from the codelist package. This will preserve both the codes and the labels:

> library(codelist)
> dta <- dp_load_from_datapackage(dir, "employment", 
+   convert_categories = "to_code")
> dta
          id        dob     gender      employ  income haspartner
1  368509515 1993-06-14 M[Male]    E[Employed] 2691.80      FALSE
2  187844355 1961-10-08 X[Other]   U[Unemplo…]      NA      FALSE
3  273040044 1982-06-17 F[Female]  E[Employed]  533.65      FALSE
4  963831798 1965-02-15 M[Male]    E[Employed]  790.13      FALSE
5  854856378 1990-01-30 F[Female]  E[Employed]  716.79       TRUE
6   20072760 1961-08-18 F[Female]  E[Employed] 1651.60      FALSE
7  429782078 2019-09-23 M[Male]    N[Non-wor…]      NA      FALSE
8  711292034 1994-02-16 M[Male]    E[Employed]  455.98      FALSE
9  949458305 2005-06-16 F[Female]  N[Non-wor…]      NA      FALSE
10 911459071 2007-02-27 F[Female]  N[Non-wor…]      NA      FALSE
11 921370403 1981-09-14 F[Female]  E[Employed] 1461.36      FALSE
12  26901869 1981-03-08 F[Female]  E[Employed] 1153.02       TRUE
13 640668848 1993-08-30 M[Male]    E[Employed]  708.98       TRUE
14 996464509 1960-12-10 M[Male]    E[Employed] 1088.99      FALSE
15  58820512 1962-10-13 M[Male]    E[Employed] 2243.10      FALSE
16 288242988 2013-06-11 M[Male]    N[Non-wor…]      NA      FALSE
17 549758863 1990-06-09 M[Male]    E[Employed]  719.69       TRUE
18 998045846 1973-01-25 F[Female]  E[Employed] 1312.45      FALSE
19 902078272 1962-05-29 F[Female]  E[Employed]  618.27      FALSE
20 594477489 1952-08-31 -[Unknown] N[Non-wor…]      NA       TRUE

When the data resource name is omitted from dp_load_from_datapackage() either the data resource with same name as the data package or the first data resource is opened.

Getting information from a Data Package

Below we open an example Data Package that comes with the package:

> library(datapackage, warn.conflicts = FALSE)
> dir <- system.file("examples/employ", package = "datapackage")
> dp <- open_datapackage(dir)
> dp
[example] Example data set for the datapackage package

This is an example data set to show how the datapackage package can be used to import data into R.
...

Location: </tmp/Rtmpo1cZ4i/Rinst33f69cf045/datapackage/examples/employ>
Resources:
[employment] Employment status
[codelist-gender] Code list for gender of person
[codelist-employ] Code list for employment status

The print statement shows the name of the package, example, the title, the first paragraph of the description, the location of the Data Package and the Data Resources in the package. In this case there are three Data Resources:

> dp_nresources(dp)
[1] 3

The names are

> dp_resource_names(dp)
[1] "employment"      "codelist-gender" "codelist-employ"

Using the resource() method on the Data Package can obtain the Data Resource

> employ <- dp_resource(dp, "employment")
> employ
[employment] Employment status

Employment status, income and background properties of sample of persons.
...

Selected properties:
path     :"employ.csv"
format   :"csv"
mediatype:"text/csv"
encoding :"utf-8"
dialect  :List of 3
schema   :Table Schema [6] "id" "dob" "gender" "employ" "income" "haspartner"

The print statement again shows the name, title and description. It also shows that the data is in a CSV-file anmes employ.csv. Standard the print shows only a few properties of the Data Resource. To show all properties:

> print(employ, properties = NA)
[employment] Employment status

Employment status, income and background properties of sample of persons.
...

Selected properties:
path     :"employ.csv"
format   :"csv"
mediatype:"text/csv"
encoding :"utf-8"
dialect  :List of 3
schema   :Table Schema [6] "id" "dob" "gender" "employ" "income" "haspartner"

Using this information it should be possible to open the dataset. The data can be opened in R using the dp_get_data() method. Based on the information in the Data Resource this function will try to open the dataset using the correct functions in R (in this case read.csv()):

> dta <- dp_get_data(employ)
> head(dta)
         id        dob gender employ  income haspartner
1 368509515 1993-06-14      M      E 2691.80      FALSE
2 187844355 1961-10-08      X      U      NA      FALSE
3 273040044 1982-06-17      F      E  533.65      FALSE
4 963831798 1965-02-15      M      E  790.13      FALSE
5 854856378 1990-01-30      F      E  716.79       TRUE
6  20072760 1961-08-18      F      E 1651.60      FALSE

It is also possible to import the data directly from the Data Package object by specifying the resource for which the data needs to be imported.

> dta <- dp_get_data(dp, "employment")

The dp_get_data() method only supports a limited set of data formats. It is possible to also provide a custum function to read the data using the reader argument of dp_get_data(). However, it is also possible to import the data ‘manually’ using the information in the Data Package. The path of the file in a Data Resource can be obtained using the dp_path() method:

> dp_path(employ)
[1] "employ.csv"

By default this will return the path as defined in the Data Package. This either a path relative to the directory in which the Data Package is located or a URL. To open a file inside the Data Package one also needs the location of the Data Package. Using the full_path = TRUE argument, dp_path() will return the full path to the file:

> fn <- dp_path(employ, full_path = TRUE)

This path can be used to open the file manually:

> dta <- read.csv2(fn)
> head(dta)
         id        dob gender employ   income haspartner
1 368509515 1993-06-14      M      E €2 691,8          N
2 187844355 1961-10-08      X      U     <NA>          N
3 273040044 1982-06-17      F      E  €533,65          N
4 963831798 1965-02-15      M      E  €790,13          N
5 854856378 1990-01-30      F      E  €716,79          Y
6  20072760 1961-08-18      F      E €1 651,6          N

First, note that we had to ‘know’ that we had to use read.csv2 since the file uses the ‘;’ as field separator. Information like this is stored in the ‘dialect’ property of a data resource:

> dp_property(employ, "dialect")
$decimalChar
[1] ","

$delimiter
[1] ";"

$nullSequence
[1] "NA"

Second, note that the field ‘income’ is not converted to numeric as this field contains euro symbols and used a space as thousands separator. Information like this is stored in the field descriptor:

> dp_field(employ, "income")
Field Descriptor:
name       :"income"
title      :"Net income"
type       :"number"
bareNumber :FALSE
decimalChar:","
groupChar  :" "

dp_get_data() uses the information from the field descriptors and dialect to automatically convert variables as much a possible to their most fitting R types. This is done using the dp_apply_schema() function:

> dp_apply_schema(dta, employ)
          id        dob gender employ  income haspartner
1  368509515 1993-06-14      M      E 2691.80      FALSE
2  187844355 1961-10-08      X      U      NA      FALSE
3  273040044 1982-06-17      F      E  533.65      FALSE
4  963831798 1965-02-15      M      E  790.13      FALSE
5  854856378 1990-01-30      F      E  716.79       TRUE
6   20072760 1961-08-18      F      E 1651.60      FALSE
7  429782078 2019-09-23      M      N      NA      FALSE
8  711292034 1994-02-16      M      E  455.98      FALSE
9  949458305 2005-06-16      F      N      NA      FALSE
10 911459071 2007-02-27      F      N      NA      FALSE
11 921370403 1981-09-14      F      E 1461.36      FALSE
12  26901869 1981-03-08      F      E 1153.02       TRUE
13 640668848 1993-08-30      M      E  708.98       TRUE
14 996464509 1960-12-10      M      E 1088.99      FALSE
15  58820512 1962-10-13      M      E 2243.10      FALSE
16 288242988 2013-06-11      M      N      NA      FALSE
17 549758863 1990-06-09      M      E  719.69       TRUE
18 998045846 1973-01-25      F      E 1312.45      FALSE
19 902078272 1962-05-29      F      E  618.27      FALSE
20 594477489 1952-08-31      -      N      NA       TRUE

Finally, note that the path property of a Data Resource can be a vector of paths in case a single data set is stored in a set of files. It is assumed then that the files have the same format. Therefore, rbind should work on these files.

Below is an alternative way of importing the data belonging to a Data Resource. Here we use the pipe operator to chain the various commands to import the data set.

> dta <- dp_resource(dp, "employment") |> dp_get_data()
> head(dta)
         id        dob gender employ  income haspartner
1 368509515 1993-06-14      M      E 2691.80      FALSE
2 187844355 1961-10-08      X      U      NA      FALSE
3 273040044 1982-06-17      F      E  533.65      FALSE
4 963831798 1965-02-15      M      E  790.13      FALSE
5 854856378 1990-01-30      F      E  716.79       TRUE
6  20072760 1961-08-18      F      E 1651.60      FALSE

Reading properties from Data Packages and Data Resources

For many of the standard fields of a Data Packages, methods are defined to obtain the values of these fields:

> dp_name(dp)
[1] "example"
> dp_description(dp)
[1] "This is an example data set to show how the datapackage package can be used to import data into R.\n\nIts main data resource is the `employ` resource which contains fictional data about individuals. The other data resources are supporting data sets."
> dp_description(dp, first_paragraph = TRUE)
[1] "This is an example data set to show how the datapackage package can be used to import data into R."
> dp_title(dp)
[1] "Example data set for the datapackage package"

The same holds for Data Resources:

> dp_title(employ)
[1] "Employment status"
> dp_resource(dp, "codelist-employ") |> dp_title()
[1] "Code list for employment status"

For datapackage objects there are currently defined the following methods: (this list can be obtained using ?PropertiesDatapackage)

dp_contributors()
dp_created()
dp_description()
dp_id()
dp_keywords()
dp_name()
dp_title()

For dataresource objects there are currently defined the following methods (this list can be obtained using ?PropertiesDataresource)

dp_bytes()
dp_encoding()
dp_description()
dp_format()
dp_hash()
dp_name()
dp_mediatype()
dp_path()
dp_schema()
dp_title()

The dp_path() method has a full_path argument that, when used, returns the full path to the Data Resources data and not just the path relative to the Data Package. The full path is needed when one wants to use the path to read the data.

> dp_path(employ)
[1] "employ.csv"
> dp_path(employ, full_path = TRUE)
[1] "/tmp/Rtmpo1cZ4i/Rinst33f69cf045/datapackage/examples/employ/employ.csv"

It is also possible to get other properties than the ones explicitly mentioned above using the dp_property() method:

> dp_property(employ, "encoding")
[1] "utf-8"

Working with categories

It is possible for fields to have a list of categories associated with them. Categories are usually stored inside the Field Descriptor. However, the datapackage package also supports lists of categories stored in a seperate Data Resource (this is not part of the datapackage standard).

In the example resource, there is are ‘gender’ and ‘employ’ that have categories associated with them:

> dta <- dp_resource(dp, "employment") |> dp_get_data()
> dta
          id        dob gender employ  income haspartner
1  368509515 1993-06-14      M      E 2691.80      FALSE
2  187844355 1961-10-08      X      U      NA      FALSE
3  273040044 1982-06-17      F      E  533.65      FALSE
4  963831798 1965-02-15      M      E  790.13      FALSE
5  854856378 1990-01-30      F      E  716.79       TRUE
6   20072760 1961-08-18      F      E 1651.60      FALSE
7  429782078 2019-09-23      M      N      NA      FALSE
8  711292034 1994-02-16      M      E  455.98      FALSE
9  949458305 2005-06-16      F      N      NA      FALSE
10 911459071 2007-02-27      F      N      NA      FALSE
11 921370403 1981-09-14      F      E 1461.36      FALSE
12  26901869 1981-03-08      F      E 1153.02       TRUE
13 640668848 1993-08-30      M      E  708.98       TRUE
14 996464509 1960-12-10      M      E 1088.99      FALSE
15  58820512 1962-10-13      M      E 2243.10      FALSE
16 288242988 2013-06-11      M      N      NA      FALSE
17 549758863 1990-06-09      M      E  719.69       TRUE
18 998045846 1973-01-25      F      E 1312.45      FALSE
19 902078272 1962-05-29      F      E  618.27      FALSE
20 594477489 1952-08-31      -      N      NA       TRUE

This is string column but it has an ‘categories’ property set which points to a Data Resource in the Data Package. It is possible te get this list of

Creating a Data Package

This is shown in a seperate vignette Creating a Data Package

Quickly saving to and reading from a Data Package

A quick way to create a Data Package from a given dataset is with the dp_save_as_datapackage() function:

> dir <- tempfile()
> data(iris)
> dp_save_as_datapackage(iris, dir)

And for reading:

> dp_load_from_datapackage(dir) |> head()
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2       1
2          4.9         3.0          1.4         0.2       1
3          4.7         3.2          1.3         0.2       1
4          4.6         3.1          1.5         0.2       1
5          5.0         3.6          1.4         0.2       1
6          5.4         3.9          1.7         0.4       1

This will either load the Data Resource with the same name as the Data Package or the first resource in the Data Package. It is also possible to specify the name of the Data Resource that should be read. Additional arguments are passed on to dp_get_data()):

> dp_load_from_datapackage(dir, "iris", convert_categories = "to_factor", 
+   use_fread = TRUE)
Loading required namespace: data.table
 
     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
            <num>       <num>        <num>       <num>    <fctr>
  1:          5.1         3.5          1.4         0.2    setosa
  2:          4.9         3.0          1.4         0.2    setosa
  3:          4.7         3.2          1.3         0.2    setosa
  4:          4.6         3.1          1.5         0.2    setosa
  5:          5.0         3.6          1.4         0.2    setosa
 ---                                                            
146:          6.7         3.0          5.2         2.3 virginica
147:          6.3         2.5          5.0         1.9 virginica
148:          6.5         3.0          5.2         2.0 virginica
149:          6.2         3.4          5.4         2.3 virginica
150:          5.9         3.0          5.1         1.8 virginica