A Data Package is collection of
files and consists of both data, which can be any type of information
such as images and CSV files, and meta data. These files are usually
stored in one directory (possibly with sub directories) although links
to external data are possible. Meta data is data about data and consists
of the information needed by software programmes to use the data and
information needed by users of the data such as descriptions, names of
authors, licences etc. The meta data is stored in a file in the
directory that is usually called datapackage.json
. The
information in this file is what below will be called the Data Package.
As mentioned, it contains both information on the data package itself
(title, description) and information on a number of Data Resources. The
Data Resources describe the data files in the data package and also
contains information like a title, description, but also information
needed by software to use the data such as the path to the data
(location of the data), and technical information such as how the data
is stored. This information makes it easier to use the data. Below we
will show how we can use the information in a Data Package to easily
read in the data and work with the data and we will show how we can
create a Data Package for our own data.
Below an overview of some of the terminology associated with Data Packages.
title
,
name
and description
.data
property
or external data pointed to by a path
property.title,
name,
encoding`, …data.frame
in R).name
and
type
.open_datapackage()
reads the meta data from the
datapackage.json
. From the output below you can see that
the data package has three data resources.
> library(datapackage, warn.conflicts = FALSE)
> dir <- system.file("examples/employ", package = "datapackage")
> dp <- open_datapackage(dir)
> dp
[example] Example data set for the datapackage package
This is an example data set to show how the datapackage package can be used to import data into R.
...
Location: </tmp/Rtmpo1cZ4i/Rinst33f69cf045/datapackage/examples/employ>
Resources:
[employment] Employment status
[codelist-gender] Code list for gender of person
[codelist-employ] Code list for employment status
To read the data beloning to one of the data resources:
> dta <- dp |> dp_resource("employment") |> dp_get_data()
> dta
id dob gender employ income haspartner
1 368509515 1993-06-14 M E 2691.80 FALSE
2 187844355 1961-10-08 X U NA FALSE
3 273040044 1982-06-17 F E 533.65 FALSE
4 963831798 1965-02-15 M E 790.13 FALSE
5 854856378 1990-01-30 F E 716.79 TRUE
6 20072760 1961-08-18 F E 1651.60 FALSE
7 429782078 2019-09-23 M N NA FALSE
8 711292034 1994-02-16 M E 455.98 FALSE
9 949458305 2005-06-16 F N NA FALSE
10 911459071 2007-02-27 F N NA FALSE
11 921370403 1981-09-14 F E 1461.36 FALSE
12 26901869 1981-03-08 F E 1153.02 TRUE
13 640668848 1993-08-30 M E 708.98 TRUE
14 996464509 1960-12-10 M E 1088.99 FALSE
15 58820512 1962-10-13 M E 2243.10 FALSE
16 288242988 2013-06-11 M N NA FALSE
17 549758863 1990-06-09 M E 719.69 TRUE
18 998045846 1973-01-25 F E 1312.45 FALSE
19 902078272 1962-05-29 F E 618.27 FALSE
20 594477489 1952-08-31 - N NA TRUE
When the name of the data resource is known, the data can also be read directly from the data package without explicitly opening the Data Package:
> dta <- dp_load_from_datapackage(dir, "employment")
> dta
id dob gender employ income haspartner
1 368509515 1993-06-14 M E 2691.80 FALSE
2 187844355 1961-10-08 X U NA FALSE
3 273040044 1982-06-17 F E 533.65 FALSE
4 963831798 1965-02-15 M E 790.13 FALSE
5 854856378 1990-01-30 F E 716.79 TRUE
6 20072760 1961-08-18 F E 1651.60 FALSE
7 429782078 2019-09-23 M N NA FALSE
8 711292034 1994-02-16 M E 455.98 FALSE
9 949458305 2005-06-16 F N NA FALSE
10 911459071 2007-02-27 F N NA FALSE
11 921370403 1981-09-14 F E 1461.36 FALSE
12 26901869 1981-03-08 F E 1153.02 TRUE
13 640668848 1993-08-30 M E 708.98 TRUE
14 996464509 1960-12-10 M E 1088.99 FALSE
15 58820512 1962-10-13 M E 2243.10 FALSE
16 288242988 2013-06-11 M N NA FALSE
17 549758863 1990-06-09 M E 719.69 TRUE
18 998045846 1973-01-25 F E 1312.45 FALSE
19 902078272 1962-05-29 F E 618.27 FALSE
20 594477489 1952-08-31 - N NA TRUE
With the convert_categories
argument categorical
variables can be converted to factor:
> dta <- dp_load_from_datapackage(dir, "employment",
+ convert_categories = "to_factor")
> dta
id dob gender employ income haspartner
1 368509515 1993-06-14 Male Employed 2691.80 FALSE
2 187844355 1961-10-08 Other Unemployed NA FALSE
3 273040044 1982-06-17 Female Employed 533.65 FALSE
4 963831798 1965-02-15 Male Employed 790.13 FALSE
5 854856378 1990-01-30 Female Employed 716.79 TRUE
6 20072760 1961-08-18 Female Employed 1651.60 FALSE
7 429782078 2019-09-23 Male Non-working-population NA FALSE
8 711292034 1994-02-16 Male Employed 455.98 FALSE
9 949458305 2005-06-16 Female Non-working-population NA FALSE
10 911459071 2007-02-27 Female Non-working-population NA FALSE
11 921370403 1981-09-14 Female Employed 1461.36 FALSE
12 26901869 1981-03-08 Female Employed 1153.02 TRUE
13 640668848 1993-08-30 Male Employed 708.98 TRUE
14 996464509 1960-12-10 Male Employed 1088.99 FALSE
15 58820512 1962-10-13 Male Employed 2243.10 FALSE
16 288242988 2013-06-11 Male Non-working-population NA FALSE
17 549758863 1990-06-09 Male Employed 719.69 TRUE
18 998045846 1973-01-25 Female Employed 1312.45 FALSE
19 902078272 1962-05-29 Female Employed 618.27 FALSE
20 594477489 1952-08-31 Unknown Non-working-population NA TRUE
Or, they can be converted to the code
class from the codelist
package. This will preserve
both the codes and the labels:
> library(codelist)
> dta <- dp_load_from_datapackage(dir, "employment",
+ convert_categories = "to_code")
> dta
id dob gender employ income haspartner
1 368509515 1993-06-14 M[Male] E[Employed] 2691.80 FALSE
2 187844355 1961-10-08 X[Other] U[Unemplo…] NA FALSE
3 273040044 1982-06-17 F[Female] E[Employed] 533.65 FALSE
4 963831798 1965-02-15 M[Male] E[Employed] 790.13 FALSE
5 854856378 1990-01-30 F[Female] E[Employed] 716.79 TRUE
6 20072760 1961-08-18 F[Female] E[Employed] 1651.60 FALSE
7 429782078 2019-09-23 M[Male] N[Non-wor…] NA FALSE
8 711292034 1994-02-16 M[Male] E[Employed] 455.98 FALSE
9 949458305 2005-06-16 F[Female] N[Non-wor…] NA FALSE
10 911459071 2007-02-27 F[Female] N[Non-wor…] NA FALSE
11 921370403 1981-09-14 F[Female] E[Employed] 1461.36 FALSE
12 26901869 1981-03-08 F[Female] E[Employed] 1153.02 TRUE
13 640668848 1993-08-30 M[Male] E[Employed] 708.98 TRUE
14 996464509 1960-12-10 M[Male] E[Employed] 1088.99 FALSE
15 58820512 1962-10-13 M[Male] E[Employed] 2243.10 FALSE
16 288242988 2013-06-11 M[Male] N[Non-wor…] NA FALSE
17 549758863 1990-06-09 M[Male] E[Employed] 719.69 TRUE
18 998045846 1973-01-25 F[Female] E[Employed] 1312.45 FALSE
19 902078272 1962-05-29 F[Female] E[Employed] 618.27 FALSE
20 594477489 1952-08-31 -[Unknown] N[Non-wor…] NA TRUE
When the data resource name is omitted from
dp_load_from_datapackage()
either the data resource with
same name as the data package or the first data resource is opened.
Below we open an example Data Package that comes with the package:
> library(datapackage, warn.conflicts = FALSE)
> dir <- system.file("examples/employ", package = "datapackage")
> dp <- open_datapackage(dir)
> dp
[example] Example data set for the datapackage package
This is an example data set to show how the datapackage package can be used to import data into R.
...
Location: </tmp/Rtmpo1cZ4i/Rinst33f69cf045/datapackage/examples/employ>
Resources:
[employment] Employment status
[codelist-gender] Code list for gender of person
[codelist-employ] Code list for employment status
The print statement shows the name of the package,
example
, the title, the first paragraph of the description,
the location of the Data Package and the Data Resources in the package.
In this case there are three Data Resources:
The names are
Using the resource()
method on the Data Package can
obtain the Data Resource
> employ <- dp_resource(dp, "employment")
> employ
[employment] Employment status
Employment status, income and background properties of sample of persons.
...
Selected properties:
path :"employ.csv"
format :"csv"
mediatype:"text/csv"
encoding :"utf-8"
dialect :List of 3
schema :Table Schema [6] "id" "dob" "gender" "employ" "income" "haspartner"
The print
statement again shows the name, title and
description. It also shows that the data is in a CSV-file anmes
employ.csv
. Standard the print
shows only a
few properties of the Data Resource. To show all properties:
> print(employ, properties = NA)
[employment] Employment status
Employment status, income and background properties of sample of persons.
...
Selected properties:
path :"employ.csv"
format :"csv"
mediatype:"text/csv"
encoding :"utf-8"
dialect :List of 3
schema :Table Schema [6] "id" "dob" "gender" "employ" "income" "haspartner"
Using this information it should be possible to open the dataset. The
data can be opened in R using the dp_get_data()
method.
Based on the information in the Data Resource this function will try to
open the dataset using the correct functions in R (in this case
read.csv()
):
> dta <- dp_get_data(employ)
> head(dta)
id dob gender employ income haspartner
1 368509515 1993-06-14 M E 2691.80 FALSE
2 187844355 1961-10-08 X U NA FALSE
3 273040044 1982-06-17 F E 533.65 FALSE
4 963831798 1965-02-15 M E 790.13 FALSE
5 854856378 1990-01-30 F E 716.79 TRUE
6 20072760 1961-08-18 F E 1651.60 FALSE
It is also possible to import the data directly from the Data Package object by specifying the resource for which the data needs to be imported.
The dp_get_data()
method only supports a limited set of
data formats. It is possible to also provide a custum function to read
the data using the reader
argument of
dp_get_data()
. However, it is also possible to import the
data ‘manually’ using the information in the Data Package. The path of
the file in a Data Resource can be obtained using the
dp_path()
method:
By default this will return the path as defined in the Data Package.
This either a path relative to the directory in which the Data Package
is located or a URL. To open a file inside the Data Package one also
needs the location of the Data Package. Using the
full_path = TRUE
argument, dp_path()
will
return the full path to the file:
This path can be used to open the file manually:
> dta <- read.csv2(fn)
> head(dta)
id dob gender employ income haspartner
1 368509515 1993-06-14 M E €2 691,8 N
2 187844355 1961-10-08 X U <NA> N
3 273040044 1982-06-17 F E €533,65 N
4 963831798 1965-02-15 M E €790,13 N
5 854856378 1990-01-30 F E €716,79 Y
6 20072760 1961-08-18 F E €1 651,6 N
First, note that we had to ‘know’ that we had to use
read.csv2
since the file uses the ‘;
’ as field
separator. Information like this is stored in the ‘dialect’ property of
a data resource:
Second, note that the field ‘income’ is not converted to numeric as this field contains euro symbols and used a space as thousands separator. Information like this is stored in the field descriptor:
> dp_field(employ, "income")
Field Descriptor:
name :"income"
title :"Net income"
type :"number"
bareNumber :FALSE
decimalChar:","
groupChar :" "
dp_get_data()
uses the information from the field
descriptors and dialect to automatically convert variables as much a
possible to their most fitting R types. This is done using the
dp_apply_schema()
function:
> dp_apply_schema(dta, employ)
id dob gender employ income haspartner
1 368509515 1993-06-14 M E 2691.80 FALSE
2 187844355 1961-10-08 X U NA FALSE
3 273040044 1982-06-17 F E 533.65 FALSE
4 963831798 1965-02-15 M E 790.13 FALSE
5 854856378 1990-01-30 F E 716.79 TRUE
6 20072760 1961-08-18 F E 1651.60 FALSE
7 429782078 2019-09-23 M N NA FALSE
8 711292034 1994-02-16 M E 455.98 FALSE
9 949458305 2005-06-16 F N NA FALSE
10 911459071 2007-02-27 F N NA FALSE
11 921370403 1981-09-14 F E 1461.36 FALSE
12 26901869 1981-03-08 F E 1153.02 TRUE
13 640668848 1993-08-30 M E 708.98 TRUE
14 996464509 1960-12-10 M E 1088.99 FALSE
15 58820512 1962-10-13 M E 2243.10 FALSE
16 288242988 2013-06-11 M N NA FALSE
17 549758863 1990-06-09 M E 719.69 TRUE
18 998045846 1973-01-25 F E 1312.45 FALSE
19 902078272 1962-05-29 F E 618.27 FALSE
20 594477489 1952-08-31 - N NA TRUE
Finally, note that the path
property of a Data Resource
can be a vector of paths in case a single data set is stored in a set of
files. It is assumed then that the files have the same format.
Therefore, rbind
should work on these files.
Below is an alternative way of importing the data belonging to a Data Resource. Here we use the pipe operator to chain the various commands to import the data set.
> dta <- dp_resource(dp, "employment") |> dp_get_data()
> head(dta)
id dob gender employ income haspartner
1 368509515 1993-06-14 M E 2691.80 FALSE
2 187844355 1961-10-08 X U NA FALSE
3 273040044 1982-06-17 F E 533.65 FALSE
4 963831798 1965-02-15 M E 790.13 FALSE
5 854856378 1990-01-30 F E 716.79 TRUE
6 20072760 1961-08-18 F E 1651.60 FALSE
For many of the standard fields of a Data Packages, methods are defined to obtain the values of these fields:
> dp_name(dp)
[1] "example"
> dp_description(dp)
[1] "This is an example data set to show how the datapackage package can be used to import data into R.\n\nIts main data resource is the `employ` resource which contains fictional data about individuals. The other data resources are supporting data sets."
> dp_description(dp, first_paragraph = TRUE)
[1] "This is an example data set to show how the datapackage package can be used to import data into R."
> dp_title(dp)
[1] "Example data set for the datapackage package"
The same holds for Data Resources:
> dp_title(employ)
[1] "Employment status"
> dp_resource(dp, "codelist-employ") |> dp_title()
[1] "Code list for employment status"
For datapackage
objects there are currently defined the
following methods: (this list can be obtained using
?PropertiesDatapackage
)
dp_contributors()
dp_created()
dp_description()
dp_id()
dp_keywords()
dp_name()
dp_title()
For dataresource
objects there are currently defined the
following methods (this list can be obtained using
?PropertiesDataresource
)
dp_bytes()
dp_encoding()
dp_description()
dp_format()
dp_hash()
dp_name()
dp_mediatype()
dp_path()
dp_schema()
dp_title()
The dp_path()
method has a full_path
argument that, when used, returns the full path to the Data Resources
data and not just the path relative to the Data Package. The full path
is needed when one wants to use the path to read the data.
> dp_path(employ)
[1] "employ.csv"
> dp_path(employ, full_path = TRUE)
[1] "/tmp/Rtmpo1cZ4i/Rinst33f69cf045/datapackage/examples/employ/employ.csv"
It is also possible to get other properties than the ones explicitly
mentioned above using the dp_property()
method:
It is possible for fields to have a list of categories
associated with them. Categories are usually stored inside the Field
Descriptor. However, the datapackage
package also supports
lists of categories stored in a seperate Data Resource (this is not part
of the datapackage standard).
In the example resource, there is are ‘gender’ and ‘employ’ that have categories associated with them:
> dta <- dp_resource(dp, "employment") |> dp_get_data()
> dta
id dob gender employ income haspartner
1 368509515 1993-06-14 M E 2691.80 FALSE
2 187844355 1961-10-08 X U NA FALSE
3 273040044 1982-06-17 F E 533.65 FALSE
4 963831798 1965-02-15 M E 790.13 FALSE
5 854856378 1990-01-30 F E 716.79 TRUE
6 20072760 1961-08-18 F E 1651.60 FALSE
7 429782078 2019-09-23 M N NA FALSE
8 711292034 1994-02-16 M E 455.98 FALSE
9 949458305 2005-06-16 F N NA FALSE
10 911459071 2007-02-27 F N NA FALSE
11 921370403 1981-09-14 F E 1461.36 FALSE
12 26901869 1981-03-08 F E 1153.02 TRUE
13 640668848 1993-08-30 M E 708.98 TRUE
14 996464509 1960-12-10 M E 1088.99 FALSE
15 58820512 1962-10-13 M E 2243.10 FALSE
16 288242988 2013-06-11 M N NA FALSE
17 549758863 1990-06-09 M E 719.69 TRUE
18 998045846 1973-01-25 F E 1312.45 FALSE
19 902078272 1962-05-29 F E 618.27 FALSE
20 594477489 1952-08-31 - N NA TRUE
This is string column but it has an ‘categories’ property set which points to a Data Resource in the Data Package. It is possible te get this list of
categories
> dp_categorieslist(dta$employ)
code label missing
1 E Employed FALSE
2 U Unemployed FALSE
3 N Non-working-population FALSE
4 X Unknown TRUE
This list of categories can also be used to convert the field to factor:
> dp_to_factor(dta$employ)
[1] Employed Unemployed Employed
[4] Employed Employed Employed
[7] Non-working-population Employed Non-working-population
[10] Non-working-population Employed Employed
[13] Employed Employed Employed
[16] Non-working-population Employed Employed
[19] Employed Non-working-population
attr(,"fielddescriptor")
Field Descriptor:
name :"employ"
title :"Employment status"
type :"string"
categories:List of 1
Levels: Employed Unemployed Non-working-population Unknown
Using the convert_categories = "to_factor"
argument of
dp_apply_schema()
(which is called by
dp_get_data()
) it is also possible to convert all fields
which have an associated ‘categories’ field to factor:
> dta <- dp_resource(dp, "employment") |>
+ dp_get_data(convert_categories = "to_factor")
> dta
id dob gender employ income haspartner
1 368509515 1993-06-14 Male Employed 2691.80 FALSE
2 187844355 1961-10-08 Other Unemployed NA FALSE
3 273040044 1982-06-17 Female Employed 533.65 FALSE
4 963831798 1965-02-15 Male Employed 790.13 FALSE
5 854856378 1990-01-30 Female Employed 716.79 TRUE
6 20072760 1961-08-18 Female Employed 1651.60 FALSE
7 429782078 2019-09-23 Male Non-working-population NA FALSE
8 711292034 1994-02-16 Male Employed 455.98 FALSE
9 949458305 2005-06-16 Female Non-working-population NA FALSE
10 911459071 2007-02-27 Female Non-working-population NA FALSE
11 921370403 1981-09-14 Female Employed 1461.36 FALSE
12 26901869 1981-03-08 Female Employed 1153.02 TRUE
13 640668848 1993-08-30 Male Employed 708.98 TRUE
14 996464509 1960-12-10 Male Employed 1088.99 FALSE
15 58820512 1962-10-13 Male Employed 2243.10 FALSE
16 288242988 2013-06-11 Male Non-working-population NA FALSE
17 549758863 1990-06-09 Male Employed 719.69 TRUE
18 998045846 1973-01-25 Female Employed 1312.45 FALSE
19 902078272 1962-05-29 Female Employed 618.27 FALSE
20 594477489 1952-08-31 Unknown Non-working-population NA TRUE
When the codelist
package is installed, it is also possible to convert the column to a
code
vector:
> dta <- dp_resource(dp, "employment") |>
+ dp_get_data(convert_categories = "to_code")
> dta
id dob gender employ income haspartner
1 368509515 1993-06-14 M[Male] E[Employed] 2691.80 FALSE
2 187844355 1961-10-08 X[Other] U[Unemplo…] NA FALSE
3 273040044 1982-06-17 F[Female] E[Employed] 533.65 FALSE
4 963831798 1965-02-15 M[Male] E[Employed] 790.13 FALSE
5 854856378 1990-01-30 F[Female] E[Employed] 716.79 TRUE
6 20072760 1961-08-18 F[Female] E[Employed] 1651.60 FALSE
7 429782078 2019-09-23 M[Male] N[Non-wor…] NA FALSE
8 711292034 1994-02-16 M[Male] E[Employed] 455.98 FALSE
9 949458305 2005-06-16 F[Female] N[Non-wor…] NA FALSE
10 911459071 2007-02-27 F[Female] N[Non-wor…] NA FALSE
11 921370403 1981-09-14 F[Female] E[Employed] 1461.36 FALSE
12 26901869 1981-03-08 F[Female] E[Employed] 1153.02 TRUE
13 640668848 1993-08-30 M[Male] E[Employed] 708.98 TRUE
14 996464509 1960-12-10 M[Male] E[Employed] 1088.99 FALSE
15 58820512 1962-10-13 M[Male] E[Employed] 2243.10 FALSE
16 288242988 2013-06-11 M[Male] N[Non-wor…] NA FALSE
17 549758863 1990-06-09 M[Male] E[Employed] 719.69 TRUE
18 998045846 1973-01-25 F[Female] E[Employed] 1312.45 FALSE
19 902078272 1962-05-29 F[Female] E[Employed] 618.27 FALSE
20 594477489 1952-08-31 -[Unknown] N[Non-wor…] NA TRUE
This has the advantage that both the values/codes and the labels are kept together and it is possible to use both when coding which can make code safer and more readable:
This is shown in a seperate vignette
Creating a Data Package
A quick way to create a Data Package from a given dataset is with the
dp_save_as_datapackage()
function:
And for reading:
> dp_load_from_datapackage(dir) |> head()
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 1
2 4.9 3.0 1.4 0.2 1
3 4.7 3.2 1.3 0.2 1
4 4.6 3.1 1.5 0.2 1
5 5.0 3.6 1.4 0.2 1
6 5.4 3.9 1.7 0.4 1
This will either load the Data Resource with the same name as the
Data Package or the first resource in the Data Package. It is also
possible to specify the name of the Data Resource that should be read.
Additional arguments are passed on to dp_get_data())
:
> dp_load_from_datapackage(dir, "iris", convert_categories = "to_factor",
+ use_fread = TRUE)
Loading required namespace: data.table
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<num> <num> <num> <num> <fctr>
1: 5.1 3.5 1.4 0.2 setosa
2: 4.9 3.0 1.4 0.2 setosa
3: 4.7 3.2 1.3 0.2 setosa
4: 4.6 3.1 1.5 0.2 setosa
5: 5.0 3.6 1.4 0.2 setosa
---
146: 6.7 3.0 5.2 2.3 virginica
147: 6.3 2.5 5.0 1.9 virginica
148: 6.5 3.0 5.2 2.0 virginica
149: 6.2 3.4 5.4 2.3 virginica
150: 5.9 3.0 5.1 1.8 virginica