--- title: "Create Datasets that are Easy to Share Exchange and Extend" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Create Datasets that are Easy to Share Exchange and Extend} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(dataset) ``` The aim of the `dataset_df` is to create semantically richer datasets. ```{r createdataasetdf} small_country_dataset <- dataset_df( country_name = defined( c("AD", "LI"), label = "Country name", definition = "http://data.europa.eu/bna/c_6c2bb82d", namespace = "https://www.geonames.org/countries/$1/"), gdp = defined( c(3897, 7365), label = "Gross Domestic Product", unit = "million dollars", definition = "http://data.europa.eu/83i/aa/GDP"), dataset_bibentry = dublincore( title = "Small Country Dataset", creator = person("Jane", "Doe"), publisher = "Example Inc.") ) ``` The `defined` class columns retain the important semantic information about the label, exact definition link, and the unit of measure: ```{r varlabel} var_label(small_country_dataset$gdp) ``` ```{r varunit} var_unit(small_country_dataset$gdp) ``` ```{r vardefinition} var_definition(small_country_dataset$gdp) ``` And the `dataset_df` object has a the important semantic information about the dataset as a whole, with pre-filled default values for not yes assigned or announced information. The metadata of the entire dataset conforms to the two most universally used definitions, Dublin Core (DCTERMS) and DataCite (DataCite). Each standard Dublinc Core/DataCite metadata field has a simple interface function, for example, you can add the language of the dataset with the following assignment command: ```{r language} language(small_country_dataset) <- "en" ``` And to review the descriptive data of the entire dataset as a whole: ```{r bibentry} print(get_bibentry(small_country_dataset), "bibtex" ) ``` ## Wikidata The [wbdataset package](https://wbdataset.dataobservatory.eu/) is an extension of the dataset, which in turn is an R package that helps to exchange, publish and combine datasets more easily by improving their semantics. The _wbdataset_ extends the usability of dataset by connecting the Wikibase API with the R statistical environment. It is in an early, exploratory stage, and it is not yet on CRAN. ```{r install-wbdataset, eval=FALSE} devtools::install_github("dataobservatory-eu/wbdataset") ``` ``` # Select the following country profiles from Wikidata: wikidata_countries <- c( "http://www.wikidata.org/entity/Q756617", "http://www.wikidata.org/entity/Q347", "http://www.wikidata.org/entity/Q3908", "http://www.wikidata.org/entity/Q1246") # Retrieve their labels into a dataset called 'European countries': wikidata_countries_df <- get_item(qid = wikidata_countries, language ="en", title = "European countries", creator = person("Daniel", "Antal")) # Add properties from Wikidata as attributes and variables of this dataset: ds <- wikidata_countries_df %>% left_join_column( label = "ISO 3166-1 alpha-2 code", property = "P297" ) %>% left_join_column( property = "P1566", label = "Geonames ID", namespace = "https://www.geonames.org/") ``` ## Further data platforms We are planning an extension of the _dataset_ package to conform the datacube definition used by statistical data repositories and exchanges, which will require a management of the multi-dimensional datasets and the data structure definition of the Statistical Data and Metadata eXchange. This feature was part of the early release of dataset, but as reviewers pointed out, they were out of focus from the core functionality. ## Serialise to RDF Our package is aiming to work seamlessly with the rOpenSci [rdflib](https://github.com/ropensci/rdflib) package. For this purpose, we are working on a simple [turtle](https://www.w3.org/TR/turtle/) package that writes the [bibentry](https://dataset.dataobservatory.eu/articles/bibentry.html) about the dataset, and the dataset contents into an RDF-annotated text file, which can be then used or stored with _rdflib_. The tuRtle package is being co-developed with _dataset_, and will serialise the data and metadata of the `dataset_df` objects. ```{r tuRtle-installation, eval=FALSE} remotes::install_github("dataobservatory-eu/tuRtle", build = FALSE) ``` ```{r createdf, eval=FALSE} library(tuRtle) tdf <- data.frame (s = c("eg:01","eg:02", "eg:01", "eg:02", "eg:01" ), p = c("a", "a", "eg-var:", "eg-var:", "rdfs:label"), o = c("qb:Observation", "qb:Observation", "\"1\"^^", "\"2\"^^", '"Example observation"') ) knitr::kable(tdf) ``` |s |p |o | |:-----|:----------|:---------------------| |eg:01 |a |qb:Observation | |eg:02 |a |qb:Observation | |eg:01 |eg-var: |"1"^^ | |eg:02 |eg-var: |"2"^^ | |eg:01 |rdfs:label |"Example observation" | ```{r writettl, eval=FALSE} example_file<- file.path(tempdir(), "example_ttl.ttl") ttl_write(tdf=tdf, ttl_namespace=NULL, file_path=example_file) readLines(example_file) ```