Getting started with dkanr

Introduction to dkanr

The dkanr package is an R client to the DKAN REST API. dkanr implements all the methods available via the DKAN REST API and DKAN datastore API. Additionnally, it provides a few wrapper functions to facilitate interacting with the DKAN API from R.

In this brief introduction, we will see how to download data from a specific data set. In the process, we will see how to:

  1. Set-up a connection
  2. Locate a dataset
  3. Access its metadata
  4. Download the attached data

Step 1: Setting Up Your Connection

Connection without authentication

To set-up a connection without authentication, you just need the site URL.

library(purrr)
library(dkanr)
library(dplyr)
dkanr_setup(url = 'https://data.louisvilleky.gov')

Authenticated connection

If authentication is required, you will also need to provide a valid username and password.

dkanr_setup(url = 'http://demo.getdkan.com',
            username = 'my_username',
            password = 'my_password')

You can verify that you are successfully connected by printing your connection information.

dkanr_settings()
## <dkanr settings>
##   Base URL:  https://data.louisvilleky.gov 
##   Cookie:  
##   Token:

Step 2: List all available datasets with list_nodes_all()

While exploring the offerings of a catalog, you can retrieve all the available datasets with a simple query.

# Get a list of all datasets
resp <- list_nodes_all(filters = c(type = 'dataset'), as = 'df')
# Print the first 10 datasets
resp %>%
  select(nid, title, uri) %>%
  arrange(title) %>%
  head(n = 10)
## # A tibble: 10 × 3
##    nid   title                             uri                                  
##    <chr> <chr>                             <chr>                                
##  1 8076  311 Service Requests              https://data.louisvilleky.gov/api/da…
##  2 4526  ABC License Data                  https://data.louisvilleky.gov/api/da…
##  3 4926  ALL Checks                        https://data.louisvilleky.gov/api/da…
##  4 2686  Abandoned Urban Property          https://data.louisvilleky.gov/api/da…
##  5 5566  Absenteeism                       https://data.louisvilleky.gov/api/da…
##  6 2781  Account Breakdown by Program Area https://data.louisvilleky.gov/api/da…
##  7 2166  Active Contractors                https://data.louisvilleky.gov/api/da…
##  8 8216  Active Permits                    https://data.louisvilleky.gov/api/da…
##  9 4496  Aerial Photogrids                 https://data.louisvilleky.gov/api/da…
## 10 5296  Air Emission Sources              https://data.louisvilleky.gov/api/da…

Step 3: Access metadata for a specific dataset node

Say you are interested in a specific dataset from the catalog, for instance, the “Active Permits” dataset. You can easily retrieve this dataset metadata using the dataset node ID.

  1. First, identify the dataset node ID
  2. Then, use the node ID to retrieve the dataset metadata

Identify the dataset node ID

# Print only the "Active Permits" dataset information
resp %>%
  filter(title == 'Active Permits') %>%
  select(nid, title, uri, type)
## # A tibble: 1 × 4
##   nid   title          uri                                                 type 
##   <chr> <chr>          <chr>                                               <chr>
## 1 8216  Active Permits https://data.louisvilleky.gov/api/dataset/node/8216 data…

Use the node ID to retrieve the dataset metadata

metadata <- retrieve_node(nid ='8216', as = 'list')
metadata
## <DKAN Node> #8216 
##   Type: dataset
##   Title: Active Permits
##   UUID: 7e83b96e-3b53-4fc5-9a4c-32af30571787
##   Created/Modified: 1467293251 / 1520511325
# All metadata fields
names(metadata)[1:30]
##  [1] "vid"                       "uid"                      
##  [3] "title"                     "log"                      
##  [5] "status"                    "comment"                  
##  [7] "promote"                   "sticky"                   
##  [9] "vuuid"                     "nid"                      
## [11] "type"                      "language"                 
## [13] "created"                   "changed"                  
## [15] "tnid"                      "translate"                
## [17] "uuid"                      "revision_timestamp"       
## [19] "revision_uid"              "body"                     
## [21] "field_additional_info"     "field_author"             
## [23] "field_contact_email"       "field_contact_name"       
## [25] "field_data_dictionary"     "field_frequency"          
## [27] "field_granularity"         "field_license"            
## [29] "field_public_access_level" "field_related_content"
# Access specific metadata fields
metadata$title
## [1] "Active Permits"

Step 4: Access data for a specific resource node

Once you have identified a dataset of interest, you will probably want to download actual data. Multiple data files and documents may be attached to a single dataset, so you’ll first need to list all the resources (data files, and other type of documents) that are linked to the dataset you are interested in.

Here, a single resource is attached to the “Active Permits” dataset

get_resource_nids(metadata)
## [1] "8221"

You can then use the resource node ID to retrieve its metadata.

metadata_rs <- retrieve_node(nid ='8221', as = 'list')
metadata_rs
## <DKAN Node> #8221 
##   Type: resource
##   Title: Active Permits
##   UUID: 65c4458b-1804-4bf2-b647-b2744648f647
##   Created/Modified: 1467293303 / 1520508729

Data can then be dowloaded either as

  • a batch download,
  • or via an API call.

Batch download

Retrieve the resource URL from the resource metadata

get_resource_url(metadata_rs)
## [1] "https://data.louisvilleky.gov/sites//default//files//ActivePermits_7.csv"

API call

Some data files may be directly queried through the API. Only data files that have been imported into the DKAN datastore can be queried through the API.

First, you’ll need to check if the data file you are interested in is available from the DKAN datastore

ds_is_available(metadata_rs)
## [1] TRUE

If this is the case, you’ll be able to retrieve data directly from the datastore. In order to do so, you’ll have to use the resource UUID (Just another unique ID number)

ds_search_all(resource_id = metadata_rs$uuid, as = 'df') %>%
  select(PERMITNUMBER, PERMITTYPE, STATUS, SQUAREFEET)
## # A tibble: 100 × 4
##    PERMITNUMBER PERMITTYPE      STATUS SQUAREFEET
##    <chr>        <chr>           <chr>  <chr>     
##  1 54912        Building Permit Issued 842       
##  2 106485       Building Permit Issued 1300      
##  3 107817       Building Permit Issued 487       
##  4 113132       Building Permit Issued 3256      
##  5 110301       Building Permit Issued 400       
##  6 115478       Building Permit Issued 2938      
##  7 281965       Building Permit Issued 1050      
##  8 281380       Building Permit Issued 360       
##  9 283077       Building Permit Issued 1225      
## 10 278714       Building Permit Issued 14710     
## # ℹ 90 more rows