--- title: "Performing Multidimensional Queries (also Including Geographic Attributes)" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Performing Multidimensional Queries (also Including Geographic Attributes)} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup, echo = FALSE, message=FALSE} library(rolap) ``` # Introduction Once we developed a star database in R using the `rolap` package, in addition to exporting it to exploit it with other tools, we can perform multidimensional queries from R: The `rolap` package provides the functionality to define geographic attributes and formulate and execute simple queries on a multidimensional schema that includes them. The main objective of this document is to show the multidimensional query formulation and execution functionality offered by this package. First, the data model is briefly discussed: the possibility of defining stars and constellations. Then, the geographic attributes definition functionality is also shown. Next, the functions defined to support multidimensional queries are presented. Finally, finish with the conclusions. # Stars and constellations Strictly speaking, a *star* is composed of a fact table and several associated dimension tables. A *constellation* is made up of several stars that can share dimensions. In the `rolap` package they are treated in a unified way under the `star_database` class: It is used both to define stars and constellations. The variable `mrs_db`, obtained in the vignette titled *Obtaining and transforming flat tables*, `vignette("v05-flat-table-op")`, contains an object of class `star_database` that we will use in the example. ```{r} class(mrs_db) ``` We can see a representation of the tables it contains using the `draw_tables()` function, as shown below. ```{r} mrs_db |> draw_tables() ``` We can see that it is a constellation because it contains more than one fact table. # Include geographic information layers The `rolap` package allows us to include layers of geographic information associated with dimension attributes. The objective is to be able to obtain layers of geographic information with the data contained in the multidimensional database. In the considered case, the table of dimension *where* contains geographical information in the form of the coordinates (latitude and longitude) of each city. We can define the `city` field as a geographic attribute. ```{r} mrs_db_geo <- mrs_db |> define_geoattribute( dimension = "where", attribute = "city", from_attribute = c("long", "lat") ) ``` We can also associate the geographic information of a vector layer of points or polygons to an attribute (or set of attributes), using the same function. ```{r} mrs_db_geo <- mrs_db_geo |> define_geoattribute( dimension = "where", attribute = "state", from_layer = us_layer_state, by = "STUSPS" ) ``` The vector layer used is defined at the state level. If there is another field at another level of detail in the layer, the same layer can be used to define other attributes of coarser granularity. ```{r} mrs_db_geo <- mrs_db_geo |> define_geoattribute( dimension = "where", attribute = "region", from_layer = us_layer_state, by = "DIVISION" ) ``` If there is no field in the layer that allows us to establish the relationship, the information associated with other attributes of the dimension can be used. ```{r} mrs_db_geo_2 <- mrs_db_geo |> define_geoattribute( dimension = "where", attribute = "region", from_attribute = "state" ) ``` If there are still unrelated instances of the dimension, the `define_geoattribute()` function warns and the instances can be consulted using the `check_geoattribute_geometry()` function. Through these functions we have defined relationships between the attributes and vector layers of geographic information, at the level of detail we need, which we can later take advantage of. # Query functions A query is defined on a `star_database` object and the result of executing it is another `star_database` object. This section presents the functions available to define queries. ## `star_query()` From a `star_database` object, an empty `star_query` object is created where we can select fact measures, dimension attributes and filter dimension rows. Example: ```{r} sq <- mrs_db_geo |> star_query() ``` At least one fact table with one dimension must be included in each query. ## `select_fact()` To define the fact table to be consulted, its name is indicated, optionally, a vector of names of selected measures and another of aggregation functions are also indicated. If the name of any of the measures is not indicated, the measure corresponding to the number of rows added will be included, which is always included. If no aggregation function is included, those defined for the measures are considered. Examples: ```{r} sq_1 <- sq |> select_fact( name = "mrs_age", measures = "all_deaths", agg_functions = "MAX" ) ``` The measure is considered with the indicated aggregation function. In addition, the measure corresponding to the number of grouped records that make up the result is automatically included. ```{r} sq_2 <- sq |> select_fact(name = "mrs_age", measures = "all_deaths") ``` The measure is considered with the aggregation function defined in the multidimensional scheme. ```{r} sq_3 <- sq |> select_fact(name = "mrs_age") ``` Only the measure corresponding to the number of grouped records is included. ```{r} sq_4 <- sq |> select_fact(name = "mrs_age", measures = "all_deaths") |> select_fact(name = "mrs_cause") ``` In a query we can select several fact tables, at least we have to select one. ## `select_dimension()` To include a dimension in a `star_query` object, we have to define its name and a subset of the dimension attributes. If only the name of the dimension is indicated, it is considered that all its attributes should be added. Example: ```{r} sq_1 <- sq |> select_dimension(name = "where", attributes = c("city", "state")) ``` Only the indicated attributes of the dimension will be included. ```{r} sq_2 <- sq |> select_dimension(name = "where") ``` All attributes of the dimension will be included. ## `filter_dimension()` Allows us to define selection conditions for dimension rows. Conditions can be defined on any attribute of the dimension, not only on attributes selected in the query for the dimension. They can also be defined on unselected dimensions. Filtering is done using the function `dplyr::filter()`. Conditions are defined in exactly the same way as in that function. Example: ```{r} sq <- sq |> filter_dimension(name = "when", week <= " 3") |> filter_dimension(name = "where", city == "Bridgeport") ``` ## `run_query()` Once we have selected the facts, dimensions and defined the conditions on the instances of dimensions, we can execute the query to obtain the result. The query can be executed on any `star_database` object that has in its structure the elements that appear in it. If the `star_database` has geographic information associated with it, this will be filtered according to its conditions. Example: ```{r} sq <- star_query(mrs_db_geo) |> select_dimension(name = "where", attributes = c("region", "state")) |> select_dimension(name = "when", attributes = "year") |> select_fact(name = "mrs_age", measures = "all_deaths") |> select_fact(name = "mrs_cause", measures = "all_deaths") |> filter_dimension(name = "when", week <= " 3" & year >= "2010") mrs_db_geo_3 <- mrs_db_geo |> run_query(sq) class(mrs_db_geo_3) ``` The result of running a query is an object of the `star_database` class that meets the conditions defined in the query: Other queries can continue to be defined on this object. We can see a representation of the tables of the result, as shown below. ```{r} mrs_db_geo_3 |> draw_tables() ``` ## Exploitation of the result This section shows an example of how to exploit the result of the multidimensional query. The first thing we do is transform it into flat tables. ```{r} ft <- mrs_db_geo_3 |> as_single_tibble_list() ft_age <- ft[["mrs_age"]] ``` Below are the rows of one of the result tables. ```{r, results = "asis", echo = FALSE} pander::pandoc.table(ft_age, split.table = Inf) ``` From the results in the form of flat tables, [`pivottabler`](https://CRAN.R-project.org/package=pivottabler) package can be used to present it in the form of pivot tables. ```{r} pt <- pivottabler::qpvt( ft_age, c("=", "region"), c("year"), c("Number of Deaths" = "sum(all_deaths)") ) pt$renderPivot() ``` # Obtaining layers with geographic information If in a `star_database` object we have attributes to which geographic information has been associated, we can obtain geographic information layers that include fact and dimension data. To include only the data we need, using query operations we can previously filter the `star_database` object. Once the geographic information layer is obtained, it can also be filtered to select the variables represented. The following sections show how to obtain a geographic information layer and the operations that we can perform on it. ## Get a `geolayer` object From a `star_database` object with some attribute to which we have associated geographic information (*geoattribute*), we can obtain a `geolayer` object. If there is more than one, we have to indicate the layer granularity geoattribute. ```{r} gl_state <- mrs_db_geo_3 |> as_geolayer(attribute = "state") ``` The `geolayer` object is composed of a layer of geographic information at the level of detail of the geoattribute and another where the variables of the recorded data are described. ```{r} layer_state <- gl_state |> get_layer() layer_state var_state <- gl_state |> get_variables() var_state ``` We can generate the geographic information layer so that it contains only the objects for which we have additional information, as we have done. ```{r} plot(sf::st_geometry(layer_state)) text( sf::st_coordinates(sf::st_centroid(sf::st_geometry(layer_state))), labels = layer_state$state, pos = 3, cex = 0.5 ) ``` We can also generate it for all objects in the original layer, even if they do not contain information. ```{r} layer_state_all <- gl_state |> get_layer(keep_all_variables_na = TRUE) plot(sf::st_shift_longitude(sf::st_geometry(layer_state_all))) ``` ## Operations on variables For the variables, in addition to being able to obtain them in `tibble` format, we can consult for those whose name we indicate or for all of their meaning, as shown below. ```{r} gl_state |> get_variable_description(c("var_01", "var_10")) vd <- gl_state |> get_variable_description() vd[c("var_01", "var_10")] ``` The variables are a `tibble` and we can select them using the `dplyr::filter()` function. ```{r} var_state_2 <- var_state |> dplyr::filter(year == '2016') ``` Once the variables we need have been filtered, we can filter the geographic information layer so that it will only contain these variables. ```{r} gl_state_2 <- gl_state |> set_variables(var_state_2) layer_state_2 <- gl_state_2 |> get_layer() ``` ```{r} plot(sf::st_geometry(layer_state_2)) text( sf::st_coordinates(sf::st_centroid(sf::st_geometry(layer_state_2))), labels = layer_state_2$state, pos = 3, cex = 0.5 ) ``` ## Data representation For example, for each state we are going to represent the percentage of pneumonia and influenza deaths registered, starting in 2010. First of all, we must define the query. ```{r} sq_2 <- star_query(mrs_db_geo) |> select_dimension(name = "where", attributes = "state") |> select_fact(name = "mrs_cause", measures = c("pneumonia_and_influenza_deaths", "all_deaths")) |> filter_dimension(name = "when", year >= "2010") ``` We run the query and get a `star_database` object as a result. ```{r} mrs_db_geo_3 <- mrs_db_geo |> run_query(sq_2) ``` We obtain a `geolayer` object. ```{r} gl_state_3 <- mrs_db_geo_3 |> as_geolayer(attribute = "state") ``` Finally, we represent it. ```{r} gl_state_3 |> get_variable_description() layer <- gl_state_3 |> get_layer() layer$tpc_deaths <- (layer$var_2 / layer$var_1) * 100 plot(layer[, "tpc_deaths"], main = "% pneumonia and influenza") ``` To maintain the definition of the new variable in the `geolayer` object, we can define that this is the new geographic information layer of it. ```{r} gl_state_3 <- gl_state_3 |> set_layer(layer) ``` If we want to treat the geographic information layer with another tool, we can export it along with the variable definition table in *GeoPackage* format. ```{r} f <- gl_state_3 |> as_GeoPackage(dir = tempdir()) sf::st_layers(f) ``` # Conclusions This document presents some of the querying possibilities that offers the `rolap` package. The queries are formulated on an object of class `star_database` and the result is another object of the same class on which additional queries can be made. Queries can be formulated about a star or set of stars or constellation. We can define attributes that have associated geographic information. If any of these geographic attributes are included in the result of a query, we obtain a geographic information layer in which the result of the query is defined in the form of variables that can be filtered and queried.