--- title: "Introduction to spaths" author: "Christian Düben" date: "Updated: March 18, 2024" output: rmarkdown::html_vignette: number_sections: true bibliography: ../inst/REFERENCES.bib vignette: > %\VignetteIndexEntry{Introduction to spaths} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction

What is the shortest path from Mombasa, Kenya to Marseille, France by boat? It is not the straight line between those locations, as it cuts through land. This package applies graph theory to gridded spatial data, e.g. deriving shortest paths between places taking barriers or cost surfaces into account. In the mentioned example, this would be delineating a path along the East African coast, through the Red Sea, the Suez Canal, and the Mediterranean. Apart from relating locations on Earth, `spaths` can also compute shortest paths more generally on spheres and planes.

`spaths` originates from a research project with larger data and more extensive computations than earlier R packages in this field could handle. Like other packages, previous versions of `spaths` used the `igraph` package, a wrapper for a C library, for graph theoretical algorithms. To further optimize computational performance and tailor the code to the gridded, spatial use case, `spaths` dropped the `igraph` dependency and now comes with its own C++ graph theory implementation.

Apart from computational performance, `spaths` is geared towards user friendliness. Users do not have to understand how graph-theoretical algorithms, transition functions, or spatial distance functions work. And via a set of parameters, they can choose an implementation that fits their machine's capabilities and the application's prerequisites. Another aspect of user friendliness is the ease of installation. Linux users without administrator rights often struggle to install geospatial or other packages that build on external system libraries. To prevent such issues, `spaths` comes with few dependencies. Apart from core packages by default installed with R, users only have to install `Rcpp` and `data.table`. Installing `terra` is optional and only required, when using `terra` objects as inputs.

The package is written to be a foundation upon which other packages build. As `igraph` has become a dependency for graph-theoretical software more broadly, `spaths` aims to be the dependency for applications connecting points in grids. The extensions of `spaths` can be in the geospatial domain, computing e.g. sailing vessel routes taking wind patterns and ocean currents into account, animal migration influenced by terrain topology and vegetation, or optimal paths for Mars explorations. However, they can also target biomedical studies or any other question connecting points in a grid. All you have to do for most extensions is to pass a custom transition function to the `tr_fun` parameter that tells `spaths` how to compute transitions costs between cells from one or more grid layers.

# Graphs

Before `spaths` applies graph-theoretical algorithms, like shortest paths identifications, it converts the spatial data into a graph. A graph consists of vertices, also called nodes, that are connected via edges. In `spaths`, grid cell centroids are vertices linked to neighboring cells' centroids through edges. Which neighbors are connected depends on the `contiguity` argument. Figure 1 illustrates the parameter's two options: rook's case contiguity and queen's case contiguity. In the former model, a cell is directly connected to its four horizontally and vertically adjacent neighbors. With the latter structure, a cell is also directly linked to the four diagonally adjacent neighbors, implying a total of eight direct neighbors.

```{r, out.width = "46%", out.height = "30%", fig.show = "hold", fig.align = "center", fig.cap = "Figure 1: Contiguity", echo = FALSE} knitr::include_graphics("Figure_1a.png") knitr::include_graphics("Figure_1b.png") ```

`spaths` uses queen's case contiguity by default as it often produces more appropriate shortest paths than rook's case contiguity does. The advantages of rook's case contiguity are that the fewer edges imply lower RAM requirements and shorter computational times of shortest path algorithms than with queen's case contiguity.

There are other types of contiguity that directly connect a cell to second order neighbors. This can produce paths that look smoother, as it permits the algorithm to traverse the grid at a larger set of angles. Yet, `spaths` does currently not implement such type of contiguity because it would allow the algorithm to jump over barriers that are one cell wide. In an application deriving ship routes, the algorithm could jump over e.g. islands and peninsulas.

All of the input grid's, i.e. `rst` argument's, non-`NA` pixels become part of the graph. In the example of ship route estimations, you would set all land pixels to `NA`, restricting ships to only traverse water cells. The number of of non-`NA` cells and thereby size of the graph is the most important determinant of RAM requirements and execution time. Crop the grid to the relevant area or set irrelevant pixels to `NA` to boost efficiency.

Shortest paths algorithms, like the here used Dijkstra's [-@Dijkstra1959] algorithm, minimize the sum of edge weights along the paths between origins and destinations. By default, the edge weight in `spaths` is the geographic distance between the centroids of the respective neighboring cells. Figure 2 illustrates edge weights as distance in kilometers between a cell at 1°N 1°E an its neighboring cells under queen's case contiguity in an unprojected, i.e. lonlat, grid of a one degree resolution. The use of kilometers is for visualization purposes and deviates from the actual implementation which uses meters in most cases, as the documentation states.

In calculating distances between locations distributed around the globe, it is usually best to use unprojected data. Projections, i.e. transfering coordinates from a spheroid or ellipsoid onto a two dimensional plane, necessarily distort the input. These distortions can severely bias distances between cells. Projections are designed for specific, often regional, applications. If you are not sure whether a projection allows for correct distances in your case, use unprojected data. This phenomenon is not specific to `spaths`, but holds for any GIS software. Another caveat of using projections is that `spaths`, like `terra`, does not connect cells across the projected grid's edges, even if the data is global. Meaning it does not connect the left most cells to the right most cells in the grid. However, when the data is unprojected and global, it does connect them. This applies to any `rst` input class, including a SpatRaster, a RasterLayer, and a matrix or a list of matrices with `spherical = TRUE`.

```{r, out.width = "46%", fig.show = "hold", fig.align = "center", fig.cap = "Figure 2: Edge Weights", echo = FALSE} knitr::include_graphics("Figure_2.png") ```

`spaths` defaults to the fastest ways of deriving geographic distances. It uses optimized Haversine distance computations for unprojected data on Earth or other spherical objects and optimized Euclidean distance computations for planar data. If you want account for Earth's ellipsoid shape rather than working with spherical distances and computational efficiency is not a top priority, set `dist_comp` to `"terra"`, which derives distances via `terra::distance`. `terra` is a great and performant package. What makes the `"terra"` option slower than `"spaths"` in `dist_comp` is that it triggers the function to derive distances between all neighboring cells separately and export them from R to C++ while the latter choice computes them in a way tailored to this application in the graph construction phase in C++. This implementation e.g. leverages the occurrence of repeated distance values in gridded data.

Edge weights do not have to be geographic straight line distances between cell centroids. With a custom transition function passed to `tr_fun`, you can freely define them in a different way. In this function, you may use the parameters `d` (distance between the pixel centroids), `x1` (x coordinate or longitude of the first cell), `x2` (x coordinate or longitude of the second cell), `y1` (y coordinate or latitude of the first cell), `y2` (y coordinate or latitude of the second cell), `v1` (`rst` layers' values from the first cell), `v2` (`rst` layers' values from the second cell), and `nc` (number of CPU cores set by `ncores`). The algorithm moves from the first to the neighboring second cell. If the data is unprojected, i.e. lonlat, or if `dist_comp = "terra"`, `d` is measured in meters. Otherwise, it uses the units of the CRS. If `rst` has one layer, the values are passed to `v1` and `v2` as vectors. Otherwise they are passed as a data table, if `v_matrix` is `FALSE` (default), and as a matrix, if `v_matrix` is `TRUE`. The function has to be callable from R, but does not have to be written in R. It can e.g. be a C++ function implemented via the `Rcpp` package.

A common transition function is Tobler's [-@Tobler1993] hiking function. It estimates the travel time between locations conditional on terrain topography. To illustrate this use case, imagine that the input grid is a digital elevation model. It consists of one layer and the pixel values document the location's elevation in meters. This `tr_fun` function then expresses the transition cost in terms of the hours it takes to travel between cell centroids according to Tobler: `function(d, v1, v2) d / (6000 * exp(-3.5 * abs((v2 - v1) / d + 0.05)))`. It combines the straight line distance in meters `d` with the altitude difference, i.e. the value of the second cell `v2` minus the value of the first cell `v1`. Tobler's [-@Tobler1993] hiking function produces asymmetric edge weights. Walking uphill is not equally fast as walking downhill. This results in a directed graph, meaning transition cost depends on the direction in which the algorithm traverses between grid cells. So, the `tr_directed` argument has to be `TRUE`.

With transition functions generating symmetric edge weights, `tr_directed` should always be set to `FALSE`, improving computational performance. Tobler's [-@Tobler1993] function would e.g. produce symmetric transition costs in the absence of the `+ 0.05`. The `tr_directed` parameter only has an effect in the presence of custom transition functions, i.e. if `tr_fun` is not `NULL`. Otherwise, the graph is always undirected.

As already mentioned, `spaths` is not restricted to geographic applications. Here is an arbitrary transition function combining various layers of `rst`: `function(v1, v2) v1[[1L]] * v2[[1L]] + v1[[2L]] * v2[[2L]]`. This multiplies the first cell's value in the first layer with the second cell's value in the first layer and adds it to the cells' second layer product.

It is recommended to use vectorized R functions like the ones above. It computes all edge weights without explicitly calling an iterative structure like `for`, `*apply`, or `foreach` loops. R loops are slow. If your function requires a loop, write it in C++. Here is an example with `v_matrix = TRUE`, the Armadillo C++ library, C++ 20, and `...` as a placeholder for the function body:

```{r, eval = FALSE} Rcpp::sourceCpp(code = ' #include // [[Rcpp::depends(RcppArmadillo)]] // [[Rcpp::plugins(cpp20)]] // [[Rcpp::export]] Rcpp::NumericVector example_tr_fun(arma::mat &v1, arma::mat &v2, Rcpp::NumericVector &d) { ... } ') ```
# Output

By default, `shortest_paths` returns distances. These distances are the sum of edge weights along the shortest paths between points. Without a custom transition function, it is the geographic distance between the centroids of the cells on a path between two points. It is the length of the path in meters, unless you use a projection with different units. With a custom transition function, the distance or length of the path is expressed in whatever units the transition function returns, such as hours of travel time with Tobler's [-@Tobler1993] hiking function.

```{r, eval = FALSE} set.seed(3L) input_grid <- terra::rast(crs = "epsg:4326", resolution = 2, vals = sample(c(1L, NA_integer_), 16200L, TRUE, c(0.8, 0.2))) origin_pts <- rnd_locations(3L, output_type = "SpatVector") destination_pts <- rnd_locations(3L, output_type = "SpatVector") shortest_paths(input_grid, origin_pts) origins destinations distances 1: 1 2 19627694 2: 1 3 7290325 3: 2 3 14467797 shortest_paths(input_grid, origin_pts, destination_pts) origins destinations distances 1: 1 1 13313293 2: 1 2 6158046 3: 1 3 15837664 4: 2 1 9137621 5: 2 2 16130624 6: 2 3 4810903 7: 3 1 15919393 8: 3 2 10787554 9: 3 3 19275995 ```

Instead of distances, `shortest_paths` can output the path lines or lines and distances jointly.

```{r, eval = FALSE} shortest_paths(input_grid, origin_pts, output = "lines") class : SpatVector geometry : lines dimensions : 3, 3 (geometries, attributes) extent : -179, 179, -63, -1 (xmin, xmax, ymin, ymax) coord. ref. : lon/lat WGS 84 (EPSG:4326) names : origins destinations connected type : values : 1 2 TRUE 1 3 TRUE 2 3 TRUE shortest_paths(input_grid, origin_pts, output = "both") class : SpatVector geometry : lines dimensions : 3, 3 (geometries, attributes) extent : -179, 179, -63, -1 (xmin, xmax, ymin, ymax) coord. ref. : lon/lat WGS 84 (EPSG:4326) names : origins destinations distances type : values : 1 2 1.963e+07 1 3 7.29e+06 2 3 1.447e+07 ```

When no destinations are specified, `shortest_paths` computes the paths between all origin combinations. If transition costs are symmetric, i.e. traveling from cell A to neighboring cell B is as expensive as traveling from B to A, the function by default only returns distances in one direction to boost computational efficiency and lower the RAM requirements of the return object. If you would like the output to report on both directions, set `bidirectional` to `TRUE`.

```{r, eval = FALSE} shortest_paths(input_grid, origin_pts) origins destinations distances 1: 1 2 19627694 2: 1 3 7290325 3: 2 3 14467797 shortest_paths(input_grid, origin_pts, bidirectional = TRUE) origins destinations distances 1: 1 2 19627694 2: 1 3 7290325 3: 2 3 14467797 4: 2 1 19627694 5: 3 1 7290325 6: 3 2 14467797 ```

The distance from a point to itself is zero. So, irrespective of what arguments you pass to `shortest_paths`, the function never returns paths from points to themselves.

If you do not want to connect all origins to all destinations, specify `pairwise = TRUE`. This connects the first origin to the first destination, the second origin to the second destination, etc. For computational optimization, the function can change the order of the results. In the example below, the first row contains the distance between the second origin and the second destination. Always check the output's origins and destinations variables regarding which points an estimate refers to.

```{r, eval = FALSE} shortest_paths(input_grid, origin_pts, destination_pts, pairwise = TRUE) origins destinations distances 1: 2 2 16130624 2: 1 1 13313293 3: 3 3 19275995 ```

The origins and destinations variables by default refer to the row numbers in the `origins` (and `destinations`) inputs. You can make `shortest_paths` to utilize other names by specifying a column in `origins` (and `destinations`) containing point names. These names can be of types `character`, `integer`, and `numeric`.

```{r, eval = FALSE} origin_pts$name <- letters[1:3] shortest_paths(input_grid, origin_pts, origin_names = "name") origins destinations distances 1: a b 19627694 2: a c 7290325 3: b c 14467797 ```

Unconnected points are marked with `Inf` in the `distances` variable, if `distance_type` is `"double"` or `"float"`, with `NA`, if `distance_type` is `"int"` or `"unsigned short int"`, and with a `connected` variable, if distances are not returned. Integers use `NA` rather than `Inf` because `Inf` is a numeric, not an integer, value.

If `output = "distances"`, the output is by default returned as a data table. Data tables are data frames and you can use them in methods expecting data frames. If you want the result to be a data frame only, not a data table, set `output_class` to `"data.frame"`.

```{r, eval = FALSE} shortest_paths(input_grid, origin_pts, output_class = "data.frame") origins destinations distances 1 1 2 19627694 2 1 3 7290325 3 2 3 14467797 ```

If `output` is `"lines"` or `"both"`, the the function returns a SpatVector, if `rst` is a SpatRaster or a RasterLayer, and a list, if `rst` is a matrix or a list of matrices. Explicitly setting `output_class` to `"list"` returns a list in any case. `output_class = "SpatVector"`, however, returns a SpatVector only if `rst` is a SpatRaster or a RasterLayer.

# Grid Updating

`NA` cells in `rst` act as barriers. They mark the cells which the algorithm must not travel through. What if these barriers move? In the example of ships moving between ports, these moving barriers could be Caribbean hurricanes. Ships do not go through the storms, but around them.

Assume each hurricane is documented with a separate polygon. You could mask the `rst` grid with the different polygons, create one grid per hurricane, and loop over these grids with `shortest_paths`. This would reestimate all shipping routes in each call of `shortest_paths`. Even lines not passing through the Caribbean, e.g. routes between India and Australia, would be recomputed. On top of that, each iteration would check the inputs and convert them into the format used by the algorithm. It is an inefficient strategy.

`shortest_paths` comes with an efficient solution to the moving barrier case. You pass the hurricane polygons to `update_rst` and the function computes the shortest paths in a hurricane-free grid and the grids subject to hurricanes. It just recomputes paths affected by a hurricane and is much more efficient than looping over `shortest_paths` is.

```{r, eval = FALSE} barrier <- terra::vect("POLYGON ((-179 -25, 100 -25, 100 -26, -179 -26, -179 -25))", crs = "epsg:4326") shortest_paths(input_grid, origin_pts, update_rst = barrier) origins destinations distances layer 1: 1 2 19627694 0 2: 1 3 7290325 0 3: 2 3 14467797 0 4: 1 2 19627694 1 5: 1 3 13207350 1 6: 2 3 15465933 1 barriers <- list(barrier, terra::vect("POLYGON ((0 20, 1 20, 1 -20, 0 -20, 0 20))", crs = "epsg:4326")) shortest_paths(input_grid, origin_pts, update_rst = barriers) origins destinations distances layer 1: 1 2 19627694 0 2: 1 3 7290325 0 3: 2 3 14467797 0 4: 1 2 19627694 1 5: 1 3 13207350 1 6: 2 3 15465933 1 7: 1 2 19813077 2 8: 1 3 7290325 2 9: 2 3 14467797 2 ```

Layer 0 is the hurricane-free base grid, layer 1 is the hurricane-free base grid updated with the first polygon, and layer 2 is the hurricane-free base grid updated with the second polygon. Each layer updates the unmodified `rst`, not the grid already updated by another polygon. `update_rst` sets cells covered by the respective polygon to `NA`. It never sets cells to any other value than `NA`.

The actual implementation does not truly update `rst` or the graph, but marks the respective cells as blocked in a more efficient way. The internal strategy obtains the same results, but is much faster than physically updating the grid would be. So, you should treat `update_rst` as a method of setting any `rst` cells that it intersects with to `NA`, irrespective of the optimized implementation.

# Shortest Paths between Non-Earth Locations

`spaths` is not limited to geographic locations on Earth. The functions can be applied to other scenarios connecting points in grids. These can be of a geographic nature, like other planets, or non-geographic subjects.

As an example, we consider astronauts walking on Mars. Non-Earth applications must provide `rst` as a matrix or a list of matrices. The SpatRaster and RasterLayer inputs are restricted to evaluations on Earth. If `rst` is a matrix or a list of matrices, the parameters `spherical`, `radius`, and `extent` define what that matrix refers to. In our Martian example, we define `rst` to be a global, unprojected grid of a two degree resolution.

```{r, eval = FALSE} set.seed(3L) input_grid <- matrix(sample(c(1L, NA_integer_), 16200L, TRUE, c(0.8, 0.2)), nrow = 90) ```

Because the data is unprojected, meaning it is expressed in degrees on a sphere, not points on a plane, we specify `spherical = TRUE`. Mars' radius is `3389500` meters and the grid's global nature implies an `extent` of `c(-180, 180, -90, 90)`. It stretches from -180 to 180 in the x dimension and -90 to 90 in the y dimension.

When `rst` is a matrix or a list of matrices, `origins` (and `destinations`) are supplied as a matrix, data frame, or data table of coordinates, with columns named `x` and `y`. The Martian example uses a data table.

```{r, eval = FALSE} set.seed(3L) origin_pts <- rnd_locations(3L) shortest_paths(input_grid, origin_pts, spherical = TRUE, radius = 3389500, extent = c(-180, 180, -90, 90)) origins destinations distances 1: 1 2 8229741 2: 1 3 5372309 3: 2 3 9088755 ```

`dist_comp = "terra"` is not available in non-Earth-based scenarios. Unless, you compute edge weights through a custom transition function, they are derived through the `dist_comp = "spaths"` methods. Custom transition functions work essentially like they do for Earth. A difference is that grid layers are not passed as layers of a single SpatRaster, but as matrices in a list. As in the SpatRaster case, the layers can be accessed by index and name.

```{r, eval = FALSE} set.seed(3L) input_grid <- list(even_terrain = input_grid, temperature = matrix(stats::rnorm(16200L, 20, 5), nrow = 90)) custom_tr <- function(v1, v2) v1[[1L]] * v2[[1L]] + v1[[2L]] * v2[[2L]] custom_tr <- function(v1, v2) v1[["even_terrain"]] * v2[["even_terrain"]] + v1[["temperature"]] * v2[["temperature"]] shortest_paths(input_grid, origin_pts, spherical = TRUE, radius = 3389500, extent = c(-180, 180, -90, 90), tr_fun = custom_tr) origins destinations distances 1: 1 2 15973.37 2: 1 3 10254.19 3: 2 1 15973.37 4: 2 3 17432.90 5: 3 1 10254.19 6: 3 2 Inf ```

Grid updating is not done with a SpatVector, but a vector of cell numbers, a matrix, or a list of either. The vector's cell numbers mark the cells to set to `NA`. `spaths` counts cells like `terra` does, starting with one in the top left, then increasing from left to right and afterwards from top to bottom. This differs from how R base matrices enumerate cells, which iterate first from top to bottom and second from left to right. If `update_rst` is a matrix, it must be of the same dimensions as `rst` and marks cells to be updated using `NA` values. Cells with non-`NA` values in an `update_rst` matrix are not updated.

```{r, eval = FALSE} set.seed(3L) input_grid <- matrix(sample(c(1L, NA_integer_), 16200L, TRUE, c(0.8, 0.2)), nrow = 90) barrier_vector <- sample.int(16200L, 10L) shortest_paths(input_grid, origin_pts, spherical = TRUE, radius = 3389500, extent = c(-180, 180, -90, 90), update_rst = barrier_vector) origins destinations distances layer 1: 1 2 8229741 0 2: 1 3 5372309 0 3: 2 3 9088755 0 4: 1 2 8229741 1 5: 1 3 5372309 1 6: 2 3 9088755 1 barrier_matrix <- matrix(rep.int(1L, 16200L), nrow = 90) barrier_matrix[sample.int(16200L, 10L)] <- NA_integer_ shortest_paths(input_grid, origin_pts, spherical = TRUE, radius = 3389500, extent = c(-180, 180, -90, 90), update_rst = barrier_matrix) origins destinations distances layer 1: 1 2 8229741 0 2: 1 3 5372309 0 3: 2 3 9088755 0 4: 1 2 8229741 1 5: 1 3 5372309 1 6: 2 3 9088755 1 ```

In each of the two cases, `update_rst` marks ten cells in `rst` to be set to `NA`, once using a vector and once using a matrix. In either approach, the ten cells do not affect the paths and accordingly the results are the same for both layers.

# Performance Optimization

Performance optimization is a central design principle of `spaths`. `shortest_paths`' runs with defaults that are optimal for the average use case. Nonetheless, there are some performance considerations that should generally be accounted for and various parameters that can help in tailoring the function execution to the application.

The number of non-`NA` pixels in `rst` has the largest influence on both computational time and RAM demand. Crop the `rst` to the relevant area and set any cells that the shortest paths are do surely no pass through to `NA`. The quantity of non-`NA` cells does not only determine the size of the graph, but the size of multiple intermediate objects, the number of edge weights to derive, and the number of places to consider in the shortest path algorithm.

To be more precise, the primary contributor to graph size is the number of edges: the quantity of links between cells. Of course, the number of edges increases in the grid cell count. There is an option to influence this link though. `shortest_paths` builds on the above described queen's case contiguity with up to eight edges per vertex. Changing `contiguity` to `"root"` cuts RAM consumption and computational time. Yet, it may induce less desireable paths than `contiguity = "queen"` does.

You can track the function's progress with `show_progress = TRUE`. This prints messages on stages of `shortest_paths`, including a progress bar, if the number of paths or `update_rst` elements is less than or equal to `bar_limit`. `bar_limit` defaults to 150. In the example below, the function prints three `=`, one per path. `show_progress = TRUE` is meant for testing purposes. Especially printing a progress bar from a parallelized function execution can prolong runtimes because the program limits writing to output to one thread at a time. If you choose to print a progress bar, do not set `bar_limit` too high, which can cause the output buffer to overflow and crash the program.

```{r, eval = FALSE} shortest_paths(input_grid, origin_pts, show_progress = TRUE) Checking arguments Converting spatial inputs Preparing algoritm inputs Starting distances calculation |---| |===| Generating output object origins destinations distances 1: 1 2 19627694 2: 1 3 7290325 3: 2 3 14467797 ```

Deriving lines is computationally more expensive and requires more RAM than deriving distances. Storing the coordinates of 10,000 lines comprised of 500,000 cells on average each, requires 74.5 GB RAM. Storing the distances associated with those 10,000 paths requires 78.1 KB RAM, or 0.0001 percent as much as storing the coordinates. This is, of course, not the only information that the function holds. There are intermediate objects etc. Yet, the computational requirements of assembling path lines should not be underestimated.

By default, `ncores` is `NULL` and `shortest_paths` parallelizes across all of the machine's CPU cores. It is implemented with OpenMP in C++. OpenMP is much more efficient than R level parallelism and scales quasi linearly in the number of cores. Each thread runs an iteration of the shortest paths algorithm. How many iterations there are depends on the number of origin and destination points. The software utilizes shared memory parallelism. Hence, objects that are shared across executions of the graph-theoretical algorithm, like the graph and `update_rst`, are not copied. Instead, all threads share the same representation. This does not mean that the RAM use does not increase in the number of cores. Each execution of the shortest path algorithm comes with its own objects, such as the priority queue managing the order of the cells to be visited, which need allocation. So, explicitly setting the number of cores to a value below the default is a way of reducing RAM consumption.

OpenMP is commonly not available on MacOS by default. This is not specific to `spaths`, but affects many modern R packages that are geared towards performance. It implies that the package may be much faster on Windows and Linux than on MacOS.

`shortest_paths` runs Dijsktra's algorithm to identify shortest paths between points. It runs the algorithm once per origin. If you provide one origin and ten destinations, the function produces ten paths and runs the algorithm once. If you provide two origins and five destinations, the function also produces ten paths, but runs the algorithm twice. The second example, therefore, takes longer than the first one. If the generated graph is undirected, it does not matter, if you pass two origins and five destinations or five origins and two destinations. `shortest_paths` automatically adjusts the direction to minimize the frequency with which Dijkstra's [-@Dijkstra1959] algorithm is called. In the example, it would be called twice in either case.

The algorithm by default derives the shortest paths from an origin to other cells of the same graph until all destination cells have been visited. This technique allows the algorithm to potentially stop before having visited each vertex. Yet, it comes at the cost of checking for each visited cell whether it is in the set of destinations. Hence, the default `early_stopping = TRUE` is efficient when point pairs are close to each other compared to the entirety of cells. If at least one points pair in an execution of the shortest paths algorithm is far from each other, the alternative `early_stopping = FALSE` can be faster. It derives the distance to all other cells and then picks the destinations cells from the result, avoiding the check for destination cells while iterating through the graph. `early_stopping = TRUE` and `early_stopping = FALSE` produce the same results, but differ in computational performance.

In a grid, the vector of distances between neighboring cells is made up of not that many unique and repeating values. That makes it efficient to precompute edge weights for the entire graph, and is the behavior which the default `pre = TRUE` induces. Computing individual edge weights while Dijkstra's [-@Dijkstra1959] algorithm runs, `pre = FALSE`, requires less RAM, as the function only stores one edge weight at a time, but is almost always much slower than precomputing them all jointly. So, setting `pre` to `FALSE` is one of the last options to consider, when the machine has insufficient RAM.

When `update_rst` is a list, there are two potential dimensions for parallelism. In iterating over the updated grids, the function could parallelize at the level of point connections, like in the base grid, or it could parallelize across grids, i.e. list elements of `update_rst`. By default the function chooses the latter, meaning `par_lvl` is `"update_rst"`. All connections in the grid updated with the first element of `update_rst` are handled by one core, all connections in the grid updated with the second element of `update_rst` are handled by another core, etc. If `par_lvl` is `"points"`, the function instead calls the former option. It computes all connections in the grid updated with the first element of `update_rst` in parallel, then computes all connections in the grid updated with the second element of `update_rst` in parallel, etc. The `par_lvl` argument only affects the grids updated with `update_rst` list elements. The unupdated base grid always uses the `"points"` strategy. Which `par_lvl` option is preferred depends on the number of recomputed paths and the number of `update_rst` list elements. Assume you run `shortest_paths` with 8 CPU cores, pass `update_rst` of two elements, and run the shortest paths algorithm 16 times per updated grid. `par_lvl = "update_rst"` would utilize two cores and leave six cores idle. `par_lvl = "points"`, in contrast, would use all 8 cores, commonly running the algorithm twice per core. It does not matter how many connections there are in total and how often Dijkstra's [-@Dijkstra1959] algorithm is called in the base grid. Only paths that are affected by the grid updating, i.e. where `update_rst` sets at least one cell on the path to `NA`, are recomputed.

Internally, `shortest_paths` stores paths in the form of cell numbers. By default, these are four byte signed integers, the same type R uses for integers. Only if `output` is `"lines"` or `"both"`, are these cell numbers converted to coordinates, usually two 8 byte double precision floating point numbers per cell, before the function returns. If `rst` has less than 65,535 non-`NA` cells, you can make use of 2 byte unsigned short integers instead. The `path_type = "unsigned short int"` option requires half as much RAM to store cell numbers as `path_type = "int"` demands, but is slower as it comes with type conversions from 4 byte signed integers. The results are the same, irrespective of which option you employ.

Another data type selection regards distances. `distance_type` defaults to the fastest and most precise option: double precision floating point numbers. `"double"` corresponds to the `numeric` type in R. It is an 8 byte type. Alternatively, you can choose 4 byte single precision floating point numbers (`"float"`), 4 byte signed integers (`"int"`), and 2 byte unsigned short integers (`"unsigned short int"`). Unlike `path_type`, `distance_type` changes the results. `"float"` stores distances between cells with lower precision than `"double"` does and `"int"` and `"unsigned short int"` round distances to integers. These deviations accumulate over a path and can induce marked differences in the result. Always test the effect in your application before choosing one of the less RAM demanding options. If `distance_type` is `"int"`, the distance between any cells, i.e. the sum of edge weights along any path, not just the returned ones, must not exceed 2,147,483,647. With `"unsigned short int"` that limit is 65,535, making it not applicable in most scenarios. `shortest_paths` does not check, if you meet these numerical constraints. The results are simply wrong, if you violate them. `"float"`, `"int"`, and `"unsigned short int"` are slower than `"double"` because they require type conversions.

# Contributions

Contributions to this package are highly welcome. You can submit them as a pull request to the `ChristianDueben/spaths` GitHub repository or via email to the maintainer email mentioned in the `DESCRIPTION` file. You may also build a package on top of `spaths`, as `movecost` and `leastcostpath` did with `gdistance`.

# FAQ ## Why do the lines look so angular, i.e. not really smooth?

This is because of the `contiguity` options that `spaths` implements: queen's and rook's case contiguity. As clarified above, these rules restrict the algorithm to traverse between directly adjacent cells only. Rook's case contiguity provides access to four neighbors spaced at 90° angles. Queen's case contiguity provides access to eight neighbors spaced at 45° angles. Incorporating types of contiguity that allow for traversing between second order neighbors would introduce a larger set of angles for the algorithm to choose from. However, this lets the function to skip over first degree neighbors. In the ship route example above, the ship could skip over land cells, such as islands or peninsulas. Hence, the lines are the shortest paths given the angles the algorithm may travel at.

## What object classes should I use for the inputs?

The recommendation is to use SpatRaster `rst`, SpatVector `origins` (and SpatVector `destinations`) objects when handling locations on Earth. The other classes also work, but the recommended classes tend to be the most appropriate in that they require little conversion and return output in a convenient format. If the data do not represent locations on Earth, you need to use a matrix or list of matrices as `rst` with a matrix or a data frame as `origins` (and `destinations`). Unlike a matrix, a data frame accepts `origin_names` and `destination_names` columns of a different type than the coordinates.

## What features will be added in the next updates?

The next features include alternative shortest paths algorithms, beyond Dijkstra's algorithm, and centrality measures. In the respective categories, the A* algorithm and closeness centrality will be first. How long it will take to publish these updates depends on the maintainer's academic career prospects.

# References