In lidR
the
LAScatalog
processing engine refers to the function
catalog_apply()
. This function is the core of the package
and drives, internally, every single other function that is capable of
processing a LAScatalog
, including clip_roi()
,
locate_trees()
, rasterize_terrain()
,
decimate_points()
and many others as well as some
experimental functions in third party packages such as lidRplugins. This
engine is powerful and versatile but also relatively hard to understand
for new users and especially for R beginners. This vignette documents
how it works by going deeper and deeper inside the engine. It is highly
recommended to read the vignette named LAScatalog formal class before
this one even if one may find some overlap between these two
vignettes.
When processing a LAScatalog
the area covered is split
into chunks. A chunk can be seen as a square region of interest (ROI)
and altogether the chunks cover the collection of files. The collection
of files is not physically split, rather the chunks correspond to a
virtual splitting of the coverage. Then the chunk are processed
sequentially (one after the other) or in parallel but always
independently. To process each chunk, the corresponding
point cloud is extracted from the collection of files and loaded into
memory. Any function can be applied on these independent point clouds
and the independent outputs are stored in a list
. The
list
contains one output per chunk and once each chunk is
processed the list
is collapsed into a single continuous
valid object such as a raster or a spatial vector. The roles of the
catalog engine are to:
When using the LAScatalog
engine, users only need to
think about which function they want to apply over their coverage, all
the above-mentioned features being managed internally. There are many
possible processing tuning tools and this is why one may feel lost by
all the options to consider. To simplify, we can make two categories of
tools:
lidR
functions that
perform a given operation either on a LAS
or
LAScatalog
object transparently in a straightforward way.
For example, pixel_metrics()
,
rasterize_terrain()
, and locate_trees()
are
high-level API. Rule of thumb: if the function
catalog_apply()
is not directly used it is a high-level
API. Processing options can be tuned with functions that start with
opt_
(for option).catalog_apply()
itself. This function is designed to build
new high-level applications and is used internally by all the
lidR
functions but can also be used to build a user’s own
tools. Options can be tuned with the parameter .options
of
catalog_apply()
.In the following vignette we will discuss first the high-level API
then the low-level API. The variables named ctg*
will refer
to a LAScatalog
object in subsequent code.
The catalog takes care of making chunks and users can define the size of the chunks. By default this size is set to 0 meaning that a chunk is a file and thus each file will be processed sequentially. The chunk size is not the most important option. It is mainly intended to be used with small configuration computers to avoid loading too many points at once. Reducing or increasing the chunk size does not modify the output but it reduces the memory used by reducing the quantity of points loaded. It is recommended to set this option to 0, but sometimes it is a good idea to set it to a small size to process particularly large files.
Each chunk is loaded with a buffer around it to ensure that independent clouds will not be affected by edge effects. For example, when computing a digital terrain model if a buffer is not loaded, the terrain is weakly computed at the edge of the point cloud because of the absence of context in the neighbourhood. The chunk buffer size is the most important option. A buffer that is too small can create incorrect outputs. The default is 30 m, which is likely to be appropriate for most use cases.
opt_chunk_buffer(ctg) <- 0 # No buffer
plot(ctg, chunk = TRUE)
opt_chunk_buffer(ctg) <- 200 # 200 m buffer
plot(ctg, chunk = TRUE)
lidR
functions always check if an appropriate buffer is
set. For example, it is impossible to apply
rasterize_terrain()
with no buffer.
In some cases it might be suitable to control the alignment of the
chunks to force a given pattern. This is a rare use case but the engine
supports such a possibility. This option is obviously meaningless when
processing by file. The chunk alignment is not the most important option
and does not modify the output, but it may generate more, or fewer,
chunks depending on the alignment. However, it might be very useful in
the particular case of the function catalog_retile()
, for
example, to accurately control the new tiling pattern.
opt_chunk_size(ctg) <- 2000
opt_chunk_buffer(ctg) <- 0
plot(ctg, chunk = TRUE)
opt_chunk_size(ctg) <- 2000
opt_chunk_buffer(ctg) <- 0
opt_chunk_alignment(ctg) <- c(1000, 1000)
plot(ctg, chunk = TRUE)
clip_roi()
is the single case where the control of the
chunk is not respected. clip_roi()
aims to extract a shape
(rectangle, disc, polygon) as a single entity . The chunk pattern does
not make sense here.
When reading a single file with readLAS()
the option
filter
allows for discarding some points based on criteria.
For example -keep_first
keeps only first returns and
discards all non-first returns. It is important to understand that the
discarded points are discarded while reading and they will never be
loaded in memory. This is useful for processing only some points of
interest, for example when computing metrics only on first returns above
2 m.
# Read all the points
readLAS("file.las")
# Only first returns above 2 m are loaded
readLAS("file.las", filter = "-drop_first -drop_z_below 2")
This works only when the readLAS()
function is
explicitly called. When using the catalog processing engine,
readLAS()
is called internally under the hood and users
cannot explicitly call the filter
argument.
filter
is propagated with the opt_filter()
function.
Internally, for each chunk, readLAS()
will be called
with filter = "-drop_first -drop_z_below 2
in every
function that is processing a LAScatalog
unless the
documentation explicitly mentions something else. In the following
examples clip_roi()
is used to extract a plot but only the
points classified as ground or water will be read and
pixel_metrics()
is used to compute metrics only on first
returns above 2 meters. It does not mean other points do not exist, they
are simply not read.
opt_filter(ctg) <- "-keep_class 2 9"
las <- clip_circle(ctg, 273400, 5274500, 20)
opt_filter(ctg) <- "-keep_first -drop_z_below 2"
m <- pixel_metrics(ctg, .stdmetrics, 20)
All filters are not necessarily appropriate everywhere. For example, the following is meaningless because it discards all the points that are used to compute a DTM.
When reading a single file with readLAS()
the option
select
allows for discarding some attributes of the points.
For example, select = "ir"
keeps only the intensity and the
return number of each point and discards all the other attributes, such
as the scan angle, the flags or the classification. Its role is only to
save processing memory by not loading data that are not needed.
Similarly to opt_filter()
, the function
opt_select()
allows propagation of the argument
select
to readLAS()
.
# Load Intensity only
opt_select(ctg) <- "i"
las <- clip_circle(ctg, 273500, 5274500, 10)
# Load Intensity + Classification + ReturnNumber + NumberOfReturns
opt_select(ctg) <- "icrn"
las <- clip_circle(ctg, 273500, 5274500, 10)
However, this option is not always respected because many
lidR
functions already know which optimization they should
apply. In the following example the ‘classification’ attribute is
explicitly discarded, yet the creation of a DTM works because the
function overwrites the user settings for something better (in this
specific case xyzc
).
By default every function returns a continuous output within a single
R object stored in memory so it is immediately usable in the working
environment in a straightforward way. For example, a
sf/sfc_POINT
for locate_trees()
.
# Find the trees
trees <- locate_trees(ctg2, lmf(3))
# The trees are immediately usable in subsequent analyses
trees
#> Simple feature collection with 17865 features and 2 fields
#> Attribute-geometry relationship: 2 constant, 0 aggregate, 0 identity
#> Geometry type: POINT
#> Dimension: XYZ
#> Bounding box: xmin: 481260.8 ymin: 3812980 xmax: 483299.6 ymax: 3816011
#> Projected CRS: NAD83 / UTM zone 12N
However in some case this behaviour might not be suitable especially
for a large collection of files that cover a broad area. In this case
the computer might not be able to handle so much data and/or will run
into trouble when merging the different independent chunks into a single
object. This is why the engine has the capability to write on disk the
output of each independent chunk. The function
opt_output_files()
can be used to set a path on disk where
the output of each chunk should be saved on disk. This path is templated
so the engine is able to create a different file name for each chunk.
The general form is the following:
The templates can be XCENTER
, YCENTER
,
XLEFT
, YBOTTOM
, XRIGHT
,
YTOP
, ID
or ORIGINALFILENAME
. The
templated string does not contain the file extension. The engine guesses
the extension and it works no matter the output. For example:
# Force the results to be written on disk
opt_output_files(ctg2) <- paste0(tempdir(), "/tree_coordinate_{XLEFT}_{YBOTTOM}")
trees <- locate_trees(ctg2, lmf(3))
# The output has been modified by these options and it now gives
# the paths to the written files (here shapefiles)
trees
#> "/tmp/RtmpJQHPNz/tree_coordinate_481200_3812900.shp" "/tmp/RtmpJQHPNz/tree_coordinate_481300_3812900.shp" "/tmp/RtmpJQHPNz/tree_coordinate_481200_3813000.shp"
#> [4] "/tmp/RtmpJQHPNz/tree_coordinate_481300_3813000.shp"
There is one file per chunk and thus processing by file with the
template {ORIGINALFILENAME}
or its shortcut
{*}
may be suitable to further match the output files with
the point cloud files. However, depending on the size of the file and
the capacity of the computer it is not always possible (see section
“Control of the chunk size”).
In the previous example, several shapefiles were written on disk and there is no way to combine them into a single light R object. But sometimes it is possible to return a light object that aggregates all the written files. For rasters, for example, it is possible to build a virtual raster mosaic. The engine automatically does this when it is possible.
# Force the results to be written on disk
opt_output_files(ctg2) <- paste0(tempdir(), "/tree_coordinate_{XLEFT}_{YBOTTOM}")
chm <- rasterize_canopy(ctg2, 1, p2r())
# Many rasters have been written on disk
# but a light raster has been returned anyway
chm
#> class : RasterLayer
#> dimensions : 90, 90, 8100 (nrow, ncol, ncell)
#> resolution : 1, 1 (x, y)
#> extent : 481260, 481350, 3812921, 3813011 (xmin, xmax, ymin, ymax)
#> crs : +proj=utm +zone=12 +datum=NAD83 +units=m +no_defs
#> source : /tmp/RtmpZVJ2hy/rasterize_canopy.vrt
#> names : tree_coordinate_481260_3812921
#> values : 0, 32.07 (min, max)
There are two special cases. The first one is when raster files are
written. They are merged into a valid RasterLayer
using a
virtual mosaic. The second one is when LAS or LAZ files are written on
disk. They are merged into a LAScatalog
.
opt_output_files(ctg2) <- "{tempdir()}/plot_{ID}"
new_ctg <- clip_circle(ctg2, x, y, 20)
new_ctg
#> class : LAScatalog (v1.2 format 0)
#> extent : 32.372, 163.136, 38.494, 198.636 (xmin, xmax, ymin, ymax)
#> coord. ref. : NAD83 / UTM zone 17N
#> area : 3895.031 m²
#> points : 44 points
#> density : 8 points/m²
#> num. files : 4
By default, rasters are written in GeoTiff files, spatial points,
spatial polygons and spatial lines either in sp
or
sf
formats are written in ESRI shapefiles, point-clouds are
written in las files, and table are written in csv files. This can be
modified at any time by users but it corresponds to advanced settings
and thus this section is discussed in a later section about advanced
usages. However, the function opt_laz_compression()
is a
simple shortcut to switch from las to laz when writing a point
cloud.
The engine provides a real time display of the progression that
serves two purposes: (a) see the progression and (b) monitor
troubleshooting. It is enabled by default and when a
LAScatalog
is processing a chart is displayed. The chunks
are progressively coloured. No colour: the chunk is pending. Blue: the
chunk is processing. Green: the chunk has been processed. Orange: the
chunk has been processed with a warning. Red: the chunk processing
failed.
This can be disabled with:
If a chunks produced a warning this is rendered in real time with an orange colouring. However, the message(s) of the warning(s) are delayed and printed only at the end of the computation.
If a chunk produced an error this is also rendered in real time. The computation stops as usual but the error is handled in such a way that the code does not actually fail. The functions returned a partial output i.e. the output of the tiles that were computed successfully. A message is printed telling the user what the error was and where it occurred, and suggests to load this specific section of the catalog to test what was wrong with it. In the following case one tile was not classified so it failed.
#> An error occurred when processing the chunk 9. Try to load this chunk with:
#> chunk <- readRDS("/tmp/RtmpTozKLu/chunk9.rds")
#> las <- readLAS(chunk)
#> No ground point found.
The engine logged the chunk and it is easy to load this specific processing region for further investigation by copy pasting the mentioned code.
It is possible to restart the computation at a given chunk to produce the missing parts. In the previous example the chunk number 9 failed so we can restart at 9
The engine is also able to bypass the error. This can be activated
with opt_stop_early()
and the computation will run until
the end in any case. In this case chunks that failed will be missing and
the output will contain holes with missing data. We do not recommend the
use of this option because other errors will be bypassed as well without
triggering any informative message. This option exists and can be useful
if used carefully. Users should always try to fix any problems.
Sometimes some chunks are empty and we discover this only when
loading the region of interest. This may happen when the chunk pattern
is different from the tiling pattern because some chunks may have been
created but they do not actually encompass any points. This is the case
when the file is only partially populated because the bounding box of
the file/tile is bigger than its actual contents. This often happens on
the edge of the collection. In this case the engine displays the chunks
in gray. This may also happen if filter
is set in such a
way that a lot of points are discarded and in some chunks all the points
appear to be discarded. Those chunk are displayed in gray.
The engine takes care of processing the chunks either sequentially or
in parallel. The parallelization can be done with a single machine or on
multiple machines by sending the chunks to different workers. The
parallelization is performed using the framework given by the package future
.
Understanding the basics of this package is thus required. To activate a
paralellization strategy users need to register a strategy with
future
. For example:
Now 4 chunks will be processed at a time and the engine monitoring displays the processing chunks in real time:
However, parallel processing is not magic. First of all it loads several point clouds at one time because several chunks are read at one time. Users must ensure they have enough RAM to support that. Second, there are strong overheads incurred when parallelizing tasks. Putting all cores on a task does not always make it faster!
In lidR
if some files are overlapping each other, a
message is printed to alert users about potential problems.
ctg <- readLAScatalog("path/to/folder/")
#> Be careful, some tiles seem to overlap each other. lidR may return incorrect outputs with edge artifacts when processing this catalog.
Overlapping is not an issue by itself. The actual problem is
duplicated points. Because lidR
makes arbitrary chunks,
users are likely to load the same points twice if some areas are
duplicated in the original dataset. Below are a few classical cases of
overlapping files:
The files are already buffered: the engine is designed to work with a continuous dataset. If files are already buffered users have several options to process the collection but they all imply driving without a seatbelt. The first and best option is to skip the buffer at read time assuming that buffer points are flagged in some way. For example, flagged as withheld:
If it is not the case one can try to retile the
LAScatalog
with a negative buffer to remove the buffer.
opt_chunk_size(ctg) <- 0
opt_output_files(ctg) <- "/data/unbufferd/{ORIGINALFILENAME}"
opt_chunk_buffer(ctg) <- -40
new_ctg <- catalog_retile(ctg)
It is also possible to process by file without a buffer by disabling
all the internal controls using opt_wall_to_wall()
. Using
this disables the guarantee of having a correct, valid, continuous
result. This is a free-wheel mode.
The files are flight-lines: when a file records a single flight transect, files are overlapping, of course, but this does not create regions with duplicated points. You can process normally and ignore the message. The engine will take care of making independent and buffered chunks.
There are some duplicated files: a few tiles
that appear more than once. When plotting the LAScatalog
,
the light blue being transparent, the overlapping regions should appear
darker. You must remove these files.
Other: there are certainly many other reasons to get overlaps.
It is possible to process only a subregion of the collection by
flagging some files. In this case only the flagged files will be
processed but the neighbouring tiles will be used to load a buffer so
the local context beyond the processed tiles is not lost. In the figure
below 2 files are flagged processed
and the display plots
the other tiles almost white.
As mentioned in a previous section, when an error stops the
computation the output is never NULL
, but it contains a
valid output computed from the part of the catalog that has been
processed. It is a partial but valid output. Computation can be
restarted from the failling chunk.
The low-level API refers to the use of catalog_apply()
.
This function drives every single other one and is intended to be used
by developers to create new processing routines that do not exist in
lidR
. When catalog_apply()
is running it
respects all the processing options we have seen so far plus some
developer’s constraints that are not intended to be modified by users.
catalog_apply()
maps any R function but such functions must
respect some rules.
A function mapped by catalog_apply()
must take a chunk
as input and must read the chunk using readLAS()
. So a
valid function starts like this:
The size of the chunks, the size of the buffer and the positioning of
the chunks depend on the options carried by the LAScatalog
and set with opt_chunk_*()
functions. In addition, the
select
and filter
arguments cannot be
specified in readLAS()
. They are controlled by
opt_filter()
and opt_select()
, as seen in
previous sections. The problem with this is that if any size of chunk
can be defined it is possible to create chunks that encompass a tile but
that do not contain any points. So sometimes readLAS(chunk)
may return a point cloud with 0 points (see section ‘empty chunks’). In
that case subsequent code is likely to fail. As a consequence the
function mapped by catalog_apply()
must check for empty
chunks and return NULL
. When returning NULL
the engine understands that the chunk was empty.
The following code does not work because catalog_apply()
checks internally if the function returns NULL
for empty
chunks
routine <- function(chunk){
las <- readLAS(chunk)
}
catalog_apply(ctg, routine)
#> Error: User's function does not return NULL for empty chunks. Please see the documentation of catalog_apply.
The following code is valid
When reading a chunk with readLAS()
the LAS
object read is a point cloud that encompass the chunk plus the buffer.
This LAS
object has an extra attribute named
buffer
that records whether the points are in the buffered
region or not. 0 refers to no buffer, 1,2,3 and 4 refer, respectively,
to bottom, left, top and right buffers. If plotted, this is what could
be seen.
The chunk is formally a LAScluster
object. This is a
lightweight object that roughly contains the names of the files that
need to be read, the extent of the chunk and some metadata useful
internally.
print(chunk)
#> class : LAScluster
#> features : 1
#> extent : 684800, 684950, 5017810, 5017960 (xmin, xmax, ymin, ymax)
#> crs : +proj=utm +zone=17 +datum=NAD83 +units=m +no_defs
raster::extent()
, terra::ext()
and
sf::st_bbox()
are valid functions that return the bounding
box of the chunk. The bounding box of the chunk is the bounding box
without the buffer.
raster::extent(chunk)
#> class : Extent
#> xmin : 684800
#> xmax : 684950
#> ymin : 5017810
#> ymax : 5017960
sf::st_bbox(chunk)
#> xmin ymin xmax ymax
#> 684800 5017810 684950 5017960
Being able to retrieve the true bounding box allows for removal of
the buffer. Indeed, the point cloud is loaded with a buffer and all
subsequent computation will provide an output with the buffer included
in the results. At the end the engine will merge everything and the
buffered region will appear twice or more. So the final output must be
unbuffered by any means. One can find examples in the documentation of
catalog_apply()
, in this vignette or in the
source code of lidR
.
Sometime we need to control processing options carried by the
LAScatalog
to ensure that users do not use invalid options.
For example rasterize_terrain()
ensures that the buffer is
not 0.
opt_chunk_buffer(ctg) <- 0
rasterize_terrain(ctg, 1, tin())
#> Error: A buffer greater than 0 is required to process the catalog.
This can be reproduced with the option
need_buffer = TRUE
routine <- function(chunk){
las <- readLAS(chunk)
if (is.empty(las)) return(NULL)
}
options = list(need_buffer = TRUE)
catalog_apply(ctg, routine, .options = options)
#> Error: A buffer greater than 0 is required to process the catalog.
Other options are documented in the manual. They serve to make safe
new high-level functions. The main idea being that when developers are
programming new tools using catalog_apply()
they are
expected to know what they are doing. But when providing new functions
to third-party users or to collaborators in a high-level way we are
never safe. This is why the catalog engine provides options to control
the inputs so that other users won’t use your tools incorrectly.
By default catalog_apply()
returns a list
with one output per chunk.
routine <- function(chunk){
las <- readLAS(chunk) # read the chunk
if (is.empty(las)) return(NULL) # exit if empty
ttop <- locate_trees(las, lmf(3)) # make any computation
ttop <- sf::st_crop(ttop, st_bbox(chunk)) # remove the buffer
return(ttop)
}
out <- catalog_apply(ctg, routine)
class(out)
#> [1] "list"
print(out[[1]])
#> Simple feature collection with 178 features and 2 fields
#> Attribute-geometry relationship: 2 constant, 0 aggregate, 0 identity
#> Geometry type: POINT
#> Dimension: XYZ
#> Bounding box: xmin: 481260.8 ymin: 3812980 xmax: 483299.6 ymax: 3816011
#> Projected CRS: NAD83 / UTM zone 12N
This list
must be reduced after
catalog_apply()
. The way to merge depends on the contents
of the list
. Here the list contains sf
so we
can rbind
the list.
out <- do.call(rbind, out)
print(out)
#> Simple feature collection with 17865 features and 2 fields
#> Attribute-geometry relationship: 2 constant, 0 aggregate, 0 identity
#> Geometry type: POINT
#> Dimension: XYZ
#> Bounding box: xmin: 481260.8 ymin: 3812980 xmax: 483299.6 ymax: 3816011
#> Projected CRS: NAD83 / UTM zone 12N
But in practice a safe merging is not always trivial and it is
annoying to do it manually. The engine supports automerging of
Spatial*
, sf
, Raster*
,
stars
, SpatRaster
, data.frame
and
LAS
objects.
options <- list(automerge = TRUE)
out <- catalog_apply(ctg, routine, .options = options)
print(out)
#> Simple feature collection with 17865 features and 2 fields
#> Attribute-geometry relationship: 2 constant, 0 aggregate, 0 identity
#> Geometry type: POINT
#> Dimension: XYZ
#> Bounding box: xmin: 481260.8 ymin: 3812980 xmax: 483299.6 ymax: 3816011
#> Projected CRS: NAD83 / UTM zone 12N
In the worst case the type is not supported and a list is returned anyway without failure.
catalog_apply()
is not really intended to be used
directly. It is intended to be used and hidden within more user-friendly
functions referred to as high-level API. High-level API are functions
intended to be used by third-party users. Let’s assume we have designed
a new application to identify dead trees. Let’s call this function
find_deadtrees()
. This function takes a point cloud as
input and returns a sf
with the positioning of the dead
trees + some attributes.
We now want to make this function work with a LAScatalog
in the sane way as all the other lidR
functions. One option
is the following. First, the function checks the input, if it is a
LAScatalog
it makes use of catalog_apply()
to
apply itself. Then the function will be fed with
LAScluster
s. We test if the input is a
LAScluster
and if it is we read the chunk and apply the
function on the loaded point cloud. To finish, if the input is a
LAS
we perform the actual computation.
find_deadtrees <- function(las, param1, param2)
{
if (is(las, "LAScatalog")) {
options <- list(automerge = TRUE, need_buffer = TRUE)
dead_trees <- catalog_apply(las, find_deadtrees, param1 = param1, param2 = param2, .options = options)
return(dead_trees)
}
else if (is(las, "LAScluster")) {
bbox <- st_bbox(las)
las <- readLAS(las)
if (is.empty(las)) return(NULL)
dead_trees <- find_deadtrees(las, param1, param2)
dead_trees <- sf::st_crop(dead_trees, bbox)
return(dead_trees)
}
else if (is(las, "LAS")) {
# make an advanced computation
return(dead_trees)
}
else {
stop("This type is not supported.")
}
}
The function now works seamlessly both on a LAS
and a
LAScatalog
. We created a function
find_deadtrees()
that can be applied on a whole collection
of files in parallel or not, on multiple machines or not, that returns a
valid continuous shapefile, that handles errors nicely, that can
optionally write the output of each chunk, and so on…
The following sections concern both high-level and low-level APIs and present some advanced use cases.
Using the processing option opt_output_files()
users can
tell the engine to sequentially write the outputs on a drive with custom
filenames. In the default settings raster are written in GeoTiff files,
vectors, either in sp
or sf
formats, are
written in ESRI shapefiles, point clouds are written in las files and
data.frame
are written in csv files. This can be modified
at any time. This is documented in
help("lidR-LAScatalog-drivers")
. If somehow the output of
the function mapped by catalog_apply()
is not a raster, a
vector, a data.frame
, the function will fail when writing
the output.
routine <- function(chunk){
las <- readLAS(chunk)
if (is.empty(las)) return(NULL)
# Do some computation
# output is a list
return(output)
}
opt_output_files(ctg) <- "{tempdir()}/{ID}"
catalog_apply(ctg, routine)
#> Erreur : Trying to write an object of class 'list' but this type is not supported.
It is possible to create a new driver. We could for example write the
list in a .rds
file. This can be done by creating a new
entry named after the name of the class of the object that needs to be
written. Here, a list
.
ctg@output_options$drivers$list = list(
write = base::saveRDS,
object = "object",
path = "file",
extension = ".rds",
param = list(compress = TRUE))
More details in help("lidR-LAScatalog-drivers")
or in
this wiki
page.
This vignette does not cover the set-up required to make two or more
machines able to communicate. In this section we assume that the user is
able to connect a remote machine via SSH. There is a wiki
page that covers some of this subject. Assuming a user is able to
get access to multiple machines remotely and such machines are all able
to read files from the same storage, the multi-machine parallelization
is straightforward because it is driven by the future
package. The only thing users need to do is to register a remote
strategy
library(future)
plan(remote, workers = c("localhost", "[email protected]", "[email protected]", "[email protected]"))
Internally each chunk is sent to a worker. Remember, a chunk is not a
subset of the point cloud. What is sent to each worker instead is a tiny
internal object named LAScluster
that roughly contains only
the bounding box of the region of interest and the files that need to be
read to load the corresponding point cloud. Consequently, the bandwidth
needed to send a workload to each worker is virtually null. This is also
why the function mapped by catalog_apply()
should start
with readLAS()
. Indeed, each worker reads its own data
using the paths provided by the LAScatalog
.
On cheap multi-machine parallelization made using several regular computers on a local network, the machines won’t necessarily have access to shared data. So a copy of the data is mandatory on each machine. If the data are accessible via the exact same path for each machine it will work smoothly. However, if the data are not available via the same paths it is possible to provide alternative directories where to find the files.
This is covered by this wiki page.