Think Globally, Fit Locally (Saul and Roweis 2003)
Modeling spectral data has garnered wide interest in the last four decades. Spectroscopy is the study of the spectral response of a matrix (e.g. soil, plant material, seeds, etc.) when it interacts with electromagnetic radiation. This spectral response directly or indirectly relates to a wide range of compositional characteristics (chemical, physical or biological) of the matrix. Therefore, it is possible to develop empirical models that can accurately quantify properties of different matrices. In this respect, quantitative spectroscopy techniques are usually fast, non-destructive and cost-efficient in comparison to conventional laboratory methods used in the analyses of such matrices. This has resulted in the development of comprehensive spectral databases for several agricultural products comprising large amounts of observations. The size of such databases increases de facto their complexity. To analyze large and complex spectral data, one must then resort to numerical and statistical tools and methods such as dimensionality reduction, and local spectroscopic modeling based on spectral dissimilarity concepts.
The aim of the resemble
package is to provide tools to
efficiently and accurately extract meaningful quantitative information
from large and complex spectral databases. The core functionalities of
the package include:
Simply type and you will get the info you need:
citation(package = "resemble")
## To cite resemble in publications use:
##
## Ramirez-Lopez, L., and Stevens, A., and Viscarra Rossel, R., and
## Shen, Z., and Wadoux, A., and Breure, T. (2024). resemble: Regression
## and similarity evaluation for memory-based learning in spectral
## chemometrics. R package Vignette R package version 2.2.3.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{resemble-package,
## title = {resemble: Regression and similarity evaluation for memory-based learning in spectral chemometrics.},
## author = {Leonardo Ramirez-Lopez and Antoine Stevens and Claudio Orellano and Raphael {Viscarra Rossel} and Zefang Shen and Alex Wadoux and Timo Breure},
## publication = {R package Vignette},
## year = {2024},
## note = {R package version 2.2.3},
## url = {https://CRAN.R-project.org/package=resemble},
## }
This vignette uses the soil Near-Infrared (NIR) spectral dataset
provided in the package
prospectr
package (Stevens and Ramirez-Lopez
2024). The reason why we use this dataset is because soils
are one of the most complex matrices analyzed with NIR spectroscopy.
This spectral dataset/library was used in the challenge by Pierna and Dardenne (2008). The library contains NIR
absorbance spectra of dried and sieved 825 soil observations/samples.
These samples originate from agricultural fields collected from all over
the Walloon region in Belgium. The data are in an R
data.frame
object which is organized as follows:
Response variables:
Nt (Total Nitrogen in g/kg of dry soil): a numerical variable (values are available for 645 samples and missing for 180 samples).
Ciso (Carbon in g/100 g of dry soil): a numerical variable (values are available for 732 and missing for 93 samples).
CEC (Cation Exchange Capacity in meq/100 g of dry soil): A numerical variable (values are available for 447 and missing for 378 samples).
Predictor variables: the predictor variables are
in a matrix embedded in the data frame, which can be accessed via
NIRsoil$spc
. These variables contain the NIR absorbance
spectra of the samples recorded between the 1100 nm and 2498 nm of the
electromagnetic spectrum at 2 nm interval. Each column name in the
matrix of spectra represents a specific wavelength (in nm).
Set: a binary variable that indicates whether the samples belong to the training subset (represented by 1, 618 samples) or to the test subset (represented by 0, 207 samples).
Load the necessary packages and data:
The dataset can be loaded into R as follows:
This step aims at improving the signal quality of the spectra for
quantitative analysis. In this respect, the following standard methods
are applied using the package
prospectr
(Stevens and Ramirez-Lopez
2024):
# obtain a numeric vector of the wavelengths at which spectra is recorded
wavs <- NIRsoil$spc %>% colnames() %>% as.numeric()
# pre-process the spectra:
# - resample it to a resolution of 6 nm
# - use first order derivative
new_res <- 5
poly_order <- 1
window <- 5
diff_order <- 1
NIRsoil$spc_p <- NIRsoil$spc %>%
resample(wav = wavs, new.wav = seq(min(wavs), max(wavs), by = new_res)) %>%
savitzkyGolay(p = poly_order, w = window, m = diff_order)
new_wavs <- as.matrix(as.numeric(colnames(NIRsoil$spc_p)))
matplot(x = wavs, y = t(NIRsoil$spc),
xlab = "Wavelengths, nm",
ylab = "Absorbance",
type = "l", lty = 1, col = "#5177A133")
matplot(x = new_wavs, y = t(NIRsoil$spc_p),
xlab = "Wavelengths, nm",
ylab = "1st derivative",
type = "l", lty = 1, col = "#5177A133")
Both the raw absorbance spectra and the first derivative spectra are shown in Figure @ref(fig:plotspectra). The first derivative spectra represents the explanatory variables that will be used for all the examples throughout this document.
For more explicit examples, the NIRsoil
data is split
into training and testing subsets:
# training dataset
training <- NIRsoil[NIRsoil$train == 1, ]
# testing dataset
testing <- NIRsoil[NIRsoil$train == 0, ]
In the resemble package we use the following notation (Ramirez-Lopez, Behrens, Schmidt, Stevens, et al. (2013)):
Training observations:
Xr
stands for the matrix of predictor variables in
the reference/training set (spectral data for calibration).
Yr
stands for the response variable(s) in the
reference/training set (dependent variable for calibration). In the
context of this package, Yr
is also referred as to
“side information”, which is a variable or set
of variables that are associated to the training observations that can
also be used to support or guide optimization during modeling, but that
not necessarily are part of the input of such models. For example, we
will see in latter sections that Yr
can be used in
Principal Component Analysis to help on deciding how many components are
optimal.
Testing observations:
Xu
stands for the matrix of predictor variables in
the unknown/test set (spectral data for validation/testing).
Yu
stands for the response variable(s) in the
unknown/test set (dependent variable for testing).
When conducting exploratory analysis of spectral data, we face the curse of dimensionality. It is such that we may be dealing with (using NIR spectra data as an example) hundreds to thousands of individual wavelengths for each spectrum. When one wants to find patterns in the data, spectral similarities and differences, or detect spectral outliers, it is necessary to reduce the dimension of the spectra while retaining important information.
Principal Component (PC) analysis and Partial Least Squares (PLS) decomposition methods assume that the meaningful structure the data intrinsically lies on a lower dimensional space. Both methods attempt to find a projection matrix that projects the original variables onto a less complex subspaces (represented by new few variables). These new variables mimic the original variability across observations. These two methods can be considered as the standard ones for dimensionality reduction in many fields of spectroscopic analysis.
The difference between PC and PLS is that in PC the objective is to find few new variables (which are orthogonal) that capture as much of the original data variance while in the latter the objective is to find few new variables that maximize their variance with respect to a set of one or more external variables (e.g. response variables or side information variables).
In PC and PLS the input spectra (X, of n × d dimensions) is decomposed into two main matrices: a matrix of scores (T) and a matrix of ladings (P), so that:
X = T P + ε where the dimensions of T and P are n × o and o × d, and where o represents a given number of components being retained and ε represents the reconstruction error. The maximum o (number of components) that can be retrieved is limited to min(n − 1, d). One interesting property of P is that it is equivalent to P−1. This implies that when the PC decomposition is estimated for a given set of observations (Xnew) the resulting P matrix can be directly used to project new spectra onto the same principal component space by: Tnew = Xnew P′
In the resemble
package, PC analysis and PLS
decomposition are available through the ortho_projection()
function which offers the following algorithms:
"pca"
: the standard method for PC analysis based on
the singular value decomposition algorithm.
"pca.nipals"
: this algorithm uses the non-linear
iterative partial least squares algorithm (NIPALS, H. Wold
1975) for the purpose of PC analysis.
"pls"
: Here, PLS decomposition also uses the NIPALS
algorithm, but in this case it makes use of side
information, which can be a variable or set of variables
that are associated to the training observations and
that are used to project the data. In this case, the variance between
the projected variables and the side
information variable(s) is maximized.
The PC analysis of the training spectra can be executed as follows:
# principal component (pc) analysis with the default
# method (singular value decomposition)
pca_tr <- ortho_projection(Xr = training$spc_p, method = "pca")
pca_tr
Plot the ortho_projection
object:
## Warning in par(opar): argument 1 does not name a graphical parameter
The code above shows that in this dataset, 7 components are required to explain around 97% of the original variance found in the spectra (Figure @ref(fig:plotpcsvariance)).
Equivalent results can be obtained with the NIPALS algorithm:
# principal component (pc) analysis with the default
# NIPALS algorithm
pca_nipals_tr <- ortho_projection(Xr = training$spc_p,
method = "pca.nipals")
pca_nipals_tr
The advantage of the NIPALS algorithm is that it can be faster than SVD when only few components are required.
For a PLS decomposition the method
argument is set to
"pls"
. In this case, side information (Yr
) is
required. In the following example, the side information used is the
Total Carbon (Ciso
):
# Partial Least Squares decomposition using
# Total carbon as side information
# (this might take some seconds)
pls_tr <- ortho_projection(Xr = training$spc_p,
Yr = training$Ciso,
method = "pls")
pls_tr
Note that in the previous code, for PLS projection the observations
with missing training$Ciso
are hold out, and then the
projection takes place. The missing observations are projected with the
resulting projection matrix and pooled together with the initial
results.
By default the ortho_projection()
function retains all
the first components that, alone, account for at least 1% of the
original variance of data. In the following section we will see that the
function also offers additional options that might be more convenient
for choosing the number of components.
Those options can be specified using the pc_selection
argument. The following options are all the ones available for that
purpose:
"var"
(default option):Those components that alone explain more than a given amount of the original spectral variance are retained. Example:
"cumvar"
:Only the first components that together explain at least a given amount of the original variance are retained. Example:
"opc"
:This is a more sophisticated method in which the selection of the components is based on the side information concept presented in Ramirez-Lopez, Behrens, Schmidt, Stevens, et al. (2013). First let P be a sequence of retained components (so that P = 1, 2, ..., k). At each iteration, the function computes a dissimilarity matrix retaining pi components. The values in this side information variable are compared against the side information values of their most spectrally similar observations. The optimal number of components retrieved by the function is the one that minimizes the root mean squared differences (RMSD) in the case of continuous variables, or maximizes the kappa index in the case of categorical variables. The RMSD is calucated as follows:
where j(i) = NN(xri, Xr{−i})
represents a function to obtain the index of the nearest neighbor
observation found in Xr (excluding the ith observation) for xri,
yi is the
value of the side variable of the ith observation, yj(i)
is the value of the side variable of the nearest neighbor of the ith observation and m is the total number of
observations. Note that for the "opc"
method
Yr
is required (i.e. the side information of the
observations). Type help(sim_eval)
in the R
console to get more details on how the RMSD and kappa are calculated in
the function.
The rationale behind the "opc"
method is based on the
assumption that the closer two observations are in terms of their
explanatory variables (Xr
), the closer they may be in terms
of their side information (Yr
).
# This uses optimal component selection
# variation in training$spc_p
optimal_sel <- list(method = "opc", value = 40)
pca_tr_opc <- ortho_projection(Xr = training$spc_p,
Yr = training$Ciso,
method = "pca",
pc_selection = optimal_sel)
pca_tr_opc
In the example above, 11 components are required to represent the space in which the overall Total Carbon difference between each sample and its corresponding nearest neighbor is minimized. The following graph shows how the RMSD varies as a function of the number of components (Figure @ref(fig:pcrmsd)):
The following code exemplifies how the RMSD is calculated (only for the 11th component, Figure @ref(fig:rmsdscatter)):
# compute the dissimilarity matrix using all the retained scores
pc_diss <- f_diss(pca_tr_opc$scores, diss_method = "mahalanobis")
# get the nearest neighbor for each sample
nearest_n <- apply(pc_diss, MARGIN = 1, FUN = function(x) order(x)[2])
# compute the RMSD
rmsd <- sqrt(mean((training$Ciso - training$Ciso[nearest_n])^2, na.rm = TRUE))
rmsd
## [1] 0.8570114
# the RSMD for all the components is already available in
# ...$opc_evaluation
pca_tr_opc$opc_evaluation[pca_tr_opc$n_components, , drop = FALSE]
## pc rmsd_Yr
## 11 11 0.8570114
plot(training$Ciso[nearest_n],
training$Ciso,
ylab = "Ciso of the nearest neighbor, %", xlab = "Ciso, %",
col = "#D19C17CC", pch = 16)
grid()
"manual"
:The user explicitly defines how many components to retrieve. Example:
# This uses manual component selection
manual_sel <- list(method = "manual", value = 9)
# PC
pca_tr_manual <- ortho_projection(Xr = training$spc_p,
method = "pca",
pc_selection = manual_sel)
pca_tr_manual
# PLS
pls_tr_manual <- ortho_projection(Xr = training$spc_p,
Yr = training$Ciso,
method = "pls",
pc_selection = manual_sel)
pls_tr_manual
Both PC and PLS methods generate projection matrices that can be used
to project new observations onto the new lower dimensional score space
they were built for. In the case of PC analysis this projection matrix
is equivalent to the transposed matrix of loadings. The
predict
method along with a projection model can be used to
project new data:
optimal_sel <- list(method = "opc", value = 40)
# PLS
pls_tr_opc <- ortho_projection(Xr = training$spc_p,
Yr = training$Ciso,
method = "pls",
pc_selection = optimal_sel,
scale = TRUE)
# the pls projection matrix
pls_tr_opc$projection_mat
pls_projected <- predict(pls_tr_opc, newdata = testing$spc_p)
# PC
pca_tr_opc <- ortho_projection(Xr = training$spc_p,
Yr = training$Ciso,
method = "pca",
pc_selection = optimal_sel,
scale = TRUE)
# the pca projection matrix
t(pca_tr_opc$X_loadings)
pca_projected <- predict(pca_tr_opc, newdata = testing$spc_p)
The ortho_projection()
function allows to project two
separate datasets in one run. For example, training and testing data can
be passed to the function as follows:
optimal_sel <- list(method = "opc", value = 40)
pca_tr_ts <- ortho_projection(Xr = training$spc_p,
Xu = testing$spc_p,
Yr = training$Ciso,
method = "pca",
pc_selection = optimal_sel,
scale = TRUE)
plot(pca_tr_ts)
In the above code for PC analyisis, training
and
testing
datasets are pooled together and then projected and
split back for presenting the final results. For the opc
selection method, the dissimilarity matrices are built only for the
training
data and for the observations with available side
information (Total Carbon). These dissimilarity matrices are used only
to find the optimal number of PCs. Note that Xr
and
Yr
refer to the same observations. Also note that the
optimal number of PCs might not be the same as when testing
is not passed to the Xu
argument since the PC projection
model is built from a different pool of observations.
In the case of PLS, the observations used for projection necessarily
have to have side information available, therefore the missing values in
Yr
are hold out during the projection model building. For
these samples, the final projection matrix is use to project them into
the PLS space.
optimal_sel <- list(method = "opc", value = 40)
pls_tr_ts <- ortho_projection(Xr = training$spc_p,
Xu = testing$spc_p,
Yr = training$Ciso,
method = "pls",
pc_selection = optimal_sel,
scale = TRUE)
# the same PLS projection model can be obtained with:
pls_tr_ts2 <- ortho_projection(Xr = training$spc_p[!is.na(training$Ciso),],
Yr = training$Ciso[!is.na(training$Ciso)],
method = "pls",
pc_selection = optimal_sel,
scale = TRUE)
identical(pls_tr_ts$projection_mat, pls_tr_ts2$projection_mat)
The ortho_projection()
function allows to pass more than
one variable to Yr
(side information):
optimal_sel <- list(method = "opc", value = 40)
pls_multi_yr <- ortho_projection(Xr = training$spc_p,
Xu = testing$spc_p,
Yr = training[, c("Ciso", "Nt", "CEC")],
method = "pls",
pc_selection = optimal_sel,
scale = TRUE)
plot(pls_multi_yr)
In the above code for PLS projections using multivariate side
information, the PLS2 method (based on the NIPALS algorithm) is used
(see S. Wold
et al. 1983). The optimal component selection
(opc
) also uses the multiple variables passed to
Yr
, the RMSD is computed for each of the variables. Each
RMSD is then standardized and the final RMSD used for optimization is
their average. For the example above, this data can be accessed as
follows:
For PC analysis multivariate side information is also allowed for the
opc
method. Alternatively, a categorical variable can also
be used as side information for the opc
. In that case, the
kappa index is used instead of the RMSD.
Similarity/dissimilarity measures between objects are often estimated by means of distance measurements, the closer two objects are to one another, the higher the similarity between them. Dissimilarity or distance measures are useful for a number of applications, for example for outlier detection and nearest neighbors search.
The dissimilarity()
function is the main function for
measuring dissimilarities between observations. It is basically a
wrapper to other existing dissimilarity functions within the package
(see f_diss()
, cor_diss()
, sid()
and ortho_diss()
). It allows to compute dissimilarities
between:
all the observations in a single matrix.
observations in a matrix against observations in a second matrix.
The dissimilarity methods available in dissimilarity()
are as follows (see diss_method
argument):
"pca"
: Mahalanobis distance computed on the matrix
of scores of a PC projection of Xr
(and Xu
if
provided). PC projection is done using the singular value decomposition
(SVD) algorithm. Type help(ortho_diss)
for more details on
the function called by this method.
"pca.nipals"
: Mahalanobis distance computed on the
matrix of scores of a PC projection of Xr
(and
Xu
if provided). PC projection is done using the non-linear
iterative partial least squares (NIPALS) algorithm. Type
help(ortho_diss)
in the R
console for more
details on the function called by this method.
"pls"
: Mahalanobis distance computed on the matrix
of scores of a partial least squares projection of Xr
(and
Xu
if provided). In this case, Yr
is always
required. Type help(ortho_diss)
in the R
console for more details on the function called by this method.
"cor"
: correlation dissimilarity which is based on
the coefficient between observations. Type help(cor_diss)
in the R
console for more details on the function called by
this method.
"euclid"
: Euclidean distance between observations.
Type help(f_diss)
in the R
console for more
details on the function called by this method.
"cosine"
: Cosine distance between observations. Type
help(f_diss)
in the R
console for more details
on the function called by this method.
"sid"
: spectral information divergence between
observations. Type help(sid)
in the R
console
for more details on the function called by this method.
In this package, the orthogonal space dissimilarities refer to dissimilarity measures performed either in the PC space or in the PLS space.
Since we can assume that the meaningful structure the data lies on a lower dimensional space, we can also assume that this lower dimensional space is optimal to measure the dissimilarity between observations (Ramirez-Lopez, Behrens, Schmidt, Rossel, et al. 2013).
To measure the dissimilarity between observations (xi and xj), the Mahalanobis distance is computed on their corresponding projected score vectors (ti and tj) found in the matrix of scores (T):
$$d(x_i,x_j) = d(t_i,t_j) = \sqrt{\frac{1}{z}\sum(t_i - t_j) C^{-1}(t_i - t_j)'}$$ where z is the number of components used, C−1 is the inverse of the covariance matrix computed from the matrix of projected variables for all the observations T. Since the projected variables are orthogonal to each other, the resulting C−1 would be equivalent to a diagonal matrix with the variance of each T column in its main diagonal. Therefore, for this case of orthogonal spaces, the Mahalanobis distance is equivalent to the Euclidean distance applied on the variance-scaled T (De Maesschalck, Jouan-Rimbaud, and Massart 2000).
To compute orthogonal dissimilarities in the resemble
package, the dissimilarity()
function can be used as
follows:
# for PC dissimilarity using the default settings
pcd <- dissimilarity(Xr = training$spc_p,
diss_method = "pca")
dim(pcd$dissimilarity)
# for PC dissimilarity using the optimized component selection method
pcd2 <- dissimilarity(Xr = training$spc_p,
diss_method = "pca.nipals",
Yr = training$Ciso,
pc_selection = list("opc", 20),
return_projection = TRUE)
dim(pcd2$dissimilarity)
pcd2$dissimilarity
pcd2$projection # the projection used to compute the dissimilarity matrix
# for PLS dissimilarity
plsd <- dissimilarity(Xr = training$spc_p,
diss_method = "pls",
Yr = training$Ciso,
pc_selection = list("opc", 20),
return_projection = TRUE)
dim(plsd$dissimilarity)
plsd$dissimilarity
plsd$projection # the projection used to compute the dissimilarity matrix
To compute the correlation dissimilarity between training and testing observations:
# For PC dissimilarity using the optimized component selection method
pcd_tr_ts <- dissimilarity(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "pca.nipals",
Yr = training$Ciso,
pc_selection = list("opc", 20))
dim(pcd_tr_ts$dissimilarity)
# For PLS dissimilarity
plsd_tr_ts <- dissimilarity(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "pls",
Yr = training$Ciso,
pc_selection = list("opc", 20))
dim(plsd_tr_ts$dissimilarity)
In the last two examples, matrices of 618 rows and 207 columns are retrieved. The number of rows is the same as in the training dataset while the number of columns is the same as in the testing dataset. The dissimilarity between the ith observation in the training dataset and the jth observation in the testing dataset is stored in the ith row and the jth column of the resulting dissimilarity matrices.
It is also possible to measure the dissimilarity between observations in a localized fashion. In this case, first a global dissimilarity matrix is computed. Then, by using this matrix for each target observation, a given set of k-nearest neighbors are identified. These neighbors (together with the target observation) are projected (from the original data space) onto a (local) orthogonal space (using the same parameters specified in the function). In this projected space the Mahalanobis distance between the target observation and its neighbors is recomputed. A missing value is assigned to the observations that do not belong to this set of neighbors (non-neighbor observations). In this case the dissimilarity matrix cannot be considered as a distance metric since it does not necessarily satisfies the symmetry condition for distance matrices (i.e. given two observations xi and xj, the local dissimilarity, d, between them is relative since generally d(xi, xj) ≠ d(xj, xi)).
For computing this type of localized dissimilarity matrix, two
arguments need to be passed to the dissimilarity()
function: .local
and pre_k
. These are not
formal arguments of the function, however, they are passed to the
ortho_diss()
function which is used by the
dissimilarity()
function for computing the dissimilarities
in the orthogonal spaces.
Here are two examples on how to perform localized dissimilarity computations:
# for localized PC dissimilarity using the optimized component selection method
# set the number of neighbors to retain
knn <- 200
local_pcd_tr_ts <- dissimilarity(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "pca",
Yr = training$Ciso,
pc_selection = list("opc", 20),
.local = TRUE,
pre_k = knn)
## The neighborhoods of 207 observations contain missing 'Yr' values.
## Check ...$neighborhood_info
dim(local_pcd_tr_ts$dissimilarity)
## [1] 618 207
# For PLS dissimilarity
local_plsd_tr_ts <- dissimilarity(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "pls",
Yr = training$Ciso,
pc_selection = list("opc", 20),
.local = TRUE,
pre_k = knn)
## The neighborhoods of 207 observations contain missing 'Yr' values.
## Check ...$neighborhood_info
dim(local_plsd_tr_ts$dissimilarity)
## [1] 618 207
# check the dissimilarity scores between the first two
# observations in the testing dataset and the first 10
# observations in the training dataset
local_plsd_tr_ts$dissimilarity[1:10, 1:2]
## Xu_1 Xu_2
## Xr_1 * *
## Xr_2 1.9537271 2.0093944
## Xr_3 2.0663499 *
## Xr_4 2.3116881 1.6281155
## Xr_5 * *
## Xr_6 * 1.8760568
## Xr_7 2.1609596 2.0513451
## Xr_8 2.1993001 1.8352469
## Xr_9 2.1764284 2.0324282
## Xr_10 * *
## *: not a neighbor
Correlation dissimilarity is based on the Pearson’s ρ correlation coefficient between observations. The value of Pearson’s ρ varies between -1 and 1. A correlation of 1 between two observations would indicate that they are perfectly correlated and might have identical characteristics (i.e. they are can be considered as highly similar). A value of -1, conversely, would indicate that the two observations are perfectly negatively correlated (i.e. the two observations are highly dissimilar). The correlation dissimilarity implemented in the package scales the values between 0 (highest dissimilarity) and 1 (highest similarity). To measure d between two observations xi and xj based on the correlation dissimilarity the following equation is applied:
$$d(x_i, x_j) = \frac{1}{2} (1 - \rho(x_i, x_j))$$
Note that d cannot be considered as a distance metric since it does not satisfy the axiom of identity of indiscernibles. Therefore we prefer the use of the term dissimilarity.
The following code demonstrates how to compute the correlation dissimilarity between all observations in the training dataset:
cd_tr <- dissimilarity(Xr = training$spc_p, diss_method = "cor")
dim(cd_tr$dissimilarity)
cd_tr$dissimilarity
To compute the correlation dissimilarity between training and testing observations:
cd_tr_ts <- dissimilarity(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "cor")
dim(cd_tr_ts$dissimilarity)
cd_tr_ts$dissimilarity
Alternatively, the correlation dissimilarity can be computed using a moving window. In this respect, a window size term w is introduced to the original equation:
$$d(x_i, x_j; w) = \frac{1}{2 w}\sum_{k=1}^{p-w}1 - \rho(x_{i,\{k:k+w\}}, x_{j,\{k:k+w\}})$$
In this case, the correlation dissimilarity is computed by averaging
the moving window correlation measures. The introduction of the window
term increases the computational cost in comparison to the simple
correlation dissimilarity. The moving window correlation dissimilarity
can be computed by setting the diss_method
argument to
"cor"
and passing a window size value to the
ws
argument as follows:
In the computation of the Euclidean dissimilarity, each feature has equal significance. Hence, correlated variables which may represent irrelevant features, may have a disproportional influence on the final dissimilarity measurement (Brereton 2003). Therefore, it is not recommended to use this measure directly on the raw data. To compute the dissimilarity between two observations/vectors xi and xj the package uses the following equation:
$$d(x_i,x_j) = \sqrt{\frac{1}{p} \sum(x_i - x_j) (x_i - x_j)'}$$ where p represents the number of variables.
With the dissimilarity()
function the Euclidean
dissimilarity can be computed as follows:
# compute the dissimilarity between all the training observations
ed <- dissimilarity(Xr = training$spc_p, diss_method = "euclid")
ed$dissimilarity
The dist()
function in the R
package
stats
can also be used to compute Euclidean distances,
however the resemble
implementation tends to be faster
(especially for very large matrices):
# compute the dissimilarity between all the training observations
pre_time_resemble <- proc.time()
ed_resemble <- dissimilarity(Xr = training$spc_p, diss_method = "euclid")
post_time_resemble <- proc.time()
post_time_resemble - pre_time_resemble
pre_time_stats <- proc.time()
ed_stats <- dist(training$spc_p, method = "euclid")
post_time_stats <- proc.time()
post_time_stats - pre_time_stats
# scale the results of dist() based on the number of input columns
ed_stats_tr <- sqrt((as.matrix(ed_stats)^2)/ncol(training$spc_p))
ed_stats_tr[1:2, 1:3]
# compare resemble and R stats results of Euclidean distances
ed_resemble$dissimilarity[1:2, 1:3]
In the above code it can be seen that the results of the
dist()
require scaling based on the number of input
variables. This means that, by default, the values output by
dist()
increase with the number of input variables. This is
an effect that is already accounted for in the implementation of the
Euclidean (and also Mahalanobis) dissimilarity implementation of
resemble
.
Another advantage of the Euclidean dissimilarity in
resemble
over the one in R
stats
is that the one in resemble
allows the computation of the
dissimilarities between observations in two matrices:
This dissimilarity metric is also known as the “Spectral Angle Mapper” which has been extensively applied in remote sensing as a tool for unsupervised classification and spectral similarity analysis. The cosine dissimilarity between two observations (xi and xj) is calculated as:
$$d (x_i, x_j) = cos^{-1} \tfrac{\sum_{k=1}^{p} x_{i,k} x_{j,k} } {\sqrt{\sum_{k=1}^{p} x_{i,k}^2} \sqrt{\sum_{k=1}^{p} x_{j,k}^2}}$$ where p is the number of variables.
With the dissimilarity()
function the Euclidean
dissimilarity can be computed as follows:
The spectral information divergence (SID, Chang 2000) indicates how dissimilar are two observations based on their probability distributions. To account for the discrepancy between the distributions of two observations (xi and xj), the SID method uses the Kullback-Leibler divergence (kl, Kullback and Leibler 1951) measure. Since the kl is a non-symmetric measure, i.e. kl(xi, xj) ≠ kl(xj, xi), the dissimilarity between xi and xj based on this method is computed as:
d(xi, xj) = kl(xi, xj) + kl(xj, xi)
The following code can be used to compute the SID between the training and testing observations:
sid_tr_ts <- dissimilarity(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "sid")
dim(sid_tr_ts$dissimilarity)
sid_tr_ts$dissimilarity
See the sid()
function in the resemble
package for more details.
Usually, dissimilarity assessment is disregarded and the decision on what method to use is sometimes arbitrary. However, if the estimations of similarity/dissimilarity between observations from its predictor/explanatory variables fail to reflect the real or main similarity/dissimilarity, these estimations can be seen as useless for further analyses.
The package resemble
offers functionality for assessing
dissimilarity matrices. These assestments are based on first nearest
neighbor search (1-NN). In this section, the different methods to
measure dissimilarity between spectra are compared in terms of their
ability to retrieve 1-NNs observations with similar Total Carbon
(“Ciso”). This indicates how well the spectral similarity between
observations reflect their compositional similarity.
Compute a dissimilarty matrix for training$spc_p
using
the different methods:
# PC dissimilarity with default settings (variance-based
# of components)
pcad <- dissimilarity(training$spc_p, diss_method = "pca", scale = TRUE)
# PLS dissimilarity with default settings (variance-based
# of components)
plsd <- dissimilarity(training$spc_p, diss_method = "pls", Yr = training$Ciso,
scale = TRUE)
# PC dissimilarity with optimal selection of components
opc_sel <- list("opc", 30)
o_pcad <- dissimilarity(training$spc_p,
diss_method = "pca",
Yr = training$Ciso,
pc_selection = opc_sel,
scale = TRUE)
# PLS dissimilarity with optimal selection of components
o_plsd <- dissimilarity(training$spc_p,
diss_method = "pls",
Yr = training$Ciso,
pc_selection = opc_sel,
scale = TRUE)
# Correlation dissimilarity
cd <- dissimilarity(training$spc_p, diss_method = "cor", scale = TRUE)
# Moving window correlation dissimilarity
mcd <- dissimilarity(training$spc_p, diss_method = "cor", ws = 51, scale = TRUE)
# Euclidean dissimilarity
ed <- dissimilarity(training$spc_p, diss_method = "euclid", scale = TRUE)
# Cosine dissimilarity
cosd <- dissimilarity(training$spc_p, diss_method = "cosine", scale = TRUE)
# Spectral information divergence/dissimilarity
sinfd <- dissimilarity(training$spc_p, diss_method = "sid", scale = TRUE)
Use the sim_eval()
function with each dissimilarity
matrix to find the closest observation to each observation and compare
them in terms of the Ciso
variable:
Ciso <- as.matrix(training$Ciso)
ev <- NULL
ev[["pcad"]] <- sim_eval(pcad$dissimilarity, side_info = Ciso)
ev[["plsd"]] <- sim_eval(plsd$dissimilarity, side_info = Ciso)
ev[["o_pcad"]] <- sim_eval(o_pcad$dissimilarity, side_info = Ciso)
ev[["o_plsd"]] <- sim_eval(o_plsd$dissimilarity, side_info = Ciso)
ev[["cd"]] <- sim_eval(cd$dissimilarity, side_info = Ciso)
ev[["mcd"]] <- sim_eval(mcd$dissimilarity, side_info = Ciso)
ev[["ed"]] <- sim_eval(ed$dissimilarity, side_info = Ciso)
ev[["cosd"]] <- sim_eval(cosd$dissimilarity, side_info = Ciso)
ev[["sinfd"]] <- sim_eval(sinfd$dissimilarity, side_info = Ciso)
Table @ref(tab:tcomparisons) and Figure @ref(fig:pcomparisons) show
the results of the comparisons (for the training dataset) between the
Total Carbon of the observations and the Total Carbon of their most
similar samples (1-NN) according to the dissimilarity method used. In
the example, the spectral dissimilarity matrices that best reflect the
compositions similarity are those built with the pls with optimized
component selection (o_plsd
) and pca with
optimized component selection (o_pcad
).
comparisons <- lapply(names(ev),
FUN = function(x, label) {
irmsd <- x[[label]]$eval[1]
ir <- x[[label]]$eval[2]
data.frame(Measure = label,
RMSD = irmsd,
r = ir)
},
x = ev)
comparisons
Measure | RMSD | r |
---|---|---|
pcad | 0.85 | 0.9 |
plsd | 0.81 | 0.89 |
o_pcad | 0.8 | 0.91 |
o_plsd | 0.75 | 0.91 |
cd | 0.99 | 0.86 |
mcd | 0.92 | 0.88 |
ed | 0.82 | 0.9 |
cosd | 1.01 | 0.86 |
sinfd | 1.44 | 0.72 |
old_par <- par("mfrow")
par(mfrow = c(3, 3))
p <- sapply(names(ev),
FUN = function(x, label, labs = c("Ciso (1-NN), %", "Ciso, %")) {
xy <- x[[label]]$first_nn[,2:1]
plot(xy, xlab = labs[1], ylab = labs[2], col = "red")
title(label)
grid()
abline(0, 1)
},
x = ev)
par(old_par)
In the package, the k-NN search aims at finding in a given reference set of observations a group of spectrally similar observations for another given set of observations. For an observation, its most similar observations are known as nearest neighbors and they are usually found by using dissimilairty metrics.
In resemble
, the k-nearest neighbor search is
implemented in the function search_neighbors()
. This
function uses the dissimilarity()
function to compute the
dissimilarity matrix that serves in the identification of the neighbors.
These neighbors can be retained in two ways: i. by providing a
specific number of neighbors or ii. by setting a dissimilarity
threshold (dth).
We encourage readers to go through the section where we discuss about dissimilarity measures, which serves as the basis for the examples presented in this section.
This means that the neighboring observations are retained regardless
their dissimilarity/distance to the target observation. Each target
observation for which its neighbors are to be found ends up with the
same neighborhood size (k). A
drawback of this approach is that observations that are in fact largely
dissimilar to the target observation might end up in its neighborhood.
This is because the requirement for building the neighborhood is based
on its size and not on the similarity of the retained observations to
the target one. In the dissimilarity()
function, the
neighborhood size is controlled by the argument k
.
Here is an example that demonstrates how
search_neighbors
can be used to search in the
training
set the spectral neighbors of the
testing
set:
knn_pc <- search_neighbors(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "pca.nipals",
k = 50)
# matrix of neighbors
knn_pc$neighbors
# matrix of neighbor distances (dissimilarity scores)
knn_pc$neighbors_diss
# the index (in the training set) of the first two closest neighbors found in
# training for the first observation in testing:
knn_pc$neighbors[1:2, 1, drop = FALSE]
# the distances of the two closest neighbors found in
# training for the first observation in testing:
knn_pc$neighbors_diss[1:2, 1, drop = FALSE]
# the indices in training that fall in any of the
# neighborhoods of testing
knn_pc$unique_neighbors
In the above code, knn_pc$neighbors
is a matrix showing
the results of the neighbors found. This is a matrix of neighbor indices
where every column represents an observarion in the testing set while
every row represents the neighbor index (in descending order). Every
entry represents the index of the neighbor observation in the training
set. The knn_pc$neighbors_diss
matrix shows the
dissimilarity scores corresponding to the neighbors found. For example,
for the first observation in testing
its closest
observation found in training
corresponds to the one with
index 526 (knn_pc$neighbors[1]
) which has a dissimilarity
score of 3.57 (knn_pc$neighbors_diss[1]
).
Neighbor search can also be conducted with all the dissimilarity measures described in previous sections. The neighbors retrieved will then depend on the dissimilarity method used. Thus, it is recommended to evaluate carefully what dissimilarity metric to use before neighbor search.
Here are other examples of neighbor search based on other dissimilarity measures:
# using PC dissimilarity with optimal selection of components
knn_opc <- search_neighbors(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "pca.nipals",
Yr = training$Ciso,
k = 50,
pc_selection = list("opc", 20),
scale = TRUE)
# using PLS dissimilarity with optimal selection of components
knn_pls <- search_neighbors(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "pls",
Yr = training$Ciso,
k = 50,
pc_selection = list("opc", 20),
scale = TRUE)
# using correlation dissimilarity
knn_c <- search_neighbors(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "cor",
k = 50, scale = TRUE)
# using moving window correlation dissimilarity
knn_mwc <- search_neighbors(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "cor",
k = 50,
ws = 51, scale = TRUE)
Another example with localized PC and PLS dissimilarity measures:
# using localized PC dissimilarity with optimal selection of components
knn_local_opc <- search_neighbors(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "pca.nipals",
Yr = training$Ciso,
k = 50,
pc_selection = list("opc", 20),
scale = TRUE,
.local = TRUE,
pre_k = 250)
# using localized PLS dissimilarity with optimal selection of components
knn_local_opc <- search_neighbors(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "pls",
Yr = training$Ciso,
k = 50,
pc_selection = list("opc", 20),
scale = TRUE,
.local = TRUE,
pre_k = 250)
Here, the neighboring observations to be retained must have a
dissimilarity score less or equal to a given dissimilarity threshold
(dth).
Therefore, the neighborhood size of the target observations is not
constant. A drawback with this approach is that choosing a meaningful
dth can
be difficult, especially because its value is largely influenced by the
dissimilarity method used. Furthermore, some neighborhoods retrieved by
certain thresholds might be of a very small size or even empty, which
constraints any type of analysis within such neighborhoods. On the other
hand, some neighborhood might end up with large sizes which might
include either redundant observations or in some other cases where dth is
too large the complexity in the neighborhood might be large. In the
dissimilarity()
function, dth is
controlled by the argument k_diss
. This argument is
accompanied by the argument k_range
which is used to
control the maximum and minimum neighborhood sizes. For example, if a
neighborhood size is below the minimum size kmin
specified in k_range
, the function automatically ignores
dth and
retrieves the kmin
closest observations. Similarly, if the neighborhood size is above the
maximum size kmax
specified in k_range
the function automatically ignores
dth and
retrieves only a maximum of kmax
neighbors.
In the package, we can use search_neighbors()
to find in
the training
set the neighbors of the testing
set which dissimilarity scores are less or equal to a user-defined
threshold:
# a dissimilarity threshold
d_th <- 1
# the minimum number of observations required in each neighborhood
k_min <- 20
# the maximum number of observations allowed in each neighborhood
k_max <- 300
dnn_pc <- search_neighbors(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "pca.nipals",
k_diss = d_th,
k_range = c(k_min, k_max),
scale = TRUE)
# matrix of neighbors. The minimum number of indices is 20 (given by k_min)
# and the maximum number of indices is 300 (given by k_max).
# NAs indicate "not a neighbor"
dnn_pc$neighbors
# this reports how many neighbors were found for each observation in
# testing using the input distance threshold (column n_k) and how
# many were finally selected (column final_n_k)
dnn_pc$k_diss_info
# matrix of neighbor distances
dnn_pc$neighbors_diss
# the indices in training that fall in any of the
# neighborhoods of testing
dnn_pc$unique_neighbors
In the code above, the size of the neighborhoods is not constant, the
size variability can be easily visualized with a histogram on
dnn_pc$k_diss_info$n_k
. Figure @ref(fig:knnhist), shows
that many
neighborhoods were reset to a size of 20 or to a size of 300.
hist(dnn_pc$k_diss_info$final_n_k,
breaks = k_min,
xlab = "Final neighborhood size",
main = "", col = "#EFBF47CC")
In the package, spiking refers to forcing specific
observations to be included in the neighborhoods. For example, if we are
searching in the training
set the neighbors of the
testing
set, and if we want to force certain observations
in training
to be included in the neighborhood of each
observation in tetsing
, we can use the spike
argument in search_neighbors()
. For that, in this argument
we will need to pass the indices of training
that we will
be forced into the neighborhoods. The following example demonstrates how
to do that: measures:
# the indices of the observations that we want to "invite" to every neighborhood
forced_guests <- c(1, 5, 8, 9)
# using PC dissimilarity with optimal selection of components
knn_spiked <- search_neighbors(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "pca.nipals",
Yr = training$Ciso,
k = 50,
spike = forced_guests,
pc_selection = list("opc", 20))
# check the first 8 neighbors found in training for the
# first 2 observations in testing
knn_spiked$neighbors[1:8, 1:2]
The previous code shows that the indices specified in
forced_guests
are always selected as part of every
neighborhood.
Spiking might be useful when there is a prior knowledge of the similarity between certain observations that cannot be easily pick up by the data.
Memory-based learning (MBL) describes a family of (non-linear) machine learning methods designed to deal with complex spectral datasets (Ramirez-Lopez, Behrens, Schmidt, Stevens, et al. 2013). In MBL, instead of deriving a general or global regression function, a specific regression model is built for each observation requiring a prediction of a response. Each model is fitted using the nearest neighbors of the target observation found in a calibration or reference set [@ref(fig:mblgif)]. While a global function may be very complex, MBL can describe the target function as a collection of less complex local (or locally stable) approximations (Mitchell 1997). For example, for predicting the response variable Y of a set of m observations from their explanatory variables X, a set of m functions are required to be fitted. This can be described as:
ŷi = f̂i(xi; θi) ∀ i = {1, ..., m} where θi represents a set of particular parameters required to fit f̂i (e.g. number of factors in a PLS model). Therefore, MBL in the above example can be described broadly as:
f̂ = {f̂1, ..., f̂m} Figure @ref(fig:mblgif) illustrates the basic steps in MBL for a set of five observations (m = 5).
There are four basic aspects behind the steps in Figure @ref(fig:mblgif) that must be defined for any MBL algorithm:
A dissimilarity metric: It is required for neighbor search. The dissimilarity metric used must be capable also to reflect the dissimilarity in terms of the response variable for which models are to be built. For example, in soil NIR spectroscopy, the spectral dissisimilarity values of soil samples must be capable of reflecting the compositional dissisimilarity between them. Dissimilarity methods that poorly reflect this general sample dissimilarity are prone to lead to MBL models with poor predictive performance.
How many neighbors to look at?: It is important to optimize the neighborhood size to be used for fitting the local models. Neighborhoods which are too small might be too sensitive to noise and outliers affecting the robustness of the models (Ramirez-Lopez, Behrens, Schmidt, Stevens, et al. 2013). Small neighborhoods might also lack of enough variance to properly capture the relationships between the predictors and the response. On the other hand, large size neighborhoods might introduce complex non-linear relationships between predictors and response which might decrease the accuracy of the models.
How to use the dissimilarity information?: The dissimilarity information can be:
Ignored, this means is only used to retrieve neighbors (e.g. the LOCAL algorithm, Shenk, Westerhaus, and Berzaghi 1997).
Used to weight the training observations according to their dissimilarity to the target observation (e.g. as in locally weighted PLS regression, Naes, Isaksson, and Kowalski 1990).
Used as source of additional predictors (Ramirez-Lopez,
Behrens, Schmidt, Stevens, et al. 2013). In this case, the
pairwise dissimilarity matrix between all the k neighbors is also retrieved. This
matrix of k × k
dimensions is combined with the p predictor variables resulting in a
final matrix of predictors (for the neighborhood) of k × (k + p)
dimensions. To predict the target observation, the predictors used are
the p spectral variables in
combination to the vector of distances between the target observation
and its neighbors. In some cases, this approach might lead to an
increase on the predictive performance of the local models. This
combined matrix of predictors can be built as follows:
where di, j
represents the dissimilarity score between the ith neighbor and the jth neighbor.
How to fit the local points?: This is given by the regression method used which is usually a linear one, as the relationships between the explanatory variables and the response are usually assumed linear within the neighborhood.
In the literature MBL is sometimes referred to as local modeling, nevertheless local modeling comprises other approaches, for example, cluster–based modeling and geographical segmentation-based modeling, etc. Hence, MBL can be described as a type of local modeling (Ramirez-Lopez, Behrens, Schmidt, Stevens, et al. 2013).
The mbl()
function in the resemble
package
offers the possibility to build customized memory-based learners. This
can be done by choosing from different dissimilarity metrics, different
methods for neighborhood size optimization, different ways of using the
dissimilarity information and different regression methods for fitting
the models within the neighborhoods.
We encourage readers to go through the sections corresponding to dissimilarity measures and k-Nearest Neighbors search which serve as the basis for the examples presented in this section.
The mbl()
function can be described in regards to the
four basic aspects of the MBL methods (which are described few
paragraphs above):
The dissimilarity metric: this is controlled by the
diss_method
argument of the mbl()
function.
The methods available are the same as the ones described in the dissimilarity measures
section.
How many neighbors to look at?: this can be defined
either in the k
or in the k_diss
arguments of
the mbl()
function. These arguments operate in a similar
fashion as their counterparts in the search_neighbors()
function described in the k-Nearest Neighbors search
section. However, in mbl()
a vector of different
neighborhood sizes can be passed to k
, or a vector of
different distance thresholds can be passed to k_diss
. This
allows to test different values in one run. Here, the
k_diss
argument is also accompanied by the argument
k_range
which is used to control the maximum and minimum
neighborhood sizes.
How to use the dissimilarity information: this is
controlled by the diss_usage
argument. If
"none"
is passed, the dissimilarity information is ignored,
if "weights"
is passed, the dissimilarity information is
used to weight the training observations (using a tricubic function). If
"predictors"
is passed, the dissimilarity information is
used as source of additional predictors.
How to fit the local points?: This is controlled by the
method
argument. For this, a local_fit
object
(which carries the information of the regression method and its
parameters) is used. There are three methods available: partial least
squares (PLS) regression, weighted average partial least squares
regression (WAPLS, Shenk, Westerhaus, and Berzaghi
1997) and Gaussian process regression (GPR) with dot product
covariance. The following examples show how to build such
local_fit
objects:
# creates an object with instructions to build PLS models
my_plsr <- local_fit_pls(pls_c = 15)
my_plsr
## Partial least squares (pls)
## Number of factors: 15
# creates an object with instructions to build WAPLS models
my_waplsr <- local_fit_wapls(min_pls_c = 3, max_pls_c = 20)
my_waplsr
## Weighted average partial least squares (wapls)
## Min. and max. number of factors: from 3 to 20
# creates an object with instructions to build GPR models
my_gpr <- local_fit_gpr()
my_gpr
## Gaussian process with linear kernel/dot product (gpr)
## Noise: 0.001
A function named mbl_control()
allows to build objects
that control internal validation and some optimization aspects of the
mbl()
function. In mbl_control()
two types of
validation can be specified using the validation_type
argument:
Leave-nearest-neighbor-out cross-validation ("NNv"
):
From the group of neighbors of each target observation, the nearest
observation (i.e. the most similar observation) is excluded and then a
local model is fitted using the remaining neighbors. This model is then
used to predict the value of the response variable of the nearest
observation. The set of predicted values in all the 1-NN observations
are finally cross-validated against their corresponding reference
values.
Local leave-group-out cross-validation ("local_cv"
):
The group of neighbors of each observation to be predicted is
partitioned into different equal size subsets. The model fitted with the
selected calibration samples is used to predict the response values of
the local validation samples and the local root mean square error is
computed. This process is repeated m times and the final local error is
computed as the average of the local root mean square errors obtained
for all the m iterations. In
the mbl_control()
function m is controlled by the
numbe
argument and the size of the subsets is controlled by
the p
argument which indicates the percentage of
observations to be selected from the subset of nearest neighbors. The
global error of the predictions is computed as the average of the local
root mean square errors.
Let’s see some examples on how to build objects for controlling the
validation in mbl
.
# create an object with instructions to conduct both validation types
# "NNv" "local_cv"
two_val_control <- mbl_control(validation_type = c("NNv", "local_cv"),
number = 10,
p = 0.75)
The object two_val_control
stores the instructions for
conducting both types of validations ("NNv"
and
"local_cv"
). For "local_cv"
, the number of
groups is set to 10 and the percentage of neighbors to build the local
calibration groups is set to 75%.
Now that we have explained the main components for the
mbl()
let’s see how the mbl()
function can be
used to predict response variables in the testing
set by
building models with the training
set. The following MBL
configuration reproduces the LOCAL algorithm (Shenk, Westerhaus, and Berzaghi
1997):
# define the dissimilarity method
my_diss <- "cor"
# define the neighborhood sizes to test
my_ks <- seq(80, 200, by = 40)
# define how to use the dissimilarity information (ignore it)
ignore_diss <- "none"
# define the regression method to be used at each neighborhood
my_waplsr <- local_fit_wapls(min_pls_c = 3, max_pls_c = 20)
# for the moment use only "NNv" validation (it will be faster)
nnv_val_control <- mbl_control(validation_type = "NNv")
# predict Total Carbon
# (remove missing values)
local_ciso <- mbl(
Xr = training$spc_p[!is.na(training$Ciso),],
Yr = training$Ciso[!is.na(training$Ciso)],
Xu = testing$spc_p,
k = my_ks,
method = my_waplsr,
diss_method = my_diss,
diss_usage = ignore_diss,
control = nnv_val_control,
scale = TRUE
)
Now let’s explore the local_ciso
object:
local_ciso
##
## Call:
##
## mbl(Xr = training$spc_p[!is.na(training$Ciso), ], Yr = training$Ciso[!is.na(training$Ciso)],
## Xu = testing$spc_p, k = my_ks, method = my_waplsr, diss_method = my_diss,
## diss_usage = ignore_diss, control = nnv_val_control, scale = TRUE)
##
## _______________________________________________________
##
## Total number of observations predicted: 207
## _______________________________________________________
##
## Nearest neighbor validation statistics
##
## k rmse st_rmse r2
## <num> <num> <num> <num>
## 1: 80 0.621 0.0363 0.874
## 2: 120 0.646 0.0377 0.864
## 3: 160 0.636 0.0371 0.868
## 4: 200 0.664 0.0388 0.859
## _______________________________________________________
According to the results obtained in the above example, the
neighborhood size that minimizes the root mean squared error (RMSE) in
nearest neighbor cross-validation is 80. Let’s get the predictions done
for the testing
dataset:
bki <- which.min(local_ciso$validation_results$nearest_neighbor_validation$rmse)
bk <- local_ciso$validation_results$nearest_neighbor_validation$k[bki]
# all the prediction results are stored in:
local_ciso$results
# the get_predictions function makes easier to retrieve the
# predictions from the previous object
ciso_hat <- as.matrix(get_predictions(local_ciso))[, bki]
# Plot predicted vs reference
plot(ciso_hat, testing$Ciso,
xlim = c(0, 14),
ylim = c(0, 14),
xlab = "Predicted Total Carbon, %",
ylab = "Total Carbon, %",
main = "LOCAL using argument k")
grid()
abline(0, 1, col = "red")
The prediction root mean squared error is then:
# prediction RMSE:
sqrt(mean((ciso_hat - testing$Ciso)^2, na.rm = TRUE))
## [1] 0.4506581
# squared R
cor(ciso_hat, testing$Ciso, use = "complete.obs")^2
## [1] 0.9155142
Similar results are obtained when the optimization of the neighbrhoods is based on distance thresholds:
# create a vector of dissimilarity thresholds to evaluate
# since the correlation dissimilarity will be used
# these thresholds need to be > 0 and <= 1
dths <- seq(0.025, 0.3, by = 0.025)
# indicate the minimum and maximum sizes allowed for the neighborhood
k_min <- 30
k_max <- 200
local_ciso_diss <- mbl(
Xr = training$spc_p[!is.na(training$Ciso),],
Yr = training$Ciso[!is.na(training$Ciso)],
Xu = testing$spc_p,
k_diss = dths,
k_range = c(k_min, k_max),
method = my_waplsr,
diss_method = my_diss,
diss_usage = ignore_diss,
control = nnv_val_control,
scale = TRUE
)
local_ciso_diss
##
## Call:
##
## mbl(Xr = training$spc_p[!is.na(training$Ciso), ], Yr = training$Ciso[!is.na(training$Ciso)],
## Xu = testing$spc_p, k_diss = dths, k_range = c(k_min, k_max),
## method = my_waplsr, diss_method = my_diss, diss_usage = ignore_diss,
## control = nnv_val_control, scale = TRUE)
##
## _______________________________________________________
##
## Total number of observations predicted: 207
## _______________________________________________________
##
## Nearest neighbor validation statistics
##
## k_diss p_bounded rmse st_rmse r2
## <num> <char> <num> <num> <num>
## 1: 0.025 94.686% 0.611 0.0357 0.878
## 2: 0.050 71.014% 0.613 0.0358 0.877
## 3: 0.075 62.319% 0.620 0.0362 0.874
## 4: 0.100 51.691% 0.610 0.0356 0.879
## 5: 0.125 41.546% 0.588 0.0343 0.887
## 6: 0.150 31.884% 0.537 0.0313 0.906
## 7: 0.175 21.739% 0.571 0.0333 0.894
## 8: 0.200 26.57% 0.615 0.0359 0.877
## 9: 0.225 28.986% 0.595 0.0347 0.884
## 10: 0.250 29.952% 0.628 0.0366 0.871
## 11: 0.275 33.816% 0.618 0.0361 0.875
## 12: 0.300 37.681% 0.625 0.0365 0.873
## _______________________________________________________
The best correlation dissimilarity threshold is 0.15. The column
“p_bounded” in the table of validation results, indicate the percentage
of neighborhoods for which the size was reset either to
k_min
or k_max
.
Here we provide few additional examples of some MBL configurations where we make use of another response variable available in the dataset: soil cation exchange capacity (CEC). This variable is perhaps more challenging to predict in comparison to Total Carbon. Table @ref(tab:addexamples) provides a summary of the configurations tested in the following code examples.
Abreviation | Dissimilarity method | Dissimilarity usage | Local regression |
---|---|---|---|
local_cec |
Correlation | None | Weighted average PLS |
pc_pred_cec |
optimized PC | Source of predictors | Weighted average PLS |
pls_pred_cec |
optimized PLS | None | Weighted average PLS |
local_gpr_cec |
optimized PC | Source of predictors | Gaussian process |
# Lets define some methods:
my_wapls <- local_fit_wapls(2, 25)
k_min_max <- c(80, 200)
# use the LOCAL algorithm
# specific thresholds for cor dissimilarity
dth_cor <- seq(0.01, 0.3, by = 0.03)
local_cec <- mbl(
Xr = training$spc_p[!is.na(training$CEC),],
Yr = training$CEC[!is.na(training$CEC)],
Xu = testing$spc_p,
k_diss = dth_cor,
k_range = k_min_max,
method = my_wapls,
diss_method = "cor",
diss_usage = "none",
control = nnv_val_control,
scale = TRUE
)
# use one where pca dissmilarity is used and the dissimilarity matrix
# is used as source of additional predictors
# lets define first a an appropriate vector of dissimilarity thresholds
# for the PC dissimilarity method
dth_pc <- seq(0.05, 1, by = 0.1)
pc_pred_cec <- mbl(
Xr = training$spc_p[!is.na(training$CEC),],
Yr = training$CEC[!is.na(training$CEC)],
Xu = testing$spc_p,
k_diss = dth_pc,
k_range = k_min_max,
method = my_wapls,
diss_method = "pca",
diss_usage = "predictors",
control = nnv_val_control,
scale = TRUE
)
# use one where PLS dissmilarity is used and the dissimilarity matrix
# is used as source of additional predictors
pls_pred_cec <- mbl(
Xr = training$spc_p[!is.na(training$CEC),],
Yr = training$CEC[!is.na(training$CEC)],
Xu = testing$spc_p,
Yu = testing$CEC,
k_diss = dth_pc,
k_range = k_min_max,
method = my_wapls,
diss_method = "pls",
diss_usage = "none",
control = nnv_val_control,
scale = TRUE
)
# use one where Gaussian process regression and pca dissmilarity are used
# and the dissimilarity matrix is used as source of additional predictors
local_gpr_cec <- mbl(
Xr = training$spc_p[!is.na(training$CEC),],
Yr = training$CEC[!is.na(training$CEC)],
Xu = testing$spc_p,
k_diss = dth_pc,
k_range = k_min_max,
method = local_fit_gpr(),
diss_method = "pca",
diss_usage = "predictors",
control = nnv_val_control,
scale = TRUE
)
Collect the predictions for each configuration:
# get the indices of the best results according to
# nearest neighbor validation statistics
c_val_name <- "validation_results"
c_nn_val_name <- "nearest_neighbor_validation"
bi_local <- which.min(local_cec[[c_val_name]][[c_nn_val_name]]$rmse)
bi_pc_pred <- which.min(pc_pred_cec[[c_val_name]][[c_nn_val_name]]$rmse)
bi_pls_pred <- which.min(pls_pred_cec[[c_val_name]][[c_nn_val_name]]$rmse)
bi_local_gpr <- which.min(local_gpr_cec[[c_val_name]][[c_nn_val_name]]$rmse)
preds <- cbind(get_predictions(local_cec)[, ..bi_local],
get_predictions(pc_pred_cec)[, ..bi_pc_pred],
get_predictions(pls_pred_cec)[, ..bi_pls_pred],
get_predictions(local_gpr_cec)[, ..bi_local_gpr])
colnames(preds) <- c("local_cec",
"pc_pred_cec",
"pls_pred_cec",
"local_gpr_cec")
preds <- as.matrix(preds)
# R2s
cor(testing$CEC, preds, use = "complete.obs")^2
## local_cec pc_pred_cec pls_pred_cec local_gpr_cec
## [1,] 0.7557315 0.7977637 0.77421 0.7790474
#RMSEs
colMeans((preds - testing$CEC)^2, na.rm = TRUE)^0.5
## local_cec pc_pred_cec pls_pred_cec local_gpr_cec
## 3.334444 3.202677 3.299601 3.189961
The scatter plots in @ref(fig:mblcomparisons) ilustrate the prediction results obatined for CEC with each of the MBL configurations tested.
old_par <- par("mfrow", "mar")
par(mfrow = c(2, 2))
plot(testing$CEC, preds[, 2],
xlab = "Predicted CEC, meq/100g",
ylab = "CEC, meq/100g", main = colnames(preds)[2])
abline(0, 1, col = "red")
plot(testing$CEC, preds[, 3],
xlab = "Predicted CEC, meq/100g",
ylab = "CEC, meq/100g", main = colnames(preds)[3])
abline(0, 1, col = "red")
plot(testing$CEC, preds[, 4],
xlab = "Predicted CEC, meq/100g",
ylab = "CEC, meq/100g", main = colnames(preds)[4])
abline(0, 1, col = "red")
par(old_par)
Yu
argumentIf information of the response values in the prediction set is
available, then, the Yu
argument can be used to directly
validate the predictions done by mbl()
. It is not taken
into accound for any optimization or modeling step. It can be used as
follows:
# use Yu argument to validate the predictions
pc_pred_nt_yu <- mbl(
Xr = training$spc_p[!is.na(training$Nt),],
Yr = training$Nt[!is.na(training$Nt)],
Xu = testing$spc_p,
Yu = testing$Nt,
k = seq(40, 100, by = 10),
diss_usage = "none",
control = mbl_control(validation_type = "NNv"),
scale = TRUE
)
pc_pred_nt_yu
The mbl()
function uses the foreach()
function of the package
foreach
for iterating over every row/observation passed
to the argument Xu
. In the following example, we use the package
doParallel
to set up the cores to be used.
Alternatively the package
doSNOW
can also be used. In the following example we
use parallel processing to predict Total Nitrogen:
# Running the mbl function using multiple cores
# Execute with two cores, if available, ...
n_cores <- 2
# ... if not then go with 1 core
if (parallel::detectCores() < 2) {
n_cores <- 1
}
# Set the number of cores
library(doParallel)
clust <- makeCluster(n_cores)
registerDoParallel(clust)
# Alternatively:
# library(doSNOW)
# clust <- makeCluster(n_cores, type = "SOCK")
# registerDoSNOW(clust)
# getDoParWorkers()
pc_pred_nt <- mbl(
Xr = training$spc_p[!is.na(training$Nt),],
Yr = training$Nt[!is.na(training$Nt)],
Xu = testing$spc_p,
k = seq(40, 100, by = 10),
diss_usage = "none",
control = mbl_control(validation_type = "NNv"),
scale = TRUE
)
# go back to sequential processing
registerDoSEQ()
try(stopCluster(clust))
pc_pred_nt