Title: | Miscellaneous Functions for Processing and Sample Selection of Spectroscopic Data |
---|---|
Description: | Functions to preprocess spectroscopic data and conduct (representative) sample selection/calibration sampling. |
Authors: | Antoine Stevens [aut, cre] , Leonardo Ramirez-Lopez [aut, cre] |
Maintainer: | Leonardo Ramirez-Lopez <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.7 |
Built: | 2024-11-13 06:20:05 UTC |
Source: | CRAN |
Misc functions for spectral data
This package implements a number of functions useful for pre-processing spectral data well as for selecting representative samples/spectra. The functions included here are particularly useful in Near-Infrared and Infrared Spectroscopy applications.
#' This is the version 0.2.7 – cakes of the package. The main functionality is listed here.
Currently, the following preprocessing functions are available:
For the selection of representative samples/observations for calibrating spectral models the following functions ca be used:
Other useful functions are also available:
Maintainer: Leonardo Ramirez-Lopez [email protected]
Authors:
Useful links:
Report bugs at https://github.com/l-ramirez-lopez/prospectr/issues
Fits a baseline to each spectrum in a matrix and removes it from the corresponding input spectrum. A vector can also be passed to this function.
baseline(X, wav)
baseline(X, wav)
X |
a numeric matrix or vector to process (optionally a data frame that can be coerced to a numerical matrix). |
wav |
optional. A numeric vector of band positions. |
The baseline function find points lying on the convex hull of a spectrum, connects the points by linear interpolation and subtracts the interpolated line (baseline) from the corresponding spectrum.
a matrix or vector with the baselined spectra. The resulting matrix
is output with an attribute called baselines
which contain the spectra
of the fitted baselines.
This function is similar to continuumRemoval
and it might
replace some of its functionality in the future.
Leonardo Ramirez-Lopez with contributions from Mervin Manalili
savitzkyGolay
, movav
,
gapDer
, binning
, continuumRemoval
data(NIRsoil) wav <- as.numeric(colnames(NIRsoil$spc)) # plot of the 5 first absorbance spectra matplot(wav, t(NIRsoil$spc[1:5, ]), type = "l", ylim = c(0, .6), xlab = "Wavelength /nm", ylab = "Absorbance" ) bs <- baseline(NIRsoil$spc, wav) matlines(wav, t(bs[1:5, ])) fitted_baselines <- attr(bs, "baselines") matlines(wav, t(fitted_baselines[1:5, ])) title("Original spectra, baselines and baselined spectra")
data(NIRsoil) wav <- as.numeric(colnames(NIRsoil$spc)) # plot of the 5 first absorbance spectra matplot(wav, t(NIRsoil$spc[1:5, ]), type = "l", ylim = c(0, .6), xlab = "Wavelength /nm", ylab = "Absorbance" ) bs <- baseline(NIRsoil$spc, wav) matlines(wav, t(bs[1:5, ])) fitted_baselines <- attr(bs, "baselines") matlines(wav, t(fitted_baselines[1:5, ])) title("Original spectra, baselines and baselined spectra")
Compute average values of a signal in pre-determined bins (col-wise subsets). The bin size can be determined either directly or by specifying the number of bins. Sometimes called boxcar transformation in signal processing
binning(X, bins, bin.size)
binning(X, bins, bin.size)
X |
a numeric matrix or vector to process (optionally a data frame that can be coerced to a numerical matrix). |
bins |
the number of bins. |
bin.size |
the desired size of the bins. |
a matrix or vector with average values per bin.
Antoine Stevens & Leonardo Ramirez-Lopez
savitzkyGolay
, movav
,
gapDer
, continuumRemoval
data(NIRsoil) wav <- as.numeric(colnames(NIRsoil$spc)) # 5 first spectra matplot(wav, t(NIRsoil$spc[1:5, ]), type = "l", xlab = "Wavelength /nm", ylab = "Absorbance" ) NIRsoil$spc_binned <- binning(NIRsoil$spc, bin.size = 20) # bin means matpoints(as.numeric(colnames(NIRsoil$spc_binned)), t(NIRsoil$spc_binned[1:5, ]), pch = 1:5 ) NIRsoil$spc_binned <- binning(NIRsoil$spc, bins = 20) dim(NIRsoil$spc_binned) # 20 bins # 5 first spectra matplot(wav, t(NIRsoil$spc[1:5, ]), type = "l", xlab = "Wavelength /nm", ylab = "Absorbance" ) # bin means matpoints(as.numeric(colnames(NIRsoil$spc_binned)), t(NIRsoil$spc_binned[1:5, ]), pch = 1:5 )
data(NIRsoil) wav <- as.numeric(colnames(NIRsoil$spc)) # 5 first spectra matplot(wav, t(NIRsoil$spc[1:5, ]), type = "l", xlab = "Wavelength /nm", ylab = "Absorbance" ) NIRsoil$spc_binned <- binning(NIRsoil$spc, bin.size = 20) # bin means matpoints(as.numeric(colnames(NIRsoil$spc_binned)), t(NIRsoil$spc_binned[1:5, ]), pch = 1:5 ) NIRsoil$spc_binned <- binning(NIRsoil$spc, bins = 20) dim(NIRsoil$spc_binned) # 20 bins # 5 first spectra matplot(wav, t(NIRsoil$spc[1:5, ]), type = "l", xlab = "Wavelength /nm", ylab = "Absorbance" ) # bin means matpoints(as.numeric(colnames(NIRsoil$spc_binned)), t(NIRsoil$spc_binned[1:5, ]), pch = 1:5 )
Sum of squares block weighting: allows to scale blocks of variables, but keeping the relative weights of the variables inside a block.
blockNorm(X, targetnorm = 1)
blockNorm(X, targetnorm = 1)
X |
a numeric matrix to transform (optionally a data frame that can be coerced to a numerical matrix). |
targetnorm |
desired sum of squares for a block of variables (default = 1) |
The function computes a scaling factor, which, multiplied by the input matrix, produces a matrix with a pre–determined sum of squares.
a list with components Xscaled
, the scaled matrix and f
, the
scaling factor
This is a R port of the ‘MBnorm.m’ function of the MB matlab toolbox by Fran van den Berg.
Antoine Stevens
Eriksson, L., Johansson, E., Kettaneh, N., Trygg, J., Wikstrom, C., and Wold, S., 2006. Multi- and Megavariate Data Analysis. MKS Umetrics AB.
blockScale
, standardNormalVariate
,
detrend
X <- matrix(rnorm(100), ncol = 10) # Block normalize to sum of square equals to 1 res <- blockNorm(X, targetnorm = 1) sum(res$Xscaled^2) # check
X <- matrix(rnorm(100), ncol = 10) # Block normalize to sum of square equals to 1 res <- blockNorm(X, targetnorm = 1) sum(res$Xscaled^2) # check
Hard or soft block scaling of a spectral matrix to constant group variance. In multivariate calibration, block scaling is used to down-weight variables, when one block of variables dominates other blocks. With hard block scaling, the variables in a block are scaled so that the sum of their variances equals 1. When soft block scaling is used, the variables are scaled such that the sum of variable variances is equal to the square root of the number of variables in a particular block.
blockScale(X, type = 'hard', sigma2 = 1)
blockScale(X, type = 'hard', sigma2 = 1)
X |
a numeric matrix or vector to process (optionally a data frame that can be coerced to a numerical matrix). |
type |
the type of block scaling: 'hard' or 'soft'. |
sigma2 |
the desired total variance of a block (ie sum of the variances
of all variables, default = 1), applicable when |
a list
with Xscaled
, the scaled matrix and f
, the scaling
factor.
Antoine Stevens
Eriksson, L., Johansson, E., Kettaneh, N., Trygg, J., Wikstrom, C., and Wold, S., 2006. Multi- and Megavariate Data Analysis. MKS Umetrics AB.
blockNorm
, standardNormalVariate
,
detrend
X <- matrix(rnorm(100), ncol = 10) # Hard block scaling res <- blockScale(X) # sum of column variances == 1 apply(res$Xscaled, 2, var)
X <- matrix(rnorm(100), ncol = 10) # Hard block scaling res <- blockScale(X) # sum of column variances == 1 apply(res$Xscaled, 2, var)
Detects and removes replicate outliers in data series based on the Cochran C test for homogeneity in variance.
cochranTest(X, id, fun = 'sum', alpha = 0.05)
cochranTest(X, id, fun = 'sum', alpha = 0.05)
X |
a a numeric matrix (optionally a data frame that can be coerced to a numerical matrix). |
id |
factor of the replicate identifiers. |
fun |
function to aggregate data: 'sum' (default), 'mean', 'PC1' or 'PC2'. |
alpha |
p-value of the Cochran C test. |
The Cochran C test is test whether a single estimate of variance is significantly larger than a a group of variances. It can be computed as:
\[RMSD = \sqrt{\frac{1}{n} \sum_{i=1}^n {(y_i - \ddot{y}_i)^2}}\]where \(y_i\) is the value of the side variable of the \(i\)th sample, \(\ddot{y}_i\) is the value of the side variable of the nearest neighbor of the \(i\)th sample and \(n\) is the total number of observations.
For multivariate data, the variance \(S_i^2\) can be computed on aggregated
data, using a summary function (fun
argument)
such as sum
, mean
, or first principal components ('PC1' and 'PC2').
An observation is considered to have an outlying variance if the Cochran C statistic is higher than an upper limit critical value \(C_{UL}\) which can be evaluated with ('t Lam, 2010):
\[C_{UL}(\alpha, n, N) = 1 + [\frac{N-1}{F_{c}(\alpha/N,(n-1),(N-1)(n-1))}]^{-1} \]where \(\alpha\) is the p-value of the test, \(n\) is the (average) number of replicates and \(F_c\) is the critical value of the Fisher's \(F\) ratio.
The replicates with outlying variance are removed and the test can be applied
iteratively until no outlying variance is detected under the given p-value.
Such iterative procedure is implemented in cochranTest
, allowing the user
to specify whether a set of replicates must be removed or not from the
dataset by graphical inspection of the outlying replicates. The user has then
the possibility to (i) remove all replicates at once, (ii) remove one or more
replicates by giving their indices or (iii) remove nothing.
a list with components:
'X
': input matrix from which outlying observations (rows) have
been removed
'outliers
': numeric vector giving the row indices of the input
data that have been flagged as outliers
The test assumes a balanced design (i.e. data series have the same number of replicates).
Antoine Stevens
Centner, V., Massart, D.L., and De Noord, O.E., 1996. Detection of inhomogeneities in sets of NIR spectra. Analytica Chimica Acta 330, 1-17.
R.U.E. 't Lam (2010). Scrutiny of variance results for outliers: Cochran's test optimized. Analytica Chimica Acta 659, 68-84.
https://en.wikipedia.org/wiki/Cochran's_C_test
Compute the continuum removed values of a data matrix or vector
continuumRemoval(X, wav, type = c("R", "A"), interpol = c("linear", "spline"), method = c("division", "substraction"))
continuumRemoval(X, wav, type = c("R", "A"), interpol = c("linear", "spline"), method = c("division", "substraction"))
X |
a numeric matrix or vector to process (optionally a data frame that can be coerced to a numerical matrix). |
wav |
optional. A numeric vector of band positions. |
type |
the type of data: 'R' for reflectance (default), 'A' for absorbance. |
interpol |
the interpolation method between points on the convex hull: 'linear' (default) or 'spline'. |
method |
normalization method: 'division' (default) or 'subtraction' (see details section). |
The continuum removal technique was introduced by Clark and Roush (1984) as a method to highlight energy absorption features of minerals. It can be viewed as a way to perform albedo normalization. The algorithm find points lying on the convex hull (local maxima or envelope) of a spectrum, connects the points by linear or spline interpolation and normalizes the spectrum by dividing (or subtracting) the input data by the interpolated line.
a matrix or vector with the filtered spectra.
Antoine Stevens & Leonardo Ramirez-Lopez
Clark, R.N., and Roush, T.L., 1984. Reflectance Spectroscopy: Quantitative Analysis Techniques for Remote Sensing Applications. J. Geophys. Res. 89, 6329-6340.
savitzkyGolay
, movav
,
gapDer
, binning
data(NIRsoil) wav <- as.numeric(colnames(NIRsoil$spc)) # plot of the 10 first abs spectra matplot(wav, t(NIRsoil$spc[1:10, ]), type = "l", ylim = c(0, .6), xlab = "Wavelength /nm", ylab = "Abs" ) # # type = 'A' is used for absorbance spectra cr <- continuumRemoval(NIRsoil$spc, wav, type = "A") matlines(wav, t(cr[1:10, ]))
data(NIRsoil) wav <- as.numeric(colnames(NIRsoil$spc)) # plot of the 10 first abs spectra matplot(wav, t(NIRsoil$spc[1:10, ]), type = "l", ylim = c(0, .6), xlab = "Wavelength /nm", ylab = "Abs" ) # # type = 'A' is used for absorbance spectra cr <- continuumRemoval(NIRsoil$spc, wav, type = "A") matlines(wav, t(cr[1:10, ]))
Normalizes each row of an input matrix by applying a SNV transformation followed by fitting a second order linear model and returning the fitted residuals.
detrend(X, wav, p = 2)
detrend(X, wav, p = 2)
X |
a numeric matrix or vector to process (optionally a data frame that can be coerced to a numerical matrix) |
wav |
the wavelengths/ band centers. |
p |
an integer larger than 1 indicating the polynomial order (default is 2, as in the original paper of Barnes et al., 1989). |
The detrend is a row-wise transformation that allows to correct for wavelength-dependent scattering effects (variations in curvilinearity). A \(p\) order polynomial is fit for each spectrum (\(x_i\)) using the vector of bands (\(\lambda\), e.g. wavelengths) as explanatory variable as follows:
\[x_i = a\lambda^p + ... + b\lambda + c + e_i\]were a, b, c are estimated by least squares, and \(e_i\) are the spectral residuals of the least square fit. The residuals of the \(i\)th correspond to the \(i\)th detrended spectrum.
a matrix or vector with the detrended data.
Antoine Stevens and Leonardo Ramirez-Lopez
Barnes RJ, Dhanoa MS, Lister SJ. 1989. Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra. Applied spectroscopy, 43(5): 772-777.
standardNormalVariate
, blockScale
,
blockNorm
data(NIRsoil) wav <- as.numeric(colnames(NIRsoil$spc)) # conversion to reflectance opar <- par(no.readonly = TRUE) par(mfrow = c(2, 1), mar = c(4, 4, 2, 2)) # plot of the 10 first spectra matplot(wav, t(NIRsoil$spc[1:10, ]), type = "l", xlab = "", ylab = "Absorbance" ) mtext("Raw spectra") det <- detrend(NIRsoil$spc, wav) matplot(wav, t(det[1:10, ]), type = "l", xlab = "Wavelength /nm", ylab = "Absorbance" ) mtext("Detrend spectra") par(opar)
data(NIRsoil) wav <- as.numeric(colnames(NIRsoil$spc)) # conversion to reflectance opar <- par(no.readonly = TRUE) par(mfrow = c(2, 1), mar = c(4, 4, 2, 2)) # plot of the 10 first spectra matplot(wav, t(NIRsoil$spc[1:10, ]), type = "l", xlab = "", ylab = "Absorbance" ) mtext("Raw spectra") det <- detrend(NIRsoil$spc, wav) matplot(wav, t(det[1:10, ]), type = "l", xlab = "Wavelength /nm", ylab = "Absorbance" ) mtext("Detrend spectra") par(opar)
Select calibration samples from a large multivariate data using the DUPLEX algorithm
duplex(X, k, metric = c("mahal", "euclid"), pc, group, .center = TRUE, .scale = FALSE)
duplex(X, k, metric = c("mahal", "euclid"), pc, group, .center = TRUE, .scale = FALSE)
X |
a numeric matrix. |
k |
the number of calibration/validation samples. |
metric |
the distance metric to be used: 'euclid' (Euclidean distance) or 'mahal' (Mahalanobis distance, default). |
pc |
optional. The number of Principal Components to be used to select
the samples. If not specified, distance are computed in the Euclidean space.
Alternatively, distances are computed in the principal component space and
|
group |
An optional |
.center |
logical value indicating whether the input matrix must be
centered before projecting |
.scale |
logical value indicating whether the input matrix must be
scaled before |
The DUPLEX algorithm is similar to the Kennard-Stone algorithm (see
kenStone
) but allows to select both calibration and validation
points that are independent. Similarly to the Kennard-Stone algorithm,
it starts by selecting the pair of points that are the farthest apart. They
are assigned to the calibration sets and removed from the list of points.
Then, the next pair of points which are farthest apart are assigned to the
validation sets and removed from the list. In a third step, the procedure
assigns each remaining point alternatively to the calibration
and validation sets based on the distance to the points already selected.
Similarly to the Kennard-Stone algorithm, the default distance metric used
by the procedure is the Euclidean distance, but the Mahalanobis distance can
be used as well using the pc
argument (see kenStone
).
a list
with components:
'model
': numeric vector giving the row indices of the input data
selected for calibration
'test
': numeric vector giving the row indices of the input data
selected for validation
'pc
': if the pc
argument is specified, a numeric matrix of the
scaled pc scores
Antoine Stevens & Leonardo Ramirez-Lopez
Kennard, R.W., and Stone, L.A., 1969. Computer aided design of experiments. Technometrics 11, 137-148.
Snee, R.D., 1977. Validation of regression models: methods and examples. Technometrics 19, 415-428.
kenStone
, honigs
, shenkWest
,
naes
data(NIRsoil) sel <- duplex(NIRsoil$spc, k = 30, metric = "mahal", pc = .99) plot(sel$pc[, 1:2], xlab = "PC1", ylab = "PC2") points(sel$pc[sel$model, 1:2], pch = 19, col = 2) # points selected for calibration points(sel$pc[sel$test, 1:2], pch = 18, col = 3) # points selected for validation # Test on artificial data X <- expand.grid(1:20, 1:20) + rnorm(1e5, 0, .1) plot(X[, 1], X[, 2], xlab = "VAR1", ylab = "VAR2") sel <- duplex(X, k = 25, metric = "mahal") points(X[sel$model, ], pch = 19, col = 2) # points selected for calibration points(X[sel$test, ], pch = 15, col = 3) # points selected for validation
data(NIRsoil) sel <- duplex(NIRsoil$spc, k = 30, metric = "mahal", pc = .99) plot(sel$pc[, 1:2], xlab = "PC1", ylab = "PC2") points(sel$pc[sel$model, 1:2], pch = 19, col = 2) # points selected for calibration points(sel$pc[sel$test, 1:2], pch = 18, col = 3) # points selected for validation # Test on artificial data X <- expand.grid(1:20, 1:20) + rnorm(1e5, 0, .1) plot(X[, 1], X[, 2], xlab = "VAR1", ylab = "VAR2") sel <- duplex(X, k = 25, metric = "mahal") points(X[sel$model, ], pch = 19, col = 2) # points selected for calibration points(X[sel$test, ], pch = 15, col = 3) # points selected for validation
Gap-Segment derivatives of a data matrix or vector
gapDer(X, m = 1, w = 1, s = 1, delta.wav)
gapDer(X, m = 1, w = 1, s = 1, delta.wav)
X |
a numeric matrix or vector' to transform (optionally a data frame that can be coerced to a numerical matrix). |
m |
an integer indicating the order of the derivative, larger than 1 (default is 1). Note that this function allows for high order derivatives (e.g. m = 6). |
w |
an integer indicating the gap size (must be odd and >=1), i.e. the spacing between points over which the derivative is computed. |
s |
an integer indicating the segment size (must be odd and >=1), i.e. the range over which the points are averaged (default = 1, i.e. no smoothing corresponding to Norris-Gap Derivative). |
delta.wav |
the sampling interval (or band spacing). |
In this type of derivatives, the gap size denotes the length of the x interval that separates the two segments that are averaged. A detailed explanation of gap segment derivatives can be found in Hopkins (2001).
The sampling interval specified with the delta.wav
argument is used for
scaling and get numerically correct derivatives.
The convolution function is written in C++/Rcpp for faster computations.
a matrix or vector with the filtered signal(s)
Antoine Stevens and Leonardo Ramirez-Lopez
Hopkins, D. W. (2001). What is a Norris derivative?. NIR news, 12(3), 3-5.
savitzkyGolay
, movav
,
binning
, continuumRemoval
data(NIRsoil) opar <- par(no.readonly = TRUE) par(mfrow = c(2, 2), mar = c(4, 4, 2, 2)) # plot of the 10 first spectra matplot(as.numeric(colnames(NIRsoil$spc)), t(NIRsoil$spc[1:10, ]), type = "l", xlab = "", ylab = "Absorbance" ) mtext("Raw spectra") der <- gapDer(NIRsoil$spc, m = 1, w = 1, s = 1, delta.wav = 2) matplot(as.numeric(colnames(der)), t(der[1:10, ]), type = "l", xlab = "Wavelength /nm", ylab = "gap derivative" ) mtext("1st derivative spectra") der <- gapDer(NIRsoil$spc, m = 1, w = 11, s = 1, delta.wav = 2) matplot(as.numeric(colnames(der)), t(der[1:10, ]), type = "l", xlab = "Wavelength /nm", ylab = "gap derivative" ) mtext("1st derivative spectra with a window size = 11 nm") der <- gapDer(NIRsoil$spc, m = 1, w = 11, s = 5, delta.wav = 2) matplot(as.numeric(colnames(der)), t(der[1:10, ]), type = "l", xlab = "Wavelength /nm", ylab = "gap derivative" ) mtext("1st derivative spectra with: window size: 11 nm, smoothing: 5 nm") par(opar)
data(NIRsoil) opar <- par(no.readonly = TRUE) par(mfrow = c(2, 2), mar = c(4, 4, 2, 2)) # plot of the 10 first spectra matplot(as.numeric(colnames(NIRsoil$spc)), t(NIRsoil$spc[1:10, ]), type = "l", xlab = "", ylab = "Absorbance" ) mtext("Raw spectra") der <- gapDer(NIRsoil$spc, m = 1, w = 1, s = 1, delta.wav = 2) matplot(as.numeric(colnames(der)), t(der[1:10, ]), type = "l", xlab = "Wavelength /nm", ylab = "gap derivative" ) mtext("1st derivative spectra") der <- gapDer(NIRsoil$spc, m = 1, w = 11, s = 1, delta.wav = 2) matplot(as.numeric(colnames(der)), t(der[1:10, ]), type = "l", xlab = "Wavelength /nm", ylab = "gap derivative" ) mtext("1st derivative spectra with a window size = 11 nm") der <- gapDer(NIRsoil$spc, m = 1, w = 11, s = 5, delta.wav = 2) matplot(as.numeric(colnames(der)), t(der[1:10, ]), type = "l", xlab = "Wavelength /nm", ylab = "gap derivative" ) mtext("1st derivative spectra with: window size: 11 nm, smoothing: 5 nm") par(opar)
Select calibration samples from a data matrix using the Honings et al. (1985) method
honigs(X, k, type)
honigs(X, k, type)
X |
a numeric matrix with absorbance or continuum-removed reflectance values (optionally a data frame that can be coerced to a numerical matrix). |
k |
the number of samples to select for calibration. |
type |
type of data: 'A' for absorbance (default), 'R' for reflectance, 'CR' for continuum-removed reflectance |
The Honigs algorithm is a simple method to select calibration samples based
on their absorption features. Absorbance, reflectance and continuum-removed
reflectance values (see continuumRemoval
) can be used (type
argument).
The algorithm can be described as follows: let be a matrix of
absorbance values:
the observation (row) with the maximum absolute absorbance
() is selected and assigned to the calibration set.
a vector of weights is computed as
where
is the column of
having the maximum absolute absorbance
and
is the absorbance value corresponding to the maximum
absolute absorbance of
each row is multiplied by the corresponding weight
and the resulting vector is subtracted from the original row
.
the row of the selected observation and the column with the maximum absolute absorbance is removed from the matrix
go back to step 1 and repeat the procedure until the desired number of selected samples is reached
The observation with the maximum absorbance is considered to have an unusual composition. The algorithm selects therefore this observation and remove from other samples the selected absorption feature by subtraction. Samples with low concentration related to this absorption will then have large negative absorption after the subtraction step and hence will be likely to be selected rapidly by the selection procedure as well.
a list
with components:
'model
': numeric vector giving the row indices of the input data
selected for calibration
'test
': numeric vector giving the row indices of the remaining
observations
'bands
': indices of the columns used during the selection procedure
The selection procedure is sensitive to noisy features in the signal.
The number of samples selected k
selected by the algorithm cannot be
greater than the number of wavelengths.
Antoine Stevens
Honigs D.E., Hieftje, G.M., Mark, H.L. and Hirschfeld, T.B. 1985. Unique-sample selection via Near-Infrared spectral substraction. Analytical Chemistry, 57, 2299-2303
kenStone
, naes
, duplex
,
shenkWest
data(NIRsoil) sel <- honigs(NIRsoil$spc, k = 10, type = "A") wav <- as.numeric(colnames(NIRsoil$spc)) # spectral library matplot(wav, t(NIRsoil$spc), type = "l", xlab = "wavelength /nm", ylab = "Abs", col = "grey50" ) # plot calibration spectra matlines(wav, t(NIRsoil$spc[sel$model, ]), type = "l", xlab = "wavelength /nm", ylab = "Abs", lwd = 2, lty = 1 ) # add bands used during the selection process abline(v = wav[sel$bands])
data(NIRsoil) sel <- honigs(NIRsoil$spc, k = 10, type = "A") wav <- as.numeric(colnames(NIRsoil$spc)) # spectral library matplot(wav, t(NIRsoil$spc), type = "l", xlab = "wavelength /nm", ylab = "Abs", col = "grey50" ) # plot calibration spectra matlines(wav, t(NIRsoil$spc[sel$model, ]), type = "l", xlab = "wavelength /nm", ylab = "Abs", lwd = 2, lty = 1 ) # add bands used during the selection process abline(v = wav[sel$bands])
Select calibration samples from a large multivariate data using the Kennard-Stone algorithm
kenStone(X, k, metric = "mahal", pc, group, .center = TRUE, .scale = FALSE, init = NULL)
kenStone(X, k, metric = "mahal", pc, group, .center = TRUE, .scale = FALSE, init = NULL)
X |
a numeric matrix. |
k |
number of calibration samples to be selected. |
metric |
distance metric to be used: 'euclid' (Euclidean distance) or 'mahal' (Mahalanobis distance, default). |
pc |
optional. If not specified, distance are computed in the Euclidean
space. Alternatively, distance are computed
in the principal component score space and |
group |
An optional |
.center |
logical value indicating whether the input matrix should be
centered before Principal Component Analysis. Default set to |
.scale |
logical value indicating whether the input matrix should be
scaled before Principal Component
Analysis. Default set to |
init |
(optional) a vector of integers indicating the indices of the
observations/rows that are to be used as observations that must be included
at the first iteration of the search process. Default is |
The Kennard–Stone algorithm allows to select samples with a uniform distribution over the predictor space (Kennard and Stone, 1969). It starts by selecting the pair of points that are the farthest apart. They are assigned to the calibration set and removed from the list of points. Then, the procedure assigns remaining points to the calibration set by computing the distance between each unassigned points \(i_0\) and selected points \(i\) and finding the point for which:
\[d_{selected} = \max\limits_{i_0}(\min\limits_{i}(d_{i,i_{0}}))\]This essentially selects point \(i_0\) which is the farthest apart from its closest neighbors \(i\) in the calibration set. The algorithm uses the Euclidean distance to select the points. However, the Mahalanobis distance can also be used. This can be achieved by performing a PCA on the input data and computing the Euclidean distance on the truncated score matrix according to the following definition of the Mahalanobis \(H\) distance:
\[H_{ij}^2 = \sum_{a=1}^A (\hat t_{ia} - \hat t_{ja})^{2} / \hat \lambda_a\]where \(\hat t_{ia}\) is the \(a^{th}\) principal component score of point \(i\), \(\hat t_{ja}\) is the corresponding value for point \(j\), \(\hat \lambda_a\) is the eigenvalue of principal component \(a\) and \(A\) is the number of principal components included in the computation.
When the group
argument is used, the sampling is conducted in such a
way that at each iteration, when a single sample is selected, this sample
along with all the samples that belong to its group, are assigned to the
final calibration set. In this respect, at each iteration, the algorithm
will select one sample (in case that sample is the only one in that group)
or more to the calibration set. This also implies that the argument k
passed to the function will not necessary reflect the exact number of samples
selected. For example, if k = 2
and if the first sample identified
belongs to with group of 5 samples and the second one belongs to a group with
10 samples, then, the total amount of samples retrieved by the
function will be 15.
a list with the following components:
model
: numeric vector giving the row indices of the input data
selected for calibration
test
: numeric vector giving the row indices of the remaining
observations
pc
: if the pc
argument is specified, a numeric matrix of the
scaled pc scores
Antoine Stevens & Leonardo Ramirez-Lopez with contributions from Thorsten Behrens and Philipp Baumann
Kennard, R.W., and Stone, L.A., 1969. Computer aided design of experiments. Technometrics 11, 137-148.
duplex
, shenkWest
, naes
,
honigs
data(NIRsoil) sel <- kenStone(NIRsoil$spc, k = 30, pc = .99) plot(sel$pc[, 1:2], xlab = "PC1", ylab = "PC2") # points selected for calibration points(sel$pc[sel$model, 1:2], pch = 19, col = 2) # Test on artificial data X <- expand.grid(1:20, 1:20) + rnorm(1e5, 0, .1) plot(X, xlab = "VAR1", ylab = "VAR2") sel <- kenStone(X, k = 25, metric = "euclid") points(X[sel$model, ], pch = 19, col = 2) # Using the group argument library(prospectr) # create groups set.seed(1) my_groups <- sample(1:275, nrow(NIRsoil$spc), replace = TRUE) |> as.factor() # check the group size table(my_groups) results_group <- kenStone(X = NIRsoil$spc, k = 2, pc = 3, group = my_groups) # as the first two samples selected belong to groups # which have in total more than 2 samples (k). my_groups[results_group$model] |> factor() |> table()
data(NIRsoil) sel <- kenStone(NIRsoil$spc, k = 30, pc = .99) plot(sel$pc[, 1:2], xlab = "PC1", ylab = "PC2") # points selected for calibration points(sel$pc[sel$model, 1:2], pch = 19, col = 2) # Test on artificial data X <- expand.grid(1:20, 1:20) + rnorm(1e5, 0, .1) plot(X, xlab = "VAR1", ylab = "VAR2") sel <- kenStone(X, k = 25, metric = "euclid") points(X[sel$model, ], pch = 19, col = 2) # Using the group argument library(prospectr) # create groups set.seed(1) my_groups <- sample(1:275, nrow(NIRsoil$spc), replace = TRUE) |> as.factor() # check the group size table(my_groups) results_group <- kenStone(X = NIRsoil$spc, k = 2, pc = 3, group = my_groups) # as the first two samples selected belong to groups # which have in total more than 2 samples (k). my_groups[results_group$model] |> factor() |> table()
A simple moving average of a matrix or vector using a convolution function written in C++/Rcpp for fast computing
movav(X, w)
movav(X, w)
X |
a numeric matrix or vector to process (optionally a data frame that can be coerced to a numerical matrix). |
w |
filter length. |
a matrix or vector with the filtered signal(s)
Antoine Stevens
savitzkyGolay
, gapDer
,
binning
, continuumRemoval
data(NIRsoil) wav <- as.numeric(colnames(NIRsoil$spc)) # adding some noise NIRsoil$spc_noise <- NIRsoil$spc + rnorm(length(NIRsoil$spc), 0, 0.001) matplot(wav, t(NIRsoil$spc_noise[1:10, ]), type = "l", lty = 1, xlab = "Wavelength /nm", ylab = "Absorbance", col = "grey" ) # window size of 11 bands NIRsoil$spc_mov <- movav(NIRsoil$spc_noise, w = 15) # smoothed data matlines(as.numeric(colnames(NIRsoil$spc_mov)), t(NIRsoil$spc_mov[1:10, ]), type = "l", lty = 1 )
data(NIRsoil) wav <- as.numeric(colnames(NIRsoil$spc)) # adding some noise NIRsoil$spc_noise <- NIRsoil$spc + rnorm(length(NIRsoil$spc), 0, 0.001) matplot(wav, t(NIRsoil$spc_noise[1:10, ]), type = "l", lty = 1, xlab = "Wavelength /nm", ylab = "Absorbance", col = "grey" ) # window size of 11 bands NIRsoil$spc_mov <- movav(NIRsoil$spc_noise, w = 15) # smoothed data matlines(as.numeric(colnames(NIRsoil$spc_mov)), t(NIRsoil$spc_mov[1:10, ]), type = "l", lty = 1 )
This function implements the multiplicative scatter correction method which attempts to remove physical light scatter by accounting for additive and multiplicative effects (Geladi et al., 1985).
msc(X, ref_spectrum = colMeans(X))
msc(X, ref_spectrum = colMeans(X))
X |
a numeric matrix of spectral data. |
ref_spectrum |
a numeric vector corresponding to an "ideal" reference
spectrum (e.g. free of scattering effects). By default the function uses the
mean spectrum of the input |
The Multiplicative Scatter Correction (MSC) is a normalization method that attempts to account for additive and multiplicative effects by aligning each spectrum (\(x_i\)) to an ideal reference one (\(x_r\)) as follows:
\[x_i = m_i x_r + a_i\] \[MSC(x_i) = \frac{a_i - x_i}{m_i}\]where \(a_i\) and \(m_i\) are the additive and multiplicative terms respectively.
a matrix of normalized spectral data with an attribute which indicates the reference spectrum used.
Leonardo Ramirez-Lopez and Guillaume Hans
Geladi, P., MacDougall, D., and Martens, H. 1985. Linearization and Scatter-Correction for Near-Infrared Reflectance Spectra of Meat. Applied Spectroscopy, 39(3):491-500.
standardNormalVariate
, detrend
,
blockScale
, blockNorm
data(NIRsoil) NIRsoil$msc_spc <- msc(X = NIRsoil$spc) # 10 first msc spectra matplot( x = as.numeric(colnames(NIRsoil$msc_spc)), y = t(NIRsoil$msc_spc[1:10, ]), type = "l", xlab = "wavelength, nm", ylab = "msc" ) # another example spectra_a <- NIRsoil$spc[1:40, ] spectra_b <- NIRsoil$spc[-(1:40), ] spectra_a_msc <- msc(spectra_a, colMeans(spectra_a)) # correct spectra_a based on the reference spectrum used to correct # spectra_a spectra_b_msc <- msc( spectra_b, ref_spectrum = attr(spectra_a_msc, "Reference spectrum") )
data(NIRsoil) NIRsoil$msc_spc <- msc(X = NIRsoil$spc) # 10 first msc spectra matplot( x = as.numeric(colnames(NIRsoil$msc_spc)), y = t(NIRsoil$msc_spc[1:10, ]), type = "l", xlab = "wavelength, nm", ylab = "msc" ) # another example spectra_a <- NIRsoil$spc[1:40, ] spectra_b <- NIRsoil$spc[-(1:40), ] spectra_a_msc <- msc(spectra_a, colMeans(spectra_a)) # correct spectra_a based on the reference spectrum used to correct # spectra_a spectra_b_msc <- msc( spectra_b, ref_spectrum = attr(spectra_a_msc, "Reference spectrum") )
Perform a k-means sampling on a matrix for multivariate calibration
naes(X, k, pc, iter.max = 10, method = 0, .center = TRUE, .scale = FALSE)
naes(X, k, pc, iter.max = 10, method = 0, .center = TRUE, .scale = FALSE)
X |
a numeric matrix (optionally a data frame that can be coerced to a numerical matrix). |
k |
either the number of calibration samples to select or a set of cluster centres to initiate the k-means clustering. |
pc |
optional. If not specified, k-means is run directly on the variable
(Euclidean) space.
Alternatively, a PCA is performed before k-means and |
iter.max |
maximum number of iterations allowed for the k-means
clustering. Default is |
method |
the method used for selecting calibration samples within each
cluster: either samples closest to the cluster.
centers ( |
.center |
logical value indicating whether the input matrix must be
centered before Principal Component Analysis. Default set to |
.scale |
logical value indicating whether the input matrix must be
scaled before Principal Component Analysis. Default set to |
K-means sampling is a simple procedure based on cluster analysis to select calibration samples from large multivariate datasets. The method can be described in three points (Naes et al.,2001):
Perform a PCA and decide how many principal component to keep,
Carry out a k-means clustering on the principal component scores and choose the number of resulting clusters to be equal to the number of desired calibration samples,
Select one sample from each cluster.
a list with components:
'model
': numeric vector giving the row indices of the input data
selected for calibration
'test
': numeric vector giving the row indices of the remaining
observations
'pc
': if the pc
argument is specified, a numeric matrix of the
scaled pc scores
'cluster
': integer vector indicating the cluster to which each
point was assigned
'centers
': a matrix of cluster centres
Antoine Stevens & Leonardo Ramirez-Lopez
Naes, T., 1987. The design of calibration in near infra-red reflectance analysis by clustering. Journal of Chemometrics 1, 121-134.
Naes, T., Isaksson, T., Fearn, T., and Davies, T., 2002. A user friendly guide to multivariate calibration and classification. NIR Publications, Chichester, United Kingdom.
kenStone
, honigs
, duplex
,
shenkWest
data(NIRsoil) sel <- naes(NIRsoil$spc, k = 5, p = .99, method = 0) # clusters plot(sel$pc[, 1:2], col = sel$cluster + 2) # points selected for calibration with method = 0 points(sel$pc[sel$model, 1:2], col = 2, pch = 19, cex = 1 ) # pre-defined centers can also be provided sel2 <- naes(NIRsoil$spc, k = sel$centers, p = .99, method = 1 ) # points selected for calibration with method = 1 points(sel$pc[sel2$model, 1:2], col = 1, pch = 15, cex = 1 )
data(NIRsoil) sel <- naes(NIRsoil$spc, k = 5, p = .99, method = 0) # clusters plot(sel$pc[, 1:2], col = sel$cluster + 2) # points selected for calibration with method = 0 points(sel$pc[sel$model, 1:2], col = 2, pch = 19, cex = 1 ) # pre-defined centers can also be provided sel2 <- naes(NIRsoil$spc, k = sel$centers, p = .99, method = 1 ) # points selected for calibration with method = 1 points(sel$pc[sel2$model, 1:2], col = 1, pch = 15, cex = 1 )
Soil spectral library of the ‘Chimiometrie 2006’ challenge. The database contains absorbance spectra of dried and sieved soil samples measured between 1100 nm and 2498 nm at 2 nm interval. The soil samples come from agricultural fields collected from all over the Walloon region in Belgium. Three parameters are associated with the spectral library: Nt (Total Nitrogen in g/Kg of dry soil), CEC (Cation Exchange Capacity in meq/100 g of dry soil) and Ciso (Carbon in g/100 g of dry soil). Carbon content has been measured following the ISO14235 method.
data(NIRsoil)
data(NIRsoil)
A data.frame
of 825 observations and 5 variables (where the spectral
data is embedded in one variable NIRSoil$spc
).
The dataset includes 618 training and 207 test samples with 5 variables:
Nt (Total Nitrogen).
Ciso (Carbon).
CEC (Cation Exchange Capacity).
train (binary vector indicating training (1) and validation (0) samples).
and spc (a matrix of spectral NIR absorbance values, where the band/wavelength positions are stored as colnames
).
Nt, Ciso and CEC have respectively 22 \ with missing values.
Pierre Dardenne from Walloon Agricultural Research Centre, Belgium.
Fernandez Pierna, J.A., and Dardenne, P., 2008. Soil parameter quantification by NIRS as a Chemometric challenge at 'Chimiometrie 2006'. Chemometrics and Intelligent Laboratory Systems 91, 94-98.
Select calibration samples from multivariate data using the Puchwein algorithm
puchwein(X, pc = 0.95, k, min.sel, details = FALSE, .center = TRUE, .scale = FALSE)
puchwein(X, pc = 0.95, k, min.sel, details = FALSE, .center = TRUE, .scale = FALSE)
X |
a matrix from which the calibration samples are to be selected (optionally a data frame that can be coerced to a numerical matrix). |
pc |
the number of principal components retained in the computation of
the distance in the standardized Principal Component space (Mahalanobis
distance).
If |
k |
the initial limiting distance parameter, if not specified (default),
set to 0.2. According to Puchwein, a good starting value for the limiting
distance is |
min.sel |
minimum number of samples to select for calibration (default = 5). |
details |
logical value, if |
.center |
logical value indicating whether the input matrix must be centered before Principal Component. Analysis. Default set to TRUE. |
.scale |
logical value indicating whether the input matrix must be scaled before Principal Component Analysis. Default set to FALSE. |
The Puchwein algorithm select samples from a data matrix by iteratively
eliminating similar samples using the Mahalanobis distance.
It starts by performing a PCA on the input matrix and extracts the score
matrix truncated to , the number of principal components. The score
matrix is then normalized to unit variance and the Euclidean distance of each
sample to the centre of the data is computed, which is identical to the
Mahalanobis distance
. Additionally, the Mahalanobis distances between
samples are comptuted. The algorithm then proceeds as follows:
Choose a initial limiting distance
Select the sample with the highest distance to the centre
Remove all samples within the minimum distance from
the sample selected in step 2
Go back to step 2 and proceed until there are no samples/observations left in the dataset
Go back to step 1 and increase the minimum distance by multiplying the limiting distance by the loop number
It is not possible to obtain a pre-defined number of samples selected by the
method. To choose the adequate number of samples, a data frame is returned
by puchwein
function (leverage
) giving the observed and theoretical
cumulative sum of leverages of the points selected in each iteration. The
theoretical cumulative sum of leverage is computed such as each point has the
same leverage (the sum of leverages divided by the number of observations).
The loop having the largest difference between the observed and theoretical
sums is considered as producing the optimal selection of points (the subset
that best reproduces the variability of the predictor space).
a list
with components:
'model
': indices of the observations (row indices of the input
data)
selected for calibration
'test
': indices of the remaining observations (row indices of the
input data)
'pc
': a numeric matrix of the scaled pc scores
'loop.optimal
': index of the loop producing the maximum difference
between the observed and
theoretical sum of leverages of the selected samples
'leverage
': data frame giving the observed and theoretical
cumulative sums of leverage of the points selected in each loop
'details
': list with the indices of the observations kept in each
loop
The Puchwein algorithm is an iterative method and can be slow for large data matrices.
Antoine Stevens
Puchwein, G., 1988. Selection of calibration samples for near-infrared spectrometry by factor analysis of spectra. Analytical Chemystry 60, 569-573.
Shetty, N., Rinnan, A., and Gislum, R., 2012. Selection of representative calibration sample sets for near-infrared reflectance spectroscopy to predict nitrogen concentration in grasses. Chemometrics and Intelligent Laboratory Systems 111, 59-65.
kenStone
, duplex
,
shenkWest
, honigs
, naes
data(NIRsoil) sel <- puchwein(NIRsoil$spc, k = 0.2, pc = .99) plot(sel$pc[, 1:2]) # points selected for calibration points(NIRsoil$spc[sel$model, 1:2], col = 2, pch = 2) # Leverage plot opar <- par(no.readonly = TRUE) par(mar = c(4, 5, 2, 2)) plot(sel$leverage$loop, sel$leverage$diff, type = "l", xlab = "# loops", ylab = "Difference between theoretical and \n observed sum of leverages" ) par(opar)
data(NIRsoil) sel <- puchwein(NIRsoil$spc, k = 0.2, pc = .99) plot(sel$pc[, 1:2]) # points selected for calibration points(NIRsoil$spc[sel$model, 1:2], col = 2, pch = 2) # Leverage plot opar <- par(no.readonly = TRUE) par(mar = c(4, 5, 2, 2)) plot(sel$leverage$loop, sel$leverage$diff, type = "l", xlab = "# loops", ylab = "Difference between theoretical and \n observed sum of leverages" ) par(opar)
This function imports .nir files generated by BUCHI NIRCal software.
read_nircal(file, response = TRUE, spectra = TRUE, metadata = TRUE, progress = TRUE, verbose = TRUE)
read_nircal(file, response = TRUE, spectra = TRUE, metadata = TRUE, progress = TRUE, verbose = TRUE)
file |
the name of the NIRCal (.nir) file which the data are to be read from. For URLs a temporary file is first downloaded and is then read. |
response |
a logical indicating if the data of the response variables
must be returned (default is |
spectra |
a logical indicating if the spectral data must be returned
(default is |
metadata |
a logical indicating if the metadada must be returned
(default is |
progress |
a logical indicating if a progress bar must be printed
(default is |
verbose |
a logical indicating if the number of spectra and response
variables (an also the ID's of the spectra without gain and/or temperature
information) must be printed (default is |
The extension of the BUCHI NIRCal files is .nir. These files are used to store spectra generated by BUCHI N-500 and BUCHI NIRMaster FT-NIR sensors. See NIRCal technical data sheet.
a data.frame containing the metadata, response variables (if
response = TRUE
) and spectra (if spectra = TRUE
, embedded in the
data.frame
as a matrix named ...$spc
).
Read single or multiple binary and ASCII files acquired with an ASD FieldSpec Pro (ASDi, Boulder, CO) spectroradiometer
readASD(fnames, in_format, out_format)
readASD(fnames, in_format, out_format)
fnames |
a character vector of the name(s) (with absolute path) of the file(s) to read. |
in_format |
the format of the input file: |
out_format |
the format of the output: 'matrix' (default) or 'list' (see below). |
if out_format
= 'matrix'
, reflectance values of the input file(s) in a single matrix.
if out_format
= 'list'
, a list
of the input file(s) data consisting of a list with components:
Name
: name of the file imported
datetime
: date and time of acquisition in POSIXct
format
header
: list with information from the header file
radiance
: if applicable, a numeric vector of radiance values
reference
: if applicable, a numeric vector of radiance values of the white reference
reflectance
: numeric vector of reflectance values
wavelength
: numeric vector of the band positions
There is a R port of the ‘importasd.m’ function from the ‘FSFPostProcessing’ Matlab toolbox by Iain Robinson (University of Edinburgh), which is based on some Java code provided by Andreas Hunei (University of Zurich).
It seems that ASD file format has changed quite a lot with file versions. The function will possibly not work as expected for all versions. Please report any bugs to the package maintainer.
Antoine Stevens (R port), Iain Robinson (matlab function) & Leonardo Ramirez-Lopez (R port)
Robinson, I., and A. MacArthur. 2011. The Field Spectroscopy Facility Post Processing Toolbox User Guide. Post processing spectral data in MATLAB, University of Edinburgh, Edinburgh, UK.
Resample a data matrix or vector to new coordinates (e.g.
band positions) using spline or linear interpolation. This function is a
simple wrapper around approx
and splinefun
in
base.
resample(X, wav, new.wav, interpol = "spline", ...)
resample(X, wav, new.wav, interpol = "spline", ...)
X |
numeric matrix or vector to resample (optionally a data frame that can be coerced to a numerical matrix). |
wav |
a numeric vector giving the original band positions. |
new.wav |
a numeric vector giving the new band positions. |
interpol |
the interpolation method: 'linear' or 'spline' (default). |
... |
additional arguments to be passed to the |
a matrix or vector with resampled values.
Antoine Stevens and Leonardo Ramirez-Lopez
data(NIRsoil) wav <- as.numeric(colnames(NIRsoil$spc)) # increase spectral resolution by 2 NIRsoil$spc_resampled <- resample(NIRsoil$spc, wav, seq(1100, 2498, 2)) dim(NIRsoil$spc) dim(NIRsoil$spc_resampled)
data(NIRsoil) wav <- as.numeric(colnames(NIRsoil$spc)) # increase spectral resolution by 2 NIRsoil$spc_resampled <- resample(NIRsoil$spc, wav, seq(1100, 2498, 2)) dim(NIRsoil$spc) dim(NIRsoil$spc_resampled)
Resample a data matrix or vector to match the response of another instrument using full width half maximum (FWHM) values
resample2(X, wav, new.wav, fwhm)
resample2(X, wav, new.wav, fwhm)
X |
a numeric matrix or vector to resample (optionally a data frame that can be coerced to a numerical matrix). |
wav |
a numeric vector giving the original band positions. |
new.wav |
a numeric vector giving the new band positions. |
fwhm |
a numeric vector giving the full width half maximums of the new band positions. If no value is specified, it is assumed that the fwhm is equal to the sampling interval (i.e. band spacing). If only one value is specified, the fwhm is assumed to be constant over the spectral range. |
The function uses gaussian models defined by fwhm values to resample the high resolution data to new band positions and resolution. It assumes that band spacing and fwhm of the input data is constant over the spectral range. The interpolated values are set to 0 if input data fall outside by 3 standard deviations of the gaussian densities defined by fwhm.
a matrix or vector with resampled values
Antoine Stevens
data(NIRsoil) wav <- as.numeric(colnames(NIRsoil$spc)) # Plot 10 first spectra matplot(wav, t(NIRsoil$spc[1:10, ]), type = "l", xlab = "Wavelength /nm", ylab = "Absorbance" ) # ASTER SWIR bands (nm) new_wav <- c(1650, 2165, 2205, 2260, 2330, 2395) # positions fwhm <- c(100, 40, 40, 50, 70, 70) # fwhm's # Resample NIRsoil to ASTER band positions aster <- resample2(NIRsoil$spc, wav, new_wav, fwhm) matpoints(as.numeric(colnames(aster)), t(aster[1:10, ]), pch = 1:5)
data(NIRsoil) wav <- as.numeric(colnames(NIRsoil$spc)) # Plot 10 first spectra matplot(wav, t(NIRsoil$spc[1:10, ]), type = "l", xlab = "Wavelength /nm", ylab = "Absorbance" ) # ASTER SWIR bands (nm) new_wav <- c(1650, 2165, 2205, 2260, 2330, 2395) # positions fwhm <- c(100, 40, 40, 50, 70, 70) # fwhm's # Resample NIRsoil to ASTER band positions aster <- resample2(NIRsoil$spc, wav, new_wav, fwhm) matpoints(as.numeric(colnames(aster)), t(aster[1:10, ]), pch = 1:5)
Savitzky-Golay smoothing and derivative of a data matrix or vector.
savitzkyGolay(X, m, p, w, delta.wav)
savitzkyGolay(X, m, p, w, delta.wav)
X |
a numeric matrix or vector to process (optionally a data frame that can be coerced to a numerical matrix). |
m |
an integer indcating the differentiation order. |
p |
an integer indicating the polynomial order. |
w |
an integer indicating the window size (must be odd). |
delta.wav |
(optional) sampling interval. |
The Savitzky-Golay algorithm fits a local polynomial regression on the signal. It requires evenly spaced data points. Mathematically, it operates simply as a weighted sum over a given window:
\[ x_j\ast = \frac{1}{N}\sum_{h=-k}^{k}{c_hx_{j+h}}\]where \(x_j\ast\) is the new value, \(N\) is a normalizing coefficient, \(k\) is the gap size on each side of \(j\) and \(c_h\) are pre-computed coefficients, that depends on the chosen polynomial order and degree.
The sampling interval specified with the delta.wav
argument is used for
scaling and get numerically correct derivatives.
The convolution function is written in C++/Rcpp for faster computations.
Antoine Stevens and Leonardo Ramirez-Lopez
Luo, J., Ying, K., He, P., & Bai, J. (2005). Properties of Savitzky–Golay digital differentiators. Digital Signal Processing, 15(2), 122-136.
Savitzky, A., and Golay, M.J.E., 1964. Smoothing and differentiation of data by simplified least squares procedures. Anal. Chem. 36, 1627-1639.
Schafer, R. W. (2011). What is a Savitzky-Golay filter? (lecture notes). IEEE Signal processing magazine, 28(4), 111-117.
Wentzell, P.D., and Brown, C.D., 2000. Signal processing in analytical chemistry. Encyclopedia of Analytical Chemistry, 9764-9800.
data(NIRsoil) opar <- par(no.readonly = TRUE) par(mfrow = c(2, 1), mar = c(4, 4, 2, 2)) # plot of the 10 first spectra matplot(as.numeric(colnames(NIRsoil$spc)), t(NIRsoil$spc[1:10, ]), type = "l", xlab = "", ylab = "Absorbance" ) mtext("Raw spectra") NIRsoil$spc_sg <- savitzkyGolay( X = NIRsoil$spc, m = 1, p = 3, w = 11, delta.wav = 2 ) matplot(as.numeric(colnames(NIRsoil$spc_sg)), t(NIRsoil$spc_sg[1:10, ]), type = "l", xlab = "Wavelength /nm", ylab = "1st derivative" ) mtext("1st derivative spectra") par(opar)
data(NIRsoil) opar <- par(no.readonly = TRUE) par(mfrow = c(2, 1), mar = c(4, 4, 2, 2)) # plot of the 10 first spectra matplot(as.numeric(colnames(NIRsoil$spc)), t(NIRsoil$spc[1:10, ]), type = "l", xlab = "", ylab = "Absorbance" ) mtext("Raw spectra") NIRsoil$spc_sg <- savitzkyGolay( X = NIRsoil$spc, m = 1, p = 3, w = 11, delta.wav = 2 ) matplot(as.numeric(colnames(NIRsoil$spc_sg)), t(NIRsoil$spc_sg[1:10, ]), type = "l", xlab = "Wavelength /nm", ylab = "1st derivative" ) mtext("1st derivative spectra") par(opar)
Select calibration samples from a large multivariate data using the SELECT algorithm as described in Shenk and Westerhaus (1991).
shenkWest(X, d.min = 0.6, pc = 0.95, rm.outlier = FALSE, .center = TRUE, .scale = FALSE)
shenkWest(X, d.min = 0.6, pc = 0.95, rm.outlier = FALSE, .center = TRUE, .scale = FALSE)
X |
a numeric matrix (optionally a data frame that can be coerced to a numerical matrix). |
d.min |
a minimum distance (default = 0.6). |
pc |
the number of principal components retained in the computation
distance in the standardized Principal Component space (Mahalanobis distance).
If |
rm.outlier |
logical. If |
.center |
logical. Indicates whether the input matrix should be centered
before Principal Component Analysis. Default set to |
.scale |
logical. Indicates whether the input matrix should be scaled
before Principal Component Analysis. Default set to |
The SELECT algorithm is an iterative procedure based on the standardized Mahalanobis distance between observations. First, the observation having the highest number of neighbours within a given minimum distance is selected and its neighbours are discarded. The procedure is repeated until there is no observation left.
If the rm.outlier
argument is set to TRUE
, outliers will be removed
before running the SELECT algorithm, using the CENTER algorithm of
Shenk and Westerhaus (1991), i.e. samples with a standardized Mahalanobis
distance >3
are removed.
a list
with components:
'model
': numeric vector giving the row indices of the input data
selected for calibration
'test
': numeric vector giving the row indices of the remaining
observations
'pc
': a numeric matrix of the scaled pc scores
Antoine Stevens
Shenk, J.S., and Westerhaus, M.O., 1991. Population Definition, Sample Selection, and Calibration Procedures for Near Infrared Reflectance Spectroscopy. Crop Science 31, 469-474.
data(NIRsoil) # reduce data size NIRsoil$spc <- binning(X = NIRsoil$spc, bin.size = 5) sel <- shenkWest(NIRsoil$spc, pc = .99, d.min = .3, rm.outlier = FALSE) plot(sel$pc[, 1:2], xlab = "PC1", ylab = "PC2") # points selected for calibration points(sel$pc[sel$model, 1:2], pch = 19, col = 2) # without outliers sel <- shenkWest(NIRsoil$spc, pc = .99, d.min = .3, rm.outlier = TRUE) plot(sel$pc[, 1:2], xlab = "PC1", ylab = "PC2") # points selected for calibration points(sel$pc[sel$model, 1:2], pch = 15, col = 3)
data(NIRsoil) # reduce data size NIRsoil$spc <- binning(X = NIRsoil$spc, bin.size = 5) sel <- shenkWest(NIRsoil$spc, pc = .99, d.min = .3, rm.outlier = FALSE) plot(sel$pc[, 1:2], xlab = "PC1", ylab = "PC2") # points selected for calibration points(sel$pc[sel$model, 1:2], pch = 19, col = 2) # without outliers sel <- shenkWest(NIRsoil$spc, pc = .99, d.min = .3, rm.outlier = TRUE) plot(sel$pc[, 1:2], xlab = "PC1", ylab = "PC2") # points selected for calibration points(sel$pc[sel$model, 1:2], pch = 15, col = 3)
Corrects steps in an input spectral matrix by linear interpolation of the values of the edges of the middle sensor
spliceCorrection(X, wav, splice = c(1000, 1830), interpol.bands = 10)
spliceCorrection(X, wav, splice = c(1000, 1830), interpol.bands = 10)
X |
a numeric matrix or vector to transform (optionally a data frame that can be coerced to a numerical matrix). |
wav |
a numeric vector with band positions. |
splice |
a numeric vector of length 1 or 2 with the positions of the
splice(s). Default:
|
interpol.bands |
the number of interpolation bands. |
This function uses by default the positions for the ASD FieldSpec Pro spectroradiometer (Malvern Panalytical) which usually exhibit steps at the splice of the three built-in detectors, positioned at 1000 nm (end of VNIR detector) and 1830 nm (end of SWIR1 detector). The data corresponding to the spectral region after the first step is used as reference for correcting the first region and the laste region (if 2 steps are supplied). Other typical examples of splice artifacts caused by concatenating data captured by different detectors inside the spectrometer:
XDS (FOSS): 1100 nm
ProxiMate (BUCHI Labortechnik): 900 nm
a matrix with the splice corrected data.
Antoine Stevens and Leonardo Ramirez-Lopez
This function normalizes each row of an input matrix by subtracting each row by its mean and dividing it by its standard deviation
standardNormalVariate(X)
standardNormalVariate(X)
X |
a numeric matrix of spectral data (optionally a data frame that can be coerced to a numerical matrix). |
SNV is simple way for normalizing spectral data that intends to correct for light scatter. It operates row-wise:
\[SNV_i = \frac{x_i - \bar{x}_i}{s_i}\]where \(x_i\) is the signal of the \(i\)th observation, \(\bar{x}_i\) is its mean and \(s_i\) its standard deviation.
a matrix of normalized spectral data.
Antoine Stevens
Barnes RJ, Dhanoa MS, Lister SJ. 1989. Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra. Applied spectroscopy, 43(5): 772-777.
msc
, detrend
, blockScale
,
blockNorm
data(NIRsoil) NIRsoil$spc_snv <- standardNormalVariate(X = NIRsoil$spc) # 10 first snv spectra matplot( x = as.numeric(colnames(NIRsoil$spc_snv)), y = t(NIRsoil$spc_snv[1:10, ]), type = "l", xlab = "wavelength, nm", ylab = "snv" ) ## Not run: apply(NIRsoil$spc_snv, 1, sd) # check ## End(Not run)
data(NIRsoil) NIRsoil$spc_snv <- standardNormalVariate(X = NIRsoil$spc) # 10 first snv spectra matplot( x = as.numeric(colnames(NIRsoil$spc_snv)), y = t(NIRsoil$spc_snv[1:10, ]), type = "l", xlab = "wavelength, nm", ylab = "snv" ) ## Not run: apply(NIRsoil$spc_snv, 1, sd) # check ## End(Not run)