| Title: | Machine Learning Models for Soil Properties |
|---|---|
| Description: | Creates a spectroscopy guideline with a highly accurate prediction model for soil properties using machine learning or deep learning algorithms such as LASSO, Random Forest, Cubist, etc., and decide which algorithm generates the best model for different soil types. |
| Authors: | Pengyuan Chen [aut, cre], Christopher Clingensmith [aut], Chenglong Ye [aut], Sabine Grunwald [aut], Katsutoshi Mizuta [aut] |
| Maintainer: | Pengyuan Chen <[email protected]> |
| License: | GPL-2 |
| Version: | 0.1.0 |
| Built: | 2026-05-11 09:27:22 UTC |
| Source: | https://github.com/cran/MLSP |
This function computes the row-wise mean for consecutive groups of columns in a matrix or data frame.
colsMean(x, ncols)colsMean(x, ncols)
x |
A numeric matrix or data frame. |
ncols |
An integer specifying the number of consecutive columns to group together. |
The function splits the columns of x into consecutive groups of ncols columns
and calculates the mean of each row for each group. The number of columns in x must
be divisible by ncols.
A numeric matrix with the same number of rows as x and ncol(x) / ncols columns,
where each column is the row-wise mean of a group of ncols columns.
mat <- matrix(1:12, nrow = 3) colsMean(mat, 2)mat <- matrix(1:12, nrow = 3) colsMean(mat, 2)
This function merges soil laboratory data with cleaned spectral (VNIR) data, performs preprocessing, and prepares inputs for calibration and model building.
merge_of_lab_and_spectrum(soil_data, data_NaturaSpec_cleaned)merge_of_lab_and_spectrum(soil_data, data_NaturaSpec_cleaned)
soil_data |
A data frame containing soil laboratory measurements (must include a column named |
data_NaturaSpec_cleaned |
A data frame containing cleaned spectral data
with columns |
The function performs the following steps:
Aggregates spectral data by wavelength and computes mean reflectance values.
Merges the soil and spectral datasets by LAB_NUM.
Separates soil variables and VNIR spectral matrix.
Creates calibration sample indices using random sampling.
Defines spectral bands to remove (detector artifact areas) and indices to be used in modeling.
A list with the following elements:
Data frame of soil laboratory data (first 8 columns of merged dataset).
Matrix of VNIR spectral reflectance values (without metadata columns).
List of calibration sample indices for cross-validation (4 sets).
Vectors of indices corresponding to spectral bands to be removed (detector artifact regions around 1000 nm and 1800 nm).
Indices of spectral bands used for aggregation (columns 7–2146).
Indices of bands to be excluded from analysis.
Vector of spectral band names retained after removal.
merged <- merge_of_lab_and_spectrum(soil_data, data_NaturaSpec_cleaned) str(merged)merged <- merge_of_lab_and_spectrum(soil_data, data_NaturaSpec_cleaned) str(merged)
This function applies several machine learning models (PCR, PLSR, Random Forest, LASSO, Cubist) to soil spectral data and compares their performance. Optionally, it can return the best-performing model.
ml_f( x, y, smoother_selection, type_of_soil, model_selection = TRUE )ml_f( x, y, smoother_selection, type_of_soil, model_selection = TRUE )
x |
A data frame or matrix containing spectral data. |
y |
A vector containing corresponding soil laboratory measurements. |
smoother_selection |
A parameter specifying the smoothing method to be applied during preprocessing. |
type_of_soil |
A character string indicating the soil type for model calibration. |
model_selection |
Logical; if 'TRUE' (default), the function returns only the best-performing model. If 'FALSE', it returns the results from all models. |
The function merges spectral and laboratory data, preprocesses the data, and evaluates the following models:
PCR (Principal Component Regression)
PLSR (Partial Least Squares Regression)
RF (Random Forest)
LASSO regression
Cubist regression
Each model's performance results are combined into a single results object. If model_selection = TRUE,
the function returns the model with the highest performance metric (based on the 11th column of the results table).
A data frame:
model_selection = FALSE
Returns results for all models.
model_selection = TRUE
Returns only the best-performing model result.
# Example usage: results <- ml_f( x, y, smoother_selection = "savitzky", type_of_soil = "loam", model_selection = TRUE )# Example usage: results <- ml_f( x, y, smoother_selection = "savitzky", type_of_soil = "loam", model_selection = TRUE )
This function computes various statistics for comparing observed values 'y' with predicted values 'yhat'. It includes correlation, regression coefficients, bias, RMSE, MSE, and predictive performance metrics like RPD and RPIQ.
msd.comp(y, yhat)msd.comp(y, yhat)
y |
Numeric vector of observed values. |
yhat |
Numeric vector of predicted values (same length as 'y'). |
A named numeric vector with the following components:
Pearson correlation between 'y' and 'yhat'
Intercept of regression of 'y' on 'yhat'
Slope of regression of 'y' on 'yhat'
Coefficient of determination (R-squared)
Mean bias: mean(yhat) - mean(y)
Root mean squared error
Mean squared error
Systematic bias component of MSE
Non-unity slope component of MSE
Lack-of-correlation component of MSE
Corrected RMSE after removing bias
Corrected MSE after removing bias
Ratio of standard deviation to RMSE (RPD)
Ratio of interquartile range to RMSE (RPIQ)
y_obs <- c(1.2, 3.4, 2.5, 4.1) y_pred <- c(1.1, 3.5, 2.4, 4.0) msd.comp(y_obs, y_pred)y_obs <- c(1.2, 3.4, 2.5, 4.1) y_pred <- c(1.1, 3.5, 2.4, 4.0) msd.comp(y_obs, y_pred)
This function aggregates VNIR (Visible and Near-Infrared) spectral data by calculating the mean of every 10 columns while removing specific detector artifact regions (~1000nm and ~1800nm) and unwanted spectral bands.
raw(vnir.matrix)raw(vnir.matrix)
vnir.matrix |
A numeric matrix or data frame containing VNIR spectral data. Each row corresponds to a sample, and each column corresponds to a spectral band. |
The function removes columns corresponding to detector artifacts: - rm1: bands 1–46 - rm2: bands 637–666 - rm3: bands 1437–1466 - rm4: bands 2127–2151 Additionally, columns 1:4, 64:66, 144:146, and 213:214 (after averaging) are removed.
A data frame containing the aggregated VNIR spectra with cleaned band names. The number of columns is reduced by averaging over every 10 bands and removing artifact-prone regions.
raw_spectra <- raw(vnir_matrix)raw_spectra <- raw(vnir_matrix)
These functions fit predictive models for soil properties using VNIR spectral data. Each function applies a specific machine learning method:
pcr_preprocess() – Principal Component Regression (PCR)
plsr_preprocess() – Partial Least Squares Regression (PLSR)
lasso_preprocess() – LASSO regression
rf_preprocess() – Random Forest regression
cubist_preprocess() – Cubist regression
Computes mean performance metrics across multiple calibration and validation sets.
Typically used to summarize the results of soil property prediction models
generated by preprocessing functions such as pcr_preprocess(),
plsr_preprocess(), lasso_preprocess(), rf_preprocess(),
or cubist_preprocess().
pcr_preprocess(soil, vnir.matrix, j, preprocess, type_of_soil) plsr_preprocess(soil, vnir.matrix, j, preprocess, type_of_soil) lasso_preprocess(soil, vnir.matrix, j, preprocess, type_of_soil) rf_preprocess(soil, vnir.matrix, j, preprocess, type_of_soil) cubist_preprocess(soil, vnir.matrix, j, preprocess, type_of_soil) results(metric.list, soil_type)pcr_preprocess(soil, vnir.matrix, j, preprocess, type_of_soil) plsr_preprocess(soil, vnir.matrix, j, preprocess, type_of_soil) lasso_preprocess(soil, vnir.matrix, j, preprocess, type_of_soil) rf_preprocess(soil, vnir.matrix, j, preprocess, type_of_soil) cubist_preprocess(soil, vnir.matrix, j, preprocess, type_of_soil) results(metric.list, soil_type)
soil |
A data frame of soil properties. Must include the target soil variable. |
vnir.matrix |
A numeric matrix of VNIR spectral data. |
j |
A list of index vectors specifying calibration sample sets
(e.g., from |
preprocess |
A preprocessing function to apply to the spectral data (e.g., smoothing, normalization). |
type_of_soil |
An integer index selecting which soil property column to model. |
metric.list |
A list of MSD metric objects returned by one of the preprocessing/model functions. Each element corresponds to a model fit on a calibration/validation split. |
soil_type |
Optional, an integer or string indicating which soil property was modeled (currently not used internally but kept for consistency). |
All functions use the same workflow:
Combine the selected soil property with preprocessed spectra.
Split data into calibration and validation sets (using sample indices).
Fit the chosen model across multiple calibration/validation partitions.
Generate predictions and compute performance metrics (MSD-based).
A list of MSD metric objects for calibration and validation sets, specific to the fitted model.
A named numeric vector of mean performance metrics across all splits:
Latent variable / model index
Cross-validated R-squared for calibration set
Bias in cross-validation for calibration set
Root mean squared error in cross-validation for calibration set
Mean squared error for calibration set
Ratio of performance to interquartile distance for calibration set
R-squared for validation set
Bias for validation set
Root mean squared error for validation set
Mean squared error for validation set
Ratio of performance to interquartile distance for validation set
merge_of_lab_and_spectrum, ml_f
# Example with PCR results_pcr <- pcr_preprocess(soil, vnir.matrix, j, preprocess = scale, type_of_soil = 2) # Example with Random Forest results_rf <- rf_preprocess(soil, vnir.matrix, j, preprocess = scale, type_of_soil = 2) msd_list <- pcr_preprocess(soil, vnir.matrix, j, preprocess = scale, type_of_soil = 2) results_summary <- results(msd_list)# Example with PCR results_pcr <- pcr_preprocess(soil, vnir.matrix, j, preprocess = scale, type_of_soil = 2) # Example with Random Forest results_rf <- rf_preprocess(soil, vnir.matrix, j, preprocess = scale, type_of_soil = 2) msd_list <- pcr_preprocess(soil, vnir.matrix, j, preprocess = scale, type_of_soil = 2) results_summary <- results(msd_list)