| Title: | Wasserstein-Based Regression for Ordinal Compositional Data |
|---|---|
| Description: | Tools analyzing regression models for ordinal compositional data using Wasserstein-based distances. The package includes linear programming solvers under simplex constraints, tensor product constructions and performance metrics. |
| Authors: | Nicola Piras [aut, cre], Monica Musio [aut], Beniamino Cappelletti-Montano [aut] |
| Maintainer: | Nicola Piras <[email protected]> |
| License: | GPL-2 |
| Version: | 0.1.0 |
| Built: | 2026-06-19 18:39:33 UTC |
| Source: | https://github.com/cran/OrdinalCompositions |
A dataset describing the satisfaction of residents with the quality of life in 83 European cities.
CitySatisfCitySatisf
A data frame with 83 observations and 33 variables:
city name
age classes
health care satisfaction
green spaces
air quality
life satisfaction
job opportunities
safety
housing
The data measures subjective well-being and satisfaction with urban living conditions across multiple domains such as healthcare, environment, safety, employment opportunities, and housing.
Responses are categorized into four levels: Very satisfied (VS), Rather satisfied (RS), Rather unsatisfied (RU), Very unsatisfied (VU).
The dataset can be used to study compositional ordinal regression models relating individual perceptions of urban quality of life to demographic groups.
The variable SL is typically used as a global indicator of life satisfaction.
European urban quality-of-life survey.
Computes an R-squared measure based on Wasserstein distances for compositional ordinal data.
compute_R2(Y, X, A, weights)compute_R2(Y, X, A, weights)
Y |
Response matrix |
X |
Input matrix |
A |
Transformation matrix |
weights |
Distance weights |
The Wasserstein coefficient of determination is defined as the proportional reduction in error relative to the baseline:
where as baseline the Fréchet mean is used.
A list with R2, Dres, Dtot, Fréchet mean and predictions
Computes the Fréchet mean in the Wasserstein geometry on the simplex. Given a sample of compositions, the cumulative distribution functions (CDFs) are first computed row-wise, the component-wise median of the CDFs is then obtained, and finally transformed back to the simplex by successive differencing.
compute_wfrechet_mean(mat)compute_wfrechet_mean(mat)
mat |
A numeric matrix whose rows are compositions. |
Let be the observed compositions and
the \(k\)-th component of their cumulative distribution
functions. The Fréchet mean is defined through
and the corresponding composition is recovered from the cumulative representation by finite differences.
A numeric vector representing the Wasserstein Fréchet mean of the sample compositions.
Y <- rbind( c(0.2, 0.5, 0.3), c(0.1, 0.6, 0.3), c(0.4, 0.2, 0.4) ) compute_wfrechet_mean(Y)Y <- rbind( c(0.2, 0.5, 0.3), c(0.1, 0.6, 0.3), c(0.4, 0.2, 0.4) ) compute_wfrechet_mean(Y)
Lambda grid
default_lambda_grid()default_lambda_grid()
Numeric vector of tested values for the regularization parameter
Numeric vector of lambda values
Education level of father (F) and mother (M) in percentages of low (l), medium (m), and high (h) educational attainment across European countries.
data(educFM)data(educFM)
A data frame with 31 observations and 8 variables:
Country identifier
Percentage of females with low education level
Percentage of females with medium education level
Percentage of females with high education level
Percentage of males with low education level
Percentage of males with medium education level
Percentage of males with high education level
The dataset contains compositional information on education levels for fathers and mothers across 31 European countries.
This dataset is originally provided in the context of compositional data analysis and is also available in the package robCompositions.
The data are expressed in percentages and represent compositional parts of educational attainment distributions. They are commonly used in compositional data analysis and regression models on the simplex.
Peter Filzmoser, Matthias Templ
Eurostat, https://ec.europa.eu/eurostat/
Filzmoser, P. & Templ, M. (various works on compositional data analysis)
Computes the Spearman rank correlation between two sets of ordinal probability distributions using their centers of mass.
OCC(A, B)OCC(A, B)
A |
Numeric matrix where each row represents a probability distribution |
B |
Numeric matrix with the same number of rows as |
The center of mass of each distribution is computed as the expected value over the ordinal support. The Spearman correlation is then calculated between the resulting vectors.
A numeric value representing the Spearman correlation coefficient
A <- matrix(c(0.2,0.3,0.5, 0.1,0.4,0.5), nrow = 2, byrow = TRUE) B <- matrix(c(0.3,0.3,0.4, 0.2,0.3,0.5), nrow = 2, byrow = TRUE) OCC(A, B)A <- matrix(c(0.2,0.3,0.5, 0.1,0.4,0.5), nrow = 2, byrow = TRUE) B <- matrix(c(0.3,0.3,0.4, 0.2,0.3,0.5), nrow = 2, byrow = TRUE) OCC(A, B)
This index represents the proportion of data pairs whose ordinal rank is preserved by the model among all pairs with distinct observed responses.
opi(weights, P, Q, tol = 1e-08)opi(weights, P, Q, tol = 1e-08)
weights |
Numeric vector of weights |
P |
Numeric vector representing the first distribution |
Q |
Numeric vector representing the second distribution |
tol |
Numeric tolerance for comparison (default: 1e-8) |
The function compares distributions by evaluating their Wasserstein distances to each unit vector basis. It determines the proportion of pairs for which the ordering is preserved.
The index is defined as
,
where is the number of pairs with such that
, and is the total number
of pairs with .
The function is defined as:
if
if
otherwise
Here, denotes the total Wasserstein order.
-1 if P < Q
1 if P > Q
0 if P and Q are equivalent
Ordinal Compositional Regression on the Simplex
ordinal_regression_simplex( xdata, ydata, lambda1_grid = default_lambda_grid(), lambda2_grid = default_lambda_grid(), method = c("cv", "gcv"), K = NULL, weights = NULL, weights_product = NULL, return_tensor_index = FALSE, do_bootstrap = FALSE, B = 1000, compute_opi = FALSE, compute_R2 = FALSE, compute_OCC = FALSE, do_bootstrap_order = FALSE )ordinal_regression_simplex( xdata, ydata, lambda1_grid = default_lambda_grid(), lambda2_grid = default_lambda_grid(), method = c("cv", "gcv"), K = NULL, weights = NULL, weights_product = NULL, return_tensor_index = FALSE, do_bootstrap = FALSE, B = 1000, compute_opi = FALSE, compute_R2 = FALSE, compute_OCC = FALSE, do_bootstrap_order = FALSE )
xdata |
List of compositional predictors.
|
ydata |
List of compositional response vectors |
lambda1_grid |
Grid of tuning parameters for |
lambda2_grid |
Grid of tuning parameters for |
method |
Character string specifying the regularization selection criterion. Either |
K |
Number of folds for cross-validation. If |
weights |
Wasserstein weights (default uniform) |
weights_product |
Weights used in the construction of the tensor product (default uniform) |
return_tensor_index |
Logical, return tensor index mapping |
do_bootstrap |
Logical, bootstrap inference using Wasserstein distance between matrices |
B |
Number of bootstrap samples |
compute_opi |
Logical, OPI index |
compute_R2 |
Logical, Wasserstein R2 |
compute_OCC |
Logical, OCC index |
do_bootstrap_order |
Logical, ordering bootstrap |
The function estimates a regression operator between compositional predictors and responses using Wasserstein geometry. Regularization is selected via generalized cross-validation. Optional bootstrap procedures provide confidence regions and additional statistics. The function supports both single compositional regression and multi-compositional tensor regression. In tensor mode, a design matrix is automatically constructed via tensor product:
The user does NOT need to precompute tensor products.
A list containing:
A_hat: Estimated regression matrix
bootstrap: Bootstrap results (if requested).
In the case of bootstrap procedure via Wasserstein distance between matrices the Fréchet mean is computed.
In the case of bootstrap procedure via the Wasserstein total order the inf, sup and median for each column are computed.
OCC: Ordinal Correlation Coefficient (if requested)
R2: Wasserstein Index (if requested)
OPI: Ordinal Preservation Index (if requested)
if (requireNamespace("codalm", quietly = TRUE)) { data(educFM) # --- preprocessing --- father <- as.matrix(educFM[, 2:4]) ya <- father / rowSums(father) mother <- as.matrix(educFM[, 5:7]) xa <- mother / rowSums(mother) x <- cbind(xa[,3], xa[,2], xa[,1]) y <- cbind(ya[,3], ya[,2], ya[,1]) ydata <- split(y, seq_len(nrow(y))) xdata <- split(x, seq_len(nrow(x))) weights_orig <- rep(1, ncol(x) - 1) # --- model --- res <- ordinal_regression_simplex( xdata, ydata, weights = weights_orig,lambda2_grid=0, compute_opi=TRUE,compute_R2=TRUE,compute_OCC=TRUE ) }if (requireNamespace("codalm", quietly = TRUE)) { data(educFM) # --- preprocessing --- father <- as.matrix(educFM[, 2:4]) ya <- father / rowSums(father) mother <- as.matrix(educFM[, 5:7]) xa <- mother / rowSums(mother) x <- cbind(xa[,3], xa[,2], xa[,1]) y <- cbind(ya[,3], ya[,2], ya[,1]) ydata <- split(y, seq_len(nrow(y))) xdata <- split(x, seq_len(nrow(x))) weights_orig <- rep(1, ncol(x) - 1) # --- model --- res <- ordinal_regression_simplex( xdata, ydata, weights = weights_orig,lambda2_grid=0, compute_opi=TRUE,compute_R2=TRUE,compute_OCC=TRUE ) }
Generates a Dirichlet random vector ensuring numerical stability by enforcing a minimum threshold on parameters.
safe_dirichlet(alpha, scale = 10, eps = 1e-06)safe_dirichlet(alpha, scale = 10, eps = 1e-06)
alpha |
Numeric vector of Dirichlet parameters |
scale |
Scaling factor applied to alpha |
eps |
Minimum threshold to avoid zeros or non-finite values |
A numeric vector sampled from a Dirichlet distribution
Selects the optimal pair of regularization parameters
and for a Wasserstein-based model.
The selection can be performed either through K-fold cross-validation
(method = "cv") or generalized cross-validation
(method = "gcv").
select_lambda( P, P_prime, weights, lambda1_grid, lambda2_grid, method = c("cv", "gcv"), K = NULL, verbose = FALSE )select_lambda( P, P_prime, weights, lambda1_grid, lambda2_grid, method = c("cv", "gcv"), K = NULL, verbose = FALSE )
P |
List of input probability vectors (predictors). |
P_prime |
List of target probability vectors (responses). |
weights |
Numeric vector of weights used in the loss function. |
lambda1_grid |
Numeric vector of candidate values for the first regularization parameter. |
lambda2_grid |
Numeric vector of candidate values for the second regularization parameter. |
method |
Character string specifying the selection criterion.
Either |
K |
Number of folds for cross-validation. If |
verbose |
Logical; if |
For each pair in the supplied grids,
the function estimates the transport matrix by solving the
constrained optimization problem implemented in
solve_simplex_lp.
When method = "cv", the optimal parameters are selected by
minimizing the prediction error computed on validation folds.
The loss is based on weighted Wasserstein-type discrepancies between
cumulative distributions.
When method = "gcv", the fit is computed on the full sample and
the effective degrees of freedom are estimated from the singular values
of according to
where are the singular values of . The GCV score is
then given by
A list containing the selected regularization parameters and the corresponding validation scores.
If method = "cv", the returned list contains:
method: "cv".
best_lambda1: selected value of .
best_lambda2: selected value of .
cv_values: matrix of cross-validation errors.
lambda1_grid: grid of candidate values.
lambda2_grid: grid of candidate values.
K: number of folds.
If method = "gcv", the returned list contains:
method: "gcv".
best_lambda1: selected value of .
best_lambda2: selected value of .
score_values: GCV scores for all parameter combinations.
fit_values: fit values for all parameter combinations.
lambda1_grid: grid of candidate values.
lambda2_grid: grid of candidate values.
# Example with simple synthetic data P <- list(c(0.5, 0.5), c(0.3, 0.7)) P_prime <- list(c(0.4, 0.6), c(0.2, 0.8)) w <- c(1) lambda_grid <- c(0, 0.01, 0.1) select_lambda(P, P_prime, w, lambda_grid, lambda_grid)# Example with simple synthetic data P <- list(c(0.5, 0.5), c(0.3, 0.7)) P_prime <- list(c(0.4, 0.6), c(0.2, 0.8)) w <- c(1) lambda_grid <- c(0, 0.01, 0.1) select_lambda(P, P_prime, w, lambda_grid, lambda_grid)
Solves a linear programming problem with simplex constraints and optional regularization for Wasserstein-based fitting.
solve_simplex_lp(P, P_prime, weights, lambda1 = 0, lambda2 = 0)solve_simplex_lp(P, P_prime, weights, lambda1 = 0, lambda2 = 0)
P |
List of input probability vectors |
P_prime |
List of target probability vectors |
weights |
Numeric vector of weights |
lambda1 |
First-order Regularization parameter |
lambda2 |
Second-order Regularization parameter |
Consider a sample of N pairs . Here, denotes the predictor, while denotes the ordinal response distribution.
We assume a linear dependence structure defined by:
where the regression coefficient is a column-stochastic (CS) matrix.
The estimation of A is performed by minimizing the loss function L(A), defined as
Once the associated cumulative representation is define, the regularized objective function is:
where are tuning parameters controlling the trade-off between goodness-of-fit and the smoothness of the estimated conditional distributions.
A list containing the optimal matrix A
Computes the tensor product of multiple compositional vectors, optionally using weighted cumulative costs for ordering.
tensor_product(comps, weights = NULL, return_indices = FALSE)tensor_product(comps, weights = NULL, return_indices = FALSE)
comps |
List of compositional vectors |
weights |
Optional list of weight vectors |
return_indices |
Logical; whether to return index combinations |
A list containing the product vector (and optionally indices)
P <- c(0.2,0.3,0.5) Q <- c(0.4,0.3,0.2,0.1) a<- c(5,15) b <- c(2,4,8) tensor_product(comps=list(P,Q), weights = list(a,b), return_indices = TRUE)P <- c(0.2,0.3,0.5) Q <- c(0.4,0.3,0.2,0.1) a<- c(5,15) b <- c(2,4,8) tensor_product(comps=list(P,Q), weights = list(a,b), return_indices = TRUE)
Computes the weighted first-order Wasserstein distance between two discrete probability distributions defined on an ordered support.
wd(weights, P, Q)wd(weights, P, Q)
weights |
Numeric vector of non-negative weights of length n |
P |
Numeric vector representing the first probability distribution in the Simplex |
Q |
Numeric vector representing the second probability distribution in the Simplex |
The distance is computed as the weighted sum of absolute differences
between the cumulative distribution functions of P and Q,
excluding the last component.
where and is defined analogously.
A numeric value representing the weighted Wasserstein distance
Computes the weighted Wasserstein distance between two matrices by summing the column-wise Wasserstein distances.
wd_matrix(A, B, weights)wd_matrix(A, B, weights)
A |
Numeric matrix where each column represents a probability distribution. |
B |
Numeric matrix with the same dimensions as |
weights |
Numeric vector of weights of length m |
The distance is computed column by column using cumulative distributions and then aggregated across all columns.
A numeric value representing the total Wasserstein distance