Introduction

The simplex distribution

Regression models for continuous responses restricted to the unit interval \((0, 1)\) play a central role in empirical research across economics, biostatistics, political science, and environmental studies. Typical examples include rates, proportions, indices, and concentration measures. The simplex regression model provides a flexible and theoretically appealing framework for such data. It is based on the simplex distribution, introduced by Barndorff-Nielsen and Jørgensen (1991), which is indexed by a mean parameter \(\mu \in (0, 1)\) and a dispersion parameter \(\sigma^2 > 0\).

A random variable \(Y\) follows a simplex distribution, \(Y \sim S^-(\mu, \sigma^2)\), if its density is

\[f(y; \mu, \sigma^2) = \left\{2\pi\sigma^2[y(1-y)]^3\right\}^{-1/2} \exp\!\left(-\frac{d(y;\mu)}{2\sigma^2}\right), \quad y \in (0,1),\]

where \(d(y;\mu) = (y - \mu)^2 / [y(1-y)\mu^2(1-\mu)^2]\) is the unit deviance. The mean and variance are \(\mathrm{E}(Y) = \mu\) and \(\mathrm{Var}(Y) < \mu(1-\mu)\), with variance function \(\mathrm{V}(\mu) = \mu^3(1-\mu)^3\).

The simplex distribution can assume a wide variety of shapes — symmetric, left- or right-skewed, J-shaped, U-shaped, and even bimodal — making it particularly flexible for modeling bounded responses. Unlike the beta distribution, which is indexed by a precision parameter, the simplex distribution is indexed by a dispersion parameter \(\sigma^2\): smaller values correspond to higher concentration around the mean, whereas larger values indicate greater variability.

The simplex regression model

The simplex regression model with variable dispersion relates the mean and dispersion to covariates through link functions:

\[g(\mu_i) = \mathbf{x}_i^\top \boldsymbol{\beta} = \eta_{1i} \quad \text{and} \quad h(\sigma^2_i) = \mathbf{z}_i^\top \boldsymbol{\gamma} = \eta_{2i},\]

where \(\boldsymbol{\beta}\) and \(\boldsymbol{\gamma}\) are unknown parameter vectors. For models with a parametric mean link, an additional shape parameter \(\lambda > 0\) is estimated jointly with \(\boldsymbol{\beta}\) and \(\boldsymbol{\gamma}\):

\[g(\mu_i, \lambda) = \mathbf{x}_i^\top \boldsymbol{\beta}.\]

The SimplexRegression package supports five fixed link functions for the mean submodel — "logit", "probit", "loglog", "cloglog", "cauchit" — and two parametric links, "plogit1" and "plogit2", defined as

\[\text{plogit1:}\; g(\mu_i,\lambda) = \log\!\left[(1-\mu_i)^{-\lambda}-1\right], \qquad \text{plogit2:}\; g(\mu_i,\lambda) = \log\!\left(\frac{\mu_i^\lambda}{1-\mu_i^\lambda}\right).\]

Both reduce to the standard logit when \(\lambda = 1\). For the dispersion submodel, the links "log", "sqrt", and "identity" are available. All parameters are estimated by maximum likelihood using a mixed algorithm that combines the BFGS quasi-Newton method with Fisher scoring steps.

Advantages over beta regression

Compared to beta regression, simplex regression has several distinctive advantages:

It can represent bimodal densities, which the beta distribution cannot;
Maximum likelihood estimators for the mean and dispersion parameters are asymptotically orthogonal, simplifying inference and improving numerical stability;
Likelihood-based inferences are typically less sensitive to atypical observations;
Its higher-order variance function \(\mu^3(1-\mu)^3\) allows more flexible variance patterns than the quadratic variance function of the beta distribution.

This vignette

This vignette illustrates the main functionalities of the SimplexRegression package through the analysis of monthly average relative humidity data recorded in Brasília, Brazil (\(n = 312\) observations). The analysis covers:

data preparation and exploratory summary;
fitting simplex regression models with parametric and fixed mean link functions;
link function selection via the penalized Scout Score (penalized.ss) and penalized information criteria (penalized.ic);
likelihood ratio test for nested models (lrtest);
score test for \(H_0\!: \lambda = 1\) (scoretest);
specification testing with the RESET test (resettest);
fitted values, residuals, and predictive performance (press);
prediction and simulation;
influence measures: hat values, Cook’s distance, generalized leverage;
diagnostic plots (plot, halfnormal.plot);
local influence and diagnostic distance measures (local.influence, diag.im, diag.distances).

The Data

Relative air humidity (RH) is defined as the ratio of the partial pressure of water vapor in the air to the saturation vapor pressure at the same temperature. In Brasília, low RH levels during dry months are associated with adverse health outcomes (asthma, nosebleeds, dehydration), increased forest-fire risk, and pressure on water resources. The dataset contains monthly averages from January 2000 to December 2025, obtained from the National Institute of Meteorology (INMET).

library(SimplexRegression)
data(RelativeHumidity, package = "SimplexRegression")
head(RelativeHumidity, 5)
#>         Date    RH   Ins  Ins2   Pre  Pre2 Neb    AP   MT  WS Dir
#> 1 2000-01-31 0.763 150.0 150.0 130.0 130.0 8.0 883.9 26.5 2.5  32
#> 2 2000-02-29 0.756 145.7 145.7 168.3 168.3 7.5 885.6 26.7 2.0   0
#> 3 2000-03-31 0.780 164.9 164.9 229.6 229.6 7.7 884.9 26.0 2.6  14
#> 4 2000-04-30 0.692 224.8 224.8  98.8  98.8 6.1 886.5 26.5 2.2  14
#> 5 2000-05-31 0.574 278.3 278.3   0.0   0.0 3.9 887.5 26.5 1.8  14

Variable	Description
`Date`	Observation date (end of month)
`RH`	Monthly average relative humidity, rescaled to \((0,1)\)
`Ins`	Total insolation (hours), with two missing values
`Ins2`	Total insolation with imputed missing values
`Pre`	Total precipitation (mm), with two missing values
`Pre2`	Total precipitation with imputed missing values
`Neb`	Average cloudiness (tenths)
`AP`	Average atmospheric pressure (hPa)
`MT`	Average maximum temperature (°C)
`WS`	Average wind speed (m/s)
`Dir`	Predominant wind direction (degrees)

Data preparation

To capture the 12-month seasonal cycle we add harmonic regressors \(s_i = \sin(2\pi i/12)\) and \(c_i = \cos(2\pi i/12)\), and a rainy-season dummy equal to 1 for October, November, and December (the rainy season in Brasília) and 0 otherwise.

rh       <- RelativeHumidity
rh$hs    <- sin(2 * pi * seq_len(nrow(rh)) / 12)
rh$hc    <- cos(2 * pi * seq_len(nrow(rh)) / 12)
rh$dummy <- as.integer(as.integer(format(rh$Date, "%m")) %in% 10:12)

Descriptive summary

summary(rh$RH)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   0.301   0.523   0.661   0.635   0.756   0.851
cat(sprintf(
  "Std. dev.: %.4f  |  Skewness: %.4f\n",
  sd(rh$RH),
  mean(((rh$RH - mean(rh$RH)) / sd(rh$RH))^3)
))
#> Std. dev.: 0.1331  |  Skewness: -0.4660

The response ranges from 0.301 to 0.851, with mean 0.635 and moderate left skewness (\(-0.47\)), consistent with a higher concentration of RH values above the mean during the wetter months. These features support the use of a unit-interval distribution such as the simplex.

Model Formula

All models fitted in this vignette share the same formula structure. The two-part formula y ~ x | z specifies the mean submodel (left of |) and the dispersion submodel (right of |) simultaneously. The mean submodel includes insolation (Ins2), maximum temperature (MT), wind speed (WS), harmonic terms (hs, hc) to capture seasonality, a rainy-season dummy, and its interaction with wind speed. The dispersion submodel contains only precipitation (Pre2), allowing the variability of RH to depend on rainfall.

formula <- RH ~ Ins2 + MT + WS + hs + hc + dummy + I(dummy * WS) | Pre2

Parametric Mean Link Functions

Fitting and model selection

We first fit models with the two parametric mean links available in the package, plogit1 and plogit2. These links include a shape parameter \(\lambda\) estimated jointly with the regression coefficients. When \(\lambda = 1\), both reduce to the standard logit link; values \(\lambda \neq 1\) introduce asymmetry in the link, adding flexibility to capture non-standard relationships between the mean response and the linear predictor.

fit_p1 <- simplexreg(formula, data = rh, link.mu = "plogit1")
fit_p2 <- simplexreg(formula, data = rh, link.mu = "plogit2")

Penalized Scout Score and information criteria

When comparing models that all use a parametric link, the penalized Scout Score (penalized.ss) and penalized information criteria (penalized.ic) are recommended. These functions incorporate an additional penalty controlled by kappa (default kappa = 0.1) that accounts for the complexity introduced by \(\lambda\) relative to its deviation from the standard logit. The model with the highest Scout Score or lowest penalized criterion is selected.

penalized.ss(fit_p1, fit_p2, kappa = 0.1)
#> 
#> Penalized Scout Score values (kappa = 0.100):
#>        df     SS
#> fit_p1 11 16.507
#> fit_p2 11  0.000
#> 
#> Selected model: fit_p1

penalized.ic(fit_p1, fit_p2, kappa = 0.1)
#> 
#> Penalized information criteria values (kappa = 0.100):
#>        df    AICc    BICc   HQICc
#> fit_p1 11 -1352.6 -1311.4 -1336.1
#> fit_p2 11 -1349.3 -1308.1 -1332.8

All penalized criteria unanimously select the plogit1 link. The Scout Score of fit_p1 is substantially higher than that of fit_p2, and all three penalized information criteria are lower for fit_p1.

Fixed Mean Link Functions

Fitting

We also fit models with all five fixed mean links supported by the package. Fixed links do not include an additional parameter and are therefore always comparable using standard (unpenalized) criteria.

fit_loglog  <- simplexreg(formula, data = rh, link.mu = "loglog")
fit_logit   <- simplexreg(formula, data = rh, link.mu = "logit")
fit_probit  <- simplexreg(formula, data = rh, link.mu = "probit")
fit_cauchit <- simplexreg(formula, data = rh, link.mu = "cauchit")
fit_cloglog <- simplexreg(formula, data = rh, link.mu = "cloglog")

Comparing all candidates

When comparing models with both fixed and parametric links simultaneously, the unpenalized version (kappa = 0) must be used, since the penalty for \(\lambda\) is only meaningful when all candidates use a parametric link.

penalized.ss(
  fit_loglog, fit_logit, fit_probit,
  fit_cauchit, fit_cloglog, fit_p1,
  kappa = 0
)
#> 
#> Scout Score values:
#>             df      SS
#> fit_loglog  10 115.923
#> fit_logit   10  55.299
#> fit_probit  10  37.521
#> fit_cauchit 10  15.857
#> fit_cloglog 10   0.000
#> fit_p1      11  25.182
#> 
#> Selected model: fit_loglog

The log-log link achieves the highest Scout Score and is selected as the best-fitting specification. We proceed with fit_loglog for all remaining analyses.

Model Summary

The summary method provides a standard regression output: estimated coefficients for the mean and dispersion submodels, standard errors, Wald \(z\)-statistics and \(p\)-values, the maximized log-likelihood, information criteria (AIC, BIC, HQIC), pseudo-\(R^2\) measures, and details on the numerical optimization procedure.

summary(fit_loglog)
#> 
#> Call:
#> simplexreg(formula = formula, data = rh, link.mu = "loglog")
#> 
#> Quantile residuals:
#>    Min     1Q Median     3Q    Max 
#> -2.585 -0.635 -0.002  0.644  2.992 
#> 
#> Coefficients (mean model with loglog link):
#>                    Estimate    Std. Error  z value Pr(>|z|)    
#> (Intercept)    3.5101214272  0.1335820306  26.2769   <2e-16 ***
#> Ins2          -0.0049290663  0.0002225212 -22.1510   <2e-16 ***
#> MT            -0.0592798832  0.0049701318 -11.9272   <2e-16 ***
#> WS            -0.0345373513  0.0133605126  -2.5850   0.0097 ** 
#> hs             0.2797251368  0.0121416923  23.0384   <2e-16 ***
#> hc             0.0428113428  0.0185577603   2.3069   0.0211 *  
#> dummy          0.3374909543  0.0661928169   5.0986   <2e-16 ***
#> I(dummy * WS) -0.1040241667  0.0302313202  -3.4409   0.0006 ***
#> 
#> Dispersion coefficients (dispersion model with log link):
#>                  Estimate    Std. Error  z value Pr(>|z|)    
#> (Intercept) -2.8324455892  0.1142028509 -24.8019   <2e-16 ***
#> Pre2         0.0020288924  0.0006649587   3.0512   0.0023 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
#> 
#> Log-likelihood: 687.47 on 10 Df
#> AIC: -1354.9
#> BIC: -1317.5
#> HQIC: -1340
#> Pseudo R-squared (Nagelkerke): 0.95318
#> Pseudo R-squared (Ferrari and Cribari-Neto): 0.95212
#> P-squared (Espinheira-Silva-Lima): 0.93404
#> Number of observations: 312
#> Number of iterations: 101 (BFGS) + 7 (Fisher scoring)

All regression coefficients are statistically significant at the 5% level. The harmonic terms (hs, hc) and the rainy-season dummy capture the strong seasonal pattern in RH, while insolation (Ins2), maximum temperature (MT), and wind speed (WS) explain the within-season variation. The Nagelkerke and Ferrari–Cribari-Neto pseudo-\(R^2\) values both exceed 0.95, indicating an excellent fit. The model converged in 101 BFGS iterations followed by 7 Fisher scoring steps.

Standard Information Criteria

The standard (unpenalized) AIC, BIC, and HQIC are accessible via the usual S3 generics and compare multiple models simultaneously. Lower values indicate a better trade-off between fit and complexity.

AIC(fit_loglog, fit_logit, fit_probit, fit_cauchit, fit_cloglog, fit_p1)
#>             df     AIC
#> fit_loglog  10 -1354.9
#> fit_logit   10 -1345.9
#> fit_probit  10 -1339.1
#> fit_cauchit 10 -1348.6
#> fit_cloglog 10 -1300.7
#> fit_p1      11 -1353.4
BIC(fit_loglog, fit_logit, fit_probit, fit_cauchit, fit_cloglog, fit_p1)
#>             df     BIC
#> fit_loglog  10 -1317.5
#> fit_logit   10 -1308.4
#> fit_probit  10 -1301.7
#> fit_cauchit 10 -1311.2
#> fit_cloglog 10 -1263.2
#> fit_p1      11 -1312.2
HQIC(fit_loglog, fit_logit, fit_probit, fit_cauchit, fit_cloglog, fit_p1)
#>             df    HQIC
#> fit_loglog  10 -1340.0
#> fit_logit   10 -1330.9
#> fit_probit  10 -1324.2
#> fit_cauchit 10 -1333.6
#> fit_cloglog 10 -1285.7
#> fit_p1      11 -1336.9

The log-log model consistently achieves the lowest values across all three criteria, confirming the Scout Score selection.

Extracting Model Components

Coefficients and covariance matrix

The coef method extracts estimated coefficients for the full model or for each submodel separately, controlled by the model argument ("full", "mean", or "dispersion"). The vcov method returns the corresponding variance-covariance matrix.

coef(fit_loglog)                          # full coefficient vector
#>              (Intercept)                     Ins2 
#>                3.5101214               -0.0049291 
#>                       MT                       WS 
#>               -0.0592799               -0.0345374 
#>                       hs                       hc 
#>                0.2797251                0.0428113 
#>                    dummy            I(dummy * WS) 
#>                0.3374910               -0.1040242 
#> (dispersion)_(Intercept)        (dispersion)_Pre2 
#>               -2.8324456                0.0020289
coef(fit_loglog, model = "mean")          # mean submodel only
#>   (Intercept)          Ins2            MT            WS            hs 
#>     3.5101214    -0.0049291    -0.0592799    -0.0345374     0.2797251 
#>            hc         dummy I(dummy * WS) 
#>     0.0428113     0.3374910    -0.1040242
coef(fit_loglog, model = "dispersion")    # dispersion submodel only
#> (Intercept)        Pre2 
#>  -2.8324456   0.0020289
round(vcov(fit_loglog, model = "mean"), 6)  # vcov of mean submodel
#>               (Intercept)   Ins2        MT        WS        hs
#> (Intercept)      0.017844  9e-06 -0.000627 -0.001014 -0.000893
#> Ins2             0.000009  0e+00 -0.000001 -0.000001  0.000000
#> MT              -0.000627 -1e-06  0.000025  0.000032  0.000022
#> WS              -0.001014 -1e-06  0.000032  0.000179  0.000076
#> hs              -0.000893  0e+00  0.000022  0.000076  0.000147
#> hc               0.001378  3e-06 -0.000063 -0.000112 -0.000044
#> dummy           -0.002718 -1e-06  0.000071  0.000403  0.000274
#> I(dummy * WS)    0.000841  1e-06 -0.000023 -0.000157 -0.000066
#>                      hc     dummy I(dummy * WS)
#> (Intercept)    0.001378 -0.002718      0.000841
#> Ins2           0.000003 -0.000001      0.000001
#> MT            -0.000063  0.000071     -0.000023
#> WS            -0.000112  0.000403     -0.000157
#> hs            -0.000044  0.000274     -0.000066
#> hc             0.000344 -0.000353      0.000088
#> dummy         -0.000353  0.004381     -0.001892
#> I(dummy * WS)  0.000088 -0.001892      0.000914

Log-likelihood

The logLik method returns the maximized log-likelihood together with the number of estimated parameters (df) and the number of observations. This is compatible with AIC() and BIC() from the stats package.

logLik(fit_loglog)
#> 'log Lik.' 687.47 (df=10)

Hypothesis Tests

Likelihood ratio test for nested models

The lrtest method from the lmtest package performs likelihood ratio tests between nested simplex regression models. A natural application is testing whether the dispersion is constant against a model with covariate-dependent dispersion. The restricted model (constant dispersion) is obtained via update by replacing the dispersion submodel with an intercept only (| 1).

fit_loglog_null <- update(fit_loglog, . ~ . | 1)
lmtest::lrtest(fit_loglog, fit_loglog_null)
#> Likelihood ratio test
#> 
#> Model 1: RH ~ Ins2 + MT + WS + hs + hc + dummy + I(dummy * WS) | Pre2
#> Model 2: RH ~ Ins2 + MT + WS + hs + hc + dummy + I(dummy * WS) | 1
#>   #Df LogLik Df Chisq Pr(>Chisq)   
#> 1  10    687                       
#> 2   9    683 -1  9.73     0.0018 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The very small \(p\)-value (\(< 0.001\)) provides strong evidence against constant dispersion, confirming that RH variability is significantly associated with precipitation (Pre2). The model with variable dispersion is therefore preferred.

Score test for \(H_0\!: \lambda = 1\)

The scoretest function implements Rao’s score test for the hypothesis that the standard logit link is adequate, i.e., \(H_0\!: \lambda = 1\). The test requires a model fitted under \(H_0\) (with the logit link) and tests against either "plogit1" or "plogit2" as the alternative. Rejection of \(H_0\) suggests that a parametric link provides a better fit than the standard logit.

fit_logit_h0 <- simplexreg(formula, data = rh, link.mu = "logit")
scoretest(fit_logit_h0, link.mu = "plogit1")
#> 
#>  Rao score test
#> 
#> data:  Logit vs plogit1
#> S = 9.7, df = 1, p-value = 0.0018
scoretest(fit_logit_h0, link.mu = "plogit2")
#> 
#>  Rao score test
#> 
#> data:  Logit vs plogit2
#> S = 9.08, df = 1, p-value = 0.0026

RESET test for functional form

The resettest function implements Ramsey’s RESET test, which augments the original model with the squared fitted linear predictor as an additional covariate and tests its significance. By default (dispersion = TRUE), the squared predictor is added to both the mean and dispersion submodels. Setting dispersion = FALSE restricts the augmentation to the mean submodel only. Failure to reject \(H_0\) supports the adequacy of the assumed functional form.

resettest(fit_loglog)                      # both submodels augmented
#> 
#>  RESET test
#> 
#> data:  RH ~ Ins2 + MT + WS + hs + hc + dummy + I(dummy * WS) | Pre2
#> RESET = 0.404, df = 2, p-value = 0.82
resettest(fit_loglog, dispersion = FALSE)  # mean submodel only
#> 
#>  RESET test
#> 
#> data:  RH ~ Ins2 + MT + WS + hs + hc + dummy + I(dummy * WS) | Pre2
#> RESET = 0.00136, df = 1, p-value = 0.97

The large \(p\)-values in both cases (\(0.817\) and \(0.929\)) provide no evidence of functional form misspecification in either submodel.

Fitted Values, Residuals, and Predictive Performance

Fitted values and residuals

The fitted method returns fitted mean values \(\hat{\mu}_i\). The residuals method supports ten residual types. Quantile residuals (default) have an approximately standard normal distribution under correct model specification and are recommended for general diagnostics. Weighted residuals are particularly useful for half-normal plots.

head(fitted(fit_loglog))
#>       1       2       3       4       5       6 
#> 0.75946 0.78218 0.77083 0.68605 0.58019 0.53788
head(residuals(fit_loglog, type = "quantile"))  # approx. N(0,1)
#>        1        2        3        4        5        6 
#>  0.13417 -1.27716  0.37469  0.20025 -0.22132 -0.46531
head(residuals(fit_loglog, type = "pearson"))
#> [1]  0.16448 -1.29814  0.40502  0.22319 -0.21326 -0.46403
head(residuals(fit_loglog, type = "weighted"))  # for halfnormal.plot
#>        1        2        3        4        5        6 
#>  0.16533 -1.30327  0.40565  0.22250 -0.21350 -0.46338

Predictive performance: PRESS and \(P^2\)

The press function computes the PRESS (Predicted Residual Error Sum of Squares) statistic and the associated cross-validation measures \(P^2\) and adjusted \(P^2_c\). The \(P^2\) statistic is a cross-validation analog of the coefficient of determination, computed from the hat matrix diagonal without refitting the model \(n\) times. Values close to 1 indicate strong predictive accuracy. The function accepts multiple models simultaneously, enabling direct comparison of predictive performance across competing specifications.

press(fit_loglog)                            # single model
#>        P2      P2_c     PRESS 
#>   0.93404   0.93208 327.39633
press(fit_loglog, fit_logit, fit_probit)     # comparing models
#>                 P2    P2_c  PRESS
#> fit_loglog 0.93404 0.93208 327.40
#> fit_logit  0.95150 0.95006 327.06
#> fit_probit 0.95402 0.95265 326.83

The log-log model achieves the highest \(P^2\) and \(P^2_c\) values, confirming its superior predictive performance. The PRESS statistic reflects the total cumulative squared prediction error across all leave-one-out fits.

Prediction and Simulation

Prediction

The predict method computes predicted values for new or original data. The type argument controls the output: "response" returns fitted means \(\hat{\mu}_i\); "link" returns the linear predictors \(\hat{\eta}_{1i}\) and \(\hat{\eta}_{2i}\); "dispersion" returns fitted dispersion values \(\hat{\sigma}^2_i\).

head(predict(fit_loglog, type = "response"))        # fitted means
#>       1       2       3       4       5       6 
#> 0.75946 0.78218 0.77083 0.68605 0.58019 0.53788
head(predict(fit_loglog, type = "link")$mean)       # mean linear predictor
#>       1       2       3       4       5       6 
#> 1.29044 1.40376 1.34597 0.97601 0.60806 0.47783
head(predict(fit_loglog, type = "link")$dispersion) # dispersion predictor
#>       1       2       3       4       5       6 
#> -2.5687 -2.4910 -2.3666 -2.6320 -2.8324 -2.8324
head(predict(fit_loglog, type = "dispersion"))      # fitted sigma^2
#>        1        2        3        4        5        6 
#> 0.076636 0.082829 0.093798 0.071935 0.058869 0.058869

# Out-of-sample prediction
new_obs <- rh[1:3, ]
predict(fit_loglog, newdata = new_obs, type = "response")
#> [1] 0.75946 0.78218 0.77083

Simulation

The simulate method generates response vectors from the fitted simplex distribution using the estimated \(\hat{\mu}_i\) and \(\hat{\sigma}^2_i\). The nsim argument controls the number of replicates and seed ensures reproducibility. The result is a data frame with nsim columns.

set.seed(2026)
sims <- simulate(fit_loglog, nsim = 3)
head(sims)
#>     sim_1   sim_2   sim_3
#> 1 0.75549 0.75994 0.76667
#> 2 0.80040 0.79964 0.78292
#> 3 0.77143 0.72447 0.74092
#> 4 0.69633 0.70120 0.68233
#> 5 0.55337 0.57190 0.56304
#> 6 0.55435 0.52547 0.52580

Influence Measures

Hat values and Cook’s distance

The hatvalues method returns the diagonal elements \(h_{ii}\) of the hat matrix, measuring the leverage of each observation on the fitted values. Observations with \(h_{ii}\) much larger than the average \(r/n\) (where \(r\) is the number of estimated parameters) deserve closer inspection. The cooks.distance method computes approximate Cook’s distances, which combine leverage and residual size to measure overall influence on the parameter estimates. The type argument selects the residual type used ("pearson" or "weighted").

hii  <- hatvalues(fit_loglog)
cook <- cooks.distance(fit_loglog, type = "pearson")
cat(sprintf("Leverages  — max: %.4f  mean: %.4f\n", max(hii),  mean(hii)))
#> Leverages  — max: 0.1272  mean: 0.0256
cat(sprintf("Cook's D   — max: %.4f\n", max(cook)))
#> Cook's D   — max: 0.2195

No observation has a disproportionately large leverage or Cook’s distance, suggesting no single point dominates the fit.

Generalized leverage

The gleverage function computes the generalized leverage values, which extend the classical hat values to account for both the mean and dispersion submodels. High generalized leverage indicates observations that may exert disproportionate influence on the fitted values across the entire model.

gl <- gleverage(fit_loglog)
cat(sprintf("Generalized leverage — max: %.4f  mean: %.4f\n",
            max(gl), mean(gl)))
#> Generalized leverage — max: 0.1299  mean: 0.0257

Diagnostic Plots

Residual and fit plots

The plot method produces up to eight diagnostic plots selected via which: (1) residuals vs. observation index; (2) residuals vs. fitted values; (3) residuals vs. linear predictor; (4) observed vs. fitted values; (5) normal Q-Q plot; (6) Cook’s distances; (7) generalized leverages. The default residual type is "quantile". For plots 6 and 7, threshold highlights influential observations and label.pos controls label placement (1 = below, 2 = left, 3 = above, 4 = right).

oldpar <- par(mfrow = c(3, 2))
plot(fit_loglog, which = 1:5, reset.par = FALSE)

Diagnostic plots (1–6) for the fitted simplex regression model with log-log link.

par(oldpar)

plot(fit_loglog, which = 6, threshold = 0.15, label.pos = 4)

Cook’s distances. Observations exceeding the threshold of 0.15 are labeled.

plot(fit_loglog, which = 7, threshold = 0.08, label.pos = 3)

Generalized leverage values. Observations exceeding 0.08 are labeled.

The quantile residuals are approximately standard normal, with no systematic patterns across fitted values or the linear predictor. Cook’s distances and generalized leverages are uniformly small, with no evidence of unduly influential observations.

Local influence

Local influence analysis quantifies how small perturbations to the model assumptions affect the parameter estimates. Two perturbation schemes are supported: "case.weight" perturbs observation weights and "response" perturbs the response values. The parameter argument selects the parameter block of interest — "theta" (all parameters), "beta" (mean submodel), or "gamma" (dispersion submodel). The type argument controls the influence measure: "Ci" for the total local influence index \(C_i\), or "dmax" for the direction of maximum curvature. Observations exceeding threshold are labeled in the index plot.

local.influence(
  fit_loglog,
  scheme    = "case.weight",
  parameter = "theta",
  type      = "Ci",
  plot      = TRUE,
  threshold = 0.5,
  label.pos = c(3, 4, 3, 2, 2)
)

Total local influence \(C_i\) under case-weight perturbation for all parameters.

local.influence(
  fit_loglog,
  scheme    = "response",
  parameter = "theta",
  type      = "Ci",
  plot      = TRUE,
  threshold = 0.4,
  label.pos = 2
)

Total local influence \(C_i\) under response perturbation for all parameters.

A small set of observations is identified as locally influential under both perturbation schemes, but none produces an extreme influence value.

Half-normal plot with simulated envelope

The halfnormal.plot function produces a half-normal plot of the absolute residuals together with a simulated envelope based on nsim Monte Carlo replications of the fitted model (default nsim = 100). Points outside the envelope may indicate model inadequacy. The type argument selects the residual type (default "weighted") and seed ensures reproducibility.

halfnormal.plot(fit_loglog, nsim = 19, type = "weighted", seed = 2026)

Half-normal plot of absolute weighted residuals with 95% simulated envelope (100 replications).

All diagnostic plots reveal no evidence against the adequacy of the fitted model. Quantile residuals are approximately standard normal, no observation has disproportionate influence, and nearly all points in the half-normal plot fall within the simulated envelope.

Fitted vs. Observed: Time Series

plot(rh$Date, rh$RH,
     type = "l", col = "black", lwd = 1.2,
     xlab = "Date", ylab = "Relative humidity",
     main = "Observed vs Fitted RH — Brasília (2000–2025)")
lines(rh$Date, fitted(fit_loglog), col = "red", lwd = 1.5, lty = 2)
legend("bottomleft",
       legend = c("Observed", "Fitted"),
       col    = c("black", "red"),
       lty    = c(1, 2), lwd = c(1.2, 1.5),
       bty    = "n", cex = 0.85)

Observed (solid black) and fitted (dashed red) monthly relative humidity in Brasília, January 2000 to December 2025.

The simplex regression model with log-log link accurately captures the strong seasonal pattern in relative humidity, with fitted values tracking the observed series closely throughout the entire 26-year period.

Summary

This vignette demonstrated the main functionalities of the SimplexRegression package through the analysis of monthly relative humidity data from Brasília. The selected simplex regression model uses the log-log mean link with variable dispersion driven by precipitation. It achieved excellent goodness of fit (pseudo-\(R^2 > 0.95\)) and strong cross-validated predictive performance (\(P^2 > 0.95\)), with no evidence of misspecification from the RESET test or departures from model assumptions in the diagnostic plots.

The package provides a comprehensive and self-consistent workflow — from data preparation and model fitting to inference, model selection, diagnostics, and influence analysis — following the standard conventions of formula-based regression modeling in R.

References

Barndorff-Nielsen, O. E. and Jørgensen, B. (1991). Some parametric models on the simplex. Journal of Multivariate Analysis, 39(1), 106–116.
Cribari-Neto, F., Vasconcellos, K. L. P. and Santana e Silva, J. J. (2025). New strategies for detecting atypical observations based on the information matrix equality. Journal of Applied Statistics, 52, 2873–2893.
Espinheira, P. L. and Silva, A. O. (2020). Residual and influence analysis to a general class of simplex regression. TEST, 29, 523–552.
Justino, M. E. C. and Cribari-Neto, F. (2026). Simplex regression with a flexible logit link: Inference and application to cross-country impunity data. Applied Mathematical Modelling, 154, 116713.

Session Info

sessionInfo()
#> R version 4.6.1 (2026-06-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 26.04 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.32.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] SimplexRegression_0.1.3 rmarkdown_2.31         
#> 
#> loaded via a namespace (and not attached):
#>  [1] cli_3.6.6        knitr_1.51       rlang_1.2.0      xfun_0.59       
#>  [5] Formula_1.2-5    otel_0.2.0       jsonlite_2.0.0   zoo_1.8-15      
#>  [9] buildtools_1.0.0 htmltools_0.5.9  maketools_1.3.2  pracma_2.4.6    
#> [13] sys_3.4.3        lmtest_0.9-40    sass_0.4.10      grid_4.6.1      
#> [17] evaluate_1.0.5   jquerylib_0.1.4  fastmap_1.2.0    yaml_2.3.12     
#> [21] lifecycle_1.0.5  compiler_4.6.1   sandwich_3.1-1   lattice_0.22-9  
#> [25] digest_0.6.39    R6_2.6.1         parallel_4.6.1   bslib_0.11.0    
#> [29] Matrix_1.7-5     tools_4.6.1      cachem_1.1.0

Modeling Relative Air Humidity with Simplex Regression