Cellwise Robust Multi-Group Gaussian Mixture Model

This vignette reproduces the weather example described in Puchhammer, Wilms and Filzmoser (2025). The original data is from Geosphere Austria (2022) and included in this package.

library(ssMRCD)
library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

Data Preparation

The original data from GeoSphere Austria (2022) is pre-cleaned and saved in the data frame object weatherAUT2021. Additional information can be found on the helping page.

# get meta data for the data set
? weatherAUT2021

# load the data
data("weatherAUT2021")

# inspect the data
head(weatherAUT2021)
#>        p     s   vv    t rsum  rel             name      lon      lat  alt
#> 1 941.35 154.5 2.15 6.15 49.5 77.5    AIGEN/ENNSTAL 14.13833 47.53278  641
#> 2 945.80 172.5 2.95 6.25 39.5 76.0      ALLENTSTEIG 15.36694 48.69083  599
#> 3 985.50 152.0 2.55 8.20 46.5 78.0        AMSTETTEN 14.89500 48.10889  266
#> 4 893.20 123.5 1.85 5.10 66.0 74.5      BAD GASTEIN 13.13333 47.11055 1092
#> 5 984.50 176.0 1.50 8.45 40.5 75.0 BAD GLEICHENBERG 15.90361 46.87222  269
#> 6 992.05 188.5 2.20 8.95 60.5 77.0  BAD RADKERSBURG 15.99333 46.69222  207

# select variables, station names and number of observations
data = weatherAUT2021 %>% select(p:rel)
stations = weatherAUT2021$name
n = dim(data)[1]

The predefined groups are based in the underlying geographical landscape consisting of Alpine mountains, hills and flatter areas in Austria.

# build 5 groups of observations based on spatial proximity and geography
cut_lon = c(min(weatherAUT2021$lon)-0.2, 12, 16, max(weatherAUT2021$lon) + 0.2)
cut_lat = c(min(weatherAUT2021$lat)-0.2, 48, max(weatherAUT2021$lat) + 0.2)
groups = ssMRCD::groups_gridbased(weatherAUT2021$lon, 
                                  weatherAUT2021$lat, 
                                  cut_lon, 
                                  cut_lat)
N = length(unique(groups))
table(groups)
#> groups
#>  1  2  3  4  5 
#> 31 80 35 16 21

# calculate MG-GMM
model = cellMGGMM(X = data, groups = groups,
                  nsteps = 100, alpha = 0.5,
                  maxcond = 100)
#> Iteration finished due to computational inaccuracies (increase of objective function in last step of size: 0.000119383117635152). Check convergence plot.

# mixture probabilities
cat("Pi (in %):\n")
#> Pi (in %):
round(model$pi_groups*100, 2)
#>        1     2     3     4     5
#> 1 100.00  0.00  0.00  0.00  0.00
#> 2  24.97 71.95  0.00  3.08  0.00
#> 3   0.00  0.00 94.02  5.98  0.00
#> 4  12.48  0.00  5.83 64.57 17.11
#> 5   0.00  0.00  9.01 15.35 75.64

# percentage of outliers
cat("% Outliers per group and variable:\n")
#> % Outliers per group and variable:
round(sapply(1:N, function(x) colMeans(1-model$W[groups == x, ]))*100, 2)
#>      [,1]  [,2] [,3] [,4] [,5]
#> p    6.45  3.75 2.86 0.00 0.00
#> s    0.00  2.50 0.00 0.00 0.00
#> vv   6.45 12.50 8.57 6.25 9.52
#> t    6.45  3.75 0.00 0.00 4.76
#> rsum 0.00  0.00 2.86 0.00 4.76
#> rel  0.00  2.50 0.00 6.25 4.76

# calculate residuals
res = residuals_mggmm(X = data, 
                groups = groups,
                Sigma = model$Sigma,
                mu = model$mu, 
                probs = model$probs,
                W = model$W)

References

GeoSphere Austria (2022): https://data.hub.geosphere.at.

Puchhammer P., Wilms I. and Filzmoser P. (2025): A smooth multi-group Gaussian Mixture Model for cellwise robust covariance estimation. https://doi.org/10.48550/arXiv.2504.02547

- Data Preparation
- References