Alexander Alexandrovich Lyubishchev (1890-1972) was a Russian biologist and entomologist who, in a 1943 manuscript titled Programma obshchey sistematiki (Program of General Systematics), set out a quantitative, multivariate approach to classification. His methods were later presented in English in Biometrics (Lubischew, 1962).
Lyubishchev’s framework operates directly on continuous measurements, using means, variances and covariances to quantify how far apart groups are and whether they overlap. This predates and is more general than the binary-character similarity coefficients of Sokal and Sneath (1963) that appear in other R packages. Because the original Russian manuscript was not widely cited in the Western numerical-taxonomy literature, this lineage is often overlooked.
This package implements four core functions. We illustrate them on
the familiar iris data set.
The divergence coefficient D measures the standardised
separation between two groups summed across features. Setosa is famously
distinct from the other two species, so we expect a large value.
setosa <- iris[iris$Species == "setosa", 1:4]
versicolor <- iris[iris$Species == "versicolor", 1:4]
divergence_coefficient(setosa, versicolor)
#> [1] 58.42465A large D confirms the two groups are easily separable
on these features.
scatter_ellipse() fits a covariance ellipse to every
class, returning the centroid, covariance and sample size for each.
ellipses <- scatter_ellipse(iris[, 1:4], iris$Species)
ellipses[["setosa"]]$mean
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 5.006 3.428 1.462 0.246
ellipses[["setosa"]]$cov
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> Sepal.Length 0.12424898 0.099216327 0.016355102 0.010330612
#> Sepal.Width 0.09921633 0.143689796 0.011697959 0.009297959
#> Petal.Length 0.01635510 0.011697959 0.030159184 0.006069388
#> Petal.Width 0.01033061 0.009297959 0.006069388 0.011106122
ellipses[["setosa"]]$n_samples
#> [1] 50transgression() checks whether two ellipses overlap,
comparing the squared Mahalanobis distance between centroids against a
chi-squared threshold. Versicolor and virginica are the hard pair: they
are known to overlap.
transgression(ellipses, "versicolor", "virginica")
#> $mahalanobis_distance
#> [1] 14.21889
#>
#> $threshold
#> [1] 9.487729
#>
#> $transgression
#> [1] FALSE
#>
#> $separation_ratio
#> [1] 1.498661Contrast this with the easy pair, setosa versus virginica:
transgression(ellipses, "setosa", "virginica")
#> $mahalanobis_distance
#> [1] 195.1855
#>
#> $threshold
#> [1] 9.487729
#>
#> $transgression
#> [1] FALSE
#>
#> $separation_ratio
#> [1] 20.57242A separation_ratio above 1 (and
transgression = FALSE) marks well-separated groups.
classify() assigns posterior probabilities to a new
specimen using the multivariate Gaussian likelihood of each class. Here
is a typical setosa specimen.
specimen <- c(5.1, 3.5, 1.4, 0.2)
result <- classify(specimen, ellipses)
sapply(result, function(r) r$posterior)
#> setosa versicolor virginica
#> 1.000000e+00 4.918517e-26 2.981541e-41The posterior concentrates on setosa, as expected.
These methods assume continuous, roughly Gaussian features. Use them for measurement data such as morphometrics, spectra or sensor readings. They are not appropriate for purely categorical or binary character data, where the Sokal-Sneath style similarity coefficients are the right tool.
Lyubishchev, A.A. (1943). Programma obshchey sistematiki [Program of General Systematics]. Manuscript, 22 November 1943. Digitized by ZIN RAS Coleoptera Laboratory. https://www.zin.ru/animalia/coleoptera/rus/lyubis05.htm
Lubischew, A.A. (1962). On the use of discriminant functions in taxonomy. Biometrics, 18(4), 455-477.