Short Intro to the Databionic Swarm (DBS)


DBS is a flexible and robust clustering framework that consists of three independent modules [Thrun/Ultsch, 2020]. The first module is the parameter-free projection method Pswarm, which exploits the concepts of self-organization and emergence, game theory, swarm intelligence and symmetry considerations [Thrun/Ultsch, 2020]. The second module is a parameter-free high-dimensional data visualization technique, which generates projected points on a topographic map with hypsometric colors, called the generalized U-matrix. The third module is a clustering method with no sensitive parameters. The clustering can be verified by the visualization and vice versa. The term DBS refers to the method as a whole [Thrun/Ultsch, 2020]. For further details, see [Thrun/Ultsch, 2020].

First Example: Automatic approach

Here an example is presented using the automatic approach without any user interaction with shiny. If you want to verify your clustering result externally, you can use Heatmap or SilhouettePlot of the R package DataVisualizations on CRAN.

First Module: Projection of high-dimensional Data

First generate a two-dimensional projection, the [1:n,1:n] distance matrix of n cases has to be defined by the user.

InputDistances = as.matrix(dist(Hepta$Data))
projection = Pswarm(InputDistances)

Second Module: Generalized Umatrix

Here the Generalized Umatrix is calculated using a simplified emergent self-organizing map algorithm [Thrun/Ultsch, 2020b]. The output is a list (genUmatrixList) of several elements. Then, the visualization of Generalized Umatrix is can be shown by a 3D landscape called topographic map with hypsometric tints using the output of this list named genUmatrixList.

Hypsometric tints are surface colors that represent ranges of elevation. For the 3D landscape the contour lines are combined with a specific color scale. The color scale is chosen to display various valleys, ridges, and basins: blue colors indicate small distances (sea level), green and brown colors indicate middle distances (low hills), and shades of white colors indicate vast distances (high mountains covered with snow and ice).

Seven valleys are shown resulting in seven main clusters. The resulting visualization will be toroidal meaning that the left borders cyclically connects to the right border (and bottom to top). It means there are no “real” borders in this visualizations. Instead, the visualization is “continuous”. This can be visualized using the ‘Tiled=TRUE’ option of ‘plotTopographicMap’.

Note, that the ‘NoLevels’ option is only set to load this vignette faster and should normally not be set manually. It describes the number contour lines placed relative to the hypsometric tints. All visualizations here are small and a low dpi is set in knitr in order to load the vignette faster.

genUmatrixList = GeneratePswarmVisualization(
  Data = Hepta$Data, 
  Parallel=FALSE)#CRAN guidelines do not allow =TRUE in vignette

  NoLevels = 10)

Third Module: Automatic Clustering

The number of clusters can be derived from dendrogram (PlotIt=TRUE) or the visualization. Therefore we choose the seven valleys as the number of clusters. The function DBSclustering has one parameter to be set. Typically, the default setting “StructureType = TRUE” works fine. However, for density-based structures sometimes StructureType = FALSE of the function ‘DBSclustering’ yields better results. Please verify with the visualization or the Dendrogram. For the Dendrogram choose PlotIt=TRUE in the function ‘DBSclustering’. In the case of “BestmatchingUnits”, the parameter “LC” defines the size of the grid with Lines and columns where the position (0,0) lies in the left upper corner. In the case of “ProjectedPoints”, the point (0,0) lies in the left bottom corner. The transformation is normally done automatically. However, sometimes the user wishes to skip the visualization and use projected points directly. Then “LC” can be changed accordingly to LC[c(2,1)]. Seldom, there could be a rounding error leading to an error catch. In such a case try LC+1.

Cls = DBSclustering(k = 7,
                    DataOrDistance = Hepta$Data,
                    BestMatches = genUmatrixList$Bestmatches,
                    LC = genUmatrixList$LC,
                    PlotIt = FALSE)
                                       NoLevels = 10)