Les miserables

library(igraph)
library(gsbm)
library(missSBM)
library(RColorBrewer)

Les Misérables character network

Les Misérables characters network, encoding interactions between characters of Victor Hugo’s novel, was first created by Donald Knuth as part of the Stanford Graph Base (https://people.sc.fsu.edu/~jburkardt/datasets/sgb/sgb.html). It contains 77 nodes corresponding to characters of the novel, and 254 vertices connecting two characters whenever they appear in the same chapter.

data(les_miserables)
A<- les_miserables$A
names <- les_miserables$names
net <- graph_from_adjacency_matrix(A, mode = "undirected")
V(net)$name <- names
V(net)$color <- "gray80"
deg <- degree(net, mode="all")
V(net)$size <- deg
plot(net, vertex.label.cex = 0.4)

We fit a classical SBM to the graph and represent the graph with nodes proportional to their degrees and colored by community assignment. The number of communities has been selected so as to minimize the ICL criterion.

vBlocks <- 1:10
collection_sbm <- missSBM::estimateMissSBM(A, vBlocks, "node")
#> 
#> 
#>  Adjusting Variational EM for Stochastic Block Model
#> 
#>  Imputation assumes a 'node' network-sampling process
#> 
#>  Initialization of 10 model(s). 
#>  Performing VEM inference
#>      Model with 8 blocks.    Model with 5 blocks.    Model with 4 blocks.    Model with 3 blocks.    Model with 7 blocks.    Model with 2 blocks.    Model with 9 blocks.    Model with 1 blocks.    Model with 6 blocks.    Model with 10 blocks.
#>  Looking for better solutions
#>  Pass 1   Going forward +++++++++                                                                                                     Pass 1   Going backward +++++++++                                                                                                    
colo <- round(collection_sbm$bestModel$fittedSBM$probMemberships)
colo <- sapply(1:nrow(A), function(i) which.max(colo[i,]))
pal3 <- brewer.pal(10, "Set3")
V(net)$color <- pal3[colo]
V(net)$label <- NA
V(net)$size <- deg
plot(net)

We observe that the main character Jean Valjean is alone in his community, and one of the clusters groups important characters (Thénardier, Éponine, Javert).

The Generalized stochastic Block Model accounts for outlier profiles (hubs, mixed memberships). In this model, nodes are divided into two sets: the inliers which follow a classical SBM, and the outliers, for which we make no assumptions on the connectivity model. These two sets are unknown a priori and are learned automatically by our procedure. Below we represent the result of the clustering, with the detected outliers indicated in red. They correspond to hubs (large center node, Jean Valjean) and nodes with mixed memberships (e.g. smaller central nodes with connections to several clusters).

lambda1 <- 4
lambda2 <- 5
res <- gsbm_mcgd(A, lambda1 = lambda1, lambda2 = lambda2)
outliers <- names[which(colSums(res$S)>0)]
sv <- svd(res$L)
pc <- sv$u[,1:4]
rownames(pc) <- names
pc <- pc[setdiff(names, outliers),]
com <- kmeans(pc, centers=4, nstart=50)
com$cluster
#>         Napoleon   MlleBaptistine      MmeMagloire     CountessDeLo 
#>                2                2                2                2 
#>         Geborand     Champtercier         Cravatte            Count 
#>                2                2                2                2 
#>           OldMan          Labarre       Marguerite           MmeDeR 
#>                2                2                2                2 
#>          Isabeau          Gervais        Tholomyes        Listolier 
#>                2                2                1                1 
#>          Fameuil      Blacheville        Favourite           Dahlia 
#>                1                1                1                1 
#>          Zephine    MmeThenardier     Fauchelevent       Bamatabois 
#>                1                2                2                3 
#>         Perpetue         Simplice      Scaufflaire           Woman1 
#>                2                2                2                2 
#>            Judge     Champmathieu           Brevet       Chenildieu 
#>                3                3                3                3 
#>      Cochepaille        Pontmercy     Boulatruelle          Eponine 
#>                3                2                2                2 
#>          Anzelma           Woman2   MotherInnocent          Gribier 
#>                2                2                2                2 
#>        Jondrette        MmeBurgon     Gillenormand           Magnon 
#>                2                2                2                2 
#> MlleGillenormand     MmePontmercy      MlleVaubois   LtGillenormand 
#>                2                2                2                2 
#>        BaronessT           Mabeuf         Enjolras       Combeferre 
#>                2                4                4                4 
#>        Prouvaire          Feuilly       Courfeyrac          Bahorel 
#>                4                4                4                4 
#>          Bossuet             Joly        Grantaire   MotherPlutarch 
#>                4                4                4                2 
#>        Gueulemer            Babet       Claquesous     Montparnasse 
#>                2                2                2                2 
#>        Toussaint           Child1           Child2           Brujon 
#>                2                2                2                2 
#>     MmeHucheloup 
#>                4
colo2 <- 1:nrow(A)
names(colo2) <- names
comu <- com$cluster
comu[which(comu==4)] <- 6
colo2[setdiff(names, outliers)] <- pal3[comu]
colo2[outliers] <- "red"
labels <- names(A)
names(labels) <- names(A)
labels[setdiff(names, outliers)] <- NA
V(net)$label <- NA
V(net)$color <- colo2
V(net)$size <- deg
E(net)$arrow.size <- 5
plot(net, vertex.label.dist=20)