Multidimensional Scaling of Discrete Probability Distributions

library(dad)

Introduction: example and objective of the method

The dataset dspg of the dad package is a list of T = 7 matrices. For each of the T years 1968, 1975, 1982, 1990, 1999, 2010 and 2015, we have the contingency table of Diploma × Socioprofessional group in France. Each table has:

  • 4 rows corresponding to a level of diploma (diplome):
    • bepc: brevet
    • cap: NCQ (CAP)
    • bac: baccalaureate
    • sup: higher education (supérieur)
  • 6 columns corresponding to socio professional groups (csp):
    • agri: farmer (agriculteur)
    • cardr: senior manager (cadre supérieur)
    • pint: middle manager (profession intermédiaire)
    • empl: employee (employé)
    • ouvr: worker (ouvrier)
data("dspg")
print(dspg)
## $`1968`
##        csp
## diplome    agri   arti   cadr   pint    empl    ouvr
##    bepc 1316116 952960 165616 682144 1927408 3978916
##    cap    39104 183004  71440 400456  430988  565404
##    bac    23920  96648 127140 483900  169804   75924
##    sup     4172  27368 426684 193872   36560   19184
## 
## $`1975`
##        csp
## diplome    agri   arti   cadr   pint    empl    ouvr
##    bepc 1040335 804175 212965 802950 2349515 4085075
##    cap   106645 288680 105385 515125  745730  986965
##    bac    16700 123020 217155 685995  289835  112740
##    sup     6565  33780 712780 512515  107645   28710
## 
## $`1982`
##        csp
## diplome   agri   arti   cadr   pint    empl    ouvr
##    bepc 732716 770188 242356 879564 2502620 3947964
##    cap  150132 385052 118612 613640 1067144 1290200
##    bac   46256 174204 293440 770516  474504  135888
##    sup   12616  62816 940980 924172  136608   17392
## 
## $`1990`
##        csp
## diplome   agri   arti    cadr    pint    empl    ouvr
##    bepc 409172 617841  252302  842414 2459709 3461490
##    cap  165573 504027  150024  816798 1651050 1991441
##    bac   79492 206386  352212 1105251  733300  182040
##    sup   22760 130374 1591719 1225666  243767   41267
## 
## $`1999`
##        csp
## diplome   agri   arti    cadr    pint    empl    ouvr
##    bepc 187003 391635  182496  676012 2314729 2628094
##    cap  206249 564047  180554 1050683 2309559 2543681
##    bac   85088 206085  293472 1088315 1076836  358332
##    sup   39690 205774 2110963 2189503  681935  155885
## 
## $`2010`
##        csp
## diplome   agri   arti    cadr    pint    empl    ouvr
##    bepc  60768 250307  141650  477459 1524048 1754585
##    cap  129022 472352  187644  940102 1987379 2241519
##    bac   98110 275878  342985 1211857 1540768  739900
##    sup   60565 304784 2957218 3115222 1266854  331672
## 
## $`2015`
##        csp
## diplome  agri   arti    cadr    pint    empl    ouvr
##    bepc 32957 213686  114517  365359 1173116 1406264
##    cap  95495 462867  169914  815957 1920849 2118735
##    bac  91093 298719  297826 1168208 1645780  844749
##    sup  67977 395035 3250543 3468127 1492026  415197

After the computation of the distances or divergences between each pair of occasions, that is the distances (δts) between their corresponding distributions, the MDS technique looks for a representation of the distributions by T points in a low dimensional space such that the distances between these points are as similar as possible to the (δts).

The dad package includes functions for all the calculations required to implement such a method and to interpret its outputs:

  • The mdsdd function performs MDS and generates scores;
  • The plot function generates graphics representing the probability distributions on the factorial axes;
  • The interpret function returns other aids to interpretation based on the marginal distributions.

The mdsdd function

MDS of discrete probability distributions can be carried using the mdsdd function. This function applies to

  • an object of class “folder” (in this case, it is used the same way as fmdsd (see help), except that the columns of each data frame of the folder are not numeric, but factors)
  • or a list of arrays (or a list of tables).

The following example shows the application of mdsdd on a list of arrays. The mdsdd function is built on the cmdscale function of R. It is carried out on the dataset dspg as follows:

resultmds <- mdsdd(dspg)

In addition to the add argument of cmdscale, the mdsdd function has two sets of optional arguments:

  • The first, consisting of distance, controls the method used to compute the distances between the distributions.
  • The second set consists of optional arguments which control the function outputs.

Interpretation of mdsdd outputs

The mdsdd function returns an object of S3 class "mdsdd", consisting of a list of 9 elements, including the scores, also called principal coordinates, and the marginal and joint distributions of the variables per occasion.

names(resultmds)
## [1] "call"         "group"        "variables"    "d"            "inertia"     
## [6] "scores"       "jointp"       "margins"      "associations"

The outputs are displayed with the print function:

print(resultmds)
## group variable:  group 
## variables:  diplome csp 
## ---------------------------------------------------------------
## inertia
##   eigenvalue inertia
## 1   1.23e+00    92.8
## 2   6.63e-02     5.0
## 3   1.59e-02     1.2
## 4   8.16e-03     0.6
## 5   4.65e-03     0.4
## 6   7.03e-17     0.0
## ---------------------------------------------------------------
## coordinates
##      group        PC.1        PC.2         PC.3
## 1968  1968 -0.61455123  0.08277634  0.051535799
## 1975  1975 -0.41112536  0.03027496 -0.006955327
## 1982  1982 -0.25617158 -0.01895616 -0.048278844
## 1990  1990 -0.01123542 -0.10771438 -0.066369785
## 1999  1999  0.24131168 -0.16253594  0.077859986
## 2010  2010  0.48611906  0.03985732 -0.016564831
## 2015  2015  0.56565286  0.13629786  0.008773002

Graphical representations on the principal planes are generated with the plot function:

plot(resultmds, fontsize.points = 1)

In this example, a single axis is enough to explain the general trends; the first principal coordinate explains 92% of the inertia.

This graph shows an evolution of the value of the first principal score, which gets higher for recent years.

The interpretation of outputs is based on the relationships between the principal scores and the marginal or joint frequencies. These relationships are quantified by correlation coefficients and are represented graphically by plotting the scores against the frequencies. These interpretation tools are provided by the interpret function which has two optional arguments: nscores indicating the indices of the column scores to be interpreted and mma whose default value is "marg1" (the probability distributions of each variable).

interpret(resultmds, nscore = 1)

## Pearson correlations between scores and probability distributions of each variable
##               PC.1
## diplome.bepc -1.00
## diplome.cap   0.79
## diplome.bac   0.99
## diplome.sup   0.98
## csp.agri     -0.96
## csp.arti     -0.96
## csp.cadr      0.99
## csp.pint      1.00
## csp.empl      0.91
## csp.ouvr     -1.00
## Spearman correlations between scores and probability distributions of each variable
##               PC.1
## diplome.bepc -1.00
## diplome.cap   0.68
## diplome.bac   1.00
## diplome.sup   1.00
## csp.agri     -1.00
## csp.arti     -0.96
## csp.cadr      1.00
## csp.pint      1.00
## csp.empl      0.86
## csp.ouvr     -1.00

From the correlations between the principal coordinates (PC) and the distributions of the variables, we deduce that:

  • The higher PC1, the higher the frequencies of the diplomas "diplome.bac" and "diplome.sup", the higher "diplome.cap" tends to be, and the lower the frequencies of "diplome.bepc".
  • The higher PC1, the higher the frequencies of the socio professional groups "csp.cadr", "csp.pint" and “csp.empl”, and the lower the frequencies of "csp.agri", "csp.arti" and "csp.ouvr".

So, reminding that PC1 gets higher for recent years, these results highlight that in France, since 1968:

  • the number of brevet graduates have decreased and higher degrees have increased,
  • the number of farmers, craftsmen and workers have decreased and the number of employees, middle and senior managers have increased.