Probabilistic Coherence

Coherence measures seek to emulate human judgment. They tend to rely on the degree to which words co-occur together. Yet, words may co-occur in ways that are unhelpful. For example, consider the words “the” and “this”. They co-occur very frequently in English-language documents. Yet these words are statistically independent. Knowing the relative frequency of the word “the” in a document does not give you any additional information on the relative frequency of the word “this” in a document.

Probabilistic coherence uses the concepts of co-occurrence and statistical independence to rank topics. Note that probabilistic coherence has not yet been rigorously evaluated to assess its correlation to human judgment. Anecdotally, those that have used it tend to find it helpful. Probabilistic coherence is included in both the textmineR and tidylda packages in the R language.

Probabilistic coherence averages differences in conditional probabilities across the top \(M\) most-probable words in a topic. Let \(x_{i, k}\) correspond the \(i\)-th most probable word in topic \(k\), such that \(x_{1,k}\) is the most probable word, \(x_{2,k}\) is the second most probable and so on. Further, let

\[\begin{align} x_{i,k} = \begin{cases} 1 & \text{if the } i\text{-th most probable word appears in a randomly selected document} \\ 0 & \text{otherwise} \end{cases} \end{align}\]

Then probabalistic coherence for the top \(M\) terms in topic \(k\) is calculated as follows:

\[\begin{align} C(k,M) &= \frac{1}{\sum_{j = 1}^{M - 1} M - j} \sum_{i = 1}^{M - 1} \sum_{j = i+1}^{M} P(x_{j,k} = 1 | x_{i,k} = 1) - P(x_{j,k} = 1) \end{align}\]

Where \(P(x_{j,k} = 1 | x_{i,k} = 1)\) is the fraction of contexts containing word \(i\) that contain word \(j\) and \(P(x_{j,k} = 1)\) is the fraction of all contexts containing word \(j\).

This brings us to interpretation:

If \(P(x_{j,k} = 1 | x_{i,k} = 1) - P(x_{j,k} = 1)\) is zero, then \(P(x_{j,k} = 1 | x_{i,k} = 1) = P(x_{j,k} = 1)\), the definition of statistical independence.
If \(P(x_{j,k} = 1 | x_{i,k} = 1) > P(x_{j,k} = 1)\), then word \(j\) is more present than average in contexts also containing word \(i\).
If \(P(x_{j,k} = 1 | x_{i,k} = 1) < P(x_{j,k} = 1)\), then word \(j\) is less present than average in contexts that contain word \(i\). LDA is unlikely to find strong negative co-occurrences. In practice, negative values tend to be near zero.