Computational Linguistics
About

Pointwise Mutual Information

Pointwise Mutual Information (PMI) quantifies the degree of association between two events beyond what would be expected by chance, serving as the foundational weighting scheme for co-occurrence matrices in distributional semantics.

PMI(w, c) = log2[P(w, c) / (P(w) * P(c))]

Pointwise Mutual Information (PMI) is an information-theoretic association measure that quantifies how much more (or less) two events co-occur than would be expected if they were independent. In computational linguistics, PMI is applied to word-context co-occurrence data, transforming raw frequency counts into association scores that highlight informative co-occurrences while discounting the effect of base frequencies. PMI-weighted co-occurrence matrices are the foundation of many distributional semantic models, and PMI has been shown to be implicitly optimized by neural embedding models like Word2Vec.

Definition and Properties

PMI and Variants PMI(w, c) = log₂[P(w, c) / (P(w) · P(c))]

Positive PMI: PPMI(w, c) = max(0, PMI(w, c))

Shifted PMI: SPMI_k(w, c) = PMI(w, c) − log k

Context distribution smoothing:
PMI_α(w, c) = log₂[P(w, c) / (P(w) · P_α(c))]
where P_α(c) = count(c)^α / Σ_c count(c)^α

PMI is positive when the co-occurrence of w and c is more likely than chance, zero when they are independent, and negative when they co-occur less than expected. In practice, negative PMI values are unreliable (they require enormous corpora to estimate accurately), so the Positive PMI (PPMI) variant, which replaces negative values with zero, is preferred. PPMI-weighted matrices with SVD dimensionality reduction constitute a strong baseline for distributional semantics, often competitive with neural embedding methods.

Connection to Neural Embeddings

A landmark result by Levy and Goldberg (2014) showed that Word2Vec's Skip-gram model with negative sampling (SGNS) implicitly factorizes a matrix whose entries are PMI values shifted by log k, where k is the number of negative samples. Specifically, the optimal solution satisfies w · c = PMI(w, c) − log k. This theoretical result unified the count-based and prediction-based paradigms of distributional semantics, showing that neural embeddings are performing a form of weighted, low-rank PMI matrix factorization.

PMI in Collocation Extraction

Beyond distributional semantics, PMI is widely used for collocation extraction and terminology mining. Word pairs with high PMI scores tend to be genuine collocations (e.g., "New York," "machine learning") rather than coincidental co-occurrences of frequent words. However, raw PMI has a bias toward low-frequency pairs, as rare co-occurrences can yield high PMI scores. Various normalizations have been proposed, including normalized PMI (NPMI = PMI / −log P(w,c)), which bounds the measure between −1 and +1 and reduces the frequency bias.

Practical Considerations

Several practical enhancements improve PMI-based representations. Context distribution smoothing, which raises context frequencies to a power alpha (typically 0.75) before computing PMI, addresses the bias toward rare contexts and yields substantial improvements on word similarity and analogy tasks. This smoothing was shown by Levy et al. (2015) to be one of the key hyperparameters explaining the performance gap between count-based and neural embedding methods. When count-based methods use the same preprocessing (smoothing, dirty subsampling), they match neural models.

PMI can be extended to higher-order associations. Trigram PMI measures the association strength of three-word combinations, useful for multiword expression detection. Conditional PMI, which measures association after conditioning on a third variable, can be used to test whether a co-occurrence pattern is mediated by a confound. These extensions make PMI a versatile tool throughout computational linguistics, from lexicography and corpus analysis to the foundations of vector-space semantics.

Related Topics

References

  1. Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22–29.
  2. Levy, O., & Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems 27 (pp. 2177–2185).
  3. Levy, O., Goldberg, Y., & Dagan, I. (2015). Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3, 211–225. doi:10.1162/tacl_a_00134
  4. Bouma, G. (2009). Normalized (pointwise) mutual information in collocation extraction. In Proceedings of GSCL (pp. 31–40).

External Links