Pointwise Mutual Information

Pointwise Mutual Information (PMI) is an information-theoretic association measure that quantifies how much more (or less) two events co-occur than would be expected if they were independent. In computational linguistics, PMI is applied to word-context co-occurrence data, transforming raw frequency counts into association scores that highlight informative co-occurrences while discounting the effect of base frequencies. PMI-weighted co-occurrence matrices are the foundation of many distributional semantic models, and PMI has been shown to be implicitly optimized by neural embedding models like Word2Vec.

Definition and Properties

PMI and Variants PMI(w, c) = log₂[P(w, c) / (P(w) · P(c))]

Positive PMI: PPMI(w, c) = max(0, PMI(w, c))

Shifted PMI: SPMI_k(w, c) = PMI(w, c) − log k

Context distribution smoothing:
PMI_α(w, c) = log₂[P(w, c) / (P(w) · P_α(c))]
where P_α(c) = count(c)^α / Σ_c count(c)^α

PMI is positive when the co-occurrence of w and c is more likely than chance, zero when they are independent, and negative when they co-occur less than expected. In practice, negative PMI values are unreliable (they require enormous corpora to estimate accurately), so the Positive PMI (PPMI) variant, which replaces negative values with zero, is preferred. PPMI-weighted matrices with SVD dimensionality reduction constitute a strong baseline for distributional semantics, often competitive with neural embedding methods.

Connection to Neural Embeddings

A landmark result by Levy and Goldberg (2014) showed that Word2Vec's Skip-gram model with negative sampling (SGNS) implicitly factorizes a matrix whose entries are PMI values shifted by log k, where k is the number of negative samples. Specifically, the optimal solution satisfies w · c = PMI(w, c) − log k. This theoretical result unified the count-based and prediction-based paradigms of distributional semantics, showing that neural embeddings are performing a form of weighted, low-rank PMI matrix factorization.

PMI in Collocation Extraction

Beyond distributional semantics, PMI is widely used for collocation extraction and terminology mining. Word pairs with high PMI scores tend to be genuine collocations (e.g., "New York," "machine learning") rather than coincidental co-occurrences of frequent words. However, raw PMI has a bias toward low-frequency pairs, as rare co-occurrences can yield high PMI scores. Various normalizations have been proposed, including normalized PMI (NPMI = PMI / −log P(w,c)), which bounds the measure between −1 and +1 and reduces the frequency bias.

Practical Considerations

Several practical enhancements improve PMI-based representations. Context distribution smoothing, which raises context frequencies to a power alpha (typically 0.75) before computing PMI, addresses the bias toward rare contexts and yields substantial improvements on word similarity and analogy tasks. This smoothing was shown by Levy et al. (2015) to be one of the key hyperparameters explaining the performance gap between count-based and neural embedding methods. When count-based methods use the same preprocessing (smoothing, dirty subsampling), they match neural models.

PMI can be extended to higher-order associations. Trigram PMI measures the association strength of three-word combinations, useful for multiword expression detection. Conditional PMI, which measures association after conditioning on a third variable, can be used to test whether a co-occurrence pattern is mediated by a confound. These extensions make PMI a versatile tool throughout computational linguistics, from lexicography and corpus analysis to the foundations of vector-space semantics.

Definition and Properties

Connection to Neural Embeddings

Practical Considerations

References

External Links

Definition and Properties

Connection to Neural Embeddings

Practical Considerations

Related Topics

References

External Links