Computational Linguistics
About

Kneser-Ney Smoothing

Kneser-Ney smoothing combines absolute discounting with a novel lower-order distribution based on continuation probability, producing the most effective smoothing method for n-gram language models as consistently demonstrated in empirical evaluations.

P_KN(wᵢ | wᵢ₋₁) = max(C(wᵢ₋₁,wᵢ) − d, 0) / C(wᵢ₋₁) + λ(wᵢ₋₁) · P_cont(wᵢ)

Kneser-Ney smoothing, introduced by Reinhard Kneser and Hermann Ney in 1995, is widely regarded as the most effective smoothing method for n-gram language models. Its key innovation is the use of a modified lower-order distribution, called the continuation probability, which estimates how likely a word is to appear in a novel context rather than simply how frequent it is overall. This distinction allows Kneser-Ney smoothing to make better predictions when backing off from higher-order to lower-order n-grams.

Continuation Probability

Kneser-Ney Smoothing P_KN(wᵢ | wᵢ₋₁) = max(C(wᵢ₋₁, wᵢ) − d, 0) / C(wᵢ₋₁) + λ(wᵢ₋₁) · P_cont(wᵢ)

Continuation probability:
P_cont(wᵢ) = |{w : C(w, wᵢ) > 0}| / |{(w, w') : C(w, w') > 0}|

λ(wᵢ₋₁) = (d / C(wᵢ₋₁)) · |{w : C(wᵢ₋₁, w) > 0}|

The continuation probability P_cont(wᵢ) is proportional to the number of distinct word types that precede wᵢ in the training corpus, rather than the raw frequency of wᵢ. This distinction matters because some words are frequent but appear in few contexts. For example, "Francisco" is frequent in text about San Francisco but almost always follows "San." Its unigram probability is high, but its continuation probability is low, correctly reflecting its limited ability to appear in novel bigram contexts. Conversely, common function words like "the" appear after many different words and have high continuation probability.

Modified Kneser-Ney

Chen and Goodman (1999) proposed modified Kneser-Ney smoothing, which uses three separate discount values d₁, d₂, and d₃₊ for n-grams with counts of one, two, and three or more, respectively. The optimal discount values are estimated from the training data using a formula derived from the leave-one-out analysis: d_c = c − (c+1) · (N_{c+1} / N_c) · (N₁ / (N₁ + 2N₂)). This modification captures the empirical finding that the appropriate discount varies with count magnitude, with singletons requiring larger discounts.

Why Kneser-Ney Dominates

The superiority of Kneser-Ney smoothing arises from two insights working together. First, absolute discounting is a better approximation to the Good-Turing adjustments than additive methods. Second, the continuation probability provides a semantically appropriate backoff distribution that reflects how words actually behave in novel contexts. No other smoothing method combines these two properties as effectively, which explains why Kneser-Ney has remained the gold standard for n-gram smoothing for over two decades.

Implementation and Impact

Kneser-Ney smoothing is implemented in widely used language modeling toolkits including SRILM, KenLM, and IRSTLM. KenLM, developed by Kenneth Heafield, provides a particularly efficient implementation that uses sorted arrays and memory-mapped files to enable querying of very large models. These implementations have been critical infrastructure for statistical machine translation systems and speech recognizers, where Kneser-Ney smoothed n-gram models served as the standard language model component for many years.

The principles underlying Kneser-Ney smoothing have also influenced neural language modeling. The idea that a backoff distribution should reflect a word's versatility rather than its frequency resonates with the way neural models learn distributed representations that capture contextual flexibility. Recent work has shown connections between Kneser-Ney smoothing and certain neural architectures, suggesting that effective language models, whether count-based or neural, must capture similar statistical regularities.

Related Topics

References

  1. Kneser, R., & Ney, H. (1995). Improved backing-off for m-gram language modeling. Proceedings of ICASSP, 181–184. doi:10.1109/ICASSP.1995.479394
  2. Chen, S. F., & Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4), 359–394. doi:10.1006/csla.1999.0128
  3. Heafield, K. (2011). KenLM: Faster and smaller language model queries. Proceedings of the Sixth Workshop on Statistical Machine Translation, 187–197.
  4. Ney, H., Essen, U., & Kneser, R. (1994). On structuring probabilistic dependences in stochastic language modelling. Computer Speech & Language, 8(1), 1–38. doi:10.1006/csla.1994.1001

External Links