Kneser-Ney smoothing, introduced by Reinhard Kneser and Hermann Ney in 1995, is widely regarded as the most effective smoothing method for n-gram language models. Its key innovation is the use of a modified lower-order distribution, called the continuation probability, which estimates how likely a word is to appear in a novel context rather than simply how frequent it is overall. This distinction allows Kneser-Ney smoothing to make better predictions when backing off from higher-order to lower-order n-grams.
Continuation Probability
Continuation probability:
P_cont(wᵢ) = |{w : C(w, wᵢ) > 0}| / |{(w, w') : C(w, w') > 0}|
λ(wᵢ₋₁) = (d / C(wᵢ₋₁)) · |{w : C(wᵢ₋₁, w) > 0}|
The continuation probability P_cont(wᵢ) is proportional to the number of distinct word types that precede wᵢ in the training corpus, rather than the raw frequency of wᵢ. This distinction matters because some words are frequent but appear in few contexts. For example, "Francisco" is frequent in text about San Francisco but almost always follows "San." Its unigram probability is high, but its continuation probability is low, correctly reflecting its limited ability to appear in novel bigram contexts. Conversely, common function words like "the" appear after many different words and have high continuation probability.
Modified Kneser-Ney
Chen and Goodman (1999) proposed modified Kneser-Ney smoothing, which uses three separate discount values d₁, d₂, and d₃₊ for n-grams with counts of one, two, and three or more, respectively. The optimal discount values are estimated from the training data using a formula derived from the leave-one-out analysis: d_c = c − (c+1) · (N_{c+1} / N_c) · (N₁ / (N₁ + 2N₂)). This modification captures the empirical finding that the appropriate discount varies with count magnitude, with singletons requiring larger discounts.
The superiority of Kneser-Ney smoothing arises from two insights working together. First, absolute discounting is a better approximation to the Good-Turing adjustments than additive methods. Second, the continuation probability provides a semantically appropriate backoff distribution that reflects how words actually behave in novel contexts. No other smoothing method combines these two properties as effectively, which explains why Kneser-Ney has remained the gold standard for n-gram smoothing for over two decades.
Implementation and Impact
Kneser-Ney smoothing is implemented in widely used language modeling toolkits including SRILM, KenLM, and IRSTLM. KenLM, developed by Kenneth Heafield, provides a particularly efficient implementation that uses sorted arrays and memory-mapped files to enable querying of very large models. These implementations have been critical infrastructure for statistical machine translation systems and speech recognizers, where Kneser-Ney smoothed n-gram models served as the standard language model component for many years.
The principles underlying Kneser-Ney smoothing have also influenced neural language modeling. The idea that a backoff distribution should reflect a word's versatility rather than its frequency resonates with the way neural models learn distributed representations that capture contextual flexibility. Recent work has shown connections between Kneser-Ney smoothing and certain neural architectures, suggesting that effective language models, whether count-based or neural, must capture similar statistical regularities.