In n-gram language modeling, maximum likelihood estimation assigns zero probability to any n-gram not observed in the training corpus. Since natural language is productive and any sufficiently long sentence is likely novel, these zero probabilities are catastrophic: a single unseen bigram drives the probability of an entire sentence to zero. Smoothing techniques address this problem by redistributing some probability mass from observed n-grams to unobserved ones, ensuring that no word sequence receives zero probability. The choice of smoothing method has a substantial impact on language model performance.
Add-k Smoothing and Lidstone Estimation
Lidstone: P(wᵢ | wᵢ₋₁) = (C(wᵢ₋₁, wᵢ) + α) / (C(wᵢ₋₁) + α|V|)
where α ∈ (0, 1] and |V| is the vocabulary size
The simplest smoothing method, Laplace (add-one) smoothing, adds one to every n-gram count before computing relative frequencies. While conceptually elegant, add-one smoothing is too aggressive for language modeling: it reassigns far too much probability mass from frequent n-grams to unseen ones. Lidstone smoothing generalizes this approach by adding a fractional count α, which can be tuned on held-out data. Even Lidstone smoothing, however, makes the unrealistic assumption that all unseen n-grams are equally likely.
Good-Turing Estimation
Good-Turing estimation, developed by Alan Turing and I. J. Good during World War II for cryptanalysis, uses the frequency of frequencies to estimate the probability of unseen events. The adjusted count for an n-gram with count c is c* = (c + 1) · N_{c+1} / N_c, where N_c is the number of n-grams that appear exactly c times. The total probability mass assigned to unseen n-grams equals N₁ / N, the fraction of singleton observations. Good-Turing estimation is more principled than add-k smoothing but requires additional techniques (such as Simple Good-Turing) to handle the noisy frequency-of-frequency counts at higher values of c.
The landmark 1999 study by Stanley Chen and Joshua Goodman systematically compared smoothing methods across multiple corpora and n-gram orders. They found that modified Kneser-Ney smoothing consistently outperformed all other methods, including Good-Turing, Witten-Bell, and various interpolation schemes. This comprehensive empirical evaluation established modified Kneser-Ney as the gold standard for n-gram smoothing and remains one of the most cited papers in language modeling.
Discounting Methods
Absolute discounting subtracts a fixed discount d (typically between 0 and 1) from each nonzero n-gram count and distributes the freed probability mass to unseen n-grams: P_abs(wᵢ | wᵢ₋₁) = max(C(wᵢ₋₁, wᵢ) - d, 0) / C(wᵢ₋₁) + λ(wᵢ₋₁) · P_lower(wᵢ). The normalization constant λ ensures the distribution sums to one. Absolute discounting can be seen as a simplification of Good-Turing where the adjustment is independent of the count magnitude, which is a reasonable approximation for counts above one.
All smoothing methods face the same fundamental tension: they must reserve enough probability mass for unseen events to avoid zero probabilities while not degrading the estimates for observed events. The most successful methods, particularly Kneser-Ney smoothing, combine discounting with interpolation and use lower-order distributions that are specifically designed to predict words in novel contexts rather than simply backing off to cruder estimates.