When a higher-order n-gram has insufficient data for reliable estimation, the language model must fall back on information from shorter contexts. Two principal strategies accomplish this: backoff, which consults a lower-order model only when the higher-order count is zero, and interpolation, which always combines estimates from multiple orders. Both strategies are essential to practical language modeling because they allow the model to exploit longer contexts when available while maintaining robustness for rare and unseen n-grams.
Interpolation
Constraint: λ₁ + λ₂ + λ₃ = 1, λᵢ ≥ 0
Context-dependent interpolation:
λᵢ = λᵢ(wᵢ₋₂, wᵢ₋₁) may depend on the context
In linear interpolation, the final probability estimate is a weighted sum of the estimates from each n-gram order. The interpolation weights λ can be fixed constants optimized on held-out data using the EM algorithm, or they can be context-dependent functions that assign more weight to higher-order models when the context is well-attested and more weight to lower-order models when the context is rare. Jelinek and Mercer (1980) introduced deleted interpolation, where the weights are optimized by iteratively holding out portions of the training data.
Backoff Models
In a backoff model, the higher-order n-gram estimate is used whenever it is available (i.e., the n-gram was observed in training); only when the count is zero does the model back off to the next lower order. The Katz backoff model (1987) is the classical formulation: P_bo(wᵢ | wᵢ₋₁) = d · C(wᵢ₋₁, wᵢ) / C(wᵢ₋₁) if C(wᵢ₋₁, wᵢ) > 0, and α(wᵢ₋₁) · P_bo(wᵢ) otherwise. The discount d is typically derived from Good-Turing counts, and α is a normalization factor ensuring the distribution sums to one.
Chen and Goodman (1999) demonstrated that interpolated models generally outperform backoff models because interpolation always uses information from all available orders rather than discarding lower-order evidence when higher-order counts are nonzero. Even a single occurrence of a trigram provides a noisy estimate; blending it with the more reliable bigram and unigram estimates via interpolation yields better predictions. This finding led to the widespread adoption of interpolated modified Kneser-Ney as the default smoothing method.
Advanced Approaches
Beyond simple linear interpolation, more sophisticated combination strategies have been developed. Log-linear interpolation assigns weights in log-space, which can be interpreted as a product of experts model. Bayesian interpolation treats the mixture weights as random variables and integrates over their uncertainty. Generalized backoff allows the model to back off to different contexts rather than simply dropping the leftmost word, for example backing off from P(w₃ | w₁, w₂) to P(w₃ | w₁) rather than P(w₃ | w₂) when the skip-bigram is more informative.
The interpolation framework extends naturally to the combination of heterogeneous models. In modern NLP systems, it is common to interpolate n-gram models with neural language models, combining the reliability of count-based estimates for frequent patterns with the generalization capability of neural models for rare contexts. The interpolation weights can be estimated on development data, and the resulting mixture typically outperforms either component alone, demonstrating the complementary strengths of count-based and distributed approaches to language modeling.