Computational Linguistics
About

Statistical Language Models

Statistical language models assign probabilities to sequences of words, forming the mathematical backbone of speech recognition, machine translation, and text generation by quantifying how likely a given string of words is in a language.

P(w₁, w₂, ..., wₙ) = ∏ P(wᵢ | w₁, ..., wᵢ₋₁)

A statistical language model defines a probability distribution over sequences of words drawn from a vocabulary V. Given a sentence w₁, w₂, ..., wₙ, the model estimates the joint probability P(w₁, w₂, ..., wₙ) by decomposing it via the chain rule of probability into a product of conditional probabilities. This decomposition is exact but intractable in its full form, since the conditioning context grows with each word. Practical statistical language models therefore make simplifying assumptions about the extent of the conditioning history, trading off expressiveness for computational and statistical feasibility.

Chain Rule Decomposition

Chain Rule of Probability for Language P(w₁, w₂, ..., wₙ) = ∏ᵢ₌₁ⁿ P(wᵢ | w₁, ..., wᵢ₋₁)

In practice, limited history (Markov assumption):
P(wᵢ | w₁, ..., wᵢ₋₁) ≈ P(wᵢ | wᵢ₋ₖ, ..., wᵢ₋₁)

where k is the order of the model

The chain rule decomposition is the theoretical foundation of all autoregressive language models, from simple n-gram models to modern neural architectures such as GPT. Each conditional probability P(wᵢ | w₁, ..., wᵢ₋₁) represents the model's prediction for the next word given the preceding context. The quality of this prediction, measured by metrics such as perplexity, determines how well the model captures the statistical regularities of natural language.

Maximum Likelihood Estimation

The parameters of a statistical language model are typically estimated from a training corpus using maximum likelihood estimation (MLE). For an n-gram model, the MLE estimate of a conditional probability is the relative frequency of the n-gram in the corpus: P_MLE(wᵢ | wᵢ₋₁) = C(wᵢ₋₁, wᵢ) / C(wᵢ₋₁), where C denotes the count function. MLE is consistent and unbiased for large samples, but in practice it assigns zero probability to any n-gram not observed in training data, necessitating smoothing techniques.

Shannon's Foundation

Claude Shannon (1951) introduced the idea of modeling English text as a stochastic process and used n-gram statistics to approximate the entropy of English. His experiments with human subjects guessing the next character in text estimated the entropy rate of English at roughly 1 bit per character, establishing a fundamental benchmark that language models still strive to approach.

Evaluation and Significance

Statistical language models are evaluated primarily through perplexity, which measures how surprised the model is by held-out test data. Lower perplexity indicates better predictive performance. The perplexity of a model on a test set of N words is PP = 2^H, where H is the cross-entropy between the model distribution and the empirical distribution of the test set. Perplexity also provides an upper bound on the entropy rate of the language, connecting language modeling to information theory.

The development of statistical language models has been one of the central threads in computational linguistics since the 1980s. From their origins in speech recognition at IBM, where they were combined with acoustic models in the noisy channel framework, statistical language models have become essential components of virtually every NLP system. The progression from n-gram models to neural language models represents a shift from count-based to distributed representations, but the fundamental goal remains the same: accurately estimating the probability of the next word.

Interactive Calculator

Enter training text (multiple sentences, one per line), then a blank line, then a test sentence, then optionally another blank line and the n-gram order (default 2). Computes n-gram probabilities with add-1 (Laplace) smoothing and perplexity of the test sentence.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics

References

  1. Shannon, C. E. (1951). Prediction and entropy of printed English. The Bell System Technical Journal, 30(1), 50–64. doi:10.1002/j.1538-7305.1951.tb01366.x
  2. Jelinek, F. (1997). Statistical Methods for Speech Recognition. MIT Press.
  3. Chen, S. F., & Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4), 359–394. doi:10.1006/csla.1999.0128
  4. Rosenfeld, R. (2000). Two decades of statistical language modeling: Where do we go from here? Proceedings of the IEEE, 88(8), 1270–1278. doi:10.1109/5.880083

External Links