Computational Linguistics
About

Statistical Machine Translation

Statistical machine translation uses probabilistic models trained on large parallel corpora to automatically learn translation correspondences, replacing hand-crafted rules with data-driven parameter estimation.

e* = argmax_e Σ_a P(f, a | e) · P(e)

Statistical machine translation (SMT), pioneered by researchers at IBM in the late 1980s, treats translation as a statistical inference problem. Given a source sentence f in a foreign language, the goal is to find the target sentence e that maximizes the posterior probability P(e|f). The approach is fundamentally empirical: translation rules and probabilities are automatically learned from parallel corpora — collections of texts and their translations — rather than specified by human linguists. SMT dominated the machine translation landscape for over two decades before being supplanted by neural approaches.

The Noisy Channel Model

SMT Decoding Objective e* = argmax_e P(e|f)
= argmax_e P(f|e) · P(e)

P(f|e) = Σ_a P(f, a|e) (translation model, marginalized over alignments)
P(e) = language model (n-gram model of target language)

Decoding is NP-hard; beam search approximation is used in practice

The noisy channel formulation decomposes the translation problem into two independent models. The translation model P(f|e) captures bilingual correspondences — which source words translate to which target words, and how they are reordered. The language model P(e) captures target-language fluency. This modularity allows each component to be trained independently on different data: the translation model on parallel data and the language model on monolingual target-language data, which is typically far more abundant.

Training Pipeline

The standard SMT training pipeline involves several stages: sentence alignment of the parallel corpus, word alignment using IBM models or HMM-based models, phrase extraction (for phrase-based systems), feature computation, and language model estimation. Each stage introduces approximations and errors that can propagate through the pipeline. The GIZA++ toolkit for word alignment and the Moses toolkit for phrase-based SMT became standard open-source implementations that enabled widespread experimentation and deployment.

The Role of Parallel Corpora

The performance of SMT systems is heavily dependent on the quantity and quality of available parallel data. The European Parliament proceedings (Europarl), United Nations documents, and Canadian parliamentary Hansards became standard training corpora. For well-resourced language pairs like French-English, billions of words of parallel data are available; for most of the world's language pairs, parallel data remains scarce, motivating research on low-resource and unsupervised translation methods.

Decoding and Search

The decoding problem in SMT — finding the highest-scoring translation — is computationally intractable (NP-hard for most formulations). Practical decoders use heuristic search algorithms, primarily beam search, which maintains a fixed-size set of partial translation hypotheses and extends them incrementally. Stack decoding, A* search, and cube pruning have been developed to make the search process more efficient while maintaining translation quality. The tradeoff between search accuracy and computational cost is a central concern in SMT system design.

The legacy of statistical machine translation extends well beyond its direct use. SMT established the evaluation methodology (BLEU and related metrics), the experimental methodology (shared tasks and standard test sets), and the fundamental decomposition of translation into alignment, reordering, and generation that continues to inform neural approaches. Many concepts from SMT — beam search, ensembling, length normalization — were directly adopted by neural MT systems.

Related Topics

References

  1. Brown, P. F., Della Pietra, V. J., Della Pietra, S. A., & Mercer, R. L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2), 263–311. aclanthology.org/J93-2003
  2. Koehn, P. (2010). Statistical Machine Translation. Cambridge University Press. doi:10.1017/CBO9780511815829
  3. Och, F. J., & Ney, H. (2002). Discriminative training and maximum entropy models for statistical machine translation. Proceedings of ACL 2002, 295–302. doi:10.3115/1073083.1073133
  4. Lopez, A. (2008). Statistical machine translation. ACM Computing Surveys, 40(3), 1–49. doi:10.1145/1380584.1380586

External Links