IBM Models

The IBM translation models, described in the landmark paper by Brown et al. (1993), define a family of generative probabilistic models that specify how a target-language sentence e generates a source-language sentence f through a latent alignment a. The five models — numbered IBM Model 1 through Model 5 — form a progression of increasing complexity, each introducing additional parameters to capture aspects of the translation process that simpler models ignore. These models provided the first rigorous mathematical framework for machine translation and remain foundational to the field.

Model 1 and Model 2

IBM Model 1 P(f, a | e) = ε / (l + 1)^m · ∏_{j=1}^{m} t(f_j | e_{a_j})

IBM Model 2 P(f, a | e) = ∏_{j=1}^{m} t(f_j | e_{a_j}) · d(a_j | j, l, m)

t(f | e) = lexical translation probability
d(i | j, l, m) = alignment probability
ε = normalization constant

IBM Model 1 is the simplest model, assuming that all alignments are equally likely. Its only parameters are the lexical translation probabilities t(f|e), which can be estimated exactly via the EM algorithm because the uniform alignment assumption makes the E-step tractable. Despite its simplicity, Model 1 learns reasonable bilingual lexicons and serves as initialization for higher models. IBM Model 2 adds position-dependent alignment probabilities d(i|j,l,m), allowing the model to learn that words near the beginning of the source sentence tend to align with words near the beginning of the target sentence.

Models 3, 4, and 5

Model 3 introduces the concept of fertility — the number of target words that each source word generates — replacing the one-to-one assumption of earlier models. A source word e_i generates φ_i target words, where the fertility φ_i is drawn from a distribution n(φ|e). This allows the model to capture phenomena like the English word "not" generating French "ne...pas" (fertility 2). Model 4 refines the distortion model by conditioning on word classes and relative positions rather than absolute positions. Model 5 addresses deficiency in Model 4 by ensuring that the distortion probabilities properly define a probability distribution over target positions.

The HMM Alignment Model

The HMM alignment model (Vogel et al., 1996), while not part of the original IBM series, is widely used in practice as a bridge between Model 2 and Model 3. It models alignment as a first-order Markov process, where the alignment position of word j depends on the alignment position of word j-1. This captures the tendency for alignments to follow a roughly monotonic pattern, which the absolute position model of IBM Model 2 handles poorly. The standard training pipeline uses Model 1, HMM, Model 3, and Model 4 in sequence.

Training and Legacy

All IBM models are trained using the Expectation-Maximization (EM) algorithm, which alternates between computing expected alignment counts (E-step) and updating parameters to maximize the expected log-likelihood (M-step). For Models 1 and 2, the E-step is exact; for Models 3–5, it requires approximation via sampling or hill-climbing because the fertility variable makes exact marginalization intractable. The GIZA++ implementation (Och and Ney, 2003) became the standard tool for IBM model training.

The IBM models' influence extends far beyond their direct use. They established the expectation-maximization approach to learning latent linguistic structure from data, anticipating similar methods in grammar induction and unsupervised parsing. The concept of alignment has been adopted in speech recognition, protein sequence analysis, and other fields. Even in the neural era, the word alignment concept pioneered by the IBM models underlies attention mechanisms and cross-lingual representation learning.

Model 1 and Model 2

Models 3, 4, and 5

Training and Legacy

References

External Links

Model 1 and Model 2

Models 3, 4, and 5

Training and Legacy

Related Topics

References

External Links