Word alignment establishes the correspondence between words in a source sentence and words in a target sentence within a parallel corpus. Given a sentence pair (e, f), an alignment a is a mapping that specifies, for each target word f_j, which source word e_i (or the NULL token) generated it. Word alignment was originally developed as a latent variable in the IBM translation models, but it has become a fundamental tool in its own right, used for phrase extraction, syntactic projection, and bilingual lexicon induction.
IBM Model Alignments
t(f | e) = lexical translation probability
d(i | j, l, m) = alignment (distortion) probability
l = source length, m = target length
Trained via Expectation-Maximization (EM)
The IBM alignment models (Brown et al., 1993) introduced a hierarchy of increasingly sophisticated models for word alignment. Model 1 assumes uniform alignment probabilities and learns only lexical translation probabilities via EM. Model 2 adds position-dependent distortion probabilities. Models 3, 4, and 5 introduce fertility (the number of target words generated by each source word) and more refined distortion models. In practice, the standard training pipeline proceeds through Models 1, 2, the HMM alignment model, and Models 3–4 in sequence, using the parameters of simpler models to initialize more complex ones.
Symmetrization and GIZA++
Because IBM models produce asymmetric alignments (each target word aligns to exactly one source word), standard practice involves training models in both directions (source-to-target and target-to-source) and combining the resulting alignments using heuristics such as intersection, union, or grow-diag-final-and. The intersection yields high-precision alignments (only points agreed upon by both directions), while the union yields high-recall alignments. The grow-diag-final-and heuristic starts from the intersection and iteratively adds alignment points from the union that are adjacent to existing points.
GIZA++ (Och and Ney, 2003) became the standard tool for training IBM alignment models and remains widely used. It implements the full IBM model cascade plus the HMM alignment model. For large-scale applications, fast_align (Dyer et al., 2013) provides a much faster alternative based on a reparameterized Model 2 that uses a single alignment parameter per sentence pair, achieving competitive alignment quality at a fraction of the computational cost.
Modern Approaches
Discriminative alignment models, such as those based on conditional random fields or neural networks, can incorporate arbitrary features and have shown improvements over generative IBM models. Neural alignment methods extract alignment information from the attention weights of neural machine translation systems, though attention weights do not always correspond to linguistically meaningful alignments. Dedicated neural alignment models, such as those using BERT-based cross-lingual representations, have achieved state-of-the-art results on standard alignment benchmarks.
Word alignment continues to play important roles beyond MT. In annotation projection, alignments transfer linguistic labels (POS tags, named entities, semantic roles) from a resource-rich language to a low-resource language. In bilingual lexicon induction, alignment statistics identify translation equivalents. Even in the era of end-to-end neural MT, explicit alignment information is valuable for terminology-constrained translation, interactive MT, and interpretability analysis.