Computational Linguistics
About

Neural Machine Translation

Neural machine translation uses end-to-end deep learning models to directly map source-language sentences to target-language sentences, replacing the modular pipeline of statistical MT with a single neural network trained to maximize translation probability.

P(y|x) = ∏_{t=1}^{T} P(y_t | y_{<t}, x; θ)

Neural machine translation (NMT) represents a paradigm shift from the modular, feature-engineered approach of statistical MT to end-to-end learning. Instead of separate translation models, language models, and reordering models combined through a log-linear framework, NMT uses a single neural network that reads the entire source sentence and generates the target sentence one token at a time. The approach was first demonstrated by Kalchbrenner and Blunsom (2013) and Sutskever et al. (2014), and within three years it had displaced statistical MT as the state of the art for virtually all language pairs.

Sequence-to-Sequence Framework

NMT Conditional Probability P(y₁, ..., y_T | x₁, ..., x_S; θ) = ∏_{t=1}^{T} P(y_t | y₁, ..., y_{t-1}, x; θ)

Encoder: h = f_enc(x₁, ..., x_S)
Decoder: P(y_t | y_{
Training: θ* = argmax_θ Σ_{(x,y)} log P(y | x; θ)

NMT systems are trained end-to-end using maximum likelihood estimation on parallel corpora. The encoder reads the source sentence and produces a sequence of continuous representations. The decoder generates the target sentence autoregressively, conditioning on the source representations and the previously generated target tokens. The entire model is trained by backpropagation to maximize the log-likelihood of the training data. This joint optimization allows all components of the model to adapt to each other, eliminating the error propagation problems of the SMT pipeline.

Key Advantages Over SMT

NMT offers several fundamental advantages. First, the continuous representations learned by the encoder capture semantic similarity, so words with similar meanings receive similar representations even if they are rare. Second, the end-to-end training optimizes translation quality directly, rather than independently optimizing sub-components. Third, NMT naturally handles long-range dependencies through attention mechanisms, which can relate any source position to any target position regardless of distance. Fourth, NMT requires minimal feature engineering — the model learns its own representations from data.

The NMT Revolution: 2016

The year 2016 marked the tipping point for NMT. Google replaced its production phrase-based system with a neural system (Wu et al., 2016), reporting that NMT reduced translation errors by 60% compared to phrase-based SMT for several language pairs. The WMT shared tasks showed NMT systems achieving human parity for some language pairs. This rapid transition from research prototype to production deployment was remarkable, compressing a technology transition that typically takes a decade into just two to three years.

Challenges and Ongoing Research

Despite its successes, NMT faces persistent challenges. Hallucination — the generation of fluent but unfaithful content — is more common in NMT than in SMT because the model's fluency can mask adequacy problems. NMT struggles with rare words, domain mismatch, and very long sentences. The models require large amounts of parallel data, making low-resource translation particularly challenging. Training is computationally expensive, requiring powerful GPUs and large memory. The opacity of neural models also complicates error analysis and system debugging.

Active research directions include improving robustness to noise and domain shift, reducing the data requirements through transfer learning and data augmentation, incorporating linguistic knowledge as inductive biases, and developing better training objectives that go beyond maximum likelihood. Document-level NMT, which translates entire documents rather than isolated sentences, aims to improve discourse coherence. Multimodal translation incorporates visual information alongside text, and simultaneous translation generates output before the full source sentence is available.

Interactive Calculator

Enter reference and hypothesis translation pairs as CSV (one pair per line): reference sentence,hypothesis sentence. The calculator tokenizes each pair, computes n-gram precisions (1-4), brevity penalty, and the final BLEU score.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics

References

  1. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems, 27, 3104–3112. doi:10.48550/arXiv.1409.3215
  2. Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. Proceedings of ICLR 2015. doi:10.48550/arXiv.1409.0473
  3. Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., ... & Dean, J. (2016). Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144. doi:10.48550/arXiv.1609.08144

External Links