Computational Linguistics
About

METEOR Score

METEOR (Metric for Evaluation of Translation with Explicit ORdering) is an automatic MT evaluation metric that extends beyond exact n-gram matching to incorporate stemming, synonymy, and word order, achieving higher correlation with human judgments than BLEU.

METEOR = F_mean · (1 − Penalty) where F_mean = (10 · P · R) / (R + 9 · P)

METEOR, developed by Banerjee and Lavie (2005) at Carnegie Mellon University, was designed to address the known weaknesses of BLEU. While BLEU relies solely on exact n-gram matches, METEOR computes an alignment between candidate and reference words using exact matching, stemming, synonymy (via WordNet), and paraphrase matching. It computes both precision and recall at the unigram level, combines them using a weighted harmonic mean that emphasizes recall, and applies a fragmentation penalty that penalizes translations where matched words are not in contiguous chunks. These design choices yield consistently higher correlation with human judgments than BLEU.

The METEOR Computation

METEOR Score Unigram precision: P = m / w_c
Unigram recall: R = m / w_r
F_mean = (10 · P · R) / (R + 9 · P)

Fragmentation penalty:
Penalty = γ · (chunks / m)^β

METEOR = F_mean · (1 − Penalty)

m = matched unigrams, w_c = candidate length, w_r = reference length
chunks = minimum contiguous groups of matched unigrams
γ = 0.5, β = 3 (default parameters)

METEOR first computes a word alignment between candidate and reference using a greedy algorithm that prioritizes exact matches, then stem matches, then synonym matches. The alignment produces the number of matched unigrams m, from which precision and recall are computed. The harmonic mean heavily weights recall (by a factor of 9:1 over precision), reflecting the finding that recall is more important for human translation quality judgments. The fragmentation penalty captures word order: a candidate with all words matching but in a scrambled order will have many chunks and receive a high penalty.

Advantages Over BLEU

METEOR addresses several specific BLEU shortcomings. By incorporating stemming, it gives credit for morphological variants ("running" matching "runs"). Synonym matching rewards semantically equivalent word choices. The recall component ensures that omitting content is penalized, not just inserting incorrect content. The fragmentation penalty provides a soft measure of word order that is more nuanced than BLEU's reliance on higher-order n-grams. These features make METEOR particularly effective for evaluating translations into morphologically rich languages and for neural MT systems that produce diverse paraphrases.

METEOR Universal

The original METEOR relied on language-specific resources (stemmers and WordNet synsets) available primarily for English. METEOR Universal (Denkowski and Lavie, 2014) extends the metric to any language by using character n-gram matching as a proxy for morphological and orthographic similarity, along with paraphrase tables extracted from parallel corpora. This language-independent approach makes METEOR applicable to the full range of languages addressed by modern MT systems, though performance still varies depending on the availability of paraphrase resources.

Role in MT Evaluation

METEOR has consistently shown higher correlation with human judgments than BLEU across multiple WMT metrics shared tasks. It serves as an important complementary metric to BLEU, and the two are frequently reported together. METEOR's explicit modeling of linguistic phenomena (stemming, synonymy, paraphrase) represents a middle ground between purely string-based metrics like BLEU and learned neural metrics like COMET and BLEURT. Understanding the tradeoffs between these approaches — interpretability, speed, correlation with human judgments, language coverage — remains an active area of MT evaluation research.

The principles underlying METEOR have influenced the design of subsequent evaluation metrics. The emphasis on recall, the incorporation of linguistic knowledge, and the soft word order penalty have been adopted and extended by metrics such as TER, BEER, and character-level metrics. More recently, neural evaluation metrics that use pre-trained language models have achieved even higher correlation with human judgments, but METEOR remains valuable for its interpretability and its ability to provide fine-grained diagnostic information about translation quality.

Interactive Calculator

Enter reference and hypothesis translation pairs as CSV (one pair per line): reference sentence,hypothesis sentence. The calculator tokenizes each pair, computes n-gram precisions (1-4), brevity penalty, and the final BLEU score.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics

References

  1. Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for MT, 65–72. aclanthology.org/W05-0909
  2. Denkowski, M., & Lavie, A. (2014). METEOR Universal: Language specific translation evaluation for any target language. Proceedings of the WMT 2014, 376–380. doi:10.3115/v1/W14-3348
  3. Lavie, A., & Agarwal, A. (2007). METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments. Proceedings of the Second Workshop on SMT, 228–231. aclanthology.org/W07-0734

External Links