Computational Linguistics
About

BLEU Score

BLEU (Bilingual Evaluation Understudy) is the most widely used automatic metric for machine translation evaluation, measuring the precision of n-gram matches between a candidate translation and one or more reference translations.

BLEU = BP · exp(Σ_{n=1}^{N} w_n · log p_n)

BLEU, introduced by Papineni et al. (2002), transformed machine translation evaluation by providing an automatic metric that correlates reasonably well with human judgments of translation quality. Before BLEU, MT evaluation relied almost exclusively on expensive and time-consuming human assessments. BLEU enabled rapid, reproducible evaluation that accelerated the pace of MT research. Despite well-known limitations, BLEU remains the de facto standard metric in MT publications and shared tasks, providing a common basis for comparing systems across studies.

The BLEU Formula

BLEU Score Computation BLEU = BP · exp(Σ_{n=1}^{N} w_n · log p_n)

Modified n-gram precision:
p_n = Σ_{C∈Candidates} Σ_{ngram∈C} Count_clip(ngram) / Σ_{C∈Candidates} Σ_{ngram∈C} Count(ngram)

Brevity Penalty:
BP = 1 if c > r
BP = exp(1 − r/c) if c ≤ r

c = candidate length, r = effective reference length
N = 4, w_n = 1/N (uniform weights)

BLEU computes modified n-gram precision for n = 1 to 4 and combines them using a geometric mean. "Modified" precision means that each n-gram in the candidate is counted at most as many times as it appears in the reference, preventing a degenerate candidate that repeats a single word from achieving perfect unigram precision. The brevity penalty (BP) penalizes translations that are shorter than the reference, since short translations can achieve artificially high precision by omitting difficult content. The standard configuration uses uniform weights (w_n = 1/4) across all n-gram orders.

Properties and Interpretation

BLEU scores range from 0 to 1 (often reported as 0 to 100). Scores are not easily interpretable in absolute terms — a BLEU score of 30 might represent acceptable quality for one language pair but poor quality for another. BLEU is a corpus-level metric, computed over an entire test set rather than individual sentences; sentence-level BLEU is unreliable due to the sparsity of higher-order n-gram matches. The metric is precision-oriented: it measures how much of the candidate appears in the reference, not how much of the reference is covered by the candidate.

Criticisms of BLEU

BLEU has been extensively criticized. It does not account for meaning: a translation that conveys the correct meaning using different words receives a low BLEU score. It treats all n-grams equally, ignoring the distinction between content words and function words. It cannot reward valid translation choices that differ from the reference. Callison-Burch et al. (2006) demonstrated cases where BLEU improvements did not correspond to quality improvements. Neural MT has made these limitations more acute, as NMT systems produce more fluent, diverse translations that diverge more from references than SMT outputs.

Variants and Alternatives

Numerous BLEU variants and alternative metrics have been proposed. SacreBLEU (Post, 2018) standardizes BLEU computation by fixing tokenization and other implementation details, addressing the problem that different BLEU implementations can yield significantly different scores for the same translations. Sentence-level smoothed BLEU (Chen and Cherry, 2014) adds smoothing to avoid zero counts at the sentence level. Character-level BLEU (chrF) computes character n-gram F-scores, which is more robust for morphologically rich languages.

Despite its limitations, BLEU's longevity reflects its practical value: it is fast to compute, requires only reference translations (not source sentences), and provides a standardized benchmark that enables cross-study comparison. The MT community increasingly uses BLEU alongside other metrics — METEOR, TER, COMET, and human evaluation — to provide a more complete picture of translation quality. The development of learned metrics that better correlate with human judgments is an active and important area of research.

Interactive Calculator

Enter reference and hypothesis translation pairs as CSV (one pair per line): reference sentence,hypothesis sentence. The calculator tokenizes each pair, computes n-gram precisions (1-4), brevity penalty, and the final BLEU score.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics

References

  1. Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine translation. Proceedings of ACL 2002, 311–318. doi:10.3115/1073083.1073135
  2. Callison-Burch, C., Osborne, M., & Koehn, P. (2006). Re-evaluating the role of BLEU in machine translation research. Proceedings of EACL 2006, 249–256. aclanthology.org/E06-1032
  3. Post, M. (2018). A call for clarity in reporting BLEU scores. Proceedings of WMT 2018, 186–191. doi:10.18653/v1/W18-6319

External Links