BLEU-4

Gold definitionUpdated Apr 2, 2026

Definition

BLEU-4 is an automatic metric for evaluating the quality of generated text, specifically focusing on the precision of 1-gram to 4-gram overlaps between a candidate text and one or more reference texts. It incorporates a brevity penalty to discourage overly short outputs, providing a single score that reflects both fluency and adequacy.

At a glance

Executive summary

BLEU-4 is a standard automated metric used to assess the quality of generated text, such as machine translations or image captions. It works by comparing sequences of words (n-grams up to four words long) in the generated text against human-written reference texts, penalizing outputs that are too short.

TL;DR

BLEU-4 is a common score that tells you how good a computer-generated sentence is by checking how many of its word sequences match human-written examples.

Key points

Compares n-gram (up to 4-gram) overlap between generated and reference texts, applying a brevity penalty.
Provides an objective, automated metric for evaluating the quality of text generation tasks.
Used by researchers in machine translation, image captioning, text summarization, and remote sensing (semantic change captioning).
Offers faster and more consistent evaluation than human judgment, but lacks deep semantic understanding compared to some newer metrics.
Remains a foundational baseline metric, often complemented by other metrics that capture semantic similarity or fluency more directly.

Use cases

Evaluating the performance of new machine translation models (e.g., English to French).
Assessing the quality of image captioning systems that describe visual content.
Benchmarking text summarization algorithms by comparing generated summaries to human-written ones.
Quantifying the accuracy of 'semantic change captions' for forest dynamics in satellite imagery.
Comparing different natural language generation (NLG) models in dialogue systems or content creation.

Also known as

BLEU, BLEU score, BLEU-N (general), N-gram overlap metric