Text summarisation is the task of automatically generating a shorter version of a text that retains the most important information. Summarisation can be single-document (condensing one document) or multi-document (synthesising information from multiple sources on the same topic). It can also be generic (capturing the overall gist) or query-focused (emphasising information relevant to a specific question). The task requires identifying salient content, removing redundancy, and producing coherent output — a combination of skills that touches nearly every aspect of natural language understanding and generation.
Extractive versus Abstractive Summarisation
Select a subset of source sentences
Abstractive: S = generate(D)
Generate new text that may paraphrase or fuse source content
Evaluation: ROUGE-N = (∑ countₘₐₜ꜀ₕ(n-gram)) / (∑ count(n-gram))
measuring n-gram overlap between system and reference summaries
Summarisation approaches fall into two broad categories. Extractive summarisation selects sentences (or smaller units) from the source text and concatenates them to form the summary. Abstractive summarisation generates new text that may paraphrase, compress, or fuse information from the source, producing summaries that read more naturally but are harder to generate faithfully. In practice, many modern systems are hybrid, using extractive methods to select salient content and abstractive methods to rephrase it. The distinction parallels the difference between highlighting passages in a textbook versus writing a summary in one's own words.
Evaluation Challenges
Evaluating summarisation quality is notoriously difficult because there is no single correct summary for a given document. The ROUGE metrics (Lin, 2004) measure n-gram overlap between system-generated and human-written reference summaries, with ROUGE-1 (unigrams), ROUGE-2 (bigrams), and ROUGE-L (longest common subsequence) being the most widely reported. While ROUGE correlates with human judgments at the system level, it has well-known limitations: it rewards lexical overlap without considering semantic equivalence, fluency, or factual correctness. BERTScore and other embedding-based metrics offer more semantically-aware evaluation but remain imperfect proxies for human judgment.
The Document Understanding Conference (DUC, 2001–2007) and its successor, the Text Analysis Conference (TAC, 2008–2014), established the foundational evaluation framework for summarisation research. These shared tasks provided standard datasets, evaluation protocols, and human evaluation studies that enabled systematic comparison of summarisation systems. DUC/TAC evaluations revealed that extractive methods could achieve reasonable quality but consistently lagged behind human summaries, motivating the development of abstractive approaches that could bridge this gap.
A critical challenge for abstractive summarisation is faithfulness: generated summaries may contain information not present in the source document (hallucination) or contradict the source (factual inconsistency). Kryscinski et al. (2020) found that approximately 30% of summaries generated by state-of-the-art models contain factual errors. Addressing faithfulness requires methods for verifying generated content against source documents, constrained decoding that prevents hallucination, and evaluation metrics specifically designed to measure factual consistency. This challenge highlights the tension between fluency and faithfulness in natural language generation.