T5, introduced by Raffel et al. (2020) at Google, reconceives all NLP tasks as text-to-text problems: the model receives a text input (possibly with a task-specific prefix) and generates a text output. Sentiment classification becomes generating the word "positive" or "negative"; translation becomes generating the target sentence; summarization becomes generating a condensed version of the input. This unified framework allows the same model, loss function, and training procedure to be applied to every task, simplifying the multi-task learning pipeline and enabling a comprehensive empirical study of pre-training objectives, architectures, and data.
Architecture and Training
Output: generated text string
Examples:
"translate English to German: That is good" → "Das ist gut"
"summarize: long article text..." → "summary text"
"stsb sentence1: ... sentence2: ..." → "3.8"
T5-Small: 60M params | T5-Base: 220M
T5-Large: 770M | T5-3B | T5-11B
T5 uses the original encoder-decoder transformer architecture. The encoder processes the input sequence bidirectionally, and the decoder generates the output autoregressively. Pre-training uses a span corruption objective: random contiguous spans of tokens are replaced with sentinel tokens, and the model is trained to generate the missing spans. This objective is a generalization of BERT's masked language modeling that requires generating multi-token spans rather than predicting individual masked tokens, and it naturally fits the text-to-text framework since both input and output are text sequences.
The Colossal Clean Crawled Corpus (C4)
A major contribution of the T5 paper was the creation of C4, a 750 GB cleaned version of Common Crawl data. The cleaning process removed duplicate content, incomplete sentences, offensive language, and boilerplate text, demonstrating that data quality significantly impacts model performance. The paper's systematic comparison showed that training on C4 outperformed training on unfiltered Common Crawl, Wikipedia alone, or other existing corpora, highlighting the importance of large-scale, high-quality pre-training data.
The T5 paper is notable for its thorough empirical comparison of design choices. It compared encoder-decoder, decoder-only, and prefix language model architectures; compared denoising objectives including BERT-style, replace-span, and drop-token variants; varied corruption rates from 10% to 50%; and tested span lengths from individual tokens to entire sentences. The finding that an encoder-decoder architecture with span corruption at 15% rate and mean span length of 3 tokens was optimal provided practical guidance for the field and demonstrated the value of systematic experimentation at scale.
Multi-Task Learning and Adaptation
T5 explored multiple strategies for multi-task learning, including mixing tasks at different ratios during pre-training and sequential fine-tuning. The paper found that multi-task pre-training followed by task-specific fine-tuning achieved the best results, and that the proportion of each task in the training mixture affects final performance. These findings informed subsequent work on instruction tuning (FLAN-T5), where the model is fine-tuned on a diverse set of tasks described in natural language, dramatically improving zero-shot performance on unseen tasks.
T5's text-to-text paradigm has proven remarkably influential. It demonstrated that a single model architecture and training framework can unify the entire landscape of NLP tasks, from low-level token classification to open-ended generation. The approach has been extended to multilingual settings (mT5), code generation (CodeT5), and scientific text (SciFive). By reducing all NLP to sequence-to-sequence mapping, T5 clarified the fundamental flexibility of the transformer architecture and established a conceptual framework that continues to shape how researchers think about multi-task NLP.