Text generation, also known as natural language generation (NLG), is the task of producing fluent, coherent, and contextually appropriate natural language text. The input to a generation system may take various forms: a prompt or topic (open-ended generation), a structured database or knowledge base (data-to-text generation), a meaning representation (semantic generation), an image (caption generation), or another text (summarisation, translation, paraphrase). Text generation is one of the oldest problems in NLP, with roots in early work on template-based systems, but has been transformed by neural language models that can produce remarkably fluent and diverse text.
Decoding Strategies
Beam search: maintain top-k partial hypotheses
Sampling: yₜ ~ P(w | y₁,...,yₜ₋₁, x)
Top-k sampling: sample from the k most probable tokens
Nucleus (top-p) sampling: sample from smallest set V'
such that ∑_{w ∈ V'} P(w) ≥ p
Decoding strategies determine how text is generated from a language model's probability distribution. Greedy decoding selects the most probable token at each step, producing deterministic but often repetitive and unnatural text. Beam search maintains multiple hypotheses, exploring a broader range of outputs, but tends to produce generic, high-probability text that lacks diversity. Stochastic decoding methods — including pure sampling, top-k sampling (Fan et al., 2018), and nucleus sampling (Holtzman et al., 2020) — introduce randomness that produces more diverse and natural-sounding text but may sacrifice coherence. The choice of decoding strategy depends on the application: beam search is preferred for tasks requiring accuracy (translation), while sampling-based methods are better for creative generation.
Controllable and Grounded Generation
Controllable generation aims to produce text with specified attributes such as topic, style, sentiment, or formality. CTRL (Keskar et al., 2019) uses control codes prepended to the input to guide generation toward specific domains or styles. PPLM (Plug and Play Language Models) uses small attribute classifiers to steer generation without retraining the language model. Instruction-tuned models such as InstructGPT and ChatGPT learn to follow natural language instructions, enabling fine-grained control through prompting rather than architectural modifications.
Evaluating text generation quality remains one of the most challenging problems in NLP. Automatic metrics such as BLEU, ROUGE, and METEOR measure surface similarity to reference texts but correlate poorly with human judgments of quality for open-ended generation. Human evaluation remains the gold standard but is expensive, slow, and difficult to standardise. More recent metrics such as BERTScore, MAUVE, and model-based evaluators (using large language models as judges) offer improved correlation with human preferences, but the fundamental challenge persists: text quality is multidimensional (fluency, coherence, informativeness, faithfulness) and context-dependent.
Grounded generation produces text that is faithful to external knowledge sources, addressing the hallucination problem that plagues purely model-based generation. Retrieval-augmented generation (RAG) retrieves relevant documents from a knowledge store and conditions generation on the retrieved context, improving factual accuracy for knowledge-intensive tasks. Data-to-text generation produces natural language descriptions of structured data such as tables, weather reports, or sports statistics, requiring the generated text to accurately reflect the input data without fabricating information. These approaches balance the fluency of neural generation with the accuracy demanded by real-world applications.