Causal language modeling (CLM), also called autoregressive language modeling, trains a model to predict each token in a sequence given only the tokens that precede it. The "causal" designation reflects the unidirectional dependency structure: the prediction at position t depends only on positions 1 through t-1, never on future positions. This left-to-right factorization makes CLM naturally suited for text generation, since the model can produce text one token at a time by sampling from its predicted distribution. CLM is the training objective of the GPT family, LLaMA, PaLM, and virtually all modern large language models used for generation.
Formal Definition
Training loss: L = -(1/T) Σₜ₌₁ᵀ log P(wₜ | w₁, ..., wₜ₋₁; Θ)
Implemented via causal attention mask:
mask(i, j) = { 1 if j ≤ i, 0 if j > i }
Attention: softmax((QKᵀ + mask) / √d_k) · V
In a transformer-based CLM, the causal constraint is enforced through an attention mask that prevents each position from attending to subsequent positions. The upper-triangular portion of the attention matrix is set to negative infinity before the softmax, effectively zeroing out attention to future tokens. This masking ensures that the representation at position t is a function of only tokens 1 through t, preserving the autoregressive factorization while allowing the model to be trained efficiently on all positions in parallel through teacher forcing.
CLM versus MLM
The fundamental tradeoff between causal and masked language modeling is directionality versus generation capability. CLM produces unidirectional representations because each position can only attend to previous positions, which limits its performance on understanding tasks where bidirectional context is valuable. MLM produces bidirectional representations but cannot naturally generate text because it does not define a sequential factorization of the probability distribution. In practice, CLM models (GPT) excel at generation while MLM models (BERT) excel at classification and extraction, though sufficiently large CLM models have shown strong performance on understanding tasks as well.
CLM training uses teacher forcing: at each position, the model receives the ground-truth previous tokens as input, even though at generation time it would receive its own predictions. This discrepancy creates exposure bias — the model never learns to recover from its own errors during training. Scheduled sampling (Bengio et al., 2015) partially addresses this by gradually replacing ground-truth tokens with model predictions during training. However, large-scale CLM models appear to be robust to exposure bias in practice, possibly because their very low per-token error rates make cascading errors rare.
Scaling and Emergent Capabilities
The CLM objective has proven remarkably scalable. Kaplan et al. (2020) showed that CLM loss decreases predictably as a power law of model size, dataset size, and compute, enabling researchers to predict the performance of larger models before training them. The Chinchilla scaling laws (Hoffmann et al., 2022) refined these findings, demonstrating that models and datasets should be scaled proportionally and that many existing models were undertrained relative to their size. These scaling laws have guided the development of models from GPT-3 through LLaMA and beyond.
A striking finding from scaling CLM is the emergence of capabilities that were absent in smaller models. Few-shot in-context learning, chain-of-thought reasoning, and instruction following all appear as model scale increases, suggesting that the CLM objective implicitly learns increasingly sophisticated representations of language and knowledge. Whether these capabilities emerge gradually or abruptly remains debated, but their existence has cemented CLM as the dominant pre-training objective for general-purpose language models and the foundation of modern AI assistants.