XLNet, introduced by Yang et al. (2019), addresses a key limitation of BERT's masked language modeling: the independence assumption between masked tokens and the pre-train/fine-tune discrepancy caused by the [MASK] token. XLNet proposes permutation language modeling, which considers all possible orderings of the input sequence and trains the model to predict each token given any subset of the other tokens as context. This approach captures bidirectional dependencies like BERT while maintaining the autoregressive formulation of GPT, yielding a model that outperformed BERT on 20 NLP benchmarks upon its release.
Permutation Language Modeling
where Z_T is the set of all permutations of [1, 2, ..., T]
z_t is the t-th element of permutation z
z_{<t} denotes the first t-1 elements of z
In practice, only the last c tokens in each permutation are predicted
(partial prediction to reduce optimization difficulty)
In permutation language modeling, the model is trained to predict tokens in a random order rather than left-to-right. For a given permutation, each token is predicted given all tokens that precede it in that permutation — which may include tokens that are to its right in the original sequence. By averaging over all permutations, every token eventually serves as context for every other token, achieving bidirectional conditioning. Critically, the actual input sequence is not shuffled; instead, the attention mask is modified to implement different factorization orders while preserving the original positional information.
Two-Stream Self-Attention
A technical challenge in permutation language modeling is that the standard transformer representation at a position cannot simultaneously encode the position's content (needed when that position serves as context for other predictions) and hide it (needed when predicting the token at that position). XLNet solves this with two-stream self-attention: a content stream that has access to the token's embedding (like standard self-attention) and a query stream that has access only to the position and the context, but not to the token itself. The query stream is used for prediction, while the content stream provides rich context representations.
XLNet builds on Transformer-XL's segment-level recurrence mechanism, which allows the model to capture dependencies beyond the fixed context window by caching hidden states from previous segments. This gives XLNet an effective context length far exceeding its window size, which is particularly beneficial for tasks requiring long-range reasoning such as document-level question answering. The relative positional encoding scheme from Transformer-XL also replaces the absolute positional encodings used in BERT, providing better generalization to different sequence lengths.
Results and Significance
XLNet achieved state-of-the-art results on 18 out of 20 benchmarks tested, including GLUE, SQuAD, and RACE reading comprehension. Particularly notable were its improvements on tasks requiring long-range context, where the Transformer-XL backbone provided an advantage. However, XLNet requires significantly more computation than BERT: it was pre-trained on 512 TPU v3 chips for 2.5 days, compared to BERT's training on 64 TPU chips for 4 days, raising questions about whether the improvements justify the additional cost.
XLNet's contribution is primarily conceptual: it demonstrated that the autoregressive and bidirectional approaches to pre-training can be unified through permutation, and that the [MASK] token in BERT is not necessary for capturing bidirectional context. However, subsequent work (RoBERTa, ELECTRA) showed that many of XLNet's improvements could be matched by simply training BERT longer with better hyperparameters, suggesting that data and compute scaling may be more important than the specific pre-training objective. Nevertheless, permutation language modeling remains an influential idea in the design of pre-training objectives.