ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately), proposed by Clark et al. (2020), introduces a novel pre-training approach inspired by generative adversarial networks. Instead of masking tokens and predicting them (as in BERT), ELECTRA trains a small generator to replace tokens and a larger discriminator to detect which tokens have been replaced. Because the discriminator performs a binary classification at every token position (real vs. replaced) rather than predicting masked tokens at only 15% of positions, ELECTRA receives a training signal from every input token, making it dramatically more sample-efficient than MLM-based methods.
Architecture and Training
L_G = -Σ_{i∈masked} log P_G(xᵢ | x_{\i})
Discriminator: classifies each token as original or replaced
L_D = -Σᵢ₌₁ⁿ [𝟙(xᵢ = x̂ᵢ) log D(x̂, i) + 𝟙(xᵢ ≠ x̂ᵢ) log(1 - D(x̂, i))]
Combined: min_G,D L_G + λ · L_D
Generator: 1/4 to 1/3 the size of discriminator
After pre-training, only the discriminator is used for downstream tasks
The generator is a small masked language model that proposes plausible replacements for masked positions. The discriminator, which is the main model used for downstream tasks, receives the generator's output (with some tokens replaced) and must classify each token as original or replaced. This replaced token detection (RTD) task is easier than MLM — the discriminator only needs to make a binary decision — but it provides a training signal at every position, not just the 15% that are masked. The generator and discriminator are trained jointly, with the generator's parameters discarded after pre-training.
Efficiency Gains
ELECTRA's primary advantage is sample efficiency. An ELECTRA-Small model trained on a single GPU for 4 days outperforms GPT on the GLUE benchmark. ELECTRA-Base, using the same compute budget as BERT-Base, substantially outperforms it. When matched for compute, ELECTRA-Large outperforms RoBERTa and XLNet on most GLUE tasks. These efficiency gains come from the fact that the discriminator learns from every token rather than just the masked 15%, effectively amplifying the training signal by roughly 7x per sequence.
Although ELECTRA's generator-discriminator framework resembles a generative adversarial network (GAN), there are critical differences. The generator is trained with maximum likelihood (not adversarial loss) because adversarial training of text generators is difficult due to the discrete nature of tokens. The generator and discriminator do not compete; instead, the generator simply provides a curriculum of replacement tokens that becomes more challenging as the generator improves. Additionally, the generator is intentionally kept smaller than the discriminator to ensure that the replacement task remains informative without becoming trivially easy or impossibly hard.
Results and Impact
ELECTRA-Large achieved a GLUE score of 89.4, competitive with RoBERTa (88.5) and XLNet (88.4), while using roughly one quarter of the compute. On SQuAD 2.0, ELECTRA achieved 88.7 F1, matching state-of-the-art models. These results demonstrated that the MLM pre-training objective, while effective, is not the most efficient way to leverage unlabeled text, and that alternative objectives can achieve comparable quality with dramatically reduced computational cost.
ELECTRA's efficiency makes it particularly attractive for settings where computational resources are limited, such as training domain-specific models or adapting to low-resource languages. The replaced token detection paradigm has influenced subsequent work on efficient pre-training methods, including MC-BERT which uses multi-choice replacements, and COCO-LM which combines corrective language modeling with contrastive learning. ELECTRA demonstrated that rethinking the pre-training objective itself, rather than simply scaling existing approaches, can yield substantial improvements in the compute-performance tradeoff.