Computational Linguistics
About

BERT

BERT (Bidirectional Encoder Representations from Transformers) introduced the masked language modeling pre-training objective, enabling deep bidirectional representations that dramatically advanced the state of the art across virtually all NLP benchmarks upon its release.

P(wₘₐₛₖ | w₁,...,w_{mask-1}, [MASK], w_{mask+1},...,wₙ) = softmax(Hₘₐₛₖ W + b)

BERT, introduced by Devlin et al. (2019) at Google, marked a watershed moment in NLP by demonstrating that deep bidirectional pre-training on unlabeled text produces representations that transfer effectively to a wide range of downstream tasks with minimal task-specific architecture. BERT uses a transformer encoder trained with two objectives: masked language modeling (MLM), which randomly masks tokens and trains the model to predict them from bidirectional context, and next sentence prediction (NSP), which trains the model to determine whether two sentences are consecutive. The resulting representations achieved state-of-the-art results on eleven NLP benchmarks simultaneously.

Architecture and Pre-Training

BERT Pre-Training Objectives Masked Language Modeling:
L_MLM = -Σ_{i∈masked} log P(wᵢ | w_{i})

Next Sentence Prediction:
L_NSP = -[y·log P(IsNext) + (1-y)·log P(NotNext)]

Total: L = L_MLM + L_NSP

BERT-Base: 12 layers, 768 hidden, 12 heads, 110M params
BERT-Large: 24 layers, 1024 hidden, 16 heads, 340M params

In masked language modeling, 15% of input tokens are selected for prediction. Of these, 80% are replaced with [MASK], 10% are replaced with a random token, and 10% are left unchanged. This strategy prevents the model from simply learning to detect the [MASK] token. The model must learn rich bidirectional representations to predict the masked tokens from their full context. The input representation sums three embeddings: token embeddings, segment embeddings (indicating sentence A or B), and positional embeddings.

Fine-Tuning Paradigm

BERT's most significant contribution was establishing the pre-train/fine-tune paradigm for NLP. After pre-training on large unlabeled corpora (BooksCorpus and English Wikipedia, totaling 3.3 billion words), BERT is fine-tuned on task-specific labeled data by adding a simple task-specific output layer. For sentence classification, the [CLS] token representation is passed through a linear classifier. For token-level tasks like NER, each token's representation is classified independently. For question answering, start and end token positions are predicted. This approach requires minimal architectural modification across tasks.

Impact on NLP Benchmarks

BERT's initial release improved the state of the art on the GLUE benchmark by 7.7 absolute points, on SQuAD 1.1 question answering by 1.5 F1 points (surpassing human performance), and on SQuAD 2.0 by 5.1 F1 points. These gains were unprecedented in scale and breadth. The model demonstrated that a single pre-trained architecture could excel across classification, entailment, question answering, and named entity recognition, validating the hypothesis that language understanding requires deep bidirectional context and that this context can be effectively learned from unlabeled text.

Limitations and Legacy

Despite its transformative impact, BERT has notable limitations. The [MASK] token used during pre-training never appears during fine-tuning, creating a pre-train/fine-tune mismatch. The independence assumption in predicting multiple masked tokens ignores correlations between them. The fixed input length of 512 tokens limits processing of longer documents. The next sentence prediction objective was later shown to be of limited value and was dropped by subsequent models like RoBERTa. Additionally, BERT's encoder-only architecture is not naturally suited to generation tasks.

BERT's legacy extends far beyond its direct performance improvements. It established the pre-train/fine-tune paradigm that became the standard approach in NLP, inspired a family of successor models (RoBERTa, ALBERT, ELECTRA, DeBERTa), and catalyzed the development of multilingual models (mBERT, XLM) that brought pre-training benefits to over 100 languages. BERT also democratized access to powerful NLP through the Hugging Face Transformers library, which made it straightforward for practitioners to fine-tune pre-trained models for specific applications.

Related Topics

References

  1. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT, 4171–4186. doi:10.18653/v1/N19-1423
  2. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2019). GLUE: A multi-task benchmark and analysis platform for natural language understanding. Proceedings of ICLR.
  3. Rogers, A., Kovaleva, O., & Rumshisky, A. (2020). A primer in BERTology: What we know about how BERT works. Transactions of the ACL, 8, 842–866. doi:10.1162/tacl_a_00349
  4. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.

External Links