Computational Linguistics
About

BPE Tokenization

Byte Pair Encoding tokenization iteratively merges the most frequent pair of adjacent symbols to build a subword vocabulary that balances between character-level flexibility and word-level efficiency, enabling open-vocabulary language modeling.

V* = argmin_{V:|V|=k} Σ_{w∈corpus} -log P(w | V)

Byte Pair Encoding (BPE) tokenization, adapted for NLP by Sennrich et al. (2016), is a subword segmentation algorithm that has become the standard tokenization method for modern language models. Starting from a character-level vocabulary, BPE iteratively identifies the most frequent pair of adjacent tokens in the training corpus and merges them into a single new token. This process continues for a predetermined number of merge operations, producing a vocabulary that ranges from individual characters to complete words. BPE elegantly addresses the open-vocabulary problem: any word can be represented as a sequence of subword tokens, eliminating the need for an unknown-word token.

Algorithm

Byte Pair Encoding Algorithm Input: corpus, desired vocabulary size k
Initialize: vocabulary = set of all characters in corpus

Repeat until |vocabulary| = k:
  1. Count all adjacent token pairs in corpus
  2. Find most frequent pair (a, b)
  3. Merge all occurrences of (a, b) → ab
  4. Add ab to vocabulary

Encoding new text: apply merges greedily in learned order

The BPE algorithm produces a deterministic segmentation: given a word and the learned merge table, the segmentation is uniquely determined by applying merges in the order they were learned. Common words are typically represented as single tokens, while rare words are split into subword units that often correspond to morphological components. For instance, "unhappiness" might be segmented as ["un", "happiness"] or ["un", "happ", "iness"], capturing meaningful morphological structure without explicit morphological analysis. The vocabulary size, typically ranging from 30,000 to 50,000 tokens, is a key hyperparameter that trades off between representation compactness and vocabulary coverage.

Variants: WordPiece and Unigram

WordPiece, used by BERT and its derivatives, is a variant of BPE that selects merges to maximize the likelihood of the training data under a unigram language model rather than simply maximizing pair frequency. This results in slightly different vocabularies that tend to prefer merges producing tokens that are more likely as independent units. SentencePiece (Kudo and Richardson, 2018) implements both BPE and unigram tokenization in a language-independent framework that operates on raw text without pre-tokenization, treating whitespace as a regular character. The unigram model (Kudo, 2018) takes the opposite approach to BPE: starting from a large vocabulary, it iteratively removes tokens that least reduce the corpus likelihood.

Tokenization Affects Model Behavior

The choice of tokenization has significant downstream effects. Different tokenizations of the same text produce different sequence lengths, affecting the model's effective context window. Subword tokenization can introduce artifacts: the same word may be tokenized differently depending on surrounding context (in some implementations), and multi-token words require the model to compose meaning across subword boundaries. Recent work has shown that tokenizer quality significantly affects model performance on morphologically rich languages, mathematical reasoning, and code generation, leading to interest in tokenizer-free approaches that operate on bytes or characters.

Impact and Modern Usage

BPE tokenization is used in virtually all modern language models. GPT-2 introduced byte-level BPE, which operates on byte sequences rather than Unicode characters, ensuring that any text can be tokenized without unknown tokens or preprocessing. GPT-3, LLaMA, and other large language models use variants of this approach with vocabulary sizes ranging from 32,000 to 100,000 tokens. The tiktoken library provides an efficient implementation used by OpenAI's models, while SentencePiece is widely used in the academic community.

Despite its ubiquity, BPE tokenization has known limitations. It can produce inconsistent segmentations for morphologically related words, it is sensitive to the training corpus, and it adds a layer of indirection between the text that humans read and the tokens that models process. These concerns have motivated research into tokenizer-free models that operate directly on characters or bytes, such as ByT5 and CANINE. However, the compression that subword tokenization provides — reducing sequence lengths by roughly 4x compared to character-level processing — makes it practically essential for efficient training of large models, and BPE remains the dominant approach.

Related Topics

References

  1. Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. Proceedings of ACL, 1715–1725. doi:10.18653/v1/P16-1162
  2. Kudo, T., & Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. Proceedings of EMNLP: System Demonstrations, 66–71. doi:10.18653/v1/D18-2012
  3. Kudo, T. (2018). Subword regularization: Improving neural network translation models with multiple subword candidates. Proceedings of ACL, 66–75. doi:10.18653/v1/P18-1007
  4. Gage, P. (1994). A new algorithm for data compression. The C Users Journal, 12(2), 23–38.

External Links