WordPiece is a subword segmentation algorithm developed at Google, first described by Schuster and Nakajima (2012) for Japanese and Korean speech recognition and later adopted as the tokenizer for BERT (Devlin et al., 2019) and related models. Like BPE, WordPiece starts with a character-level vocabulary and iteratively merges tokens. However, where BPE selects the most frequent pair, WordPiece selects the pair whose merge maximally increases the likelihood of a unigram language model trained on the corpus. This likelihood-based criterion produces vocabularies that are slightly different from BPE, tending to favor merges that reduce the overall surprise of the corpus.
The WordPiece Algorithm
score(a, b) = log P(ab) − log P(a) − log P(b)
= log [freq(ab) / (freq(a) × freq(b))] + const
Select merge: argmax_{(a,b)} score(a, b)
This is equivalent to maximizing the mutual
information of the bigram (a, b)
The merge criterion in WordPiece is equivalent to selecting the bigram with the highest pointwise mutual information (PMI). A pair is merged not just because it is frequent, but because it is more frequent than would be expected given the frequencies of its constituents. This means WordPiece preferentially merges pairs that co-occur more than chance would predict — capturing genuine collocations and morphological patterns rather than simply frequent character sequences. In practice, the difference from BPE is often small, but WordPiece tends to produce slightly more linguistically motivated segments.
WordPiece in BERT
BERT uses a WordPiece vocabulary of 30,522 tokens for English. Words that appear in the vocabulary are tokenized as single tokens; others are split into subword pieces, with continuation pieces marked by a "##" prefix. For example, "unaffordable" might be tokenized as ["un", "##afford", "##able"]. The "##" marker distinguishes word-initial subwords from word-internal ones, preserving some word boundary information. This representation allows BERT to handle any English text without encountering unknown tokens, while keeping the vocabulary size manageable for softmax computation.
Research has shown that WordPiece tokens partially recover morphological structure. Bostrom and Durrett (2020) found that BERT's WordPiece tokenizer aligns with English morpheme boundaries about 60% of the time. However, the alignment is imperfect: high-frequency morphologically complex words (like "unfortunately") may be kept as single tokens, while rare simple words may be split arbitrarily. This has implications for how BERT represents morphologically complex words and whether its internal representations encode compositional morphological semantics.
Comparison with Other Subword Methods
WordPiece differs from BPE primarily in its merge criterion (likelihood-based vs. frequency-based) and from the Unigram Language Model approach (which starts with a large vocabulary and prunes rather than building up). In practice, all three methods produce broadly similar tokenizations for well-resourced languages, with differences emerging mainly at the margins — rare words, technical terms, and morphologically complex forms. The choice between methods is often driven more by implementation convenience than by principled linguistic or performance considerations.
For multilingual models like mBERT and XLM-RoBERTa, the WordPiece vocabulary must cover over 100 languages in a shared vocabulary of typically 100,000-250,000 tokens. The allocation of vocabulary capacity across languages is determined by the training corpus composition and affects per-language performance: languages overrepresented in training receive more dedicated tokens and thus shorter tokenizations, while underrepresented languages are tokenized into longer sequences, increasing both computational cost and the difficulty of learning good representations.