Grapheme-to-phoneme (G2P) conversion is the task of predicting the pronunciation of a word from its written form. For a word like "knight," G2P must produce /naɪt/, recognizing that "kn" maps to /n/, "igh" maps to /aɪ/, and "t" maps to /t/. This mapping is straightforward in languages with transparent orthographies (Finnish, Spanish) where spelling closely mirrors pronunciation, but is notoriously complex in languages like English and French where historical sound changes have left the orthography divorced from modern pronunciation. G2P is a core component of text-to-speech systems and is used to generate pronunciation dictionaries for speech recognition.
Approaches to G2P Conversion
Output: p₁ p₂ ... pₙ (phoneme sequence)
Example: t-h-r-o-u-g-h → /θ-r-uː/
Alignment: th:θ r:r ou:uː gh:ε
Joint n-gram model:
P(g₁:p₁, g₂:p₂, ...) = ∏ᵢ P(gᵢ:pᵢ | gᵢ₋₁:pᵢ₋₁, ..., gᵢ₋ₙ₊₁:pᵢ₋ₙ₊₁)
Early G2P systems used hand-crafted rules supplemented by exception dictionaries. The NETtalk system (Sejnowski and Rosenberg, 1987) was a landmark demonstration that neural networks could learn grapheme-to-phoneme mappings from examples. Modern approaches include joint sequence models (Bisani and Ney, 2008), which model aligned grapheme-phoneme pairs as an n-gram language model; weighted finite-state transducers, which compose spelling-to-pronunciation mappings; and neural encoder-decoder models, which treat G2P as a sequence-to-sequence translation problem.
Neural G2P Models
Neural sequence-to-sequence models with attention have become the dominant approach to G2P conversion. Character-level encoder-decoder architectures process the grapheme sequence character by character and generate phonemes autoregressively. Transformer-based models have achieved the best results on standard benchmarks, with word-level accuracy exceeding 95% for English and higher for languages with more regular orthographies. These models implicitly learn the complex many-to-many alignments between graphemes and phonemes without requiring explicit alignment as a preprocessing step.
Some words have different pronunciations depending on their meaning or grammatical role — these are called heteronyms. The word "read" is pronounced /riːd/ in present tense and /rɛd/ in past tense; "lead" is /liːd/ (the verb) or /lɛd/ (the metal). Standard G2P models, which operate on isolated words, cannot disambiguate these cases. Context-dependent G2P systems use the surrounding sentence to resolve heteronyms, employing part-of-speech tagging or neural language models. Sun et al. (2019) showed that BERT-based contextual G2P systems substantially improve heteronym disambiguation over context-free baselines.
Cross-Lingual and Low-Resource G2P
G2P conversion varies dramatically in difficulty across languages. For Finnish, where orthography is nearly phonemic, G2P is almost trivially a one-to-one mapping. For English, where historical layers of borrowing and sound change have created an opaque orthography, G2P requires extensive pattern learning. For languages like Chinese, where the writing system is logographic rather than alphabetic, G2P requires character-to-reading lookup tables supplemented by disambiguation of polyphones (characters with multiple possible readings).
Low-resource G2P — building pronunciation models for languages with little or no pronunciation data — is an active research area motivated by the need for speech technology in underserved languages. Transfer learning from related high-resource languages, phonological universals expressed as features, and Wiktionary-derived pronunciation data have all been used to bootstrap G2P models for low-resource languages. The Lexicon-based approach leverages small pronunciation dictionaries while unsupervised methods discover grapheme-phoneme correspondences from unpaired text and audio data.