Concatenative speech synthesis generates spoken output by selecting and concatenating segments of natural recorded speech from a database. Because the output is composed of real human speech, concatenative systems can achieve high naturalness for well-covered synthesis targets. The approach dominated commercial TTS from the late 1990s through the mid-2010s, powering systems from AT&T, Nuance, and early versions of Apple's Siri. The core challenge lies in selecting segments that match the target specification while minimizing audible discontinuities at concatenation boundaries.
Unit Selection Synthesis
C_t(t_i, u_i): target cost (mismatch between desired and candidate unit)
C_j(u_{i-1}, u_i): join cost (discontinuity between consecutive units)
Search: Viterbi algorithm over unit lattice
Unit selection synthesis maintains a large database of speech recorded from a single speaker, segmented into units (typically diphones or half-phones). For each target unit in the synthesis specification, the system identifies candidate units from the database and scores them using a target cost (how well the candidate matches the desired phonetic and prosodic context) and a join cost (how smooth the transition to the previous unit will be). A Viterbi search finds the globally optimal sequence of units that minimizes the total cost.
Database Design and Recording
The quality of a concatenative synthesizer depends critically on the design of the recording corpus. A well-designed corpus covers all diphone combinations in the target language, with each diphone appearing in multiple prosodic contexts (different pitch levels, durations, and stress patterns). Recording scripts are carefully constructed to maximize phonetic and prosodic coverage while remaining natural enough for the voice talent to read fluently. Typical databases range from 5 to 50 hours of speech, with larger databases enabling higher quality through better unit coverage.
Concatenative synthesis has inherent limitations: it cannot produce speech that differs significantly from what was recorded (different emotions, speaking rates, or styles), it requires massive storage for the unit database, and join artifacts are audible when the system is forced to concatenate units from dissimilar contexts. These limitations motivated the development of parametric and neural synthesis approaches that generate speech from compact models rather than databases, offering far greater flexibility at the cost of requiring sophisticated generation algorithms.
Signal processing at concatenation boundaries is essential for minimizing audible discontinuities. Techniques include pitch-synchronous overlap-add (PSOLA) for smoothing pitch and timing differences, spectral smoothing across boundaries, and energy normalization. Despite these techniques, concatenation artifacts remain the primary source of quality degradation, particularly in prosodically demanding contexts such as questions, emphatic speech, or long sentences.
While concatenative synthesis has largely been superseded by neural approaches for general-purpose TTS, the fundamental insight that real speech segments carry naturalness that is difficult to replicate synthetically continues to influence hybrid approaches. Some modern systems use neural models for prosody prediction and segment selection while preserving concatenative elements for specific voice qualities or limited-domain applications.