Statistical parametric speech synthesis (SPSS) generates speech by training a model to predict acoustic parameters — such as spectral envelope, fundamental frequency, and aperiodicity — from linguistic features derived from the input text. Unlike concatenative synthesis, which stores and retrieves actual speech segments, parametric synthesis generates every aspect of the speech signal from a compact statistical model. This approach, pioneered by Tokuda and colleagues using HMMs in the early 2000s, dominated TTS research for over a decade before being succeeded by neural methods.
HMM-Based Parametric Synthesis
Synthesis: generate state sequence from input text
o_t ~ N(μ_q(t), Σ_q(t)) for state q(t)
F0_t ~ N(μ^{f0}_{q(t)}, σ^{f0}_{q(t)}) with voiced/unvoiced decision
Parameter generation: o* = argmax P(o|q, λ) with dynamic feature constraints
In HMM-based synthesis, the speech signal is parameterized as a sequence of vocoder features: mel-cepstral coefficients for the spectral envelope, log F0 for pitch, band aperiodicities for noise characteristics, and a voiced/unvoiced flag. These parameters, along with their delta and delta-delta derivatives, are modeled by context-dependent HMM states. At synthesis time, a state sequence is derived from the input text, and the maximum-likelihood parameter generation algorithm produces a smooth trajectory of vocoder parameters that respects the dynamic feature constraints.
DNN-Based Parametric Synthesis
Deep neural networks replaced HMMs as the mapping function from linguistic features to acoustic parameters around 2013, yielding significant quality improvements. DNN-based parametric synthesis frames the problem as regression: given a vector of linguistic features (phoneme identity, position in syllable/word/phrase, stress, POS tag), predict the corresponding acoustic parameters. LSTMs and bidirectional RNNs further improved quality by modeling the temporal dependencies that frame-independent DNNs miss.
A fundamental limitation of parametric synthesis is the vocoder, which reconstructs the waveform from predicted parameters. Traditional vocoders like STRAIGHT and WORLD produce speech that sounds "buzzy" or "muffled" compared to natural speech, even when the predicted parameters are accurate. This vocoder degradation accounts for much of the quality gap between parametric and concatenative synthesis. Neural vocoders such as WaveNet and WaveRNN dramatically narrowed this gap, leading to the neural TTS paradigm where the distinction between parametric and waveform-level synthesis blurs.
Parametric synthesis offers several advantages over concatenative approaches: the model is compact (megabytes versus gigabytes), it generalizes to unseen phonetic and prosodic contexts, and it enables flexible control over speaking style, emotion, and speaker identity through model adaptation or interpolation. Speaker adaptation techniques allow a model trained on one speaker to be adapted to a new speaker using as little as a few minutes of data, a capability that is much harder to achieve with concatenative systems.
While pure parametric synthesis has been largely superseded by end-to-end neural TTS, its legacy persists in the overall system design. Modern neural TTS systems still decompose the problem into text-to-spectrogram and spectrogram-to-waveform stages, echoing the parametric synthesis philosophy of separating linguistic and acoustic modeling from waveform generation.