Neural text-to-speech represents the current state of the art in speech synthesis, using deep neural networks for every stage of the synthesis pipeline. The paradigm was catalyzed by two breakthroughs: WaveNet (2016), which demonstrated that a neural network could generate raw audio waveforms with unprecedented quality, and Tacotron (2017), which showed that a sequence-to-sequence model could learn to map text directly to spectrograms without hand-crafted linguistic features. Together, these innovations launched a new era of TTS quality that rivals human speech.
Tacotron and Sequence-to-Sequence TTS
Attention: location-sensitive attention mechanism
Decoder: autoregressive LSTM → linear projection → mel spectrogram
+ PostNet: 5-layer CNN for spectrogram refinement
Loss = MSE(mel_pred, mel_target) + MSE(postnet_mel, mel_target)
Tacotron 2, introduced by Shen et al. in 2018, established the dominant neural TTS paradigm: an encoder-decoder model with attention that converts a phoneme or character sequence into a mel spectrogram, followed by a neural vocoder that converts the spectrogram to audio. The encoder processes the input text into a sequence of hidden representations, the attention mechanism learns the alignment between text and acoustic frames, and the decoder autoregressively generates mel spectrogram frames. A post-processing network refines the spectrogram before vocoding.
Non-Autoregressive Models
Autoregressive models like Tacotron generate speech one frame at a time, making synthesis slow and susceptible to attention failures (skipping, repeating, or babbling). FastSpeech and its successor FastSpeech 2 address these issues by predicting all mel frames in parallel using a feed-forward Transformer architecture with explicit duration, pitch, and energy predictors. A duration model predicts how many acoustic frames each phoneme should span, a pitch predictor generates the F0 contour, and the decoder generates the full spectrogram in a single forward pass, achieving synthesis speeds orders of magnitude faster than real-time.
VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) unifies the acoustic model and vocoder into a single model trained end-to-end with a combination of variational inference and adversarial training. By eliminating the two-stage pipeline, VITS avoids the mismatch between predicted and ground-truth mel spectrograms that degrades vocoder quality. The model achieves naturalness comparable to ground-truth recordings while enabling parallel synthesis, and has become a widely adopted baseline for TTS research.
Controllability is a major focus of neural TTS research. Style tokens, reference encoders, and variational autoencoders allow users to control speaking style, emotion, and prosody either by providing a reference audio clip or by manipulating learned latent variables. Multi-speaker models share a single architecture across speakers, with speaker embeddings conditioning the synthesis to produce any target voice. Zero-shot voice cloning extends this capability to speakers never seen during training.
The rapid progress in neural TTS has also benefited from advances in neural vocoders. WaveRNN, LPCNet, HiFi-GAN, and UnivNet each offer different tradeoffs between synthesis quality, speed, and model size, with HiFi-GAN currently providing the best combination of quality and efficiency for most applications. The entire field continues to advance rapidly, with diffusion-based models and codec-based approaches representing the latest frontiers.