Computational Linguistics
About

Neural TTS

Neural TTS uses deep learning for both spectrogram prediction and waveform generation, achieving unprecedented naturalness through end-to-end training of models like Tacotron, FastSpeech, and VITS.

mel = Decoder(Encoder(phonemes)); wav = Vocoder(mel)

Neural text-to-speech represents the current state of the art in speech synthesis, using deep neural networks for every stage of the synthesis pipeline. The paradigm was catalyzed by two breakthroughs: WaveNet (2016), which demonstrated that a neural network could generate raw audio waveforms with unprecedented quality, and Tacotron (2017), which showed that a sequence-to-sequence model could learn to map text directly to spectrograms without hand-crafted linguistic features. Together, these innovations launched a new era of TTS quality that rivals human speech.

Tacotron and Sequence-to-Sequence TTS

Tacotron 2 Architecture Encoder: character/phoneme embeddings → convolution layers → bi-LSTM
Attention: location-sensitive attention mechanism
Decoder: autoregressive LSTM → linear projection → mel spectrogram
+ PostNet: 5-layer CNN for spectrogram refinement

Loss = MSE(mel_pred, mel_target) + MSE(postnet_mel, mel_target)

Tacotron 2, introduced by Shen et al. in 2018, established the dominant neural TTS paradigm: an encoder-decoder model with attention that converts a phoneme or character sequence into a mel spectrogram, followed by a neural vocoder that converts the spectrogram to audio. The encoder processes the input text into a sequence of hidden representations, the attention mechanism learns the alignment between text and acoustic frames, and the decoder autoregressively generates mel spectrogram frames. A post-processing network refines the spectrogram before vocoding.

Non-Autoregressive Models

Autoregressive models like Tacotron generate speech one frame at a time, making synthesis slow and susceptible to attention failures (skipping, repeating, or babbling). FastSpeech and its successor FastSpeech 2 address these issues by predicting all mel frames in parallel using a feed-forward Transformer architecture with explicit duration, pitch, and energy predictors. A duration model predicts how many acoustic frames each phoneme should span, a pitch predictor generates the F0 contour, and the decoder generates the full spectrogram in a single forward pass, achieving synthesis speeds orders of magnitude faster than real-time.

End-to-End Models: VITS and Beyond

VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) unifies the acoustic model and vocoder into a single model trained end-to-end with a combination of variational inference and adversarial training. By eliminating the two-stage pipeline, VITS avoids the mismatch between predicted and ground-truth mel spectrograms that degrades vocoder quality. The model achieves naturalness comparable to ground-truth recordings while enabling parallel synthesis, and has become a widely adopted baseline for TTS research.

Controllability is a major focus of neural TTS research. Style tokens, reference encoders, and variational autoencoders allow users to control speaking style, emotion, and prosody either by providing a reference audio clip or by manipulating learned latent variables. Multi-speaker models share a single architecture across speakers, with speaker embeddings conditioning the synthesis to produce any target voice. Zero-shot voice cloning extends this capability to speakers never seen during training.

The rapid progress in neural TTS has also benefited from advances in neural vocoders. WaveRNN, LPCNet, HiFi-GAN, and UnivNet each offer different tradeoffs between synthesis quality, speed, and model size, with HiFi-GAN currently providing the best combination of quality and efficiency for most applications. The entire field continues to advance rapidly, with diffusion-based models and codec-based approaches representing the latest frontiers.

Related Topics

References

  1. Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., ... & Wu, Y. (2018). Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. Proc. ICASSP, 4779–4783. doi:10.1109/ICASSP.2018.8461368
  2. Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T.-Y. (2021). FastSpeech 2: Fast and high-quality end-to-end text to speech. Proc. ICLR.
  3. Kim, J., Kong, J., & Son, J. (2021). Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. Proc. ICML, 5530–5540.
  4. Kong, J., Kim, J., & Bae, J. (2020). HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. Proc. NeurIPS, 33, 17022–17033.

External Links