End-to-end automatic speech recognition replaces the traditional modular pipeline of acoustic model, pronunciation dictionary, and language model with a single neural network that is trained to directly produce text from audio. This paradigm shift, enabled by advances in deep learning, dramatically simplifies the ASR system architecture and eliminates the need for expert-crafted pronunciation lexicons and phoneme inventories. The three dominant end-to-end approaches are connectionist temporal classification (CTC), attention-based encoder-decoder models, and RNN-Transducers.
Attention-Based Encoder-Decoder
c_u = Attention(s_{u-1}, h_1, ..., h_T)
P(y_u | y_{<u}, X) = Decoder(s_{u-1}, y_{u-1}, c_u)
Encoder: bidirectional LSTM or Conformer
Decoder: autoregressive LSTM or Transformer
Attention: content-based or location-aware
The Listen, Attend and Spell (LAS) model, introduced by Chan et al. in 2016, uses a pyramidal bidirectional LSTM encoder to compress the input spectrogram into a sequence of high-level representations, and an attention-based decoder that generates output tokens one at a time while attending to relevant parts of the encoder output. The attention mechanism learns a soft alignment between acoustic frames and output characters or subword tokens, replacing the hard alignment of HMM-based systems.
Conformer and Modern Architectures
The Conformer architecture, proposed by Gulati et al. in 2020, combines convolution and self-attention in a single encoder block, capturing both local acoustic patterns (through convolution) and global dependencies (through self-attention). Conformer-based models have achieved state-of-the-art results on major benchmarks, outperforming both pure Transformer and pure convolution approaches. The architecture typically pairs a Conformer encoder with either a CTC, attention, or transducer decoder.
Models such as wav2vec 2.0 and HuBERT have demonstrated that pre-training speech encoders on large amounts of unlabeled audio dramatically improves end-to-end ASR, especially in low-resource settings. These models learn robust speech representations by predicting masked portions of the audio signal, analogous to masked language modeling in NLP. Fine-tuning a pre-trained encoder with CTC on as little as 10 minutes of labeled speech can yield competitive recognition accuracy, democratizing ASR for under-resourced languages.
End-to-end models offer compelling advantages: they optimize a single objective directly tied to the final task, eliminate error propagation between pipeline components, and require far less expert knowledge to build. However, they face challenges including the need for large amounts of paired audio-text training data, difficulty incorporating external language models and domain knowledge, and the latency constraints of autoregressive decoding for streaming applications.
The RNN-Transducer (RNN-T) addresses the streaming requirement by factoring the model into an encoder that processes audio causally and a prediction network that models output history, with a joint network combining both to produce output probabilities. RNN-T has become the dominant architecture for on-device streaming ASR in commercial products from Google, Apple, and Amazon.