Automatic Speech Recognition (ASR) is the task of converting a continuous acoustic signal into a sequence of words. The dominant framework for ASR, established in the 1970s at IBM and refined over decades, applies Bayes' theorem to decompose the problem into two components: an acoustic model P(O|W) that scores how well an observation sequence O matches a hypothesized word sequence W, and a language model P(W) that captures the prior probability of the word sequence. The decoder searches over possible word sequences to find the one that maximizes the posterior probability.
The Statistical Framework
= argmax_W P(O|W) · P(W) / P(O)
= argmax_W P(O|W) · P(W)
P(O|W): acoustic model (GMM-HMM or neural network)
P(W): language model (n-gram or neural)
Search: Viterbi beam search over WFST
The acoustic model traditionally pairs Hidden Markov Models (HMMs) with Gaussian Mixture Models (GMMs) to capture the temporal dynamics and spectral variability of speech. Each word is decomposed into a sequence of phonemes, each phoneme modeled by a multi-state HMM, and each state's emission distribution represented as a mixture of Gaussians over acoustic feature vectors such as MFCCs. Context-dependent triphone models account for coarticulation effects, and decision-tree state tying manages the combinatorial explosion of context-dependent states.
From GMM-HMM to Deep Neural Networks
The introduction of deep neural networks (DNNs) to replace GMMs for acoustic modeling, pioneered by Hinton and colleagues in 2012, produced the largest single improvement in ASR accuracy in over a decade. DNN-HMM hybrid systems use a neural network to compute the posterior probability of each HMM state given a window of acoustic features, then convert these posteriors to likelihoods for use in the standard Viterbi decoding framework. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) further improved performance by exploiting local spectral structure and long-range temporal dependencies respectively.
Modern ASR systems compose the acoustic model, pronunciation lexicon, and language model into a single weighted finite-state transducer (WFST). This elegant framework, developed by Mohri, Pereira, and Riley, enables efficient search through determinization, minimization, and weight pushing operations. The WFST approach unifies the search problem and allows optimizations that would be impossible if the components were treated separately, forming the backbone of production ASR systems like Kaldi.
The decoding process explores a vast hypothesis space efficiently using beam search, pruning unlikely hypotheses at each frame. Lattice generation preserves multiple competing hypotheses for downstream reranking or confidence estimation. Word error rate (WER), computed as the edit distance between the hypothesis and reference transcript normalized by reference length, remains the standard evaluation metric, though slot error rate and semantic accuracy metrics are increasingly used for task-oriented applications.
ASR has advanced from isolated word recognition systems of the 1970s to today's systems that approach human parity on clean conversational speech. The transition from GMM-HMM to DNN-HMM hybrids, and subsequently to end-to-end models, represents a fundamental shift in how the field approaches the speech recognition problem, trading explicit modularity for the power of learned representations.