Computational Linguistics
About

Connectionist Temporal Classification

Connectionist Temporal Classification is a training criterion and decoding algorithm that enables neural networks to produce label sequences from unsegmented input by marginalizing over all valid alignments.

P(Y|X) = ∑_{π ∈ B^{-1}(Y)} ∏_t P(π_t | X)

Connectionist Temporal Classification (CTC), introduced by Alex Graves and colleagues in 2006, solved a fundamental problem in sequence-to-sequence learning: how to train a neural network to produce a label sequence when the alignment between input frames and output labels is unknown. CTC introduces a blank symbol that represents "no output" and defines the probability of an output sequence as the sum over all possible frame-level alignments that collapse to that sequence after removing blanks and repeated labels.

The CTC Loss Function

CTC Formulation P(Y|X) = ∑_{π ∈ B^{-1}(Y)} ∏_{t=1}^{T} P(π_t | x_t)

B: collapsing function (remove blanks, merge repeats)
B^{-1}(Y): set of all valid alignments for label sequence Y
π_t ∈ L ∪ {blank}: label at time t

Loss: L_CTC = −log P(Y|X)

The key insight of CTC is the collapsing function B, which maps a frame-level alignment path to an output label sequence by first removing all blank tokens and then merging consecutive identical labels. For example, the alignment "—aab—bc—" (where — denotes blank) collapses to "abc." The CTC loss is computed as the negative log-probability of the target sequence, which requires summing over the exponentially many valid alignment paths. This summation is performed efficiently using a forward-backward dynamic programming algorithm analogous to the HMM forward algorithm.

Conditional Independence and Decoding

CTC makes a strong conditional independence assumption: given the input, the output at each time step is independent of the outputs at all other time steps. This means CTC relies entirely on the encoder to capture temporal dependencies and cannot model output label dependencies directly. Despite this limitation, CTC works remarkably well when paired with powerful encoders (deep bidirectional LSTMs or Conformers) that provide rich contextual representations at each frame.

CTC Prefix Beam Search

Greedy CTC decoding simply takes the most probable label at each frame and applies the collapsing function, but this ignores the fact that multiple alignment paths can contribute to the same output sequence. CTC prefix beam search maintains a set of active output prefixes and accumulates their probabilities across alignment paths, tracking separate probabilities for prefixes ending in blank versus non-blank. This algorithm yields significantly better results than greedy decoding, especially when combined with an external language model that scores partial hypotheses.

CTC is often combined with attention-based decoding in hybrid CTC/attention architectures, where a CTC loss applied to the encoder output provides a monotonic alignment constraint that stabilizes attention training, while the attention decoder models output dependencies. This joint training, implemented in frameworks such as ESPnet, has become a standard recipe for end-to-end ASR and achieves competitive results on benchmarks from LibriSpeech to multilingual tasks.

Beyond speech recognition, CTC has found applications in handwriting recognition, action detection in video, and any sequence labeling task where the alignment between input and output is unknown. Its elegant solution to the alignment problem remains one of the most influential contributions to sequence-to-sequence learning.

Related Topics

References

  1. Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. Proc. ICML, 369–376. doi:10.1145/1143844.1143891
  2. Graves, A., & Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent neural networks. Proc. ICML, 1764–1772.
  3. Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., ... & Ochiai, T. (2018). ESPnet: End-to-end speech processing toolkit. Proc. Interspeech, 2207–2211. doi:10.21437/Interspeech.2018-1456
  4. Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., ... & Ng, A. Y. (2014). Deep Speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.

External Links