Computational Linguistics
About

Spoken Language Understanding

Spoken Language Understanding extracts structured semantic representations from speech, combining speech recognition with natural language understanding to identify user intents and extract relevant entities from spoken utterances.

P(intent, slots | audio) = P(intent, slots | W*) · P(W* | audio)

Spoken Language Understanding (SLU) is the task of extracting meaning from spoken utterances in a form suitable for downstream action. In task-oriented dialogue systems — virtual assistants, customer service bots, command-and-control interfaces — SLU typically involves two subtasks: intent detection (classifying the user's goal, such as "book a flight" or "check the weather") and slot filling (extracting relevant entities, such as the departure city, destination, and date). SLU bridges the gap between raw speech input and structured semantic representations that the dialogue manager can act upon.

Pipeline vs. End-to-End SLU

SLU Architectures Pipeline: Audio → ASR → Text → NLU → (Intent, Slots)

End-to-End: Audio → SLU Model → (Intent, Slots)

Joint intent & slot model:
Intent: P(intent | h_1, ..., h_T) via pooled representation
Slots: P(slot_t | h_t) via sequence labeling (BIO tagging)

The traditional SLU pipeline cascades an ASR system and a text-based NLU module. The ASR system transcribes the audio, and the NLU module processes the transcript using intent classification and named entity recognition. This modular approach benefits from mature ASR and NLU technologies, but suffers from error propagation: ASR errors (misrecognized words) cause downstream NLU failures, and information present in the audio but absent from the transcript (prosody, hesitation, speaker state) is lost.

Joint and End-to-End Models

Joint models for intent detection and slot filling share a common encoder and are trained with a combined loss, allowing the two tasks to benefit from shared representations. BERT-based joint models have achieved strong results on benchmark datasets like ATIS and SNIPS. End-to-end SLU models go further by operating directly on speech features, bypassing the ASR transcript entirely. These models can potentially leverage acoustic cues — a rising intonation suggesting a question, emphasis on a particular word — that a text pipeline would miss.

Handling ASR Errors

In pipeline SLU, robustness to ASR errors is critical. Techniques include training NLU models on ASR output (with errors) rather than clean text, using n-best lists or lattices from the ASR to provide alternative hypotheses, incorporating ASR confidence scores as features, and data augmentation with simulated ASR errors. Word confusion networks, which compactly represent the uncertainty in ASR output, can be processed by specialized NLU models that reason over the full distribution of possible transcriptions rather than a single best hypothesis.

Evaluation of SLU systems uses intent accuracy (the fraction of utterances with correctly predicted intent), slot F1 score (the harmonic mean of precision and recall for slot entities), and sentence accuracy (the fraction of utterances with both correct intent and all correct slots). The sentence-level metric is the most stringent and practically relevant, as a single slot error can cause the system to take the wrong action. On benchmark datasets, state-of-the-art models achieve over 97% intent accuracy and 96% slot F1, but performance degrades significantly on noisy speech, accented speakers, and out-of-domain utterances.

The evolution of SLU reflects a broader trend toward integration in speech and language processing. From early rule-based systems, through statistical classifiers on ASR output, to modern pre-trained models that jointly process speech and language, the field continually seeks tighter coupling between acoustic and linguistic processing to minimize information loss and error propagation.

Related Topics

References

  1. Tur, G., & De Mori, R. (Eds.). (2011). Spoken Language Understanding: Systems for Extracting Semantic Information from Speech. John Wiley & Sons. doi:10.1002/9781119992691
  2. Chen, Q., Zhuo, Z., & Wang, W. (2019). BERT for joint intent classification and slot filling. arXiv preprint arXiv:1902.10909.
  3. Lugosch, L., Ravanelli, M., Ignoto, P., Tomar, V. S., & Bengio, Y. (2019). Speech model pre-training for end-to-end spoken language understanding. Proc. Interspeech, 814–818. doi:10.21437/Interspeech.2019-2396
  4. Hemphill, C. T., Godfrey, J. J., & Doddington, G. R. (1990). The ATIS spoken language systems pilot corpus. Proc. DARPA Speech and Natural Language Workshop, 96–101.

External Links