Disfluencies are interruptions in the smooth flow of speech that occur naturally in spontaneous conversation. Unlike written text, which is typically edited before consumption, spoken language is produced in real time and is replete with phenomena that reflect the speaker's planning, monitoring, and repair processes. Disfluency detection is the task of automatically identifying these phenomena in speech transcripts or audio, a capability that is essential for downstream NLP tasks (parsing, machine translation, information extraction) that expect fluent input, and that provides insights into psycholinguistic models of speech production.
Disfluency Taxonomy
Types:
Filled pauses: "uh", "um", "er"
Repetitions: "I I want to go"
Revisions: "go to Bos- uh to Denver" (reparandum: "Bos-")
Restarts: "Can you- I'd like a flight to Denver"
Discourse markers: "you know", "I mean", "like"
The standard analysis of disfluencies, formalized by Shriberg, decomposes them into a reparandum (the part to be replaced), an optional editing term (typically a filled pause like "uh" or "um"), and a repair (the corrected version). Detecting the reparandum is the most challenging subtask because it requires identifying which words the speaker intended to retract. In the Switchboard corpus of conversational telephone speech, approximately 5-10% of words fall within disfluent regions, making disfluency detection essential for accurate processing of spontaneous speech.
Detection Approaches
Rule-based approaches exploit the observation that reparanda and repairs often share lexical material (the speaker repeats or corrects similar words). Statistical approaches use sequence labeling models — CRFs, BiLSTMs, and Transformers — trained on annotated corpora to classify each word as fluent or belonging to a disfluent region. The state of the art uses pre-trained language models fine-tuned on disfluency-annotated data, achieving F1 scores above 90% on the Switchboard test set. These models leverage the fact that disfluencies create detectable anomalies in the local language model probability.
While disfluency detection is often motivated by the need to "clean up" speech for downstream processing, research has shown that disfluencies carry useful information. Filled pauses like "uh" and "um" signal upcoming difficulty (complex or infrequent words) and may help listeners predict what comes next. Repairs indicate that the speaker detected an error in their own speech, providing information about their monitoring process. In dialogue systems, preserving and interpreting disfluencies can improve understanding of user uncertainty and turn-taking behavior.
Incremental disfluency detection processes speech word by word in real time, a requirement for interactive dialogue systems that must understand partial utterances as they unfold. This setting is more challenging because the model cannot use future context (the repair) to detect the reparandum. Transition-based parsers and incremental neural models have been adapted for this task, trading detection accuracy for the ability to identify disfluencies as soon as possible.
Cross-lingual disfluency detection is an emerging research direction, as disfluency patterns vary across languages while sharing universal properties. Languages differ in their filled pause inventories (English "uh/um" vs. Japanese "eto/ano"), preferred repair strategies, and disfluency rates. Multilingual models trained on annotated data from multiple languages have shown the ability to transfer disfluency detection capabilities to new languages, though language-specific patterns still require adaptation.