A finite-state automaton (FSA) is an abstract machine that processes input strings symbol by symbol, transitioning between a finite number of states according to a transition function. FSAs come in two flavors: deterministic (DFA), where each state has exactly one transition per input symbol, and nondeterministic (NFA), where multiple transitions are allowed. Despite this apparent difference, DFAs and NFAs recognize exactly the same class of languages — the regular languages. Finite-state automata are the computational backbone of lexical analysis in NLP, powering everything from spell checkers to morphological analyzers.
Formal Definition
A deterministic finite automaton is defined as a 5-tuple M = (Q, Σ, δ, q₀, F), where Q is a finite set of states, Σ is the input alphabet, δ: Q × Σ → Q is the transition function, q₀ ∈ Q is the start state, and F ⊆ Q is the set of accepting states. The automaton reads an input string one symbol at a time, starting in q₀, and accepts the string if it ends in a state belonging to F.
δ*(q, wa) = δ(δ*(q, w), a)
L(M) = { w ∈ Σ* | δ*(q₀, w) ∈ F }
Nondeterministic finite automata extend this by allowing δ to map to sets of states and by permitting ε-transitions (transitions without consuming input). The subset construction converts any NFA into an equivalent DFA, though the DFA may have exponentially more states. In practice, NFAs are often more natural to design, while DFAs are more efficient to execute.
Minimization and Equivalence
Every regular language has a unique minimal DFA (up to isomorphism), which can be computed in O(n log n) time using Hopcroft's algorithm. Minimal automata are practically important because they use the fewest states, minimizing memory usage in applications. The Myhill-Nerode theorem provides the theoretical foundation: the number of states in the minimal DFA equals the number of equivalence classes of the right-invariant equivalence relation defined by the language.
Kimmo Koskenniemi's two-level morphology (1983) and the Xerox finite-state tools developed by Beesley and Karttunen pioneered the use of finite-state automata for morphological analysis. In these systems, the mapping between surface forms and underlying lexical representations is modeled as a regular relation implemented by finite-state transducers. A single transducer can simultaneously encode the entire morphology of a language, handling inflection, derivation, and compounding in a unified framework.
Applications in NLP
Finite-state automata pervade computational linguistics. Tokenizers use DFAs to segment text into words and punctuation. Named entity recognizers employ finite-state patterns to identify dates, numbers, and proper names. Phonological rule systems are compiled into cascades of finite-state transducers. Even in the era of neural NLP, finite-state methods remain essential for preprocessing pipelines and for domains where interpretability, speed, and guaranteed correctness are paramount.
The efficiency of finite-state computation — O(n) time and O(1) space for DFA recognition — is unmatched by more powerful models. This efficiency, combined with closure under all Boolean operations and the decidability of all standard decision problems, makes finite-state technology an indispensable tool in the computational linguist's toolkit.