Formal language identification, also known as grammatical inference or inductive inference of formal languages, studies the problem of learning a grammar from data. The field asks: given a stream of examples from an unknown language drawn from a specified class, can a learning algorithm converge to a correct grammar? This question, first systematically investigated by E. Mark Gold in 1967, has developed into a rich research area with multiple learning paradigms, positive and negative results for different language classes, and direct applications to grammar induction, language acquisition modeling, and NLP.
Learning Paradigms
The field encompasses several distinct learning frameworks, each modeling different assumptions about the data and success criteria available to the learner. These paradigms differ in the type of input (positive examples, positive and negative examples, membership queries, equivalence queries), the success criterion (exact identification in the limit, approximate identification, polynomial-time convergence), and the measure of complexity.
Identification from informant: learn from positive and negative examples
Query learning: learn from membership and equivalence queries (Angluin, 1987)
PAC learning: approximately correct with high probability (Valiant, 1984)
Identification with probability: learn from stochastic examples
Key Results
The landscape of learnability results is rich and nuanced. Gold's theorem shows that superfinite classes (including all Chomsky hierarchy levels) are not identifiable from text. However, many practically relevant subclasses are learnable. Angluin's L* algorithm identifies any regular language exactly using membership and equivalence queries. The class of k-reversible automata is identifiable from text. Zero-reversible languages, locally testable languages, and pattern languages are all learnable from positive data. For context-free grammars, Clark's distributional learning algorithms can identify certain subclasses from positive data using substitutability principles.
Alexander Clark's work on distributional learning draws on Zellig Harris's distributional hypothesis to learn context-free grammars from positive data. The key idea is that syntactic categories can be identified by their distribution — the set of contexts in which they can appear. By identifying substitutable strings (strings that appear in the same set of contexts), the learner can infer non-terminal categories and production rules. This approach has yielded provably correct algorithms for learning several classes of context-free languages from positive data alone, partially circumventing Gold's negative result.
Applications and Modern Developments
Formal language identification has direct applications in computational linguistics. The ALERGIA algorithm and its descendants learn probabilistic finite automata from positive data, applicable to language modeling and phonological rule induction. The RPNI (Regular Positive and Negative Inference) algorithm learns DFAs from positive and negative examples. Bayesian approaches to grammar induction use prior distributions that encode MDL-like preferences for simpler grammars. These algorithms have been applied to morphological paradigm learning, phonotactic constraint discovery, and the induction of syntactic grammars from raw text.
The field continues to develop, with recent work connecting formal learnability to neural network training. Questions about whether neural language models implicitly perform grammatical inference, what formal language classes they can learn in practice, and how their inductive biases compare to those of classical grammatical inference algorithms are active areas of research. Formal language identification provides the theoretical framework for addressing these questions rigorously, connecting classical computability theory to contemporary machine learning.