Computational Linguistics
About

Formal Language Identification

Formal language identification studies the conditions under which a learning algorithm can converge to the correct grammar from a stream of linguistic data, encompassing identification in the limit, query learning, and probabilistic identification.

∀L ∈ C, ∀ text T for L: ∃n₀ ∀n≥n₀ : φ(T[n]) = G_L

Formal language identification, also known as grammatical inference or inductive inference of formal languages, studies the problem of learning a grammar from data. The field asks: given a stream of examples from an unknown language drawn from a specified class, can a learning algorithm converge to a correct grammar? This question, first systematically investigated by E. Mark Gold in 1967, has developed into a rich research area with multiple learning paradigms, positive and negative results for different language classes, and direct applications to grammar induction, language acquisition modeling, and NLP.

Learning Paradigms

The field encompasses several distinct learning frameworks, each modeling different assumptions about the data and success criteria available to the learner. These paradigms differ in the type of input (positive examples, positive and negative examples, membership queries, equivalence queries), the success criterion (exact identification in the limit, approximate identification, polynomial-time convergence), and the measure of complexity.

Major Learning Paradigms Identification in the limit from text: learn from positive examples only (Gold, 1967)
Identification from informant: learn from positive and negative examples
Query learning: learn from membership and equivalence queries (Angluin, 1987)
PAC learning: approximately correct with high probability (Valiant, 1984)
Identification with probability: learn from stochastic examples

Key Results

The landscape of learnability results is rich and nuanced. Gold's theorem shows that superfinite classes (including all Chomsky hierarchy levels) are not identifiable from text. However, many practically relevant subclasses are learnable. Angluin's L* algorithm identifies any regular language exactly using membership and equivalence queries. The class of k-reversible automata is identifiable from text. Zero-reversible languages, locally testable languages, and pattern languages are all learnable from positive data. For context-free grammars, Clark's distributional learning algorithms can identify certain subclasses from positive data using substitutability principles.

Distributional Learning

Alexander Clark's work on distributional learning draws on Zellig Harris's distributional hypothesis to learn context-free grammars from positive data. The key idea is that syntactic categories can be identified by their distribution — the set of contexts in which they can appear. By identifying substitutable strings (strings that appear in the same set of contexts), the learner can infer non-terminal categories and production rules. This approach has yielded provably correct algorithms for learning several classes of context-free languages from positive data alone, partially circumventing Gold's negative result.

Applications and Modern Developments

Formal language identification has direct applications in computational linguistics. The ALERGIA algorithm and its descendants learn probabilistic finite automata from positive data, applicable to language modeling and phonological rule induction. The RPNI (Regular Positive and Negative Inference) algorithm learns DFAs from positive and negative examples. Bayesian approaches to grammar induction use prior distributions that encode MDL-like preferences for simpler grammars. These algorithms have been applied to morphological paradigm learning, phonotactic constraint discovery, and the induction of syntactic grammars from raw text.

The field continues to develop, with recent work connecting formal learnability to neural network training. Questions about whether neural language models implicitly perform grammatical inference, what formal language classes they can learn in practice, and how their inductive biases compare to those of classical grammatical inference algorithms are active areas of research. Formal language identification provides the theoretical framework for addressing these questions rigorously, connecting classical computability theory to contemporary machine learning.

Related Topics

References

  1. Gold, E. M. (1967). Language identification in the limit. Information and Control, 10(5), 447–474. https://doi.org/10.1016/S0019-9958(67)91165-5
  2. Angluin, D. (1987). Learning regular sets from queries and counterexamples. Information and Computation, 75(2), 87–106. https://doi.org/10.1016/0890-5401(87)90052-6
  3. de la Higuera, C. (2010). Grammatical Inference: Learning Automata and Grammars. Cambridge University Press. https://doi.org/10.1017/CBO9781139194655
  4. Clark, A., & Eyraud, R. (2007). Polynomial identification in the limit of substitutable context-free languages. Journal of Machine Learning Research, 8, 1725–1745.

External Links