Grammar induction, also known as grammar learning or grammatical inference, is the problem of automatically inferring a grammar from data without explicit syntactic annotation. In its strongest form (unsupervised grammar induction), the learner receives only raw sentences and must discover the syntactic rules governing the language. In semi-supervised settings, limited annotation or linguistic knowledge guides the learning process. Grammar induction is both a practical goal (reducing the need for expensive treebank annotation) and a scientific question (understanding how children acquire syntactic knowledge).
Approaches
β(A, i, j) = P(wi...wj | A) (inside probability)
α(A, i, j) = P(w1...wi−1, A, wj+1...wn) (outside probability)
M-step: re-estimate rule probabilities
q(A → BC) = E[count(A → BC)] / E[count(A)]
Iterate until convergence
The classical approach to unsupervised PCFG induction uses the inside-outside algorithm (Baker, 1979), an instance of EM applied to PCFGs. Starting from a random or uniform initialization, the algorithm iterates between computing expected rule counts (E-step, using inside and outside probabilities) and re-estimating rule probabilities (M-step). However, this approach is known to converge to poor local optima and does not recover linguistically meaningful grammars from raw text without additional inductive biases.
Inductive Biases and Constraints
Successful grammar induction requires strong inductive biases. Important biases include: structural bias (preferring shorter derivations or specific branching patterns via Bayesian priors), distributional bias (words in similar contexts should have similar syntactic behavior), and typological bias (exploiting universal tendencies such as head-directionality consistency). The Dependency Model with Valence (DMV) of Klein and Manning (2004) achieved the first successful unsupervised dependency induction for English by modeling whether a head has generated any dependents in each direction.
Neural Grammar Induction
Recent work applies neural networks to grammar induction. Neural PCFG models (Kim et al., 2019) parameterize rule probabilities using neural networks, enabling better generalization. Compound PCFGs add a global sentence-level latent variable that conditions all rule choices, capturing inter-rule correlations that standard PCFGs miss. These models achieve the best unsupervised constituency parsing results, with F1 scores around 55-65% on the Penn Treebank, still far below supervised parsers but a significant advance over classical EM-based approaches.