Statistical Parsing

Statistical parsing encompasses a broad family of approaches that use probabilistic models, estimated from annotated treebanks, to assign the most likely syntactic structure to a sentence. The fundamental insight is that syntactic ambiguity, which makes parsing computationally intractable under an unweighted grammar, can be resolved by learning which structures are more likely from data. Statistical parsers select the parse tree that maximizes the probability (or score) assigned by a learned model.

Lexicalized Parsing Models

Collins Model 1 (Head-Driven) P(tree) = ∏ P(H_i | P_i, h_i, t_i)
× ∏ P(L_j, l_j, t_lj | P_i, H_i, h_i, t_i)
× ∏ P(R_k, r_k, t_rk | P_i, H_i, h_i, t_i)

P_i = parent nonterminal, H_i = head child
h_i = head word, t_i = head tag
L_j, R_k = left/right modifiers

The most influential statistical parsing models are the lexicalized models of Collins (1997, 2003) and Charniak (1997, 2000). These extend PCFGs by conditioning rule probabilities on the head word of each constituent, capturing lexical selectional preferences (e.g., that "eat" takes food-related objects). Collins' models generate modifiers conditionally on the head, using smoothed backoff estimation to handle data sparsity. Charniak's model uses a generative process with maximum entropy-inspired features.

Latent-Variable Parsing

An alternative to lexicalization is to automatically refine nonterminal categories using latent variables. The Berkeley Parser (Petrov et al., 2006) starts with a bare treebank grammar and iteratively splits each nonterminal into subcategories using EM, then merges back splits that do not improve likelihood. This split-merge procedure learns fine-grained distinctions (e.g., splitting NP into pronouns, proper nouns, and common noun phrases) without hand-engineering. The resulting latent-variable PCFG achieves over 90% F1, rivaling lexicalized models with a simpler and faster architecture.

Coarse-to-Fine Parsing

Charniak et al. (2006) introduced coarse-to-fine parsing, which first parses with a simple grammar to prune unlikely constituents, then reparses the pruned chart with a richer model. This hierarchical pruning scheme provides a 20x speedup with no loss in accuracy and has become standard practice in statistical parsing.

Evaluation and Impact

Statistical parsers are evaluated on the Penn Treebank WSJ section 23 using labeled precision, recall, and F1 of bracketed constituents. Key milestones include Collins (1997) at 88.1% F1, Charniak (2000) at 89.5%, the Berkeley Parser at 90.1%, and Charniak and Johnson (2005) with discriminative reranking at 91.0%. These models dominated parsing research for over a decade and established the methodology of treebank-trained probabilistic parsing that continues to underpin modern neural approaches.

Lexicalized Parsing Models

Latent-Variable Parsing

Evaluation and Impact

References

External Links

Lexicalized Parsing Models

Latent-Variable Parsing

Evaluation and Impact

Related Topics

References

External Links