Statistical parsing encompasses a broad family of approaches that use probabilistic models, estimated from annotated treebanks, to assign the most likely syntactic structure to a sentence. The fundamental insight is that syntactic ambiguity, which makes parsing computationally intractable under an unweighted grammar, can be resolved by learning which structures are more likely from data. Statistical parsers select the parse tree that maximizes the probability (or score) assigned by a learned model.
Lexicalized Parsing Models
× ∏ P(Lj, lj, tlj | Pi, Hi, hi, ti)
× ∏ P(Rk, rk, trk | Pi, Hi, hi, ti)
Pi = parent nonterminal, Hi = head child
hi = head word, ti = head tag
Lj, Rk = left/right modifiers
The most influential statistical parsing models are the lexicalized models of Collins (1997, 2003) and Charniak (1997, 2000). These extend PCFGs by conditioning rule probabilities on the head word of each constituent, capturing lexical selectional preferences (e.g., that "eat" takes food-related objects). Collins' models generate modifiers conditionally on the head, using smoothed backoff estimation to handle data sparsity. Charniak's model uses a generative process with maximum entropy-inspired features.
Latent-Variable Parsing
An alternative to lexicalization is to automatically refine nonterminal categories using latent variables. The Berkeley Parser (Petrov et al., 2006) starts with a bare treebank grammar and iteratively splits each nonterminal into subcategories using EM, then merges back splits that do not improve likelihood. This split-merge procedure learns fine-grained distinctions (e.g., splitting NP into pronouns, proper nouns, and common noun phrases) without hand-engineering. The resulting latent-variable PCFG achieves over 90% F1, rivaling lexicalized models with a simpler and faster architecture.
Evaluation and Impact
Statistical parsers are evaluated on the Penn Treebank WSJ section 23 using labeled precision, recall, and F1 of bracketed constituents. Key milestones include Collins (1997) at 88.1% F1, Charniak (2000) at 89.5%, the Berkeley Parser at 90.1%, and Charniak and Johnson (2005) with discriminative reranking at 91.0%. These models dominated parsing research for over a decade and established the methodology of treebank-trained probabilistic parsing that continues to underpin modern neural approaches.