Constituency parsing, also called phrase-structure parsing, analyzes a sentence by recursively decomposing it into nested constituents according to a context-free grammar (or extensions thereof). The output is a parse tree whose internal nodes are phrasal categories (S, NP, VP, PP, etc.) and whose leaves are the words of the sentence. This representation captures the hierarchical grouping of words into phrases and the syntactic role each phrase plays within the larger sentence.
Formal Foundations
N = set of nonterminal symbols (S, NP, VP, ...)
T = set of terminal symbols (words)
R = set of production rules A → α
S = start symbol
A context-free grammar defines the set of valid parse trees. Each production rule specifies how a nonterminal can be expanded into a sequence of terminals and nonterminals. Parsing is the inverse problem: given a string of words and a grammar, find one or more derivations that produce the string. The fundamental challenge is ambiguity — natural language sentences routinely admit dozens or even thousands of valid parse trees under broad-coverage grammars.
Parsing Strategies
Constituency parsers employ either top-down strategies (expanding from the start symbol S toward the words), bottom-up strategies (combining words into larger and larger constituents), or chart-based strategies that combine both directions. Classical algorithms include the CYK algorithm, Earley's algorithm, and generalized chart parsing. Modern approaches use probabilistic grammars or neural networks to select the most likely parse among the exponentially many candidates.
Evaluation
Constituency parsers are evaluated using labeled precision (fraction of predicted constituents that are correct), labeled recall (fraction of gold constituents that are predicted), and their harmonic mean, the F1 score. The evalb tool implements the PARSEVAL metric, ignoring punctuation and certain trivial unary chains. State-of-the-art parsers achieve F1 scores above 95% on the Penn Treebank Wall Street Journal test set, though performance drops on out-of-domain text and morphologically rich languages.
The recovered phrase-structure trees serve as input to downstream tasks including semantic role labeling, information extraction, and machine translation. They also provide the structural backbone for linguistic theories of syntax, making constituency parsing a central task in both computational and theoretical linguistics.