Constituency Parsing

Constituency parsing, also called phrase-structure parsing, analyzes a sentence by recursively decomposing it into nested constituents according to a context-free grammar (or extensions thereof). The output is a parse tree whose internal nodes are phrasal categories (S, NP, VP, PP, etc.) and whose leaves are the words of the sentence. This representation captures the hierarchical grouping of words into phrases and the syntactic role each phrase plays within the larger sentence.

Formal Foundations

Context-Free Grammar G = (N, T, R, S)
N = set of nonterminal symbols (S, NP, VP, ...)
T = set of terminal symbols (words)
R = set of production rules A → α
S = start symbol

A context-free grammar defines the set of valid parse trees. Each production rule specifies how a nonterminal can be expanded into a sequence of terminals and nonterminals. Parsing is the inverse problem: given a string of words and a grammar, find one or more derivations that produce the string. The fundamental challenge is ambiguity — natural language sentences routinely admit dozens or even thousands of valid parse trees under broad-coverage grammars.

Parsing Strategies

Constituency parsers employ either top-down strategies (expanding from the start symbol S toward the words), bottom-up strategies (combining words into larger and larger constituents), or chart-based strategies that combine both directions. Classical algorithms include the CYK algorithm, Earley's algorithm, and generalized chart parsing. Modern approaches use probabilistic grammars or neural networks to select the most likely parse among the exponentially many candidates.

Penn Treebank Standard

The Penn Treebank annotation scheme, with its 45 POS tags and ~27 phrasal categories, became the de facto standard for evaluating English constituency parsers. The standard evaluation metric is PARSEVAL, which measures labeled precision and recall of bracketed constituents.

Evaluation

Constituency parsers are evaluated using labeled precision (fraction of predicted constituents that are correct), labeled recall (fraction of gold constituents that are predicted), and their harmonic mean, the F1 score. The evalb tool implements the PARSEVAL metric, ignoring punctuation and certain trivial unary chains. State-of-the-art parsers achieve F1 scores above 95% on the Penn Treebank Wall Street Journal test set, though performance drops on out-of-domain text and morphologically rich languages.

The recovered phrase-structure trees serve as input to downstream tasks including semantic role labeling, information extraction, and machine translation. They also provide the structural backbone for linguistic theories of syntax, making constituency parsing a central task in both computational and theoretical linguistics.

Interactive Calculator

Enter CNF grammar rules (one per line, format A -> B C or A -> a) followed by a blank line and a sentence (words separated by spaces). The CYK algorithm builds a parse table and determines whether the sentence is in the language.

CNF Grammar + Sentence (blank line separates)

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Formal Foundations

Parsing Strategies

Evaluation

Interactive Calculator

References

External Links

Formal Foundations

Parsing Strategies

Evaluation

Interactive Calculator

Related Topics

References

External Links