Probabilistic CFG

A probabilistic context-free grammar (PCFG) extends a standard CFG by assigning a conditional probability to each production rule. The probability of a parse tree is the product of the probabilities of all rules used in the derivation. For an ambiguous sentence, the PCFG selects the parse tree with the highest probability, providing a principled approach to disambiguation. PCFGs also define a probability distribution over strings, making them generative models of language.

Formal Definition

PCFG G = (N, T, R, S, q)
q(A → α) = P(α | A) for each rule A → α ∈ R
Constraint: ∑_α q(A → α) = 1 for each A ∈ N

P(tree t) = ∏_{r ∈ t} q(r)
P(sentence w) = ∑_{t: yield(t)=w} P(t)

The probability of a sentence is the sum of the probabilities of all its parse trees. The most likely parse for a sentence w is argmax_t P(t | w) = argmax_t P(t), since all parses generate the same w. The Viterbi parse can be found efficiently using the probabilistic CYK algorithm, and the inside-outside algorithm computes the total string probability and marginal probabilities of constituents.

Parameter Estimation

PCFG rule probabilities are typically estimated from a treebank using maximum likelihood estimation: q(A → α) = count(A → α) / count(A). When treebank data is unavailable, the inside-outside algorithm (a special case of EM) can estimate parameters from unannotated text. However, vanilla PCFGs have significant limitations: they assume context-free independence between rules, which means they cannot capture lexical preferences, structural dependencies, or the tendency of certain constructions to co-occur.

Limitations of Vanilla PCFGs

PCFGs assign the same probability to a rule regardless of its context in the tree. This means they cannot model lexical head preferences (e.g., that "put" requires a PP complement) or structural biases (e.g., right-branching preference in English). Lexicalized PCFGs and parent-annotated PCFGs address these limitations.

Extensions

Major extensions include lexicalized PCFGs (Collins, 1997; Charniak, 1997) that condition rule probabilities on head words, parent-annotated PCFGs (Johnson, 1998) that refine nonterminals with parent information, and latent-variable PCFGs (Matsuzaki et al., 2005; Petrov et al., 2006) that automatically learn fine-grained nonterminal subcategories. These extensions dramatically improve parsing accuracy, with the Berkeley Parser achieving over 90% F1 using split-merge latent-variable training. PCFGs remain foundational in computational linguistics even as neural approaches have become dominant.

Interactive Calculator

Enter CNF grammar rules (one per line, format A -> B C or A -> a) followed by a blank line and a sentence (words separated by spaces). The CYK algorithm builds a parse table and determines whether the sentence is in the language.

CNF Grammar + Sentence (blank line separates)

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Formal Definition

Parameter Estimation

Extensions

Interactive Calculator

References

External Links

Formal Definition

Parameter Estimation

Extensions

Interactive Calculator

Related Topics

References

External Links