Computational Linguistics
About

CYK Algorithm

The Cocke-Younger-Kasami (CYK) algorithm is a bottom-up dynamic programming parser for context-free grammars in Chomsky Normal Form, running in O(n^3 |G|) time.

T[i,j,A] = ∨_{B,C,k} T[i,k,B] ∧ T[k+1,j,C] ∧ (A→BC ∈ R)

The Cocke-Younger-Kasami (CYK) algorithm, independently discovered by Cocke, Younger, and Kasami in the 1960s, is a tabular bottom-up parsing algorithm for context-free grammars. It requires the grammar to be in Chomsky Normal Form (CNF), where every production is either A → BC (two nonterminals) or A → a (a single terminal). Any CFG can be converted to CNF without changing the generated language, making CYK a general-purpose CFG parser.

Algorithm

CYK Recognition Table Input: w = w1 w2 ... wn, grammar G = (N, T, R, S)

Base case: T[i, i, A] = true iff (A → wi) ∈ R
Recursive case: T[i, j, A] = true iff ∃ B, C ∈ N, k ∈ [i, j−1]:
  T[i, k, B] = true ∧ T[k+1, j, C] = true ∧ (A → BC) ∈ R

Accept iff T[1, n, S] = true

The algorithm fills an n × n upper-triangular table, where cell T[i, j] records all nonterminals that can derive the substring wi...wj. Cells are filled in order of increasing span length: first all spans of length 1 (individual words), then length 2, and so on up to the full sentence. For each cell T[i, j], the algorithm considers all possible split points k and all binary rules A → BC such that B derives wi...wk and C derives wk+1...wj.

Complexity and Extensions

The time complexity is O(n3 |G|), where n is the sentence length and |G| is the grammar size. Space complexity is O(n2 |N|). While cubic time is practical for short sentences and small grammars, broad-coverage grammars can have thousands of nonterminals after binarization, making the constant factor significant. Optimizations include agenda-driven parsing, coarse-to-fine pruning, and A* search.

Probabilistic CYK
The probabilistic extension replaces Boolean values with probabilities: T[i, j, A] stores the highest probability of any subtree rooted at A spanning positions i to j. The recursion uses multiplication (of rule and subtree probabilities) and maximization (over split points and rules), yielding the Viterbi parse in the same O(n³) time.

To recover the actual parse tree rather than just recognize the string, each cell stores back-pointers recording which rule and split point were used. These back-pointers are followed from T[1, n, S] to reconstruct the complete derivation. The CYK algorithm remains foundational in computational linguistics and serves as the basis for probabilistic parsing with PCFGs.

Interactive Calculator

Enter CNF grammar rules (one per line, format A -> B C or A -> a) followed by a blank line and a sentence (words separated by spaces). The CYK algorithm builds a parse table and determines whether the sentence is in the language.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics

References

  1. Younger, D. H. (1967). Recognition and parsing of context-free languages in time n³. Information and Control, 10(2), 189–208. https://doi.org/10.1016/S0019-9958(67)80007-X
  2. Kasami, T. (1966). An efficient recognition and syntax-analysis algorithm for context-free languages. Scientific Report AFCRL-65-758, Air Force Cambridge Research Lab.
  3. Jurafsky, D., & Martin, J. H. (2024). Speech and language processing (3rd ed. draft). Pearson. https://web.stanford.edu/~jurafsky/slp3/
  4. Aho, A. V., & Ullman, J. D. (1972). The theory of parsing, translation, and compiling. Prentice-Hall.

External Links