Parsing and Syntax Trees: Where Compilers and NLP Share Roots
From constituency parsing to dependency parsing and the CYK algorithm, exploring how machines discover the grammatical structure of sentences using the same ideas that power programming language compilers.
Terminology
| Term | Definition |
|---|---|
| Parsing | The process of analyzing a string of symbols (words or tokens) according to a formal grammar to produce a structured representation |
| Constituency Parsing | Breaking a sentence into nested phrases (noun phrase, verb phrase, etc.) organized as a tree where leaves are words and internal nodes are phrase labels |
| Dependency Parsing | Identifying directed relationships between words where each word depends on exactly one head word, forming a tree rooted at the main verb |
| Context-Free Grammar (CFG) | A formal grammar with rules of the form $A \rightarrow \alpha$ where $A$ is a non-terminal and $\alpha$ is a sequence of terminals and non-terminals |
| Chomsky Normal Form (CNF) | A restricted CFG form where every rule is either $A \rightarrow BC$ (two non-terminals) or $A \rightarrow a$ (one terminal), required by the CYK algorithm |
| CYK Algorithm | Cocke-Younger-Kasami: a dynamic programming algorithm that parses a string under a CNF grammar in $O(n^3 \cdot |G|)$ time |
| PCFG | Probabilistic Context-Free Grammar: a CFG where each rule has an associated probability, enabling the parser to select the most likely parse tree |
| Parse Tree | A tree structure representing the syntactic analysis of a sentence, with the root as the start symbol and leaves as words |
| Ambiguity | When a sentence has multiple valid parse trees under the grammar, e.g., "I saw the man with the telescope" has two readings |
What & Why
Parsing reveals the hidden structure of sentences. "The cat sat on the mat" is not just a flat sequence of words; it has a hierarchical structure where "the cat" forms a noun phrase, "on the mat" forms a prepositional phrase, and "sat on the mat" forms a verb phrase. This structure determines meaning.
The connection to compilers is direct. Programming language compilers parse source code into abstract syntax trees (ASTs) using context-free grammars, then traverse those trees to generate machine code. NLP parsers do the same thing with natural language, except natural language is ambiguous (one sentence can have multiple valid parse trees) while programming languages are designed to be unambiguous.
Understanding parsing matters because it is the bridge between surface text and structured meaning. Information extraction, question answering, and machine translation all benefit from knowing which words modify which. And the algorithms (CYK, Earley, shift-reduce) are shared between compiler construction and computational linguistics, making this one of the deepest connections between programming and language.
How It Works
Constituency Parsing
A constituency parse tree groups words into nested phrases. Given the grammar:
- S -> NP VP
- NP -> DET NN
- VP -> VBD PP
- PP -> IN NP
- DET -> "the"
- NN -> "cat" | "mat"
- VBD -> "sat"
- IN -> "on"
The sentence "the cat sat on the mat" produces:
Dependency Parsing
Instead of phrase structure, dependency parsing identifies head-modifier relationships. Each word has exactly one head (except the root). "sat" is the root; "cat" depends on "sat" as its subject; "mat" depends on "on" as its object.
Dependency trees are more compact and directly encode grammatical relations (subject, object, modifier), making them popular for information extraction.
The CYK Algorithm
CYK is a bottom-up dynamic programming parser for grammars in Chomsky Normal Form. It fills a triangular table where cell [i][j] contains all non-terminals that can derive the substring from position i to j. It tries every possible split point k between i and j, checking if any rule A \rightarrow BC matches where B is in cell [i][k] and C is in cell [k+1][j].
Complexity Analysis
| Operation | Time | Space | Notes |
|---|---|---|---|
| CYK parsing | $O(n^3 \cdot |G|)$ | $O(n^2 \cdot |N|)$ | $n$ = sentence length, $|G|$ = grammar rules, $|N|$ = non-terminals |
| Earley parsing | $O(n^3)$ worst, $O(n^2)$ unambiguous | $O(n^2)$ | Handles any CFG, not just CNF |
| Shift-reduce (dependency) | $O(n)$ | $O(n)$ | Linear time with greedy or beam search decisions |
| CNF conversion | $O(|G|)$ | $O(|G|)$ | One-time preprocessing of the grammar |
For a typical English sentence of length $n = 25$ with a grammar of $|G| = 10{,}000$ rules, CYK performs roughly $25^3 \times 10{,}000 \approx 1.5 \times 10^8$ operations. This is feasible but motivates the use of pruning and beam search in practice.
Implementation
ALGORITHM CYK(words, grammar)
INPUT: words: list of n tokens
grammar: set of CNF rules, each rule is (A -> B C) or (A -> a) with probability
OUTPUT: parseTable: 2D table where parseTable[i][j] = set of non-terminals deriving words[i..j]
bestTree: most probable parse tree
BEGIN
n <- LENGTH(words)
table <- CREATE 2D array [n x n], each cell is an empty map of {nonTerminal -> (prob, backpointer)}
// Fill diagonal: lexical rules (A -> a)
FOR i FROM 0 TO n - 1 DO
FOR EACH rule (A -> words[i]) WITH probability p IN grammar DO
table[i][i][A] <- (p, leaf: words[i])
END FOR
END FOR
// Fill upper triangle: binary rules (A -> B C)
FOR span FROM 2 TO n DO
FOR i FROM 0 TO n - span DO
j <- i + span - 1
FOR k FROM i TO j - 1 DO
FOR EACH rule (A -> B C) WITH probability p IN grammar DO
IF B IN table[i][k] AND C IN table[k+1][j] THEN
totalProb <- p * table[i][k][B].prob * table[k+1][j][C].prob
IF A NOT IN table[i][j] OR totalProb > table[i][j][A].prob THEN
table[i][j][A] <- (totalProb, split: k, left: B, right: C)
END IF
END IF
END FOR
END FOR
END FOR
END FOR
// Check if start symbol S spans the entire sentence
IF "S" IN table[0][n-1] THEN
bestTree <- RECONSTRUCT tree from backpointers starting at table[0][n-1]["S"]
RETURN table, bestTree
ELSE
RETURN table, NO_PARSE
END IF
END
ALGORITHM ShiftReduceDependencyParse(words, classifier)
INPUT: words: list of n tokens, classifier: trained model that predicts actions
OUTPUT: arcs: set of (head, dependent, relation) triples
BEGIN
stack <- [ROOT]
buffer <- COPY of words
arcs <- empty set
WHILE buffer is not empty OR LENGTH(stack) > 1 DO
features <- EXTRACT features from top of stack and front of buffer
action <- classifier.PREDICT(features)
IF action = SHIFT AND buffer is not empty THEN
PUSH buffer[0] onto stack
REMOVE buffer[0]
ELSE IF action = LEFT_ARC AND LENGTH(stack) >= 2 THEN
// stack[-2] depends on stack[-1]
dependent <- stack[-2]
head <- stack[-1]
ADD (head, dependent, action.relation) TO arcs
REMOVE stack[-2]
ELSE IF action = RIGHT_ARC AND LENGTH(stack) >= 2 THEN
// stack[-1] depends on stack[-2]
dependent <- stack[-1]
head <- stack[-2]
ADD (head, dependent, action.relation) TO arcs
POP stack
END IF
END WHILE
RETURN arcs
END
Real-World Applications
- Compilers and interpreters: Programming language parsers use CFGs (often LL or LR variants) to build ASTs from source code, the same theoretical foundation as NLP constituency parsing
- Question answering: Dependency parses identify the subject and object of a question, helping systems locate the answer span in a passage
- Grammar checkers: Tools like Grammarly use parsing to detect structural errors (dangling modifiers, subject-verb disagreement) that surface-level rules miss
- Information extraction: Dependency paths between entities in a sentence serve as features for relation extraction ("Obama" -nsubj-> "visited" -dobj-> "Paris" implies a visit relation)
- Machine translation: Syntax-based MT systems reorder parse tree nodes to handle structural differences between languages
- Sentiment analysis: Recursive neural networks operate on parse trees to compose phrase-level sentiment from word-level sentiment, handling negation and intensifiers
Key Takeaways
- Constituency parsing groups words into nested phrases (NP, VP, PP) forming a tree, while dependency parsing identifies head-modifier relationships between individual words
- The CYK algorithm parses sentences under a CNF grammar in $O(n^3 \cdot |G|)$ time using dynamic programming, filling a triangular table bottom-up
- Natural language is inherently ambiguous: "I saw the man with the telescope" has two valid parse trees, and resolving ambiguity requires probabilistic grammars (PCFGs) or learned models
- Compilers and NLP parsers share the same theoretical foundation (context-free grammars, parse trees), making parsing one of the deepest connections between programming languages and natural language
- Modern dependency parsers use neural shift-reduce models that run in linear time, making them fast enough for real-time applications