Context-Free Grammars and Parsing
Context-free grammars, pushdown automata, why programming languages are context-free, the Chomsky hierarchy, and how parsers turn source code into syntax trees.
Terminology
| Term | Definition |
|---|---|
| Context-free grammar (CFG) | A formal grammar where every production rule has a single non-terminal on the left side, e.g. $S \to aSb \mid \epsilon$ |
| Non-terminal | A symbol that can be replaced by a sequence of terminals and non-terminals via production rules (e.g. $S$, $A$, $\text{Expr}$) |
| Terminal | A symbol from the input alphabet that appears in the final string (e.g. $a$, $b$, $+$, digits) |
| Production rule | A rewriting rule that replaces a non-terminal with a sequence of symbols, e.g. $\text{Expr} \to \text{Expr} + \text{Term}$ |
| Pushdown automaton (PDA) | A finite automaton augmented with a stack, giving it the power to recognize context-free languages |
| Parse tree (syntax tree) | A tree showing how a string is derived from the start symbol using production rules |
| Ambiguous grammar | A grammar where some string has more than one valid parse tree, causing parsing ambiguity |
| Chomsky Normal Form (CNF) | A restricted CFG form where every rule is either $A \to BC$ or $A \to a$; any CFG can be converted to CNF |
| Chomsky hierarchy | A classification of formal languages into four levels: regular (Type 3), context-free (Type 2), context-sensitive (Type 1), and recursively enumerable (Type 0) |
What & Why
Context-free grammars are the formal foundation of programming language syntax. When you write if (x > 0) { return x; }, a parser uses a CFG to understand the structure: the if keyword, the condition in parentheses, the body in braces. Without CFGs, compilers could not turn source code into executable programs.
CFGs are more powerful than regular expressions. Regular expressions cannot match balanced parentheses, nested HTML tags, or recursive structures. CFGs can, because they are backed by pushdown automata, which have a stack for tracking nesting depth.
The Chomsky hierarchy places CFGs at Type 2, above regular languages (Type 3) but below context-sensitive languages (Type 1). This hierarchy tells you what each formalism can and cannot express, guiding the choice of parsing technology for different tasks.
How It Works
CFG Definition
A CFG is a 4-tuple (V, \Sigma, R, S): V is the set of non-terminals, \Sigma is the set of terminals, R is the set of production rules, and S \in V is the start symbol.
Example: Grammar for balanced parentheses:
S \to (S) \mid SS \mid \epsilon
This generates: \epsilon, (\ ), (\ )(\ ), ((\ )), (\ )((\ )), etc.
Example: Grammar for arithmetic expressions:
E \to E + T \mid T
T \to T * F \mid F
F \to (E) \mid \text{id}
This grammar encodes operator precedence: multiplication binds tighter than addition because T (terms with *) is lower in the derivation tree than E (expressions with +).
The Chomsky Hierarchy
Pushdown Automata
A pushdown automaton (PDA) is a finite automaton with a stack. The stack gives it the ability to count and match nested structures. A PDA can push symbols onto the stack, pop them off, and make transitions based on the current state, input symbol, and top of the stack.
Key equivalence: A language is context-free if and only if some PDA recognizes it.
For balanced parentheses: push on (, pop on ). If the stack is empty at the end, accept.
Parsing Strategies
Complexity Analysis
| Parsing Algorithm | Time | Grammar Restriction |
|---|---|---|
| CYK (general CFG) | $O(n^3 \cdot |G|)$ | Any CFG in CNF |
| Earley parser | $O(n^3)$ general, $O(n^2)$ unambiguous, $O(n)$ for LR grammars | Any CFG |
| LL(1) recursive descent | $O(n)$ | LL(1) grammars only |
| LR(1) / LALR(1) | $O(n)$ | LR(1) / LALR(1) grammars |
| PEG / Packrat | $O(n)$ with memoization | PEG (no ambiguity by design) |
Most programming language parsers use $O(n)$ algorithms (LL or LR) because programming language grammars are designed to be parseable in linear time.
Implementation
ALGORITHM RecursiveDescentParse(tokens)
INPUT: tokens: list of terminal symbols
OUTPUT: parse tree or ERROR
BEGIN
pos <- 0
FUNCTION ParseExpr()
node <- ParseTerm()
WHILE tokens[pos] = '+' DO
pos <- pos + 1
right <- ParseTerm()
node <- MakeNode('+', node, right)
END WHILE
RETURN node
END FUNCTION
FUNCTION ParseTerm()
node <- ParseFactor()
WHILE tokens[pos] = '*' DO
pos <- pos + 1
right <- ParseFactor()
node <- MakeNode('*', node, right)
END WHILE
RETURN node
END FUNCTION
FUNCTION ParseFactor()
IF tokens[pos] = '(' THEN
pos <- pos + 1
node <- ParseExpr()
EXPECT tokens[pos] = ')'
pos <- pos + 1
RETURN node
ELSE
node <- MakeLeaf(tokens[pos])
pos <- pos + 1
RETURN node
END IF
END FUNCTION
RETURN ParseExpr()
END
ALGORITHM CYKParse(grammar, input)
INPUT: grammar: CFG in Chomsky Normal Form, input: string of length n
OUTPUT: TRUE if input is in the language
BEGIN
table <- n x n matrix of empty sets
FOR i <- 1 TO n DO
FOR EACH rule A -> a WHERE a = input[i] DO
ADD A TO table[i][i]
END FOR
END FOR
FOR length <- 2 TO n DO
FOR i <- 1 TO n - length + 1 DO
j <- i + length - 1
FOR k <- i TO j - 1 DO
FOR EACH rule A -> BC DO
IF B IN table[i][k] AND C IN table[k+1][j] THEN
ADD A TO table[i][j]
END IF
END FOR
END FOR
END FOR
END FOR
RETURN start_symbol IN table[1][n]
END
Real-World Applications
- Compilers and interpreters: every programming language has a CFG defining its syntax; the parser (front end) uses this grammar to build an abstract syntax tree from source code
- JSON and XML parsing: data interchange formats are context-free languages; their parsers validate structure (matched braces, nested tags) using CFG-based techniques
- SQL query parsing: database engines parse SQL statements using CFGs to build query execution plans
- Natural language processing: probabilistic CFGs model sentence structure in human languages, enabling syntactic parsing for machine translation and information extraction
- IDE features: syntax highlighting, code folding, and auto-indentation rely on grammar-aware parsing to understand code structure in real time
Key Takeaways
- Context-free grammars are strictly more powerful than regular expressions: they can express nested and recursive structures like balanced parentheses and programming language syntax
- Pushdown automata (finite automata + stack) are the machine model equivalent to CFGs
- The Chomsky hierarchy classifies languages into four levels: regular, context-free, context-sensitive, and recursively enumerable
- Practical parsers (LL, LR) run in $O(n)$ time by restricting the grammar to avoid ambiguity; general CFG parsing (CYK, Earley) takes $O(n^3)$
- Every compiler, interpreter, and structured data parser you use is built on context-free grammar theory