Notes/Context-Free Grammars and Parsing

Context-Free Grammars and Parsing

Context-free grammars, pushdown automata, why programming languages are context-free, the Chomsky hierarchy, and how parsers turn source code into syntax trees.

2025-10-21AI-Synthesized from Personal Notes

Theory of ComputationContext Free GrammarsParsingChomsky

Terminology

Term	Definition
Context-free grammar (CFG)	A formal grammar where every production rule has a single non-terminal on the left side, e.g. $S \to aSb \mid \epsilon$
Non-terminal	A symbol that can be replaced by a sequence of terminals and non-terminals via production rules (e.g. $S$, $A$, $\text{Expr}$)
Terminal	A symbol from the input alphabet that appears in the final string (e.g. $a$, $b$, $+$, digits)
Production rule	A rewriting rule that replaces a non-terminal with a sequence of symbols, e.g. $\text{Expr} \to \text{Expr} + \text{Term}$
Pushdown automaton (PDA)	A finite automaton augmented with a stack, giving it the power to recognize context-free languages
Parse tree (syntax tree)	A tree showing how a string is derived from the start symbol using production rules
Ambiguous grammar	A grammar where some string has more than one valid parse tree, causing parsing ambiguity
Chomsky Normal Form (CNF)	A restricted CFG form where every rule is either $A \to BC$ or $A \to a$; any CFG can be converted to CNF
Chomsky hierarchy	A classification of formal languages into four levels: regular (Type 3), context-free (Type 2), context-sensitive (Type 1), and recursively enumerable (Type 0)

What & Why

Context-free grammars are the formal foundation of programming language syntax. When you write if (x > 0) { return x; }, a parser uses a CFG to understand the structure: the if keyword, the condition in parentheses, the body in braces. Without CFGs, compilers could not turn source code into executable programs.

CFGs are more powerful than regular expressions. Regular expressions cannot match balanced parentheses, nested HTML tags, or recursive structures. CFGs can, because they are backed by pushdown automata, which have a stack for tracking nesting depth.

The Chomsky hierarchy places CFGs at Type 2, above regular languages (Type 3) but below context-sensitive languages (Type 1). This hierarchy tells you what each formalism can and cannot express, guiding the choice of parsing technology for different tasks.

How It Works

CFG Definition

A CFG is a 4-tuple $(V, \Sigma, R, S)$: $V$ is the set of non-terminals, $\Sigma$ is the set of terminals, $R$ is the set of production rules, and $S \in V$ is the start symbol.

Example: Grammar for balanced parentheses:

$S \to (S) \mid SS \mid \epsilon$

This generates: $\epsilon$, $(\ )$, $(\ )(\ )$, $((\ ))$, $(\ )((\ ))$, etc.

Example: Grammar for arithmetic expressions:

$E \to E + T \mid T$ $T \to T * F \mid F$ $F \to (E) \mid \text{id}$

This grammar encodes operator precedence: multiplication binds tighter than addition because $T$ (terms with $*$) is lower in the derivation tree than $E$ (expressions with $+$).

The Chomsky Hierarchy

Pushdown Automata

A pushdown automaton (PDA) is a finite automaton with a stack. The stack gives it the ability to count and match nested structures. A PDA can push symbols onto the stack, pop them off, and make transitions based on the current state, input symbol, and top of the stack.

Key equivalence: A language is context-free if and only if some PDA recognizes it.

For balanced parentheses: push on (, pop on ). If the stack is empty at the end, accept.

Parsing Strategies

Complexity Analysis

Parsing Algorithm	Time	Grammar Restriction
CYK (general CFG)	$O(n^3 \cdot \|G\|)$	Any CFG in CNF
Earley parser	$O(n^3)$ general, $O(n^2)$ unambiguous, $O(n)$ for LR grammars	Any CFG
LL(1) recursive descent	$O(n)$	LL(1) grammars only
LR(1) / LALR(1)	$O(n)$	LR(1) / LALR(1) grammars
PEG / Packrat	$O(n)$ with memoization	PEG (no ambiguity by design)

Most programming language parsers use $O(n)$ algorithms (LL or LR) because programming language grammars are designed to be parseable in linear time.

Implementation

ALGORITHM RecursiveDescentParse(tokens)
INPUT: tokens: list of terminal symbols
OUTPUT: parse tree or ERROR
BEGIN
  pos <- 0

  FUNCTION ParseExpr()
    node <- ParseTerm()
    WHILE tokens[pos] = '+' DO
      pos <- pos + 1
      right <- ParseTerm()
      node <- MakeNode('+', node, right)
    END WHILE
    RETURN node
  END FUNCTION

  FUNCTION ParseTerm()
    node <- ParseFactor()
    WHILE tokens[pos] = '*' DO
      pos <- pos + 1
      right <- ParseFactor()
      node <- MakeNode('*', node, right)
    END WHILE
    RETURN node
  END FUNCTION

  FUNCTION ParseFactor()
    IF tokens[pos] = '(' THEN
      pos <- pos + 1
      node <- ParseExpr()
      EXPECT tokens[pos] = ')'
      pos <- pos + 1
      RETURN node
    ELSE
      node <- MakeLeaf(tokens[pos])
      pos <- pos + 1
      RETURN node
    END IF
  END FUNCTION

  RETURN ParseExpr()
END

ALGORITHM CYKParse(grammar, input)
INPUT: grammar: CFG in Chomsky Normal Form, input: string of length n
OUTPUT: TRUE if input is in the language
BEGIN
  table <- n x n matrix of empty sets

  FOR i <- 1 TO n DO
    FOR EACH rule A -> a WHERE a = input[i] DO
      ADD A TO table[i][i]
    END FOR
  END FOR

  FOR length <- 2 TO n DO
    FOR i <- 1 TO n - length + 1 DO
      j <- i + length - 1
      FOR k <- i TO j - 1 DO
        FOR EACH rule A -> BC DO
          IF B IN table[i][k] AND C IN table[k+1][j] THEN
            ADD A TO table[i][j]
          END IF
        END FOR
      END FOR
    END FOR
  END FOR

  RETURN start_symbol IN table[1][n]
END

Real-World Applications

Compilers and interpreters: every programming language has a CFG defining its syntax; the parser (front end) uses this grammar to build an abstract syntax tree from source code
JSON and XML parsing: data interchange formats are context-free languages; their parsers validate structure (matched braces, nested tags) using CFG-based techniques
SQL query parsing: database engines parse SQL statements using CFGs to build query execution plans
Natural language processing: probabilistic CFGs model sentence structure in human languages, enabling syntactic parsing for machine translation and information extraction
IDE features: syntax highlighting, code folding, and auto-indentation rely on grammar-aware parsing to understand code structure in real time

Key Takeaways

Context-free grammars are strictly more powerful than regular expressions: they can express nested and recursive structures like balanced parentheses and programming language syntax
Pushdown automata (finite automata + stack) are the machine model equivalent to CFGs
The Chomsky hierarchy classifies languages into four levels: regular, context-free, context-sensitive, and recursively enumerable
Practical parsers (LL, LR) run in $O(n)$ time by restricting the grammar to avoid ambiguity; general CFG parsing (CYK, Earley) takes $O(n^3)$
Every compiler, interpreter, and structured data parser you use is built on context-free grammar theory

2025-10-22

Turing Machines and Computability

The universal Turing machine, the Church-Turing thesis, the Halting Problem, undecidability, and why some problems can never be solved by any computer.

2025-10-23

Amortized Analysis: The True Cost of Operations

Understanding amortized analysis through the aggregate method, accounting method, and potential method, with dynamic array and splay tree examples showing why worst-case per-operation analysis can be misleading.

2025-10-24

Randomized Algorithms and Probabilistic Analysis

Las Vegas vs Monte Carlo algorithms, expected running time, randomized quicksort, skip list analysis, and why adding randomness can make algorithms simpler and faster.