Back to Blog

Context-Free Grammars and Parsing

Context-free grammars, pushdown automata, why programming languages are context-free, the Chomsky hierarchy, and how parsers turn source code into syntax trees.

2025-10-21
Share
Theory of Computationcontext-free-grammarsparsingchomsky

Terminology

Term Definition
Context-free grammar (CFG) A formal grammar where every production rule has a single non-terminal on the left side, e.g. $S \to aSb \mid \epsilon$
Non-terminal A symbol that can be replaced by a sequence of terminals and non-terminals via production rules (e.g. $S$, $A$, $\text{Expr}$)
Terminal A symbol from the input alphabet that appears in the final string (e.g. $a$, $b$, $+$, digits)
Production rule A rewriting rule that replaces a non-terminal with a sequence of symbols, e.g. $\text{Expr} \to \text{Expr} + \text{Term}$
Pushdown automaton (PDA) A finite automaton augmented with a stack, giving it the power to recognize context-free languages
Parse tree (syntax tree) A tree showing how a string is derived from the start symbol using production rules
Ambiguous grammar A grammar where some string has more than one valid parse tree, causing parsing ambiguity
Chomsky Normal Form (CNF) A restricted CFG form where every rule is either $A \to BC$ or $A \to a$; any CFG can be converted to CNF
Chomsky hierarchy A classification of formal languages into four levels: regular (Type 3), context-free (Type 2), context-sensitive (Type 1), and recursively enumerable (Type 0)

What & Why

Context-free grammars are the formal foundation of programming language syntax. When you write if (x > 0) { return x; }, a parser uses a CFG to understand the structure: the if keyword, the condition in parentheses, the body in braces. Without CFGs, compilers could not turn source code into executable programs.

CFGs are more powerful than regular expressions. Regular expressions cannot match balanced parentheses, nested HTML tags, or recursive structures. CFGs can, because they are backed by pushdown automata, which have a stack for tracking nesting depth.

The Chomsky hierarchy places CFGs at Type 2, above regular languages (Type 3) but below context-sensitive languages (Type 1). This hierarchy tells you what each formalism can and cannot express, guiding the choice of parsing technology for different tasks.

How It Works

CFG Definition

A CFG is a 4-tuple (V, \Sigma, R, S): V is the set of non-terminals, \Sigma is the set of terminals, R is the set of production rules, and S \in V is the start symbol.

Example: Grammar for balanced parentheses:

S \to (S) \mid SS \mid \epsilon

This generates: \epsilon, (\ ), (\ )(\ ), ((\ )), (\ )((\ )), etc.

Example: Grammar for arithmetic expressions:

E \to E + T \mid T T \to T * F \mid F F \to (E) \mid \text{id}

This grammar encodes operator precedence: multiplication binds tighter than addition because T (terms with *) is lower in the derivation tree than E (expressions with +).

The Chomsky Hierarchy

Chomsky Hierarchy Type 0: Recursively Enumerable (Turing machines) Type 1: Context-Sensitive Type 2: Context-Free (CFGs, PDAs) Programming languages, balanced parens, $a^n b^n$ Type 3: Regular (DFAs, regex) Identifiers, keywords, patterns $a^*b^*$, $(ab)^+$

Pushdown Automata

A pushdown automaton (PDA) is a finite automaton with a stack. The stack gives it the ability to count and match nested structures. A PDA can push symbols onto the stack, pop them off, and make transitions based on the current state, input symbol, and top of the stack.

Key equivalence: A language is context-free if and only if some PDA recognizes it.

For balanced parentheses: push on (, pop on ). If the stack is empty at the end, accept.

Parsing Strategies

Top-Down Parsing Start from S, expand rules LL(1), recursive descent Predictive, no backtracking Used in hand-written parsers Bottom-Up Parsing Start from input, reduce to S LR(1), LALR(1), SLR Shift-reduce, more powerful Used in parser generators (yacc)

Complexity Analysis

Parsing Algorithm Time Grammar Restriction
CYK (general CFG) $O(n^3 \cdot |G|)$ Any CFG in CNF
Earley parser $O(n^3)$ general, $O(n^2)$ unambiguous, $O(n)$ for LR grammars Any CFG
LL(1) recursive descent $O(n)$ LL(1) grammars only
LR(1) / LALR(1) $O(n)$ LR(1) / LALR(1) grammars
PEG / Packrat $O(n)$ with memoization PEG (no ambiguity by design)

Most programming language parsers use $O(n)$ algorithms (LL or LR) because programming language grammars are designed to be parseable in linear time.

Implementation

ALGORITHM RecursiveDescentParse(tokens)
INPUT: tokens: list of terminal symbols
OUTPUT: parse tree or ERROR
BEGIN
  pos <- 0

  FUNCTION ParseExpr()
    node <- ParseTerm()
    WHILE tokens[pos] = '+' DO
      pos <- pos + 1
      right <- ParseTerm()
      node <- MakeNode('+', node, right)
    END WHILE
    RETURN node
  END FUNCTION

  FUNCTION ParseTerm()
    node <- ParseFactor()
    WHILE tokens[pos] = '*' DO
      pos <- pos + 1
      right <- ParseFactor()
      node <- MakeNode('*', node, right)
    END WHILE
    RETURN node
  END FUNCTION

  FUNCTION ParseFactor()
    IF tokens[pos] = '(' THEN
      pos <- pos + 1
      node <- ParseExpr()
      EXPECT tokens[pos] = ')'
      pos <- pos + 1
      RETURN node
    ELSE
      node <- MakeLeaf(tokens[pos])
      pos <- pos + 1
      RETURN node
    END IF
  END FUNCTION

  RETURN ParseExpr()
END

ALGORITHM CYKParse(grammar, input)
INPUT: grammar: CFG in Chomsky Normal Form, input: string of length n
OUTPUT: TRUE if input is in the language
BEGIN
  table <- n x n matrix of empty sets

  FOR i <- 1 TO n DO
    FOR EACH rule A -> a WHERE a = input[i] DO
      ADD A TO table[i][i]
    END FOR
  END FOR

  FOR length <- 2 TO n DO
    FOR i <- 1 TO n - length + 1 DO
      j <- i + length - 1
      FOR k <- i TO j - 1 DO
        FOR EACH rule A -> BC DO
          IF B IN table[i][k] AND C IN table[k+1][j] THEN
            ADD A TO table[i][j]
          END IF
        END FOR
      END FOR
    END FOR
  END FOR

  RETURN start_symbol IN table[1][n]
END

Real-World Applications

  • Compilers and interpreters: every programming language has a CFG defining its syntax; the parser (front end) uses this grammar to build an abstract syntax tree from source code
  • JSON and XML parsing: data interchange formats are context-free languages; their parsers validate structure (matched braces, nested tags) using CFG-based techniques
  • SQL query parsing: database engines parse SQL statements using CFGs to build query execution plans
  • Natural language processing: probabilistic CFGs model sentence structure in human languages, enabling syntactic parsing for machine translation and information extraction
  • IDE features: syntax highlighting, code folding, and auto-indentation rely on grammar-aware parsing to understand code structure in real time

Key Takeaways

  • Context-free grammars are strictly more powerful than regular expressions: they can express nested and recursive structures like balanced parentheses and programming language syntax
  • Pushdown automata (finite automata + stack) are the machine model equivalent to CFGs
  • The Chomsky hierarchy classifies languages into four levels: regular, context-free, context-sensitive, and recursively enumerable
  • Practical parsers (LL, LR) run in $O(n)$ time by restricting the grammar to avoid ambiguity; general CFG parsing (CYK, Earley) takes $O(n^3)$
  • Every compiler, interpreter, and structured data parser you use is built on context-free grammar theory