Notes/Finite Automata and Regular Languages

Finite Automata and Regular Languages

DFAs, NFAs, regular expressions under the hood, the equivalence between them, and the Pumping Lemma for proving what finite automata cannot do.

2025-10-20AI-Synthesized from Personal Notes

Theory of ComputationFinite AutomataRegular LanguagesDfa

Terminology

Term	Definition
Finite automaton (FA)	A machine with a finite number of states that reads input one symbol at a time and either accepts or rejects the input string
DFA (Deterministic Finite Automaton)	An FA where each state has exactly one transition for each input symbol; no ambiguity in which state to move to
NFA (Nondeterministic Finite Automaton)	An FA where a state may have zero, one, or multiple transitions for the same symbol, plus epsilon transitions (moves without consuming input)
Regular language	A language (set of strings) that can be recognized by a finite automaton, equivalently described by a regular expression
Regular expression (regex)	A pattern notation using concatenation, union ($\|$), and Kleene star ($*$) to describe regular languages
Epsilon transition ($\epsilon$)	A transition in an NFA that moves between states without consuming any input symbol
Pumping Lemma	A theorem stating that all sufficiently long strings in a regular language can be "pumped" (a middle section repeated) and remain in the language; used to prove a language is not regular
Alphabet ($\Sigma$)	The finite set of symbols from which input strings are formed, e.g. $\Sigma = \{0, 1\}$
Accept state (final state)	A state that, if the machine is in it after reading the entire input, causes the machine to accept the string

What & Why

Every time you use a regular expression to validate an email, match a URL pattern, or search text, you are using a finite automaton under the hood. Regex engines compile your pattern into a state machine that processes input one character at a time.

Finite automata are the simplest model of computation. They have no memory beyond their current state (no stack, no tape). This limitation is also their strength: they are fast ($O(n)$ for input of length $n$), predictable, and easy to reason about. They define the class of regular languages, which includes patterns like "strings ending in 01," "strings with an even number of a's," and "valid identifiers."

Understanding finite automata matters because they draw the boundary between what pattern matching can and cannot do. The Pumping Lemma gives you a tool to prove that certain languages (like balanced parentheses or ${a^n b^n}$) are beyond the reach of any regex or finite automaton.

How It Works

DFA: Deterministic Finite Automaton

A DFA is defined by five components: $(Q, \Sigma, \delta, q_0, F)$ where $Q$ is a finite set of states, $\Sigma$ is the input alphabet, $\delta: Q \times \Sigma \to Q$ is the transition function, $q_0$ is the start state, and $F \subseteq Q$ is the set of accept states.

Example: a DFA that accepts binary strings ending in "01":

On input "1101": $q_0 \xrightarrow{1} q_0 \xrightarrow{1} q_0 \xrightarrow{0} q_1 \xrightarrow{1} q_2$ (accept).

NFA: Nondeterministic Finite Automaton

An NFA allows multiple transitions from a state on the same symbol and epsilon transitions. It accepts if any possible path through the machine reaches an accept state.

NFAs are often easier to design but can be converted to an equivalent DFA using the subset construction algorithm. The resulting DFA may have up to $2^{|Q|}$ states in the worst case.

The Equivalence Triangle

DFAs, NFAs, and regular expressions are all equivalent in power. They all recognize exactly the class of regular languages.

The Pumping Lemma

The Pumping Lemma is a proof tool for showing a language is NOT regular. It states: if $L$ is regular, then there exists a pumping length $p$ such that any string $s \in L$ with $|s| \ge p$ can be split as $s = xyz$ where:

$|y| > 0$ (the pumped part is non-empty)
$|xy| \le p$ (the pump is near the start)
For all $i \ge 0$: $xy^iz \in L$ (pumping $y$ any number of times stays in $L$)

Example proof: $L = {a^n b^n \mid n \ge 0}$ is not regular.

Assume $L$ is regular with pumping length $p$. Take $s = a^p b^p$. By condition 2, $y$ consists only of $a$'s. Pumping $y$ twice gives $a^{p+|y|}b^p$, which has more $a$'s than $b$'s, so it is not in $L$. Contradiction. Therefore $L$ is not regular.

Complexity Analysis

Operation	Time	Space
DFA simulation on input of length $n$	$O(n)$	$O(1)$ (just current state)
NFA simulation on input of length $n$	$O(n \cdot \|Q\|)$	$O(\|Q\|)$ (set of active states)
NFA to DFA conversion (subset construction)	$O(2^{\|Q\|} \cdot \|\Sigma\|)$ worst case	$O(2^{\|Q\|})$ states
Regex to NFA (Thompson's construction)	$O(m)$ where $m$ is regex length	$O(m)$ states
DFA minimization	$O(\|Q\| \cdot \|\Sigma\| \cdot \log \|Q\|)$	$O(\|Q\|)$

Implementation

STRUCTURE DFA
  states: set of state identifiers
  alphabet: set of symbols
  transition: map from (state, symbol) to state
  start: initial state
  accept: set of accepting states
END STRUCTURE

ALGORITHM SimulateDFA(dfa, input)
INPUT: dfa: DFA structure, input: string of symbols
OUTPUT: ACCEPT or REJECT
BEGIN
  current <- dfa.start

  FOR EACH symbol IN input DO
    current <- dfa.transition(current, symbol)
  END FOR

  IF current IN dfa.accept THEN
    RETURN ACCEPT
  ELSE
    RETURN REJECT
  END IF
END

ALGORITHM SimulateNFA(nfa, input)
INPUT: nfa: NFA structure, input: string of symbols
OUTPUT: ACCEPT or REJECT
BEGIN
  currentStates <- EpsilonClosure({nfa.start})

  FOR EACH symbol IN input DO
    nextStates <- empty set
    FOR EACH state IN currentStates DO
      nextStates <- nextStates UNION nfa.transition(state, symbol)
    END FOR
    currentStates <- EpsilonClosure(nextStates)
  END FOR

  IF currentStates INTERSECT nfa.accept is not empty THEN
    RETURN ACCEPT
  ELSE
    RETURN REJECT
  END IF
END

ALGORITHM EpsilonClosure(states)
INPUT: states: set of NFA states
OUTPUT: set of all states reachable via epsilon transitions
BEGIN
  closure <- copy of states
  stack <- list of all states in states

  WHILE stack is not empty DO
    s <- POP(stack)
    FOR EACH state t reachable from s via epsilon transition DO
      IF t NOT IN closure THEN
        closure <- closure UNION {t}
        PUSH(stack, t)
      END IF
    END FOR
  END WHILE

  RETURN closure
END

Real-World Applications

Regular expression engines: programming languages compile regex patterns into NFAs or DFAs for string matching; the choice between NFA simulation and DFA compilation affects performance characteristics
Lexical analysis: compilers use DFAs to tokenize source code, recognizing keywords, identifiers, numbers, and operators in a single pass over the input
Network packet filtering: firewalls and intrusion detection systems use DFA-based pattern matching to inspect packet payloads at wire speed
Protocol validation: network protocols (TCP state machine, HTTP request parsing) are modeled as finite automata to verify correct behavior
Hardware design: digital circuits like traffic light controllers, vending machines, and elevator controllers are designed as finite state machines

Key Takeaways

DFAs, NFAs, and regular expressions are equivalent in power: they all recognize exactly the regular languages
DFAs process input in $O(n)$ time with $O(1)$ space, making them ideal for high-performance pattern matching
NFAs are easier to construct but may require exponential blowup when converted to DFAs via subset construction
The Pumping Lemma proves that certain languages (like $\{a^n b^n\}$) are not regular, establishing the limits of finite automata
Every regex you write is secretly a finite automaton, and understanding this connection helps you write more efficient patterns

2025-10-21

Context-Free Grammars and Parsing

Context-free grammars, pushdown automata, why programming languages are context-free, the Chomsky hierarchy, and how parsers turn source code into syntax trees.

2025-10-22

Turing Machines and Computability

The universal Turing machine, the Church-Turing thesis, the Halting Problem, undecidability, and why some problems can never be solved by any computer.

2025-10-23

Amortized Analysis: The True Cost of Operations

Understanding amortized analysis through the aggregate method, accounting method, and potential method, with dynamic array and splay tree examples showing why worst-case per-operation analysis can be misleading.