Finite Automata and Regular Languages
DFAs, NFAs, regular expressions under the hood, the equivalence between them, and the Pumping Lemma for proving what finite automata cannot do.
Terminology
| Term | Definition |
|---|---|
| Finite automaton (FA) | A machine with a finite number of states that reads input one symbol at a time and either accepts or rejects the input string |
| DFA (Deterministic Finite Automaton) | An FA where each state has exactly one transition for each input symbol; no ambiguity in which state to move to |
| NFA (Nondeterministic Finite Automaton) | An FA where a state may have zero, one, or multiple transitions for the same symbol, plus epsilon transitions (moves without consuming input) |
| Regular language | A language (set of strings) that can be recognized by a finite automaton, equivalently described by a regular expression |
| Regular expression (regex) | A pattern notation using concatenation, union ($|$), and Kleene star ($*$) to describe regular languages |
| Epsilon transition ($\epsilon$) | A transition in an NFA that moves between states without consuming any input symbol |
| Pumping Lemma | A theorem stating that all sufficiently long strings in a regular language can be "pumped" (a middle section repeated) and remain in the language; used to prove a language is not regular |
| Alphabet ($\Sigma$) | The finite set of symbols from which input strings are formed, e.g. $\Sigma = \{0, 1\}$ |
| Accept state (final state) | A state that, if the machine is in it after reading the entire input, causes the machine to accept the string |
What & Why
Every time you use a regular expression to validate an email, match a URL pattern, or search text, you are using a finite automaton under the hood. Regex engines compile your pattern into a state machine that processes input one character at a time.
Finite automata are the simplest model of computation. They have no memory beyond their current state (no stack, no tape). This limitation is also their strength: they are fast (O(n) for input of length n), predictable, and easy to reason about. They define the class of regular languages, which includes patterns like "strings ending in 01," "strings with an even number of a's," and "valid identifiers."
Understanding finite automata matters because they draw the boundary between what pattern matching can and cannot do. The Pumping Lemma gives you a tool to prove that certain languages (like balanced parentheses or \{a^n b^n\}) are beyond the reach of any regex or finite automaton.
How It Works
DFA: Deterministic Finite Automaton
A DFA is defined by five components: (Q, \Sigma, \delta, q_0, F) where Q is a finite set of states, \Sigma is the input alphabet, \delta: Q \times \Sigma \to Q is the transition function, q_0 is the start state, and F \subseteq Q is the set of accept states.
Example: a DFA that accepts binary strings ending in "01":
On input "1101": q_0 \xrightarrow{1} q_0 \xrightarrow{1} q_0 \xrightarrow{0} q_1 \xrightarrow{1} q_2 (accept).
NFA: Nondeterministic Finite Automaton
An NFA allows multiple transitions from a state on the same symbol and epsilon transitions. It accepts if any possible path through the machine reaches an accept state.
NFAs are often easier to design but can be converted to an equivalent DFA using the subset construction algorithm. The resulting DFA may have up to 2^{|Q|} states in the worst case.
The Equivalence Triangle
DFAs, NFAs, and regular expressions are all equivalent in power. They all recognize exactly the class of regular languages.
The Pumping Lemma
The Pumping Lemma is a proof tool for showing a language is NOT regular. It states: if L is regular, then there exists a pumping length p such that any string s \in L with |s| \ge p can be split as s = xyz where:
|y| > 0(the pumped part is non-empty)|xy| \le p(the pump is near the start)- For all
i \ge 0:xy^iz \in L(pumpingyany number of times stays inL)
Example proof: L = \{a^n b^n \mid n \ge 0\} is not regular.
Assume L is regular with pumping length p. Take s = a^p b^p. By condition 2, y consists only of a's. Pumping y twice gives a^{p+|y|}b^p, which has more a's than b's, so it is not in L. Contradiction. Therefore L is not regular.
Complexity Analysis
| Operation | Time | Space |
|---|---|---|
| DFA simulation on input of length $n$ | $O(n)$ | $O(1)$ (just current state) |
| NFA simulation on input of length $n$ | $O(n \cdot |Q|)$ | $O(|Q|)$ (set of active states) |
| NFA to DFA conversion (subset construction) | $O(2^{|Q|} \cdot |\Sigma|)$ worst case | $O(2^{|Q|})$ states |
| Regex to NFA (Thompson's construction) | $O(m)$ where $m$ is regex length | $O(m)$ states |
| DFA minimization | $O(|Q| \cdot |\Sigma| \cdot \log |Q|)$ | $O(|Q|)$ |
Implementation
STRUCTURE DFA
states: set of state identifiers
alphabet: set of symbols
transition: map from (state, symbol) to state
start: initial state
accept: set of accepting states
END STRUCTURE
ALGORITHM SimulateDFA(dfa, input)
INPUT: dfa: DFA structure, input: string of symbols
OUTPUT: ACCEPT or REJECT
BEGIN
current <- dfa.start
FOR EACH symbol IN input DO
current <- dfa.transition(current, symbol)
END FOR
IF current IN dfa.accept THEN
RETURN ACCEPT
ELSE
RETURN REJECT
END IF
END
ALGORITHM SimulateNFA(nfa, input)
INPUT: nfa: NFA structure, input: string of symbols
OUTPUT: ACCEPT or REJECT
BEGIN
currentStates <- EpsilonClosure({nfa.start})
FOR EACH symbol IN input DO
nextStates <- empty set
FOR EACH state IN currentStates DO
nextStates <- nextStates UNION nfa.transition(state, symbol)
END FOR
currentStates <- EpsilonClosure(nextStates)
END FOR
IF currentStates INTERSECT nfa.accept is not empty THEN
RETURN ACCEPT
ELSE
RETURN REJECT
END IF
END
ALGORITHM EpsilonClosure(states)
INPUT: states: set of NFA states
OUTPUT: set of all states reachable via epsilon transitions
BEGIN
closure <- copy of states
stack <- list of all states in states
WHILE stack is not empty DO
s <- POP(stack)
FOR EACH state t reachable from s via epsilon transition DO
IF t NOT IN closure THEN
closure <- closure UNION {t}
PUSH(stack, t)
END IF
END FOR
END WHILE
RETURN closure
END
Real-World Applications
- Regular expression engines: programming languages compile regex patterns into NFAs or DFAs for string matching; the choice between NFA simulation and DFA compilation affects performance characteristics
- Lexical analysis: compilers use DFAs to tokenize source code, recognizing keywords, identifiers, numbers, and operators in a single pass over the input
- Network packet filtering: firewalls and intrusion detection systems use DFA-based pattern matching to inspect packet payloads at wire speed
- Protocol validation: network protocols (TCP state machine, HTTP request parsing) are modeled as finite automata to verify correct behavior
- Hardware design: digital circuits like traffic light controllers, vending machines, and elevator controllers are designed as finite state machines
Key Takeaways
- DFAs, NFAs, and regular expressions are equivalent in power: they all recognize exactly the regular languages
- DFAs process input in $O(n)$ time with $O(1)$ space, making them ideal for high-performance pattern matching
- NFAs are easier to construct but may require exponential blowup when converted to DFAs via subset construction
- The Pumping Lemma proves that certain languages (like $\{a^n b^n\}$) are not regular, establishing the limits of finite automata
- Every regex you write is secretly a finite automaton, and understanding this connection helps you write more efficient patterns