Notes/Topological Sort, Strongly Connected Components, and Tarjan's Algorithm

Topological Sort, Strongly Connected Components, and Tarjan's Algorithm

Advanced graph algorithms for directed graphs: ordering dependencies with topological sort and finding strongly connected components with Tarjan's algorithm.

2021-04-09AI-Synthesized from Personal Notes

AlgorithmsGraphsTopological SortScc

Terminology

Directed Acyclic Graph (DAG): a directed graph with no cycles. Topological sort is only defined for DAGs.
Topological sort: a linear ordering of vertices in a DAG such that for every directed edge $(u, v)$, vertex $u$ appears before $v$ in the ordering.
In-degree: the number of incoming edges to a vertex. In Kahn's algorithm, vertices with in-degree 0 have no unprocessed dependencies.
Strongly connected component (SCC): a maximal set of vertices in a directed graph where every vertex is reachable from every other vertex in the set.
Component graph (condensation): the DAG formed by contracting each SCC into a single vertex. This graph is always acyclic.
DFS discovery time: the timestamp when DFS first visits a vertex. Used by Tarjan's algorithm to identify SCCs.
Low-link value: the smallest discovery time reachable from a vertex through the DFS subtree and back edges. In Tarjan's algorithm, a vertex is the root of an SCC if its low-link equals its discovery time.
Back edge: an edge from a vertex to one of its ancestors in the DFS tree. The presence of a back edge indicates a cycle in a directed graph.
Cross edge: an edge between vertices in different DFS subtrees. Tarjan's algorithm must distinguish these from back edges.

What & Why

Directed graphs model dependencies, workflows, and relationships where direction matters. Two fundamental questions arise:

What order should I process things? If tasks have dependencies (task A must finish before task B), topological sort gives a valid execution order.
Which parts of the graph are mutually reachable? Strongly connected components identify groups of vertices where every vertex can reach every other, revealing tightly coupled subsystems.

Topological sort is essential for:

Build systems (compile dependencies in the right order)
Course scheduling (prerequisites before advanced courses)
Package managers (install dependencies before dependents)

SCC decomposition is essential for:

Compiler optimization (identifying loops in control flow graphs)
Social network analysis (finding tightly connected communities)
2-SAT solving (a classic reduction from boolean satisfiability to SCC)

How It Works

Topological Sort (Kahn's Algorithm, BFS-based)

Kahn's algorithm uses in-degrees:

Compute the in-degree of every vertex
Add all vertices with in-degree 0 to a queue
While the queue is not empty: remove a vertex, add it to the result, and decrement the in-degree of all its neighbors. If any neighbor's in-degree becomes 0, add it to the queue.
If the result contains all vertices, the graph is a DAG. Otherwise, a cycle exists.

Topological Sort (DFS-based)

The DFS approach runs a depth-first search and adds each vertex to the front of the result list when its DFS call finishes (post-order). The reverse post-order is a valid topological sort.

Tarjan's Algorithm for SCCs

Tarjan's algorithm finds all SCCs in a single DFS pass:

Maintain a stack of vertices in the current DFS path
For each vertex, track its discovery time and low-link value
When DFS finishes a vertex: if its low-link equals its discovery time, it is the root of an SCC. Pop all vertices from the stack down to (and including) this vertex; they form one SCC.

The low-link value for a vertex $v$ is the smallest discovery time reachable from $v$ through the DFS subtree and back edges to vertices still on the stack.

Complexity Analysis

Algorithm	Time	Space	Notes
Topological Sort (Kahn's)	$O(V + E)$	$O(V + E)$	BFS-based, detects cycles
Topological Sort (DFS)	$O(V + E)$	$O(V + E)$	DFS-based, simpler code
Tarjan's SCC	$O(V + E)$	$O(V)$	Single DFS pass
Kosaraju's SCC	$O(V + E)$	$O(V + E)$	Two DFS passes, needs transpose graph

Why $O(V + E)$?

All four algorithms are based on DFS or BFS, which visits every vertex once and traverses every edge once:

$T = O(V + E)$

This is optimal: you must examine every vertex and edge at least once to determine the structure of the graph.

SCC Count Bounds

A directed graph with $V$ vertices has between 1 and $V$ strongly connected components. If the graph is strongly connected (one big SCC), there is 1 component. If the graph is a DAG (no cycles at all), there are $V$ components (each vertex is its own SCC).

Implementation

Topological Sort (Kahn's Algorithm)

ALGORITHM KahnTopologicalSort(graph, V)
  INPUT: directed graph as adjacency list, V vertices
  OUTPUT: topological ordering, or report cycle

  CREATE inDegree array of size V, initialized to 0

  FOR EACH vertex u DO
    FOR EACH neighbor v of u DO
      inDegree[v] = inDegree[v] + 1
    END FOR
  END FOR

  CREATE queue Q
  FOR EACH vertex u DO
    IF inDegree[u] = 0 THEN
      ENQUEUE u into Q
    END IF
  END FOR

  result = empty list
  count = 0

  WHILE Q is not empty DO
    u = DEQUEUE from Q
    APPEND u to result
    count = count + 1

    FOR EACH neighbor v of u DO
      inDegree[v] = inDegree[v] - 1
      IF inDegree[v] = 0 THEN
        ENQUEUE v into Q
      END IF
    END FOR
  END WHILE

  IF count != V THEN
    REPORT "Graph contains a cycle"
  END IF

  RETURN result
END

Topological Sort (DFS-based)

ALGORITHM DFSTopologicalSort(graph, V)
  INPUT: directed graph as adjacency list, V vertices
  OUTPUT: topological ordering

  CREATE visited array of size V, set all to FALSE
  CREATE stack S (for result)

  FOR EACH vertex u DO
    IF NOT visited[u] THEN
      DFSVisit(graph, u, visited, S)
    END IF
  END FOR

  RETURN S (reversed, or read from top to bottom)
END

ALGORITHM DFSVisit(graph, u, visited, S)
  visited[u] = TRUE

  FOR EACH neighbor v of u DO
    IF NOT visited[v] THEN
      DFSVisit(graph, v, visited, S)
    END IF
  END FOR

  PUSH u onto S   // add to result after all descendants are processed
END

Tarjan's SCC Algorithm

ALGORITHM TarjanSCC(graph, V)
  INPUT: directed graph as adjacency list, V vertices
  OUTPUT: list of strongly connected components

  CREATE disc array of size V, set all to -1
  CREATE low array of size V, set all to -1
  CREATE onStack array of size V, set all to FALSE
  CREATE stack S
  timer = 0
  sccs = empty list

  FOR EACH vertex u DO
    IF disc[u] = -1 THEN
      TarjanDFS(graph, u, disc, low, onStack, S, timer, sccs)
    END IF
  END FOR

  RETURN sccs
END

ALGORITHM TarjanDFS(graph, u, disc, low, onStack, S, timer, sccs)
  disc[u] = timer
  low[u] = timer
  timer = timer + 1
  PUSH u onto S
  onStack[u] = TRUE

  FOR EACH neighbor v of u DO
    IF disc[v] = -1 THEN
      // v not yet visited
      TarjanDFS(graph, v, disc, low, onStack, S, timer, sccs)
      low[u] = MIN(low[u], low[v])
    ELSE IF onStack[v] THEN
      // v is on the stack, so it is in the current SCC
      low[u] = MIN(low[u], disc[v])
    END IF
  END FOR

  // If u is the root of an SCC
  IF low[u] = disc[u] THEN
    component = empty list
    REPEAT
      w = POP from S
      onStack[w] = FALSE
      APPEND w to component
    UNTIL w = u
    APPEND component to sccs
  END IF
END

Real-World Applications

Build systems: Make, Gradle, and Bazel use topological sort to determine the correct compilation order based on file dependencies.
Package managers: npm, pip, and apt resolve dependency graphs using topological sort to install packages in the right order.
Course scheduling: universities use topological sort to ensure prerequisite courses are taken before advanced ones.
Compiler optimization: SCC detection in control flow graphs identifies natural loops, enabling loop optimizations like unrolling and vectorization.
2-SAT solving: the 2-SAT problem (a restricted form of boolean satisfiability) is solvable in linear time by reducing it to SCC detection on an implication graph.
Web crawling: search engines use SCC analysis to identify clusters of tightly interlinked pages, which can indicate link farms or topical communities.

Key Takeaways

Topological sort produces a valid dependency ordering for DAGs in $O(V + E)$ time. If the graph has a cycle, no topological sort exists.
Kahn's algorithm (BFS with in-degrees) naturally detects cycles; the DFS approach is more concise but requires separate cycle detection
Strongly connected components partition a directed graph into maximal groups of mutually reachable vertices
Tarjan's algorithm finds all SCCs in a single DFS pass using discovery times and low-link values
The condensation (component graph) of any directed graph is always a DAG, enabling topological sort on the SCC level

2021-04-01

Comparison-Based Sorting: Merge Sort, Quick Sort, and Heap Sort

A deep dive into the three most important comparison-based sorting algorithms, their mechanics, and when to reach for each one.

2021-04-02

Non-Comparison Sorting: Counting Sort, Radix Sort, and Bucket Sort

Breaking the n log n barrier by exploiting structure in the data, with Counting Sort, Radix Sort, and Bucket Sort.

2021-04-03

Binary Search and Its Variants

Mastering the classic binary search algorithm and its powerful variants: lower bound, upper bound, and searching rotated arrays.

Topological Sort, Strongly Connected Components, and Tarjan's Algorithm

Advanced graph algorithms for directed graphs: ordering dependencies with topological sort and finding strongly connected components with Tarjan's algorithm.

Terminology

What & Why

How It Works

Topological Sort (Kahn's Algorithm, BFS-based)

Topological Sort (DFS-based)

Tarjan's Algorithm for SCCs

Complexity Analysis

Why $O(V + E)$?

SCC Count Bounds

Implementation

Topological Sort (Kahn's Algorithm)

Topological Sort (DFS-based)

Tarjan's SCC Algorithm

Real-World Applications

Key Takeaways

Read More