Iterative DFS: Mark-on-Pop vs. Mark-on-Push — A Comparative Analysis

Published 2026-04-03 · Markdown source

Introduction

Depth-First Search (DFS) is one of the most fundamental graph traversal algorithms in computer science. While its recursive form is elegant and intuitive, practical applications often demand an iterative version using an explicit stack to avoid stack overflow on large inputs. However, there is a subtle but important design decision in iterative DFS that is rarely discussed in textbooks: when should a node be marked as visited — when it is pushed onto the stack, or when it is popped off?

This essay presents two iterative DFS strategies, analyzes their behavioral and performance differences, and reports benchmark results that reveal a surprising finding: the optimal strategy depends on the cost of the visited-set operations, not on the algorithm's asymptotic complexity.

The Two Strategies

Strategy V1: Mark on Pop

In this approach, a node is pushed onto the stack without being marked as visited. The visited check and marking happen only when the node is popped:

stack.push(start);
while (!stack.empty()) {
    int node = stack.pop();
    if (visited[node]) continue;   // check on pop
    visited[node] = true;          // mark on pop
    for (int neighbor : adj[node]) {
        if (!visited[neighbor]) {
            stack.push(neighbor);  // push without marking
        }
    }
}

Characteristics:

A node may be pushed onto the stack multiple times by different neighbors before it is first popped and marked visited.
When a duplicate is eventually popped, the visited check causes it to be skipped, but the pop itself still costs time.
The stack can grow larger than the number of unvisited nodes due to duplicates.

Strategy V2: Mark on Push

In this approach, a node is marked as visited immediately before it is pushed onto the stack. Since each node is marked before pushing, no node can be pushed twice, so no duplicate check is needed on pop:

visited[start] = true;             // mark before push
stack.push(start);
while (!stack.empty()) {
    int node = stack.pop();
    // no check needed — each node is pushed exactly once
    for (int neighbor : adj[node]) {
        if (!visited[neighbor]) {
            visited[neighbor] = true;   // mark before push
            stack.push(neighbor);
        }
    }
}

Characteristics:

Each node appears on the stack at most once.
No wasted pops or redundant neighbor scans.
The stack size is bounded by the number of unvisited nodes.
When a node is popped from the stack, it can be processed unconditionally.

Behavioral Differences

Duplicate Pushes in V1

Consider a simple diamond graph:

    A
   / \\
  B   C
   \\ /
    D

Starting DFS from A (stack is LIFO, so the last pushed node is popped first):

Push A. Stack: [A]
Pop A, mark visited. Push B, C. Stack: [B, C]
Pop C (top), mark visited. C's neighbor D is unvisited, push D. Stack: [B, D]
Pop D (top), mark visited. D's neighbor B is unvisited, push B. Stack: [B, B]
Pop B (top), mark visited. All neighbors visited. Stack: [B]
Pop B (bottom), skip — already visited.

In V1, node B is pushed twice and popped twice. The second pop is wasted work: B is already visited, but we still had to pop it and check. In V2, B would have been marked visited when first pushed (by A), so D would see B as visited and never push it again.

Redundant Neighbor Scanning

There is a worse variant of V1 found in some textbook implementations (notably the GeeksforGeeks iterative DFS example) where the neighbor-scanning loop runs unconditionally even for already-visited nodes:

s = stack.pop();
if (!visited[s]) {
    visited[s] = true;
}
// Always scans neighbors, even if s was already visited!
for (auto neighbor : adj[s]) {
    if (!visited[neighbor])
        stack.push(neighbor);
}

This version not only pops duplicate nodes but also re-scans their entire adjacency list each time, compounding the wasted work.

Impact on Stack Size

In dense graphs, V1 can accumulate significant stack bloat. In the worst case (a complete graph K_n), a single node can be pushed by every other node that discovers it, leading to O(V^2) stack entries instead of O(V). V2 guarantees the stack never exceeds O(V) entries.

For grid-based problems (like Number of Islands on a 1000x1000 grid), this difference is meaningful: a cell with 4 neighbors can be pushed up to 4 times in V1 before any of those copies are popped.

Asymptotic Complexity

Despite the behavioral differences, both strategies have the same asymptotic time complexity: O(V + E).

In V1, each edge causes at most one push (the if (!visited[neighbor]) guard prevents pushing already-visited nodes). The duplicate pushes come from multiple edges targeting the same unvisited node before it is first popped, but this is bounded by the node's in-degree. Summing over all nodes, the total number of pushes is O(E), and each node is processed (visited check + neighbor scan) at most once after its first pop. The extra duplicate pops are bounded by O(E) as well.

In V2, each node is pushed and popped exactly once, and each edge is examined exactly once during neighbor scanning, giving a clean O(V + E).

So while the asymptotic complexity is identical, the constant factors differ, and this is where our benchmarks become interesting.

Benchmark Setup

We benchmarked both strategies on two different problem types:

Test 1: General Graph with Hash-Set Visited

Using our dfs_basic.cpp template with std::unordered_set<int> for the visited set:

Graph: 100,000 nodes, ~500,000 edges (chain backbone + random edges)
Visited operations: unordered_set::find() and unordered_set::insert() — both amortized O(1) but with significant constant overhead (hashing, bucket lookup, memory allocation for new entries)
5 runs averaged, compiled with O2

Test 2: Grid Problem with Array Visited

Using our num_islands.cpp template where the grid itself serves as the visited array (grid[i][j] = '0' to mark visited):

Grid: 1000 x 1000 (1,000,000 cells), 60% land
Visited operations: direct array read (grid[i][j] != '1') and write (grid[i][j] = '0') — true O(1) with minimal constant overhead
10 runs averaged, compiled with O2

Benchmark Results

Test 1: General Graph (Hash-Set Visited)

V1 (check/mark on pop):  51.5 ms avg
V2 (check/mark on push): 68.1 ms avg
V2 speedup: 0.76x  (V1 wins by ~24%)

V1 wins. This was unexpected.

Test 2: Grid Problem (Array Visited)

V1 (check/mark on pop):  29.5 ms avg
V2 (check/mark on push): 26.9 ms avg
V2 speedup: 1.10x  (V2 wins by ~10%)

V2 wins. This matched our expectations.

Analysis: Why the Winner Depends on the Visited-Set Cost

The key insight is that V1 and V2 distribute their visited-set operations differently:

Operation	V1	V2
`visited.find()` per neighbor	1 (before push)	1 (before push)
`visited.insert()` per node	1 (on pop)	1 (on push)
Extra `visited.find()` on pop	1 per pop (incl. dupes)	0 (no check needed on pop)
Extra `visited.insert()` calls	0	0

At first glance, V2 should be faster: it avoids duplicate pops and their associated visited checks. Both strategies call insert exactly V times and find at least E times (once per edge during neighbor scanning). The difference is not in the total count of operations but in where they happen in the code.

V1 calls insert in the outer loop (once per pop, only for newly visited nodes). The neighbor-scanning inner loop only calls find. This means the hot inner loop — which iterates over adjacency lists — does lightweight read-only hash lookups without modifying the hash set. The adjacency data stays cache-hot during scanning.
V2 calls insert in the inner loop (for each newly discovered neighbor). This means the hot neighbor-scanning loop alternates between reading adjacency data and modifying the hash set. Each insert may allocate memory and potentially trigger a rehash, evicting adjacency data from the CPU cache.

The extra work V1 does — popping duplicates and checking find for already-visited nodes — costs only a cheap hash lookup per duplicate (fast path: the bucket exists and matches immediately). This overhead is less than the cache disruption V2 causes by interleaving insert calls with adjacency iteration.

With a hash set, the cache and allocation overhead of insert inside the inner loop dominates, and V1's approach of deferring insert to the outer loop wins.

With an array/grid, both find (array read) and insert (array write) are trivially cheap — a single memory access with no allocation or rehashing. The cache disruption disappears because array writes are in-place. The dominant cost becomes the redundant work of popping duplicates and re-scanning their neighbors, which V2 avoids entirely. Hence V2 wins.

What the Literature Says

This performance trade-off is largely undiscussed in existing literature. Most discussions focus on correctness and memory:

A notable blog post, "Why what you have been taught about DFS is wrong (at least partially)" (banterly.net, 2020), argues that the common V1 approach is misleadingly taught because it causes stack bloat that doesn't match recursive DFS's memory behavior. The author focuses on memory, not runtime.
The GeeksforGeeks iterative DFS tutorial uses V1 but doesn't discuss V2 as an alternative.
A GitHub issue on the graphology library (#481) flagged the library's DFS as incorrect due to mark-on-push vs. mark-on-pop behavior affecting traversal order.
TechNetExperts provides a comparison focused on correctness: V1 is recommended for pathfinding and backtracking; V2 is recommended for simple reachability and memory efficiency.

None of these sources benchmark the two approaches or identify the visited-set cost as the determining factor.

Practical Recommendations

Based on our analysis and benchmarks:

For grid/matrix problems (flood fill, island counting, maze solving): Use V2 (mark on push). The visited check is a trivial array access, and V2 avoids all duplicate work. This is the approach used in our num_islands.cpp.
For general graphs with hash-based visited sets: Use V1 (mark on pop). The hash set's insert cost makes V2's eager marking more expensive than V1's lazy duplicate handling. Alternatively, if node IDs are contiguous integers, use a vector<bool> instead of unordered_set to get array-like performance, which would likely flip the result in V2's favor.
For backtracking problems (permutations, N-queens): Neither V1 nor V2 applies directly, since backtracking requires temporary marking (mark on enter, unmark on backtrack). Use recursive DFS or an iterative approach with explicit state management.
When stack memory is a concern: Use V2 (mark on push). It guarantees no duplicates on the stack, bounding stack size to O(V) in all cases.

Conclusion

The choice between mark-on-pop and mark-on-push in iterative DFS is not merely a stylistic preference — it has measurable performance implications that depend on the data structure used for the visited set. Our benchmarks demonstrate a ~24% advantage for mark-on-pop with hash sets and a ~10% advantage for mark-on-push with array-based visited tracking.

This finding highlights a general principle in algorithm engineering: asymptotic complexity tells you the shape of the curve, but constant factors — driven by data structure choice, cache behavior, and branch prediction — determine where you actually land on it. When optimizing DFS for a specific problem, the right strategy depends not just on the graph, but on how you represent the "visited" state.