Pagefy

Pagefy

Back to Data Structures and Algorithms

Data Structures for Disjoint Sets

Introduction to Algorithms

Chapter 21: Data Structures for Disjoint Sets

Some applications involve grouping n distinct elements into a collection of disjoint sets. Two operations dominate: finding the unique set that contains a given element, and uniting two sets. This chapter explores increasingly efficient data structures for these operations — from linked lists to disjoint-set forests with union by rank and path compression — culminating in a nearly-linear time bound of O(m α(n)).


21.1 Disjoint-Set Operations

A disjoint-set data structure maintains a collection S = {S₁, S₂, …, Sₖ} of disjoint dynamic sets. Each set is identified by a representative — some member of the set. If the set is not modified between two queries, the representative returned must be the same both times.

Supported Operations

  • MAKE-SET(x): Creates a new set whose only member (and thus representative) is x. Requires that x is not already in some other set.
  • UNION(x, y): Unites the sets containing x and y (say Sₓ and Sᵧ) into a new set Sₓ ∪ Sᵧ. The representative of the resulting set is any member of Sₓ ∪ Sᵧ. Conceptually destroys Sₓ and Sᵧ, removing them from the collection.
  • FIND-SET(x): Returns a pointer to the representative of the unique set containing x.

Analysis Parameters

Running times are expressed in terms of two parameters:

  • n — the number of MAKE-SET operations
  • m — the total number of MAKE-SET, UNION, and FIND-SET operations

Key observations:

  • Each UNION reduces the number of sets by one, so there are at most n − 1 UNION operations.
  • Since MAKE-SET operations are included in m, we have m ≥ n.
  • We assume the n MAKE-SET operations are the first n operations performed.

Application: Connected Components

A classic application is computing the connected components of an undirected graph.

CONNECTED-COMPONENTS(G)
1  for each vertex v ∈ G.V
2      MAKE-SET(v)
3  for each edge (u, v) ∈ G.E
4      if FIND-SET(u) ≠ FIND-SET(v)
5          UNION(u, v)
SAME-COMPONENT(u, v)
1  if FIND-SET(u) == FIND-SET(v)
2      return TRUE
3  else return FALSE
  • Initially each vertex is placed in its own set.
  • For each edge (u, v), the sets containing u and v are united.
  • After processing all edges, two vertices are in the same connected component if and only if they are in the same set.

This approach is especially useful when edges are added dynamically and we need to maintain connected components incrementally, rather than re-running DFS for each new edge.


21.2 Linked-List Representation of Disjoint Sets

Structure

Each set is represented by its own linked list:

  • The set object has attributes head (pointer to first object) and tail (pointer to last object).
  • Each list object contains:
    • A set member
    • A pointer to the next object in the list
    • A pointer back to the set object
  • The representative is the set member in the first object of the list.

Operation Costs

  • MAKE-SET(x): Create a new linked list with one object → O(1)
  • FIND-SET(x): Follow the pointer from x back to its set object, return the member at headO(1)
  • UNION(x, y): Append y's list onto the end of x's list. Must update the back-pointer to the set object for every element originally in y's list → O(length of y's list)

Worst-Case Analysis of Simple Union

A sequence of n MAKE-SET operations followed by n − 1 UNION operations (total m = 2n − 1) can require Θ(n²) time:

OperationObjects Updated
MAKE-SET(x₁) … MAKE-SET(xₙ)1 each
UNION(x₂, x₁)1
UNION(x₃, x₂)2
UNION(x₄, x₃)3
UNION(xₙ, xₙ₋₁)n − 1

Total objects updated by UNIONs: ∑(i=1 to n−1) i = Θ(n²)

Amortized cost per operation: Θ(n).

Weighted-Union Heuristic

Maintain the length of each list. Always append the shorter list onto the longer (breaking ties arbitrarily).

  • A single UNION can still take Θ(n) in the worst case.
  • But a sequence of m operations (with n MAKE-SETs) takes O(m + n lg n) total.

Theorem 21.1

Using the linked-list representation with the weighted-union heuristic, a sequence of m MAKE-SET, UNION, and FIND-SET operations, n of which are MAKE-SET operations, takes O(m + n lg n) time.

Proof sketch: Consider any object x. Each time x's back-pointer is updated, x started in the smaller set, so the resulting set has at least twice as many members. Since the largest set has at most n members, each object's pointer is updated at most ⌈lg n⌉ times. Total pointer updates across all UNIONs: O(n lg n). Adding O(m) for MAKE-SET and FIND-SET operations gives O(m + n lg n).


21.3 Disjoint-Set Forests

A faster implementation represents sets as rooted trees:

  • Each member points only to its parent.
  • The root of each tree is the representative and is its own parent.
  • Each tree represents one set.

Basic Operations

  • MAKE-SET: Create a tree with a single node.
  • FIND-SET: Follow parent pointers up to the root. The nodes visited form the find path.
  • UNION: Make the root of one tree point to the root of the other.

Without heuristics, a sequence of n − 1 UNIONs can produce a linear chain of n nodes — no better than linked lists.

Heuristic 1: Union by Rank

Similar to the weighted-union heuristic but uses rank (an upper bound on node height) instead of subtree size:

  • Each node maintains an integer rank.
  • During UNION, make the root with smaller rank point to the root with larger rank.
  • If ranks are equal, choose one root as parent and increment its rank by 1.

Heuristic 2: Path Compression

Used during FIND-SET operations:

  • Make every node on the find path point directly to the root.
  • Does not change any ranks.

This is a two-pass method: one pass up to find the root, one pass back down (via recursion unwinding) to update parent pointers.

Pseudocode

MAKE-SET(x)
1  x.p = x
2  x.rank = 0
UNION(x, y)
1  LINK(FIND-SET(x), FIND-SET(y))
LINK(x, y)
1  if x.rank > y.rank
2      y.p = x
3  else x.p = y
4      if x.rank == y.rank
5          y.rank = y.rank + 1
FIND-SET(x)
1  if x ≠ x.p
2      x.p = FIND-SET(x.p)
3  return x.p

Effect on Running Time

Heuristic(s) UsedRunning Time
Union by rank aloneO(m lg n)
Path compression aloneΘ(n + f · (1 + log₂₊f/n n)) for f FIND-SETs
Both combinedO(m α(n))

Where α(n) is the inverse Ackermann function — so slowly growing that α(n) ≤ 4 for all practical values of n. The running time is effectively linear in m.


21.4 Analysis of Union by Rank with Path Compression

The Ackermann Function and Its Inverse

For integers k ≥ 0 and j ≥ 1, define:

         ⎧ j + 1                  if k = 0
Aₖ(j) = ⎨
         ⎩ Aₖ₋₁^(j+1)(j)         if k ≥ 1

Where Aₖ₋₁^(i)(j) denotes functional iteration: apply Aₖ₋₁ to j a total of i times.

Closed forms:

  • A₁(j) = 2j + 1 (Lemma 21.2)
  • A₂(j) = 2^(j+1) · (j + 1) − 1 (Lemma 21.3)

Growth at j = 1:

kAₖ(1)
02
13
27
32047
4≫ 10⁸⁰ (more than atoms in the observable universe)

Inverse Ackermann Function

α(n) = min{ k : Aₖ(1) ≥ n }
α(n)Range of n
00 ≤ n ≤ 2
1n = 3
24 ≤ n ≤ 7
38 ≤ n ≤ 2047
42048 ≤ n ≤ A₄(1)

For all practical purposes, α(n) ≤ 4.

Properties of Ranks

  • Lemma 21.4: For all nodes x: x.rank ≤ x.p.rank, with strict inequality if x ≠ x.p. Once x becomes a non-root, x.rank never changes. The value x.p.rank monotonically increases over time.
  • Corollary 21.5: Along any path from a node toward the root, ranks strictly increase.
  • Lemma 21.6: Every node has rank at most n − 1 (in fact, at most ⌊lg n⌋).

Amortized Analysis via Potential Method

The analysis converts UNION operations into LINK + two FIND-SET operations (Lemma 21.7 shows this doesn't change the asymptotic bound).

Potential function: Each node x is assigned a potential φ(x):

         ⎧ α(n) · x.rank                              if x is a root or x.rank = 0
φ(x) =  ⎨
         ⎩ (α(n) − level(x)) · x.rank − iter(x)       if x is not a root and x.rank ≥ 1

Where the auxiliary functions are:

  • level(x) = max{ k : x.p.rank ≥ Aₖ(x.rank) } — the greatest level k for which Aₖ applied to x's rank doesn't exceed the parent's rank.
    • Satisfies: 0 ≤ level(x) < α(n)
  • iter(x) = max{ i : x.p.rank ≥ Aₗₑᵥₑₗ₍ₓ₎^(i)(x.rank) } — how many times we can iteratively apply A at x's level before exceeding the parent's rank.
    • Satisfies: 1 ≤ iter(x) ≤ x.rank

Key property (Lemma 21.8): For every node x:

0 ≤ φ(x) ≤ α(n) · x.rank

Amortized Costs

  • MAKE-SET (Lemma 21.11): Amortized cost = O(1)

    • Creates a node with rank 0 and potential 0.
  • LINK (Lemma 21.12): Amortized cost = O(α(n))

    • Only the new root can have its potential increase, by at most α(n).
  • FIND-SET (Lemma 21.13): Amortized cost = O(α(n))

    • Actual cost is O(s) for a find path of s nodes.
    • At least max(0, s − (α(n) + 2)) nodes on the find path have their potential decrease by at least 1 (because path compression causes either iter(x) or level(x) to increase).
    • The potential drop offsets the actual cost, leaving O(α(n)).

Theorem 21.14

A sequence of m MAKE-SET, UNION, and FIND-SET operations, n of which are MAKE-SET operations, can be performed on a disjoint-set forest with union by rank and path compression in worst-case time O(m α(n)).


Summary

RepresentationHeuristicTime Bound
Linked listSimple unionO(m · n) — Θ(n) amortized per op
Linked listWeighted unionO(m + n lg n)
ForestUnion by rank onlyO(m lg n)
ForestPath compression onlyΘ(n + f · (1 + log₂₊f/n n))
ForestUnion by rank + path compressionO(m α(n)) ≈ effectively O(m)

Key takeaways:

  • The disjoint-set forest with both heuristics is the optimal implementation.
  • α(n) ≤ 4 for any conceivable input size, making the bound effectively linear.
  • The amortized analysis uses a sophisticated potential function based on the inverse Ackermann function.
  • The data structure is widely used in algorithms for connected components, Kruskal's MST, and other graph problems.