Pagefy

Pagefy

Back to Data Structures and Algorithms

Multithreaded Algorithms

Introduction to Algorithms

Chapter 27: Multithreaded Algorithms

This chapter extends the algorithmic model to parallel algorithms running on multiprocessor computers. We explore dynamic multithreaded algorithms, which are amenable to algorithmic design, analysis, and efficient implementation.

Key idea: Programmers specify only the logical parallelism; the underlying concurrency platform handles scheduling and load balancing automatically.

Three concurrency keywords added to pseudocode:

  • spawn — allows a subroutine to run in parallel with its caller
  • sync — waits for all spawned children to complete
  • parallel — indicates loop iterations may execute concurrently

The serialization of a multithreaded algorithm is the serial algorithm obtained by deleting spawn, sync, and parallel keywords.


27.1 The Basics of Dynamic Multithreading

Fibonacci Example

The classic recursive Fibonacci serves as a motivating example:

FIB(n)
1  if n ≤ 1
2      return n
3  else x = FIB(n - 1)
4       y = FIB(n - 2)
5       return x + y
  • Running time: T(n) = T(n−1) + T(n−2) + Θ(1) = Θ(φⁿ), where φ = (1+√5)/2
  • The two recursive calls are independent → can run in parallel

Multithreaded version:

P-FIB(n)
1  if n ≤ 1
2      return n
3  else x = spawn P-FIB(n - 1)
4       y = P-FIB(n - 2)
5       sync
6       return x + y
  • spawn in line 3: the parent may continue executing line 4 in parallel with the spawned child
  • sync in line 5: waits for all spawned children before summing x + y
  • Every procedure executes an implicit sync before returning

Computation DAG Model

A multithreaded computation is modeled as a directed acyclic graph (DAG) G = (V, E):

  • Vertices represent strands (maximal sequences of instructions with no parallel control)
  • Edges represent dependencies between strands

Edge types:

  • Continuation edge — connects a strand to its successor within the same procedure (horizontal)
  • Spawn edge — connects a strand to a spawned child (downward)
  • Call edge — connects a strand to a normally called child (downward)
  • Return edge — connects a returning strand back to the caller's strand after sync (upward)

Two strands are in series if there is a directed path between them; otherwise they are in parallel.

Performance Measures

Let Tₚ = running time on P processors.

MetricDefinitionNotation
WorkTotal time on 1 processor (sum of all strand times)T₁
SpanLongest path in the DAG (critical path)T∞

Fundamental laws:

  1. Work Law: Tₚ ≥ T₁ / P
  2. Span Law: Tₚ ≥ T∞

Derived metrics:

  • Speedup on P processors: T₁ / Tₚ (at most P)
  • Linear speedup: T₁/Tₚ = Θ(P)
  • Perfect linear speedup: T₁/Tₚ = P
  • Parallelism: T₁ / T∞ — the average work per step along the critical path; upper bound on useful speedup
  • Slackness: (T₁/T∞) / P — factor by which parallelism exceeds processor count

Rule of thumb: A slackness of at least 10 generally suffices for good speedup.

Analysis of P-FIB

MetricValue
Work T₁(n)Θ(φⁿ) — same as serial FIB
Span T∞(n)T∞(n) = T∞(n−1) + Θ(1) = Θ(n)
ParallelismΘ(φⁿ / n) — grows dramatically with n

For the span: the two recursive calls run in parallel, so we take the max of their spans:

  • T∞(n) = max(T∞(n−1), T∞(n−2)) + Θ(1) = T∞(n−1) + Θ(1) = Θ(n)

Composition rules for work and span:

  • Series composition: work adds, span adds
  • Parallel composition: work adds, span takes the max

Scheduling

A greedy scheduler assigns as many strands to processors as possible each time step:

  • Complete step: ≥ P strands ready → assign P of them
  • Incomplete step: < P strands ready → assign all ready strands

Theorem 27.1 (Greedy Scheduler Bound):

Tₚ ≤ T₁/P + T∞

Corollary 27.2: A greedy scheduler is within a factor of 2 of optimal:

Tₚ ≤ 2 · T*ₚ

Corollary 27.3: If P ≪ T₁/T∞ (high slackness), then Tₚ ≈ T₁/P (near-perfect linear speedup).

Proof sketch of Theorem 27.1:

  • Number of complete steps ≤ ⌊T₁/P⌋ (otherwise total work exceeds T₁)
  • Each incomplete step reduces the span of the remaining DAG by 1 → at most T∞ incomplete steps
  • Total: Tₚ ≤ ⌊T₁/P⌋ + T∞ ≤ T₁/P + T∞

Parallel Loops

A parallel for loop is implemented as a divide-and-conquer binary tree of recursive spawns:

MAT-VEC(A, x)
1  n = A.rows
2  let y be a new vector of length n
3  parallel for i = 1 to n
4      yᵢ = 0
5  parallel for i = 1 to n
6      for j = 1 to n
7          yᵢ = yᵢ + aᵢⱼxⱼ
8  return y

The compiler translates a parallel for with n iterations into a recursive procedure with depth Θ(lg n).

Span of a parallel loop with n iterations, where iteration i has span iter∞(i):

T∞ = Θ(lg n) + max₁≤ᵢ≤ₙ iter∞(i)

MAT-VEC analysis:

  • Work: T₁(n) = Θ(n²)
  • Span: T∞(n) = Θ(n) — dominated by the inner serial loop of n iterations
  • Parallelism: Θ(n)

Race Conditions

A determinacy race occurs when two logically parallel instructions access the same memory location and at least one performs a write.

RACE-EXAMPLE()
1  x = 0
2  parallel for i = 1 to 2
3      x = x + 1
4  print x
  • May print 1 instead of 2 because x = x + 1 is not atomic (read → increment → write)
  • If two processors interleave these sub-operations, one update can be lost

Avoiding races: Ensure strands operating in parallel are independent (no shared writes):

  • All iterations of a parallel for should be independent
  • Between spawn and sync, parent and child code should be independent

Chess Lesson (Work/Span Extrapolation)

A practical example showing why work and span are better for predicting performance than measured runtimes:

  • Original: T₁ = 2048, T∞ = 1 → T₃₂ = 65s, T₅₁₂ = 5s
  • "Optimized": T₁' = 1024, T∞' = 8 → T₃₂' = 40s, T₅₁₂' = 10s

The optimization helped on 32 processors but hurt on 512 because the increased span became the dominant term.


27.2 Multithreaded Matrix Multiplication

Parallelizing the Standard Algorithm

P-SQUARE-MATRIX-MULTIPLY(A, B)
1  n = A.rows
2  let C be a new n × n matrix
3  parallel for i = 1 to n
4      parallel for j = 1 to n
5          cᵢⱼ = 0
6          for k = 1 to n
7              cᵢⱼ = cᵢⱼ + aᵢₖ · bₖⱼ
8  return C
MetricValue
WorkΘ(n³) — same as serial triple loop
SpanΘ(lg n) + Θ(lg n) + Θ(n) = Θ(n)
ParallelismΘ(n²)
  • The two parallel for loops contribute Θ(lg n) each to the span
  • The inner serial for k loop contributes Θ(n), which dominates

Divide-and-Conquer Multithreaded Matrix Multiplication

Partition each n×n matrix into four (n/2)×(n/2) submatrices:

C = A · B  where  C₁₁ = A₁₁B₁₁ + A₁₂B₂₁,  etc.
P-MATRIX-MULTIPLY-RECURSIVE(C, A, B)
1   n = A.rows
2   if n == 1
3       c₁₁ = a₁₁ · b₁₁
4   else let T be a new n × n matrix
5       partition A, B, C, and T into (n/2)×(n/2) submatrices
6       spawn P-MATRIX-MULTIPLY-RECURSIVE(C₁₁, A₁₁, B₁₁)
7       spawn P-MATRIX-MULTIPLY-RECURSIVE(C₁₂, A₁₁, B₁₂)
8       spawn P-MATRIX-MULTIPLY-RECURSIVE(C₂₁, A₂₁, B₁₁)
9       spawn P-MATRIX-MULTIPLY-RECURSIVE(C₂₂, A₂₁, B₁₂)
10      spawn P-MATRIX-MULTIPLY-RECURSIVE(T₁₁, A₁₂, B₂₁)
11      spawn P-MATRIX-MULTIPLY-RECURSIVE(T₁₂, A₁₂, B₂₂)
12      spawn P-MATRIX-MULTIPLY-RECURSIVE(T₂₁, A₂₂, B₂₁)
13      P-MATRIX-MULTIPLY-RECURSIVE(T₂₂, A₂₂, B₂₂)
14      sync
15      parallel for i = 1 to n
16          parallel for j = 1 to n
17              cᵢⱼ = cᵢⱼ + tᵢⱼ

Strategy:

  • Lines 6–9: Compute the first term of each Cᵢⱼ sum (into C)
  • Lines 10–13: Compute the second term of each Cᵢⱼ sum (into temp matrix T)
  • 7 spawned + 1 called = 8 recursive multiplications in parallel
  • Lines 15–17: Add T into C using parallel loops

Analysis:

MetricRecurrenceSolution
Work M₁(n)8M₁(n/2) + Θ(n²)Θ(n³) (master theorem case 1)
Span M∞(n)M∞(n/2) + Θ(lg n)Θ(lg² n)
ParallelismM₁/M∞Θ(n³ / lg² n)

For the span: all 8 recursive calls execute in parallel on same-size matrices, so we take just one; the Θ(lg n) addition from the parallel loops follows.

Multithreading Strassen's Method

Apply the same parallelization strategy to Strassen's algorithm:

  1. Divide A, B, C into (n/2)×(n/2) submatrices — Θ(1) work and span
  2. Create 10 sum/difference matrices S₁…S₁₀ — Θ(n²) work, Θ(lg n) span (parallel loops)
  3. Recursively spawn 7 matrix products P₁…P₇ in parallel
  4. Combine results into C₁₁, C₁₂, C₂₁, C₂₂ — Θ(n²) work, Θ(lg n) span
MetricValue
WorkΘ(n^lg7) ≈ Θ(n^2.81) — same as serial Strassen
SpanM∞(n/2) + Θ(lg n) = Θ(lg² n) — same recurrence as above
ParallelismΘ(n^lg7 / lg² n)

27.3 Multithreaded Merge Sort

Naive Approach: Spawn the Recursive Calls

MERGE-SORT'(A, p, r)
1  if p < r
2      q = ⌊(p + r) / 2⌋
3      spawn MERGE-SORT'(A, p, q)
4      MERGE-SORT'(A, q + 1, r)
5      sync
6      MERGE(A, p, q, r)
MetricRecurrenceSolution
Work2·MS'₁(n/2) + Θ(n)Θ(n lg n)
SpanMS'∞(n/2) + Θ(n)Θ(n)
ParallelismΘ(lg n) — very low!

Bottleneck: The serial MERGE procedure has Θ(n) span, dominating the span recurrence.

Multithreaded Merge (P-MERGE)

Key idea: Divide-and-conquer merging using the median of the larger subarray and binary search.

Strategy (merging sorted T[p₁..r₁] and T[p₂..r₂] into A[p₃..r₃]):

  1. Ensure n₁ ≥ n₂ (swap if needed)
  2. Find median x = T[q₁] of the larger subarray, where q₁ = ⌊(p₁+r₁)/2⌋
  3. Binary search for position q₂ in T[p₂..r₂] where x would be inserted
  4. Compute output position q₃ = p₃ + (q₁ − p₁) + (q₂ − p₂)
  5. Place x into A[q₃]
  6. Recursively merge the two halves in parallel
BINARY-SEARCH(x, T, p, r)
1  low = p
2  high = max(p, r + 1)
3  while low < high
4      mid = ⌊(low + high) / 2⌋
5      if x ≤ T[mid]
6          high = mid
7      else low = mid + 1
8  return high
P-MERGE(T, p₁, r₁, p₂, r₂, A, p₃)
1   n₁ = r₁ - p₁ + 1
2   n₂ = r₂ - p₂ + 1
3   if n₁ < n₂              // ensure n₁ ≥ n₂
4       exchange p₁ with p₂
5       exchange r₁ with r₂
6       exchange n₁ with n₂
7   if n₁ == 0              // both empty?
8       return
9   else q₁ = ⌊(p₁ + r₁) / 2⌋
10      q₂ = BINARY-SEARCH(T[q₁], T, p₂, r₂)
11      q₃ = p₃ + (q₁ - p₁) + (q₂ - p₂)
12      A[q₃] = T[q₁]
13      spawn P-MERGE(T, p₁, q₁-1, p₂, q₂-1, A, p₃)
14      P-MERGE(T, q₁+1, r₁, q₂, r₂, A, q₃+1)
15      sync

Worst-case analysis: Each recursive call handles at most 3n/4 elements:

  • Since n₁ ≥ n₂, we have n₂ ≤ n/2
  • One recursive call merges ⌊n₁/2⌋ + n₂ ≤ n₁/2 + n₂/2 + n₂/2 = n/2 + n/4 = 3n/4

P-MERGE Analysis:

MetricRecurrenceSolution
Span PM∞(n)PM∞(3n/4) + Θ(lg n)Θ(lg² n)
Work PM₁(n)PM₁(αn) + PM₁((1−α)n) + O(lg n), where 1/4 ≤ α ≤ 3/4Θ(n)
ParallelismPM₁/PM∞Θ(n / lg² n)

Work proof (substitution method): Assume PM₁(n) ≤ c₁n − c₂ lg n. The two recursive calls together process at most n elements, and the binary search costs O(lg n). Substituting confirms PM₁(n) = O(n). Combined with the Ω(n) lower bound (every element must be copied), we get Θ(n).

Full Multithreaded Merge Sort (P-MERGE-SORT)

P-MERGE-SORT(A, p, r, B, s)
1   n = r - p + 1
2   if n == 1
3       B[s] = A[p]
4   else let T[1..n] be a new array
5       q = ⌊(p + r) / 2⌋
6       q' = q - p + 1
7       spawn P-MERGE-SORT(A, p, q, T, 1)
8       P-MERGE-SORT(A, q+1, r, T, q'+1)
9       sync
10      P-MERGE(T, 1, q', q'+1, n, B, s)

How it works:

  1. Recursively sort both halves into temporary array T (in parallel via spawn)
  2. After sync, merge the two sorted halves from T into output array B using P-MERGE

P-MERGE-SORT Analysis:

MetricRecurrenceSolution
Work PMS₁(n)2·PMS₁(n/2) + Θ(n)Θ(n lg n) (master theorem case 2)
Span PMS∞(n)PMS∞(n/2) + Θ(lg² n)Θ(lg³ n)
ParallelismPMS₁/PMS∞Θ(n lg n / lg³ n) = Θ(n / lg² n)

Comparison of merge sort variants:

AlgorithmWorkSpanParallelism
MERGE-SORT' (serial merge)Θ(n lg n)Θ(n)Θ(lg n)
P-MERGE-SORT (parallel merge)Θ(n lg n)Θ(lg³ n)Θ(n / lg² n)

The parallel merge dramatically improves parallelism from Θ(lg n) to Θ(n / lg² n).

Practical note: Coarsen the base case by switching to a serial sort (e.g., quicksort) for small subarrays to reduce overhead constants while maintaining high parallelism.


Summary of Key Results

AlgorithmWorkSpanParallelism
P-FIB(n)Θ(φⁿ)Θ(n)Θ(φⁿ/n)
MAT-VEC (matrix-vector)Θ(n²)Θ(n)Θ(n)
P-SQUARE-MATRIX-MULTIPLYΘ(n³)Θ(n)Θ(n²)
P-MATRIX-MULTIPLY-RECURSIVEΘ(n³)Θ(lg² n)Θ(n³/lg² n)
Multithreaded StrassenΘ(n^lg7)Θ(lg² n)Θ(n^lg7/lg² n)
MERGE-SORT' (serial merge)Θ(n lg n)Θ(n)Θ(lg n)
P-MERGEΘ(n)Θ(lg² n)Θ(n/lg² n)
P-MERGE-SORTΘ(n lg n)Θ(lg³ n)Θ(n/lg² n)

Core principles:

  • Work = serial running time of the serialization
  • Span = critical path length; use max for parallel composition, sum for series
  • Greedy bound: Tₚ ≤ T₁/P + T∞
  • Slackness ≥ 10 is a practical target for near-linear speedup
  • Avoid determinacy races by ensuring parallel strands are independent

Previous chapter

Maximum Flow