Pagefy

Book

Introduction to Algorithms

1.The Role of Algorithms in Computing 2.Getting Started 3.Growth of Functions 4.Divide and Conquer 5.Probabilistic Analysis and Randomized Algorithms 6.Heapsort 7.Quicksort 8.Sorting in Linear Time 9.Medians and Order Statistics 10.Elementary Data Structures 11.Hash Tables 12.Binary Search Trees 13.Red Black Trees 14.Augmenting Data Structures 15.Dynamic Programming 16.Greedy Algorithms 17.Amortized Analysis 18.B Trees 19.Fibonacci Heaps 20.van Emde Boas Trees 21.Data Structures for Disjoint Sets 22.Elementary Graph Algorithms 23.Minimum Spanning Trees 24.Single Source Shortest Paths 25.All Pairs Shortest Paths 26.Maximum Flow 27.Multithreaded Algorithms 28.Matrix Operations 29.Linear Programming 30.Polynomials and the FFT 31.Number Theoretic Algorithms 32.String Matching 33.Computational Geometry 34.NP Completeness 35.Approximation Algorithms 36.Appendix A Summations 37.Appendix B Sets Etc 38.Appendix C Counting and Probability 39.Appendix D Matrices

Multithreaded Algorithms

Introduction to Algorithms

Chapter 27: Multithreaded Algorithms

This chapter extends the algorithmic model to parallel algorithms running on multiprocessor computers. We explore dynamic multithreaded algorithms, which are amenable to algorithmic design, analysis, and efficient implementation.

Key idea: Programmers specify only the logical parallelism; the underlying concurrency platform handles scheduling and load balancing automatically.

Three concurrency keywords added to pseudocode:

spawn — allows a subroutine to run in parallel with its caller
sync — waits for all spawned children to complete
parallel — indicates loop iterations may execute concurrently

The serialization of a multithreaded algorithm is the serial algorithm obtained by deleting spawn, sync, and parallel keywords.

27.1 The Basics of Dynamic Multithreading

Fibonacci Example

The classic recursive Fibonacci serves as a motivating example:

FIB(n)
1  if n ≤ 1
2      return n
3  else x = FIB(n - 1)
4       y = FIB(n - 2)
5       return x + y

Running time: T(n) = T(n−1) + T(n−2) + Θ(1) = Θ(φⁿ), where φ = (1+√5)/2
The two recursive calls are independent → can run in parallel

Multithreaded version:

P-FIB(n)
1  if n ≤ 1
2      return n
3  else x = spawn P-FIB(n - 1)
4       y = P-FIB(n - 2)
5       sync
6       return x + y

spawn in line 3: the parent may continue executing line 4 in parallel with the spawned child
sync in line 5: waits for all spawned children before summing x + y
Every procedure executes an implicit sync before returning

Computation DAG Model

A multithreaded computation is modeled as a directed acyclic graph (DAG) G = (V, E):

Vertices represent strands (maximal sequences of instructions with no parallel control)
Edges represent dependencies between strands

Edge types:

Continuation edge — connects a strand to its successor within the same procedure (horizontal)
Spawn edge — connects a strand to a spawned child (downward)
Call edge — connects a strand to a normally called child (downward)
Return edge — connects a returning strand back to the caller's strand after sync (upward)

Two strands are in series if there is a directed path between them; otherwise they are in parallel.

Performance Measures

Let Tₚ = running time on P processors.

Metric	Definition	Notation
Work	Total time on 1 processor (sum of all strand times)	T₁
Span	Longest path in the DAG (critical path)	T∞

Fundamental laws:

Work Law: Tₚ ≥ T₁ / P
Span Law: Tₚ ≥ T∞

Derived metrics:

Speedup on P processors: T₁ / Tₚ (at most P)
Linear speedup: T₁/Tₚ = Θ(P)
Perfect linear speedup: T₁/Tₚ = P
Parallelism: T₁ / T∞ — the average work per step along the critical path; upper bound on useful speedup
Slackness: (T₁/T∞) / P — factor by which parallelism exceeds processor count

Rule of thumb: A slackness of at least 10 generally suffices for good speedup.

Analysis of P-FIB

Metric	Value
Work T₁(n)	Θ(φⁿ) — same as serial FIB
Span T∞(n)	T∞(n) = T∞(n−1) + Θ(1) = Θ(n)
Parallelism	Θ(φⁿ / n) — grows dramatically with n

For the span: the two recursive calls run in parallel, so we take the max of their spans:

T∞(n) = max(T∞(n−1), T∞(n−2)) + Θ(1) = T∞(n−1) + Θ(1) = Θ(n)

Composition rules for work and span:

Series composition: work adds, span adds
Parallel composition: work adds, span takes the max

Scheduling

A greedy scheduler assigns as many strands to processors as possible each time step:

Complete step: ≥ P strands ready → assign P of them
Incomplete step: < P strands ready → assign all ready strands

Theorem 27.1 (Greedy Scheduler Bound):

Tₚ ≤ T₁/P + T∞

Corollary 27.2: A greedy scheduler is within a factor of 2 of optimal:

Tₚ ≤ 2 · T*ₚ

Corollary 27.3: If P ≪ T₁/T∞ (high slackness), then Tₚ ≈ T₁/P (near-perfect linear speedup).

Proof sketch of Theorem 27.1:

Number of complete steps ≤ ⌊T₁/P⌋ (otherwise total work exceeds T₁)
Each incomplete step reduces the span of the remaining DAG by 1 → at most T∞ incomplete steps
Total: Tₚ ≤ ⌊T₁/P⌋ + T∞ ≤ T₁/P + T∞

Parallel Loops

A parallel for loop is implemented as a divide-and-conquer binary tree of recursive spawns:

MAT-VEC(A, x)
1  n = A.rows
2  let y be a new vector of length n
3  parallel for i = 1 to n
4      yᵢ = 0
5  parallel for i = 1 to n
6      for j = 1 to n
7          yᵢ = yᵢ + aᵢⱼxⱼ
8  return y

The compiler translates a parallel for with n iterations into a recursive procedure with depth Θ(lg n).

Span of a parallel loop with n iterations, where iteration i has span iter∞(i):

T∞ = Θ(lg n) + max₁≤ᵢ≤ₙ iter∞(i)

MAT-VEC analysis:

Work: T₁(n) = Θ(n²)
Span: T∞(n) = Θ(n) — dominated by the inner serial loop of n iterations
Parallelism: Θ(n)

Race Conditions

A determinacy race occurs when two logically parallel instructions access the same memory location and at least one performs a write.

RACE-EXAMPLE()
1  x = 0
2  parallel for i = 1 to 2
3      x = x + 1
4  print x

May print 1 instead of 2 because x = x + 1 is not atomic (read → increment → write)
If two processors interleave these sub-operations, one update can be lost

Avoiding races: Ensure strands operating in parallel are independent (no shared writes):

All iterations of a parallel for should be independent
Between spawn and sync, parent and child code should be independent

Chess Lesson (Work/Span Extrapolation)

A practical example showing why work and span are better for predicting performance than measured runtimes:

Original: T₁ = 2048, T∞ = 1 → T₃₂ = 65s, T₅₁₂ = 5s
"Optimized": T₁' = 1024, T∞' = 8 → T₃₂' = 40s, T₅₁₂' = 10s

The optimization helped on 32 processors but hurt on 512 because the increased span became the dominant term.

27.2 Multithreaded Matrix Multiplication

Parallelizing the Standard Algorithm

P-SQUARE-MATRIX-MULTIPLY(A, B)
1  n = A.rows
2  let C be a new n × n matrix
3  parallel for i = 1 to n
4      parallel for j = 1 to n
5          cᵢⱼ = 0
6          for k = 1 to n
7              cᵢⱼ = cᵢⱼ + aᵢₖ · bₖⱼ
8  return C

Metric	Value
Work	Θ(n³) — same as serial triple loop
Span	Θ(lg n) + Θ(lg n) + Θ(n) = Θ(n)
Parallelism	Θ(n²)

The two parallel for loops contribute Θ(lg n) each to the span
The inner serial for k loop contributes Θ(n), which dominates

Divide-and-Conquer Multithreaded Matrix Multiplication

Partition each n×n matrix into four (n/2)×(n/2) submatrices:

C = A · B  where  C₁₁ = A₁₁B₁₁ + A₁₂B₂₁,  etc.

P-MATRIX-MULTIPLY-RECURSIVE(C, A, B)
1   n = A.rows
2   if n == 1
3       c₁₁ = a₁₁ · b₁₁
4   else let T be a new n × n matrix
5       partition A, B, C, and T into (n/2)×(n/2) submatrices
6       spawn P-MATRIX-MULTIPLY-RECURSIVE(C₁₁, A₁₁, B₁₁)
7       spawn P-MATRIX-MULTIPLY-RECURSIVE(C₁₂, A₁₁, B₁₂)
8       spawn P-MATRIX-MULTIPLY-RECURSIVE(C₂₁, A₂₁, B₁₁)
9       spawn P-MATRIX-MULTIPLY-RECURSIVE(C₂₂, A₂₁, B₁₂)
10      spawn P-MATRIX-MULTIPLY-RECURSIVE(T₁₁, A₁₂, B₂₁)
11      spawn P-MATRIX-MULTIPLY-RECURSIVE(T₁₂, A₁₂, B₂₂)
12      spawn P-MATRIX-MULTIPLY-RECURSIVE(T₂₁, A₂₂, B₂₁)
13      P-MATRIX-MULTIPLY-RECURSIVE(T₂₂, A₂₂, B₂₂)
14      sync
15      parallel for i = 1 to n
16          parallel for j = 1 to n
17              cᵢⱼ = cᵢⱼ + tᵢⱼ

Strategy:

Lines 6–9: Compute the first term of each Cᵢⱼ sum (into C)
Lines 10–13: Compute the second term of each Cᵢⱼ sum (into temp matrix T)
7 spawned + 1 called = 8 recursive multiplications in parallel
Lines 15–17: Add T into C using parallel loops

Analysis:

Metric	Recurrence	Solution
Work M₁(n)	8M₁(n/2) + Θ(n²)	Θ(n³) (master theorem case 1)
Span M∞(n)	M∞(n/2) + Θ(lg n)	Θ(lg² n)
Parallelism	M₁/M∞	Θ(n³ / lg² n)

For the span: all 8 recursive calls execute in parallel on same-size matrices, so we take just one; the Θ(lg n) addition from the parallel loops follows.

Multithreading Strassen's Method

Apply the same parallelization strategy to Strassen's algorithm:

Divide A, B, C into (n/2)×(n/2) submatrices — Θ(1) work and span
Create 10 sum/difference matrices S₁…S₁₀ — Θ(n²) work, Θ(lg n) span (parallel loops)
Recursively spawn 7 matrix products P₁…P₇ in parallel
Combine results into C₁₁, C₁₂, C₂₁, C₂₂ — Θ(n²) work, Θ(lg n) span

Metric	Value
Work	Θ(n^lg7) ≈ Θ(n^2.81) — same as serial Strassen
Span	M∞(n/2) + Θ(lg n) = Θ(lg² n) — same recurrence as above
Parallelism	Θ(n^lg7 / lg² n)

27.3 Multithreaded Merge Sort

Naive Approach: Spawn the Recursive Calls

MERGE-SORT'(A, p, r)
1  if p < r
2      q = ⌊(p + r) / 2⌋
3      spawn MERGE-SORT'(A, p, q)
4      MERGE-SORT'(A, q + 1, r)
5      sync
6      MERGE(A, p, q, r)

Metric	Recurrence	Solution
Work	2·MS'₁(n/2) + Θ(n)	Θ(n lg n)
Span	MS'∞(n/2) + Θ(n)	Θ(n)
Parallelism		Θ(lg n) — very low!

Bottleneck: The serial MERGE procedure has Θ(n) span, dominating the span recurrence.

Multithreaded Merge (P-MERGE)

Key idea: Divide-and-conquer merging using the median of the larger subarray and binary search.

Strategy (merging sorted T[p₁..r₁] and T[p₂..r₂] into A[p₃..r₃]):

Ensure n₁ ≥ n₂ (swap if needed)
Find median x = T[q₁] of the larger subarray, where q₁ = ⌊(p₁+r₁)/2⌋
Binary search for position q₂ in T[p₂..r₂] where x would be inserted
Compute output position q₃ = p₃ + (q₁ − p₁) + (q₂ − p₂)
Place x into A[q₃]
Recursively merge the two halves in parallel

BINARY-SEARCH(x, T, p, r)
1  low = p
2  high = max(p, r + 1)
3  while low < high
4      mid = ⌊(low + high) / 2⌋
5      if x ≤ T[mid]
6          high = mid
7      else low = mid + 1
8  return high

P-MERGE(T, p₁, r₁, p₂, r₂, A, p₃)
1   n₁ = r₁ - p₁ + 1
2   n₂ = r₂ - p₂ + 1
3   if n₁ < n₂              // ensure n₁ ≥ n₂
4       exchange p₁ with p₂
5       exchange r₁ with r₂
6       exchange n₁ with n₂
7   if n₁ == 0              // both empty?
8       return
9   else q₁ = ⌊(p₁ + r₁) / 2⌋
10      q₂ = BINARY-SEARCH(T[q₁], T, p₂, r₂)
11      q₃ = p₃ + (q₁ - p₁) + (q₂ - p₂)
12      A[q₃] = T[q₁]
13      spawn P-MERGE(T, p₁, q₁-1, p₂, q₂-1, A, p₃)
14      P-MERGE(T, q₁+1, r₁, q₂, r₂, A, q₃+1)
15      sync

Worst-case analysis: Each recursive call handles at most 3n/4 elements:

Since n₁ ≥ n₂, we have n₂ ≤ n/2
One recursive call merges ⌊n₁/2⌋ + n₂ ≤ n₁/2 + n₂/2 + n₂/2 = n/2 + n/4 = 3n/4

P-MERGE Analysis:

Metric	Recurrence	Solution
Span PM∞(n)	PM∞(3n/4) + Θ(lg n)	Θ(lg² n)
Work PM₁(n)	PM₁(αn) + PM₁((1−α)n) + O(lg n), where 1/4 ≤ α ≤ 3/4	Θ(n)
Parallelism	PM₁/PM∞	Θ(n / lg² n)

Work proof (substitution method): Assume PM₁(n) ≤ c₁n − c₂ lg n. The two recursive calls together process at most n elements, and the binary search costs O(lg n). Substituting confirms PM₁(n) = O(n). Combined with the Ω(n) lower bound (every element must be copied), we get Θ(n).

Full Multithreaded Merge Sort (P-MERGE-SORT)

P-MERGE-SORT(A, p, r, B, s)
1   n = r - p + 1
2   if n == 1
3       B[s] = A[p]
4   else let T[1..n] be a new array
5       q = ⌊(p + r) / 2⌋
6       q' = q - p + 1
7       spawn P-MERGE-SORT(A, p, q, T, 1)
8       P-MERGE-SORT(A, q+1, r, T, q'+1)
9       sync
10      P-MERGE(T, 1, q', q'+1, n, B, s)

How it works:

Recursively sort both halves into temporary array T (in parallel via spawn)
After sync, merge the two sorted halves from T into output array B using P-MERGE

P-MERGE-SORT Analysis:

Metric	Recurrence	Solution
Work PMS₁(n)	2·PMS₁(n/2) + Θ(n)	Θ(n lg n) (master theorem case 2)
Span PMS∞(n)	PMS∞(n/2) + Θ(lg² n)	Θ(lg³ n)
Parallelism	PMS₁/PMS∞	Θ(n lg n / lg³ n) = Θ(n / lg² n)

Comparison of merge sort variants:

Algorithm	Work	Span	Parallelism
MERGE-SORT' (serial merge)	Θ(n lg n)	Θ(n)	Θ(lg n)
P-MERGE-SORT (parallel merge)	Θ(n lg n)	Θ(lg³ n)	Θ(n / lg² n)

The parallel merge dramatically improves parallelism from Θ(lg n) to Θ(n / lg² n).

Practical note: Coarsen the base case by switching to a serial sort (e.g., quicksort) for small subarrays to reduce overhead constants while maintaining high parallelism.

Summary of Key Results

Algorithm	Work	Span	Parallelism
P-FIB(n)	Θ(φⁿ)	Θ(n)	Θ(φⁿ/n)
MAT-VEC (matrix-vector)	Θ(n²)	Θ(n)	Θ(n)
P-SQUARE-MATRIX-MULTIPLY	Θ(n³)	Θ(n)	Θ(n²)
P-MATRIX-MULTIPLY-RECURSIVE	Θ(n³)	Θ(lg² n)	Θ(n³/lg² n)
Multithreaded Strassen	Θ(n^lg7)	Θ(lg² n)	Θ(n^lg7/lg² n)
MERGE-SORT' (serial merge)	Θ(n lg n)	Θ(n)	Θ(lg n)
P-MERGE	Θ(n)	Θ(lg² n)	Θ(n/lg² n)
P-MERGE-SORT	Θ(n lg n)	Θ(lg³ n)	Θ(n/lg² n)

Core principles:

Work = serial running time of the serialization
Span = critical path length; use max for parallel composition, sum for series
Greedy bound: Tₚ ≤ T₁/P + T∞
Slackness ≥ 10 is a practical target for near-linear speedup
Avoid determinacy races by ensuring parallel strands are independent

Previous chapter

Maximum Flow

Next chapter

Matrix Operations