Multithreaded Algorithms
Introduction to AlgorithmsChapter 27: Multithreaded Algorithms
This chapter extends the algorithmic model to parallel algorithms running on multiprocessor computers. We explore dynamic multithreaded algorithms, which are amenable to algorithmic design, analysis, and efficient implementation.
Key idea: Programmers specify only the logical parallelism; the underlying concurrency platform handles scheduling and load balancing automatically.
Three concurrency keywords added to pseudocode:
spawn— allows a subroutine to run in parallel with its callersync— waits for all spawned children to completeparallel— indicates loop iterations may execute concurrently
The serialization of a multithreaded algorithm is the serial algorithm obtained by deleting spawn, sync, and parallel keywords.
27.1 The Basics of Dynamic Multithreading
Fibonacci Example
The classic recursive Fibonacci serves as a motivating example:
FIB(n)
1 if n ≤ 1
2 return n
3 else x = FIB(n - 1)
4 y = FIB(n - 2)
5 return x + y
- Running time: T(n) = T(n−1) + T(n−2) + Θ(1) = Θ(φⁿ), where φ = (1+√5)/2
- The two recursive calls are independent → can run in parallel
Multithreaded version:
P-FIB(n)
1 if n ≤ 1
2 return n
3 else x = spawn P-FIB(n - 1)
4 y = P-FIB(n - 2)
5 sync
6 return x + y
spawnin line 3: the parent may continue executing line 4 in parallel with the spawned childsyncin line 5: waits for all spawned children before summing x + y- Every procedure executes an implicit
syncbefore returning
Computation DAG Model
A multithreaded computation is modeled as a directed acyclic graph (DAG) G = (V, E):
- Vertices represent strands (maximal sequences of instructions with no parallel control)
- Edges represent dependencies between strands
Edge types:
- Continuation edge — connects a strand to its successor within the same procedure (horizontal)
- Spawn edge — connects a strand to a spawned child (downward)
- Call edge — connects a strand to a normally called child (downward)
- Return edge — connects a returning strand back to the caller's strand after sync (upward)
Two strands are in series if there is a directed path between them; otherwise they are in parallel.
Performance Measures
Let Tₚ = running time on P processors.
| Metric | Definition | Notation |
|---|---|---|
| Work | Total time on 1 processor (sum of all strand times) | T₁ |
| Span | Longest path in the DAG (critical path) | T∞ |
Fundamental laws:
- Work Law: Tₚ ≥ T₁ / P
- Span Law: Tₚ ≥ T∞
Derived metrics:
- Speedup on P processors: T₁ / Tₚ (at most P)
- Linear speedup: T₁/Tₚ = Θ(P)
- Perfect linear speedup: T₁/Tₚ = P
- Parallelism: T₁ / T∞ — the average work per step along the critical path; upper bound on useful speedup
- Slackness: (T₁/T∞) / P — factor by which parallelism exceeds processor count
Rule of thumb: A slackness of at least 10 generally suffices for good speedup.
Analysis of P-FIB
| Metric | Value |
|---|---|
| Work T₁(n) | Θ(φⁿ) — same as serial FIB |
| Span T∞(n) | T∞(n) = T∞(n−1) + Θ(1) = Θ(n) |
| Parallelism | Θ(φⁿ / n) — grows dramatically with n |
For the span: the two recursive calls run in parallel, so we take the max of their spans:
- T∞(n) = max(T∞(n−1), T∞(n−2)) + Θ(1) = T∞(n−1) + Θ(1) = Θ(n)
Composition rules for work and span:
- Series composition: work adds, span adds
- Parallel composition: work adds, span takes the max
Scheduling
A greedy scheduler assigns as many strands to processors as possible each time step:
- Complete step: ≥ P strands ready → assign P of them
- Incomplete step: < P strands ready → assign all ready strands
Theorem 27.1 (Greedy Scheduler Bound):
Tₚ ≤ T₁/P + T∞
Corollary 27.2: A greedy scheduler is within a factor of 2 of optimal:
Tₚ ≤ 2 · T*ₚ
Corollary 27.3: If P ≪ T₁/T∞ (high slackness), then Tₚ ≈ T₁/P (near-perfect linear speedup).
Proof sketch of Theorem 27.1:
- Number of complete steps ≤ ⌊T₁/P⌋ (otherwise total work exceeds T₁)
- Each incomplete step reduces the span of the remaining DAG by 1 → at most T∞ incomplete steps
- Total: Tₚ ≤ ⌊T₁/P⌋ + T∞ ≤ T₁/P + T∞
Parallel Loops
A parallel for loop is implemented as a divide-and-conquer binary tree of recursive spawns:
MAT-VEC(A, x)
1 n = A.rows
2 let y be a new vector of length n
3 parallel for i = 1 to n
4 yᵢ = 0
5 parallel for i = 1 to n
6 for j = 1 to n
7 yᵢ = yᵢ + aᵢⱼxⱼ
8 return y
The compiler translates a parallel for with n iterations into a recursive procedure with depth Θ(lg n).
Span of a parallel loop with n iterations, where iteration i has span iter∞(i):
T∞ = Θ(lg n) + max₁≤ᵢ≤ₙ iter∞(i)
MAT-VEC analysis:
- Work: T₁(n) = Θ(n²)
- Span: T∞(n) = Θ(n) — dominated by the inner serial loop of n iterations
- Parallelism: Θ(n)
Race Conditions
A determinacy race occurs when two logically parallel instructions access the same memory location and at least one performs a write.
RACE-EXAMPLE()
1 x = 0
2 parallel for i = 1 to 2
3 x = x + 1
4 print x
- May print 1 instead of 2 because
x = x + 1is not atomic (read → increment → write) - If two processors interleave these sub-operations, one update can be lost
Avoiding races: Ensure strands operating in parallel are independent (no shared writes):
- All iterations of a
parallel forshould be independent - Between
spawnandsync, parent and child code should be independent
Chess Lesson (Work/Span Extrapolation)
A practical example showing why work and span are better for predicting performance than measured runtimes:
- Original: T₁ = 2048, T∞ = 1 → T₃₂ = 65s, T₅₁₂ = 5s
- "Optimized": T₁' = 1024, T∞' = 8 → T₃₂' = 40s, T₅₁₂' = 10s
The optimization helped on 32 processors but hurt on 512 because the increased span became the dominant term.
27.2 Multithreaded Matrix Multiplication
Parallelizing the Standard Algorithm
P-SQUARE-MATRIX-MULTIPLY(A, B)
1 n = A.rows
2 let C be a new n × n matrix
3 parallel for i = 1 to n
4 parallel for j = 1 to n
5 cᵢⱼ = 0
6 for k = 1 to n
7 cᵢⱼ = cᵢⱼ + aᵢₖ · bₖⱼ
8 return C
| Metric | Value |
|---|---|
| Work | Θ(n³) — same as serial triple loop |
| Span | Θ(lg n) + Θ(lg n) + Θ(n) = Θ(n) |
| Parallelism | Θ(n²) |
- The two
parallel forloops contribute Θ(lg n) each to the span - The inner serial
for kloop contributes Θ(n), which dominates
Divide-and-Conquer Multithreaded Matrix Multiplication
Partition each n×n matrix into four (n/2)×(n/2) submatrices:
C = A · B where C₁₁ = A₁₁B₁₁ + A₁₂B₂₁, etc.
P-MATRIX-MULTIPLY-RECURSIVE(C, A, B)
1 n = A.rows
2 if n == 1
3 c₁₁ = a₁₁ · b₁₁
4 else let T be a new n × n matrix
5 partition A, B, C, and T into (n/2)×(n/2) submatrices
6 spawn P-MATRIX-MULTIPLY-RECURSIVE(C₁₁, A₁₁, B₁₁)
7 spawn P-MATRIX-MULTIPLY-RECURSIVE(C₁₂, A₁₁, B₁₂)
8 spawn P-MATRIX-MULTIPLY-RECURSIVE(C₂₁, A₂₁, B₁₁)
9 spawn P-MATRIX-MULTIPLY-RECURSIVE(C₂₂, A₂₁, B₁₂)
10 spawn P-MATRIX-MULTIPLY-RECURSIVE(T₁₁, A₁₂, B₂₁)
11 spawn P-MATRIX-MULTIPLY-RECURSIVE(T₁₂, A₁₂, B₂₂)
12 spawn P-MATRIX-MULTIPLY-RECURSIVE(T₂₁, A₂₂, B₂₁)
13 P-MATRIX-MULTIPLY-RECURSIVE(T₂₂, A₂₂, B₂₂)
14 sync
15 parallel for i = 1 to n
16 parallel for j = 1 to n
17 cᵢⱼ = cᵢⱼ + tᵢⱼ
Strategy:
- Lines 6–9: Compute the first term of each Cᵢⱼ sum (into C)
- Lines 10–13: Compute the second term of each Cᵢⱼ sum (into temp matrix T)
- 7 spawned + 1 called = 8 recursive multiplications in parallel
- Lines 15–17: Add T into C using parallel loops
Analysis:
| Metric | Recurrence | Solution |
|---|---|---|
| Work M₁(n) | 8M₁(n/2) + Θ(n²) | Θ(n³) (master theorem case 1) |
| Span M∞(n) | M∞(n/2) + Θ(lg n) | Θ(lg² n) |
| Parallelism | M₁/M∞ | Θ(n³ / lg² n) |
For the span: all 8 recursive calls execute in parallel on same-size matrices, so we take just one; the Θ(lg n) addition from the parallel loops follows.
Multithreading Strassen's Method
Apply the same parallelization strategy to Strassen's algorithm:
- Divide A, B, C into (n/2)×(n/2) submatrices — Θ(1) work and span
- Create 10 sum/difference matrices S₁…S₁₀ — Θ(n²) work, Θ(lg n) span (parallel loops)
- Recursively spawn 7 matrix products P₁…P₇ in parallel
- Combine results into C₁₁, C₁₂, C₂₁, C₂₂ — Θ(n²) work, Θ(lg n) span
| Metric | Value |
|---|---|
| Work | Θ(n^lg7) ≈ Θ(n^2.81) — same as serial Strassen |
| Span | M∞(n/2) + Θ(lg n) = Θ(lg² n) — same recurrence as above |
| Parallelism | Θ(n^lg7 / lg² n) |
27.3 Multithreaded Merge Sort
Naive Approach: Spawn the Recursive Calls
MERGE-SORT'(A, p, r)
1 if p < r
2 q = ⌊(p + r) / 2⌋
3 spawn MERGE-SORT'(A, p, q)
4 MERGE-SORT'(A, q + 1, r)
5 sync
6 MERGE(A, p, q, r)
| Metric | Recurrence | Solution |
|---|---|---|
| Work | 2·MS'₁(n/2) + Θ(n) | Θ(n lg n) |
| Span | MS'∞(n/2) + Θ(n) | Θ(n) |
| Parallelism | Θ(lg n) — very low! |
Bottleneck: The serial MERGE procedure has Θ(n) span, dominating the span recurrence.
Multithreaded Merge (P-MERGE)
Key idea: Divide-and-conquer merging using the median of the larger subarray and binary search.
Strategy (merging sorted T[p₁..r₁] and T[p₂..r₂] into A[p₃..r₃]):
- Ensure n₁ ≥ n₂ (swap if needed)
- Find median x = T[q₁] of the larger subarray, where q₁ = ⌊(p₁+r₁)/2⌋
- Binary search for position q₂ in T[p₂..r₂] where x would be inserted
- Compute output position q₃ = p₃ + (q₁ − p₁) + (q₂ − p₂)
- Place x into A[q₃]
- Recursively merge the two halves in parallel
BINARY-SEARCH(x, T, p, r)
1 low = p
2 high = max(p, r + 1)
3 while low < high
4 mid = ⌊(low + high) / 2⌋
5 if x ≤ T[mid]
6 high = mid
7 else low = mid + 1
8 return high
P-MERGE(T, p₁, r₁, p₂, r₂, A, p₃)
1 n₁ = r₁ - p₁ + 1
2 n₂ = r₂ - p₂ + 1
3 if n₁ < n₂ // ensure n₁ ≥ n₂
4 exchange p₁ with p₂
5 exchange r₁ with r₂
6 exchange n₁ with n₂
7 if n₁ == 0 // both empty?
8 return
9 else q₁ = ⌊(p₁ + r₁) / 2⌋
10 q₂ = BINARY-SEARCH(T[q₁], T, p₂, r₂)
11 q₃ = p₃ + (q₁ - p₁) + (q₂ - p₂)
12 A[q₃] = T[q₁]
13 spawn P-MERGE(T, p₁, q₁-1, p₂, q₂-1, A, p₃)
14 P-MERGE(T, q₁+1, r₁, q₂, r₂, A, q₃+1)
15 sync
Worst-case analysis: Each recursive call handles at most 3n/4 elements:
- Since n₁ ≥ n₂, we have n₂ ≤ n/2
- One recursive call merges ⌊n₁/2⌋ + n₂ ≤ n₁/2 + n₂/2 + n₂/2 = n/2 + n/4 = 3n/4
P-MERGE Analysis:
| Metric | Recurrence | Solution |
|---|---|---|
| Span PM∞(n) | PM∞(3n/4) + Θ(lg n) | Θ(lg² n) |
| Work PM₁(n) | PM₁(αn) + PM₁((1−α)n) + O(lg n), where 1/4 ≤ α ≤ 3/4 | Θ(n) |
| Parallelism | PM₁/PM∞ | Θ(n / lg² n) |
Work proof (substitution method): Assume PM₁(n) ≤ c₁n − c₂ lg n. The two recursive calls together process at most n elements, and the binary search costs O(lg n). Substituting confirms PM₁(n) = O(n). Combined with the Ω(n) lower bound (every element must be copied), we get Θ(n).
Full Multithreaded Merge Sort (P-MERGE-SORT)
P-MERGE-SORT(A, p, r, B, s)
1 n = r - p + 1
2 if n == 1
3 B[s] = A[p]
4 else let T[1..n] be a new array
5 q = ⌊(p + r) / 2⌋
6 q' = q - p + 1
7 spawn P-MERGE-SORT(A, p, q, T, 1)
8 P-MERGE-SORT(A, q+1, r, T, q'+1)
9 sync
10 P-MERGE(T, 1, q', q'+1, n, B, s)
How it works:
- Recursively sort both halves into temporary array T (in parallel via spawn)
- After sync, merge the two sorted halves from T into output array B using P-MERGE
P-MERGE-SORT Analysis:
| Metric | Recurrence | Solution |
|---|---|---|
| Work PMS₁(n) | 2·PMS₁(n/2) + Θ(n) | Θ(n lg n) (master theorem case 2) |
| Span PMS∞(n) | PMS∞(n/2) + Θ(lg² n) | Θ(lg³ n) |
| Parallelism | PMS₁/PMS∞ | Θ(n lg n / lg³ n) = Θ(n / lg² n) |
Comparison of merge sort variants:
| Algorithm | Work | Span | Parallelism |
|---|---|---|---|
| MERGE-SORT' (serial merge) | Θ(n lg n) | Θ(n) | Θ(lg n) |
| P-MERGE-SORT (parallel merge) | Θ(n lg n) | Θ(lg³ n) | Θ(n / lg² n) |
The parallel merge dramatically improves parallelism from Θ(lg n) to Θ(n / lg² n).
Practical note: Coarsen the base case by switching to a serial sort (e.g., quicksort) for small subarrays to reduce overhead constants while maintaining high parallelism.
Summary of Key Results
| Algorithm | Work | Span | Parallelism |
|---|---|---|---|
| P-FIB(n) | Θ(φⁿ) | Θ(n) | Θ(φⁿ/n) |
| MAT-VEC (matrix-vector) | Θ(n²) | Θ(n) | Θ(n) |
| P-SQUARE-MATRIX-MULTIPLY | Θ(n³) | Θ(n) | Θ(n²) |
| P-MATRIX-MULTIPLY-RECURSIVE | Θ(n³) | Θ(lg² n) | Θ(n³/lg² n) |
| Multithreaded Strassen | Θ(n^lg7) | Θ(lg² n) | Θ(n^lg7/lg² n) |
| MERGE-SORT' (serial merge) | Θ(n lg n) | Θ(n) | Θ(lg n) |
| P-MERGE | Θ(n) | Θ(lg² n) | Θ(n/lg² n) |
| P-MERGE-SORT | Θ(n lg n) | Θ(lg³ n) | Θ(n/lg² n) |
Core principles:
- Work = serial running time of the serialization
- Span = critical path length; use max for parallel composition, sum for series
- Greedy bound: Tₚ ≤ T₁/P + T∞
- Slackness ≥ 10 is a practical target for near-linear speedup
- Avoid determinacy races by ensuring parallel strands are independent
Previous chapter
Maximum FlowNext chapter
Matrix Operations