Medians and Order Statistics
Introduction to AlgorithmsChapter 9: Medians and Order Statistics
The ith order statistic of a set of n elements is the ith smallest element.
- Minimum = 1st order statistic (i = 1)
- Maximum = nth order statistic (i = n)
- Median = the "halfway point" of the set
- If n is odd: median is unique at i = (n + 1)/2
- If n is even: two medians at i = n/2 (lower median) and i = n/2 + 1 (upper median)
- By convention, "the median" refers to the lower median: i = ⌊(n + 1)/2⌋
The Selection Problem:
- Input: A set A of n distinct numbers and an integer i, with 1 ≤ i ≤ n
- Output: The element x ∈ A that is larger than exactly i − 1 other elements of A
A naive approach is to sort in O(n lg n) time and index the ith element. This chapter presents faster O(n) algorithms.
9.1 Minimum and Maximum
Finding the Minimum
We need n − 1 comparisons to find the minimum — examine each element and track the smallest seen so far.
MINIMUM(A)
1 min = A[1]
2 for i = 2 to A.length
3 if min > A[i]
4 min = A[i]
5 return min
- Finding the maximum also requires n − 1 comparisons
- This is optimal: think of it as a tournament — every element except the winner must lose at least one match, so n − 1 comparisons are necessary
Simultaneous Minimum and Maximum
Finding both independently costs 2n − 2 comparisons (n − 1 each). We can do better.
Key idea: Process elements in pairs.
- Compare pairs of elements with each other first
- Compare the smaller of the pair with the current minimum
- Compare the larger of the pair with the current maximum
- Cost: 3 comparisons per 2 elements
Initialization:
- If n is odd: set both min and max to the first element, process the rest in pairs
- If n is even: compare the first 2 elements to set initial min and max, then process the rest in pairs
Total comparisons: at most 3⌊n/2⌋
- If n is odd: 3⌊n/2⌋
- If n is even: 1 + 3(n − 2)/2 = 3n/2 − 2
9.2 Selection in Expected Linear Time
The general selection problem has the same asymptotic running time as finding a minimum: Θ(n).
RANDOMIZED-SELECT is a divide-and-conquer algorithm modeled after quicksort:
- Recursively partitions the input array
- Unlike quicksort, it only recurses on one side of the partition
- This is why it achieves Θ(n) expected time instead of quicksort's Θ(n lg n)
The Algorithm
Uses RANDOMIZED-PARTITION from Section 7.3. Returns the ith smallest element of A[p..r].
RANDOMIZED-SELECT(A, p, r, i)
1 if p == r
2 return A[p]
3 q = RANDOMIZED-PARTITION(A, p, r)
4 k = q - p + 1
5 if i == k // the pivot value is the answer
6 return A[q]
7 elseif i < k
8 return RANDOMIZED-SELECT(A, p, q - 1, i)
9 else return RANDOMIZED-SELECT(A, q + 1, r, i - k)
How It Works
- Base case (line 1): subarray has one element — return it
- Partition (line 3): splits A[p..r] into:
- A[p..q−1] — elements ≤ A[q] (the pivot)
- A[q+1..r] — elements > A[q]
- k (line 4): number of elements in the low side + pivot
- Decision:
- If i == k → pivot is the answer (line 6)
- If i < k → recurse on the low side (line 8)
- If i > k → recurse on the high side, looking for the (i − k)th smallest (line 9)
Running Time Analysis
- Worst case: Θ(n²) — if we always partition around the largest remaining element
- Expected case: Θ(n) — assuming distinct elements
- Since it's randomized, no particular input elicits worst-case behavior
Proof that E[T(n)] = O(n):
Let T(n) be the running time on n elements. Define indicator random variables:
- X_k = I{subarray A[p..q] has exactly k elements}, so E[X_k] = 1/n
Assuming T(n) is monotonically increasing, we upper-bound by always recursing on the larger side:
$$T(n) \leq \sum_{k=1}^{n} X_k \cdot T(\max(k-1, n-k)) + O(n)$$
Evaluating max(k−1, n−k):
- Each term from T(⌈n/2⌉) to T(n−1) appears at most twice in the summation
This gives:
$$E[T(n)] \leq \frac{2}{n} \sum_{k=\lfloor n/2 \rfloor}^{n-1} E[T(k)] + O(n)$$
By substitution (assume E[T(n)] ≤ cn), we can show:
$$E[T(n)] \leq cn - \left(\frac{cn}{4} - \frac{c}{2} - an\right)$$
Choosing c > 4a ensures cn/4 − c/2 − an ≥ 0 for sufficiently large n, proving E[T(n)] = O(n).
Conclusion: Any order statistic (including the median) can be found in expected linear time.
9.3 Selection in Worst-Case Linear Time
The SELECT algorithm achieves O(n) worst-case running time by guaranteeing a good split when partitioning.
- Uses the deterministic PARTITION from quicksort (modified to take the pivot as input)
- Key idea: choose the median of medians as the pivot
The Algorithm (5 Steps)
For an input array of n > 1 distinct elements, to find the ith smallest:
-
Divide the n elements into ⌊n/5⌋ groups of 5, plus at most one group of the remaining n mod 5 elements
-
Find the median of each of the ⌈n/5⌉ groups by insertion sorting each group (at most 5 elements) and picking the median
-
Recursively use SELECT to find the median x of the ⌈n/5⌉ medians from step 2 (the median of medians)
-
Partition the input array around x. Let k = number of elements on the low side + 1, so x is the kth smallest
-
Recurse:
- If i = k → return x
- If i < k → SELECT recursively on the low side for the ith smallest
- If i > k → SELECT recursively on the high side for the (i − k)th smallest
Analysis: Why It's Linear
At least 3n/10 − 6 elements are greater than x:
- At least half of the ⌈n/5⌉ group medians are ≥ x
- Each such group contributes at least 3 elements ≥ x (the median and the two above it)
- Discounting the group containing x and a possible partial group:
- Elements greater than x ≥ 3(⌈n/5⌉/2 − 2) ≥ 3n/10 − 6
- Similarly, at least 3n/10 − 6 elements are less than x
Therefore, step 5 recurses on at most 7n/10 + 6 elements.
Recurrence
$$T(n) = \begin{cases} O(1) & \text{if } n < 140 \ T(\lceil n/5 \rceil) + T(7n/10 + 6) + O(n) & \text{if } n \geq 140 \end{cases}$$
Where:
- T(⌈n/5⌉): step 3 — finding median of medians
- T(7n/10 + 6): step 5 — recursing on the larger side
- O(n): steps 1, 2, and 4 (partitioning, grouping, sorting groups of 5)
Proof of Linearity (by Substitution)
Assume T(n) ≤ cn. Substituting:
$$T(n) \leq c\lceil n/5 \rceil + c(7n/10 + 6) + an$$ $$\leq cn/5 + c + 7cn/10 + 6c + an$$ $$= 9cn/10 + 7c + an$$ $$= cn + (-cn/10 + 7c + an)$$
This is ≤ cn when: −cn/10 + 7c + an ≤ 0
Rearranging: c ≥ 10a · n/(n − 70) when n > 70.
Since n ≥ 140, we have n/(n − 70) ≤ 2, so choosing c ≥ 20a suffices.
The constant 140 is not special — any integer strictly greater than 70 works; we then choose c accordingly.
Key Takeaway
- Both SELECT and RANDOMIZED-SELECT determine relative order only by comparing elements
- Sorting requires Ω(n lg n) in the comparison model
- Selection algorithms achieve O(n) without sorting — they are not subject to the Ω(n lg n) lower bound
- Solving selection by sorting and indexing is asymptotically inefficient
Summary
| Algorithm | Expected Time | Worst-Case Time | Notes |
|---|---|---|---|
| Minimum/Maximum | Θ(n) | Θ(n) | n − 1 comparisons (optimal) |
| Simultaneous Min & Max | — | 3⌊n/2⌋ comparisons | Process elements in pairs |
| RANDOMIZED-SELECT | Θ(n) | Θ(n²) | Practical, randomized |
| SELECT (median of medians) | Θ(n) | Θ(n) | Deterministic, more theoretical |
Chapter Notes:
- The worst-case linear-time algorithm was devised by Blum, Floyd, Pratt, Rivest, and Tarjan (1973)
- The randomized version is due to Hoare
- The exact number of comparisons needed to determine the median is still unknown — best bounds are approximately between 2n and 2.95n
Previous chapter
Sorting in Linear TimeNext chapter
Elementary Data Structures