Pagefy

Pagefy

Back to Data Structures and Algorithms

Medians and Order Statistics

Introduction to Algorithms

Chapter 9: Medians and Order Statistics

The ith order statistic of a set of n elements is the ith smallest element.

  • Minimum = 1st order statistic (i = 1)
  • Maximum = nth order statistic (i = n)
  • Median = the "halfway point" of the set
    • If n is odd: median is unique at i = (n + 1)/2
    • If n is even: two medians at i = n/2 (lower median) and i = n/2 + 1 (upper median)
    • By convention, "the median" refers to the lower median: i = ⌊(n + 1)/2⌋

The Selection Problem:

  • Input: A set A of n distinct numbers and an integer i, with 1 ≤ i ≤ n
  • Output: The element x ∈ A that is larger than exactly i − 1 other elements of A

A naive approach is to sort in O(n lg n) time and index the ith element. This chapter presents faster O(n) algorithms.


9.1 Minimum and Maximum

Finding the Minimum

We need n − 1 comparisons to find the minimum — examine each element and track the smallest seen so far.

MINIMUM(A)
1  min = A[1]
2  for i = 2 to A.length
3      if min > A[i]
4          min = A[i]
5  return min
  • Finding the maximum also requires n − 1 comparisons
  • This is optimal: think of it as a tournament — every element except the winner must lose at least one match, so n − 1 comparisons are necessary

Simultaneous Minimum and Maximum

Finding both independently costs 2n − 2 comparisons (n − 1 each). We can do better.

Key idea: Process elements in pairs.

  1. Compare pairs of elements with each other first
  2. Compare the smaller of the pair with the current minimum
  3. Compare the larger of the pair with the current maximum
  4. Cost: 3 comparisons per 2 elements

Initialization:

  • If n is odd: set both min and max to the first element, process the rest in pairs
  • If n is even: compare the first 2 elements to set initial min and max, then process the rest in pairs

Total comparisons: at most 3⌊n/2⌋

  • If n is odd: 3⌊n/2⌋
  • If n is even: 1 + 3(n − 2)/2 = 3n/2 − 2

9.2 Selection in Expected Linear Time

The general selection problem has the same asymptotic running time as finding a minimum: Θ(n).

RANDOMIZED-SELECT is a divide-and-conquer algorithm modeled after quicksort:

  • Recursively partitions the input array
  • Unlike quicksort, it only recurses on one side of the partition
  • This is why it achieves Θ(n) expected time instead of quicksort's Θ(n lg n)

The Algorithm

Uses RANDOMIZED-PARTITION from Section 7.3. Returns the ith smallest element of A[p..r].

RANDOMIZED-SELECT(A, p, r, i)
1  if p == r
2      return A[p]
3  q = RANDOMIZED-PARTITION(A, p, r)
4  k = q - p + 1
5  if i == k          // the pivot value is the answer
6      return A[q]
7  elseif i < k
8      return RANDOMIZED-SELECT(A, p, q - 1, i)
9  else return RANDOMIZED-SELECT(A, q + 1, r, i - k)

How It Works

  1. Base case (line 1): subarray has one element — return it
  2. Partition (line 3): splits A[p..r] into:
    • A[p..q−1] — elements ≤ A[q] (the pivot)
    • A[q+1..r] — elements > A[q]
  3. k (line 4): number of elements in the low side + pivot
  4. Decision:
    • If i == k → pivot is the answer (line 6)
    • If i < k → recurse on the low side (line 8)
    • If i > k → recurse on the high side, looking for the (i − k)th smallest (line 9)

Running Time Analysis

  • Worst case: Θ(n²) — if we always partition around the largest remaining element
  • Expected case: Θ(n) — assuming distinct elements
  • Since it's randomized, no particular input elicits worst-case behavior

Proof that E[T(n)] = O(n):

Let T(n) be the running time on n elements. Define indicator random variables:

  • X_k = I{subarray A[p..q] has exactly k elements}, so E[X_k] = 1/n

Assuming T(n) is monotonically increasing, we upper-bound by always recursing on the larger side:

$$T(n) \leq \sum_{k=1}^{n} X_k \cdot T(\max(k-1, n-k)) + O(n)$$

Evaluating max(k−1, n−k):

  • Each term from T(⌈n/2⌉) to T(n−1) appears at most twice in the summation

This gives:

$$E[T(n)] \leq \frac{2}{n} \sum_{k=\lfloor n/2 \rfloor}^{n-1} E[T(k)] + O(n)$$

By substitution (assume E[T(n)] ≤ cn), we can show:

$$E[T(n)] \leq cn - \left(\frac{cn}{4} - \frac{c}{2} - an\right)$$

Choosing c > 4a ensures cn/4 − c/2 − an ≥ 0 for sufficiently large n, proving E[T(n)] = O(n).

Conclusion: Any order statistic (including the median) can be found in expected linear time.


9.3 Selection in Worst-Case Linear Time

The SELECT algorithm achieves O(n) worst-case running time by guaranteeing a good split when partitioning.

  • Uses the deterministic PARTITION from quicksort (modified to take the pivot as input)
  • Key idea: choose the median of medians as the pivot

The Algorithm (5 Steps)

For an input array of n > 1 distinct elements, to find the ith smallest:

  1. Divide the n elements into ⌊n/5⌋ groups of 5, plus at most one group of the remaining n mod 5 elements

  2. Find the median of each of the ⌈n/5⌉ groups by insertion sorting each group (at most 5 elements) and picking the median

  3. Recursively use SELECT to find the median x of the ⌈n/5⌉ medians from step 2 (the median of medians)

  4. Partition the input array around x. Let k = number of elements on the low side + 1, so x is the kth smallest

  5. Recurse:

    • If i = k → return x
    • If i < k → SELECT recursively on the low side for the ith smallest
    • If i > k → SELECT recursively on the high side for the (i − k)th smallest

Analysis: Why It's Linear

At least 3n/10 − 6 elements are greater than x:

  • At least half of the ⌈n/5⌉ group medians are ≥ x
  • Each such group contributes at least 3 elements ≥ x (the median and the two above it)
  • Discounting the group containing x and a possible partial group:
    • Elements greater than x ≥ 3(⌈n/5⌉/2 − 2) ≥ 3n/10 − 6
  • Similarly, at least 3n/10 − 6 elements are less than x

Therefore, step 5 recurses on at most 7n/10 + 6 elements.

Recurrence

$$T(n) = \begin{cases} O(1) & \text{if } n < 140 \ T(\lceil n/5 \rceil) + T(7n/10 + 6) + O(n) & \text{if } n \geq 140 \end{cases}$$

Where:

  • T(⌈n/5⌉): step 3 — finding median of medians
  • T(7n/10 + 6): step 5 — recursing on the larger side
  • O(n): steps 1, 2, and 4 (partitioning, grouping, sorting groups of 5)

Proof of Linearity (by Substitution)

Assume T(n) ≤ cn. Substituting:

$$T(n) \leq c\lceil n/5 \rceil + c(7n/10 + 6) + an$$ $$\leq cn/5 + c + 7cn/10 + 6c + an$$ $$= 9cn/10 + 7c + an$$ $$= cn + (-cn/10 + 7c + an)$$

This is ≤ cn when: −cn/10 + 7c + an ≤ 0

Rearranging: c ≥ 10a · n/(n − 70) when n > 70.

Since n ≥ 140, we have n/(n − 70) ≤ 2, so choosing c ≥ 20a suffices.

The constant 140 is not special — any integer strictly greater than 70 works; we then choose c accordingly.

Key Takeaway

  • Both SELECT and RANDOMIZED-SELECT determine relative order only by comparing elements
  • Sorting requires Ω(n lg n) in the comparison model
  • Selection algorithms achieve O(n) without sorting — they are not subject to the Ω(n lg n) lower bound
  • Solving selection by sorting and indexing is asymptotically inefficient

Summary

AlgorithmExpected TimeWorst-Case TimeNotes
Minimum/MaximumΘ(n)Θ(n)n − 1 comparisons (optimal)
Simultaneous Min & Max3⌊n/2⌋ comparisonsProcess elements in pairs
RANDOMIZED-SELECTΘ(n)Θ(n²)Practical, randomized
SELECT (median of medians)Θ(n)Θ(n)Deterministic, more theoretical

Chapter Notes:

  • The worst-case linear-time algorithm was devised by Blum, Floyd, Pratt, Rivest, and Tarjan (1973)
  • The randomized version is due to Hoare
  • The exact number of comparisons needed to determine the median is still unknown — best bounds are approximately between 2n and 2.95n