Pagefy

Book

Introduction to Algorithms

1.The Role of Algorithms in Computing 2.Getting Started 3.Growth of Functions 4.Divide and Conquer 5.Probabilistic Analysis and Randomized Algorithms 6.Heapsort 7.Quicksort 8.Sorting in Linear Time 9.Medians and Order Statistics 10.Elementary Data Structures 11.Hash Tables 12.Binary Search Trees 13.Red Black Trees 14.Augmenting Data Structures 15.Dynamic Programming 16.Greedy Algorithms 17.Amortized Analysis 18.B Trees 19.Fibonacci Heaps 20.van Emde Boas Trees 21.Data Structures for Disjoint Sets 22.Elementary Graph Algorithms 23.Minimum Spanning Trees 24.Single Source Shortest Paths 25.All Pairs Shortest Paths 26.Maximum Flow 27.Multithreaded Algorithms 28.Matrix Operations 29.Linear Programming 30.Polynomials and the FFT 31.Number Theoretic Algorithms 32.String Matching 33.Computational Geometry 34.NP Completeness 35.Approximation Algorithms 36.Appendix A Summations 37.Appendix B Sets Etc 38.Appendix C Counting and Probability 39.Appendix D Matrices

Medians and Order Statistics

Introduction to Algorithms

Chapter 9: Medians and Order Statistics

The ith order statistic of a set of n elements is the ith smallest element.

Minimum = 1st order statistic (i = 1)
Maximum = nth order statistic (i = n)
Median = the "halfway point" of the set
- If n is odd: median is unique at i = (n + 1)/2
- If n is even: two medians at i = n/2 (lower median) and i = n/2 + 1 (upper median)
- By convention, "the median" refers to the lower median: i = ⌊(n + 1)/2⌋

The Selection Problem:

Input: A set A of n distinct numbers and an integer i, with 1 ≤ i ≤ n
Output: The element x ∈ A that is larger than exactly i − 1 other elements of A

A naive approach is to sort in O(n lg n) time and index the ith element. This chapter presents faster O(n) algorithms.

9.1 Minimum and Maximum

Finding the Minimum

We need n − 1 comparisons to find the minimum — examine each element and track the smallest seen so far.

MINIMUM(A)
1  min = A[1]
2  for i = 2 to A.length
3      if min > A[i]
4          min = A[i]
5  return min

Finding the maximum also requires n − 1 comparisons
This is optimal: think of it as a tournament — every element except the winner must lose at least one match, so n − 1 comparisons are necessary

Simultaneous Minimum and Maximum

Finding both independently costs 2n − 2 comparisons (n − 1 each). We can do better.

Key idea: Process elements in pairs.

Compare pairs of elements with each other first
Compare the smaller of the pair with the current minimum
Compare the larger of the pair with the current maximum
Cost: 3 comparisons per 2 elements

Initialization:

If n is odd: set both min and max to the first element, process the rest in pairs
If n is even: compare the first 2 elements to set initial min and max, then process the rest in pairs

Total comparisons: at most 3⌊n/2⌋

If n is odd: 3⌊n/2⌋
If n is even: 1 + 3(n − 2)/2 = 3n/2 − 2

9.2 Selection in Expected Linear Time

The general selection problem has the same asymptotic running time as finding a minimum: Θ(n).

RANDOMIZED-SELECT is a divide-and-conquer algorithm modeled after quicksort:

Recursively partitions the input array
Unlike quicksort, it only recurses on one side of the partition
This is why it achieves Θ(n) expected time instead of quicksort's Θ(n lg n)

The Algorithm

Uses RANDOMIZED-PARTITION from Section 7.3. Returns the ith smallest element of A[p..r].

RANDOMIZED-SELECT(A, p, r, i)
1  if p == r
2      return A[p]
3  q = RANDOMIZED-PARTITION(A, p, r)
4  k = q - p + 1
5  if i == k          // the pivot value is the answer
6      return A[q]
7  elseif i < k
8      return RANDOMIZED-SELECT(A, p, q - 1, i)
9  else return RANDOMIZED-SELECT(A, q + 1, r, i - k)

How It Works

Base case (line 1): subarray has one element — return it
Partition (line 3): splits A[p..r] into:
- A[p..q−1] — elements ≤ A[q] (the pivot)
- A[q+1..r] — elements > A[q]
k (line 4): number of elements in the low side + pivot
Decision:
- If i == k → pivot is the answer (line 6)
- If i < k → recurse on the low side (line 8)
- If i > k → recurse on the high side, looking for the (i − k)th smallest (line 9)

Running Time Analysis

Worst case: Θ(n²) — if we always partition around the largest remaining element
Expected case: Θ(n) — assuming distinct elements
Since it's randomized, no particular input elicits worst-case behavior

Proof that E[T(n)] = O(n):

Let T(n) be the running time on n elements. Define indicator random variables:

X_k = I{subarray A[p..q] has exactly k elements}, so E[X_k] = 1/n

Assuming T(n) is monotonically increasing, we upper-bound by always recursing on the larger side:

$$T(n) \leq \sum_{k=1}^{n} X_k \cdot T(\max(k-1, n-k)) + O(n)$$

Evaluating max(k−1, n−k):

Each term from T(⌈n/2⌉) to T(n−1) appears at most twice in the summation

This gives:

$$E[T(n)] \leq \frac{2}{n} \sum_{k=\lfloor n/2 \rfloor}^{n-1} E[T(k)] + O(n)$$

By substitution (assume E[T(n)] ≤ cn), we can show:

$$E[T(n)] \leq cn - \left(\frac{cn}{4} - \frac{c}{2} - an\right)$$

Choosing c > 4a ensures cn/4 − c/2 − an ≥ 0 for sufficiently large n, proving E[T(n)] = O(n).

Conclusion: Any order statistic (including the median) can be found in expected linear time.

9.3 Selection in Worst-Case Linear Time

The SELECT algorithm achieves O(n) worst-case running time by guaranteeing a good split when partitioning.

Uses the deterministic PARTITION from quicksort (modified to take the pivot as input)
Key idea: choose the median of medians as the pivot

The Algorithm (5 Steps)

For an input array of n > 1 distinct elements, to find the ith smallest:

Divide the n elements into ⌊n/5⌋ groups of 5, plus at most one group of the remaining n mod 5 elements
Find the median of each of the ⌈n/5⌉ groups by insertion sorting each group (at most 5 elements) and picking the median
Recursively use SELECT to find the median x of the ⌈n/5⌉ medians from step 2 (the median of medians)
Partition the input array around x. Let k = number of elements on the low side + 1, so x is the kth smallest
Recurse:
- If i = k → return x
- If i < k → SELECT recursively on the low side for the ith smallest
- If i > k → SELECT recursively on the high side for the (i − k)th smallest

Analysis: Why It's Linear

At least 3n/10 − 6 elements are greater than x:

At least half of the ⌈n/5⌉ group medians are ≥ x
Each such group contributes at least 3 elements ≥ x (the median and the two above it)
Discounting the group containing x and a possible partial group:
- Elements greater than x ≥ 3(⌈n/5⌉/2 − 2) ≥ 3n/10 − 6
Similarly, at least 3n/10 − 6 elements are less than x

Therefore, step 5 recurses on at most 7n/10 + 6 elements.

Recurrence

$$T(n) = \begin{cases} O(1) & \text{if } n < 140 \ T(\lceil n/5 \rceil) + T(7n/10 + 6) + O(n) & \text{if } n \geq 140 \end{cases}$$

Where:

T(⌈n/5⌉): step 3 — finding median of medians
T(7n/10 + 6): step 5 — recursing on the larger side
O(n): steps 1, 2, and 4 (partitioning, grouping, sorting groups of 5)

Proof of Linearity (by Substitution)

Assume T(n) ≤ cn. Substituting:

$$T(n) \leq c\lceil n/5 \rceil + c(7n/10 + 6) + an$$ $$\leq cn/5 + c + 7cn/10 + 6c + an$$ $$= 9cn/10 + 7c + an$$ $$= cn + (-cn/10 + 7c + an)$$

This is ≤ cn when: −cn/10 + 7c + an ≤ 0

Rearranging: c ≥ 10a · n/(n − 70) when n > 70.

Since n ≥ 140, we have n/(n − 70) ≤ 2, so choosing c ≥ 20a suffices.

The constant 140 is not special — any integer strictly greater than 70 works; we then choose c accordingly.

Key Takeaway

Both SELECT and RANDOMIZED-SELECT determine relative order only by comparing elements
Sorting requires Ω(n lg n) in the comparison model
Selection algorithms achieve O(n) without sorting — they are not subject to the Ω(n lg n) lower bound
Solving selection by sorting and indexing is asymptotically inefficient

Summary

Algorithm	Expected Time	Worst-Case Time	Notes
Minimum/Maximum	Θ(n)	Θ(n)	n − 1 comparisons (optimal)
Simultaneous Min & Max	—	3⌊n/2⌋ comparisons	Process elements in pairs
RANDOMIZED-SELECT	Θ(n)	Θ(n²)	Practical, randomized
SELECT (median of medians)	Θ(n)	Θ(n)	Deterministic, more theoretical

Chapter Notes:

The worst-case linear-time algorithm was devised by Blum, Floyd, Pratt, Rivest, and Tarjan (1973)
The randomized version is due to Hoare
The exact number of comparisons needed to determine the median is still unknown — best bounds are approximately between 2n and 2.95n

Previous chapter

Sorting in Linear Time

Next chapter

Elementary Data Structures