Hash Tables
Introduction to AlgorithmsChapter 11: Hash Tables
Many applications require a dynamic set supporting only INSERT, SEARCH, and DELETE — the dictionary operations. A hash table is an effective data structure for implementing dictionaries. Under reasonable assumptions, the average time to search for an element in a hash table is O(1).
A hash table generalizes the simpler notion of an ordinary array. Instead of using the key as an array index directly, the array index is computed from the key using a hash function. This reduces the required array size from |U| (universe size) to m (proportional to the number of keys actually stored).
Direct-Address Tables
Direct addressing works well when the universe U of keys is reasonably small.
- Suppose each element has a key drawn from U = {0, 1, ..., m − 1}, where m is not too large
- No two elements have the same key
- We use an array (the direct-address table) T[0 .. m − 1], where each slot corresponds to a key in U
- Slot k points to the element with key k, or contains NIL if no such element exists
Operations
All operations run in O(1) time:
DIRECT-ADDRESS-SEARCH(T, k)
return T[k]
DIRECT-ADDRESS-INSERT(T, x)
T[x.key] = x
DIRECT-ADDRESS-DELETE(T, x)
T[x.key] = NIL
Space Optimization
- The table itself can hold elements directly (instead of pointers to external objects), saving space
- If stored in the slot, the key can often be omitted (the index implies the key)
- A special key or flag is needed to indicate empty slots
Hash Tables
Motivation
- If U is large, storing a table of size |U| may be impractical or impossible
- If the set K of actual keys is much smaller than U, most space in a direct-address table is wasted
- A hash table reduces storage to Θ(|K|) while maintaining O(1) average-case search time
Hash Functions and Hashing
- With hashing, an element with key k is stored in slot h(k), where h is a hash function:
- h : U → {0, 1, ..., m − 1}
- The size m of the hash table is typically much less than |U|
- h(k) is called the hash value of key k
Collisions
- A collision occurs when two keys hash to the same slot: h(k₁) = h(k₂)
- Since |U| > m, collisions are unavoidable
- A well-designed hash function minimizes collisions but cannot eliminate them
- Two main resolution strategies: chaining and open addressing
Collision Resolution by Chaining
All elements that hash to the same slot are placed into a linked list:
- Slot j contains a pointer to the head of the list of all elements with hash value j
- If no elements hash to j, the slot contains NIL
CHAINED-HASH-INSERT(T, x)
insert x at the head of list T[h(x.key)]
CHAINED-HASH-SEARCH(T, k)
search for an element with key k in list T[h(k)]
CHAINED-HASH-DELETE(T, x)
delete x from the list T[h(x.key)]
Running times:
| Operation | Time |
|---|---|
| Insert | O(1) worst-case (assumes element not already present) |
| Search | O(length of list) |
| Delete | O(1) with doubly-linked lists (takes element, not key) |
Analysis of Hashing with Chaining
- Load factor: α = n/m (n = number of elements, m = number of slots)
- α is the average number of elements per chain
- Can be < 1, = 1, or > 1
- Worst case: all n keys hash to the same slot → Θ(n) search time
- Average case relies on the assumption of simple uniform hashing: each key is equally likely to hash to any of the m slots, independently of other keys
Theorem 11.1 — An unsuccessful search takes average-case time Θ(1 + α) under simple uniform hashing.
- The search examines the entire list T[h(k)], which has expected length α
Theorem 11.2 — A successful search takes average-case time Θ(1 + α) under simple uniform hashing.
- Expected elements examined: 1 + α/2 − α/(2n)
Key insight: If n = O(m), then α = O(1), and all dictionary operations take O(1) time on average.
Hash Functions
A good hash function satisfies (approximately) the assumption of simple uniform hashing: each key is equally likely to hash to any of the m slots, independently of where other keys have hashed.
Interpreting Keys as Natural Numbers
- Most hash functions assume keys are natural numbers
- Non-numeric keys are converted: e.g., a character string can be interpreted as an integer in a suitable radix
- Example: "pt" → (112 × 128) + 116 = 14452 (using ASCII values, radix 128)
The Division Method
The hash function is:
h(k) = k mod m
- Fast: requires only a single division
- Choosing m:
- Avoid powers of 2: if m = 2ᵖ, then h(k) is just the p lowest-order bits of k
- Avoid m = 2ᵖ − 1 when keys are character strings in radix 2ᵖ (permuting characters doesn't change hash value)
- A prime not too close to an exact power of 2 is often a good choice
- Example: for n ≈ 2000 with desired α ≈ 3, choose m = 701 (prime near 2000/3)
The Multiplication Method
The hash function is:
h(k) = ⌊m (kA mod 1)⌋
where "kA mod 1" means the fractional part of kA, and 0 < A < 1.
Steps:
- Multiply key k by constant A (0 < A < 1)
- Extract the fractional part of kA
- Multiply by m and take the floor
Advantages:
- The value of m is not critical — typically choose m = 2ᵖ
- Easy to implement on machines with word size w:
- Let s = A · 2ʷ (an integer)
- Multiply k by s to get a 2w-bit result: r₁ · 2ʷ + r₀
- The desired hash value is the p most significant bits of r₀
Suggested constant (Knuth):
A ≈ (√5 − 1)/2 = 0.6180339887...
Universal Hashing
Any fixed hash function is vulnerable to adversarial inputs that force all keys to the same slot → Θ(n) retrieval time.
Universal hashing selects the hash function randomly at runtime from a carefully designed class, guaranteeing good average-case performance for any input.
Definition: A finite collection H of hash functions mapping U → {0, 1, ..., m − 1} is universal if for each pair of distinct keys k, l ∈ U:
The number of hash functions h ∈ H for which h(k) = h(l) is at most |H|/m
Equivalently, the collision probability for any distinct pair is at most 1/m.
Theorem 11.3 — With universal hashing and chaining:
- If key k is not in the table: E[n_{h(k)}] ≤ α = n/m
- If key k is in the table: E[n_{h(k)}] ≤ 1 + α
Corollary 11.4 — Using universal hashing with chaining in a table with m slots, handling any sequence of n INSERT, SEARCH, DELETE operations (with O(m) inserts) takes expected Θ(n) total time.
Designing a Universal Class
Choose a prime p > |U| (every key k is in range 0 to p − 1). Define:
h_{ab}(k) = ((ak + b) mod p) mod m
where a ∈ {1, 2, ..., p − 1} and b ∈ {0, 1, ..., p − 1}.
The family:
H_{pm} = { h_{ab} : a ∈ Z*_p and b ∈ Z_p }
- Contains p(p − 1) hash functions
- Output range m is arbitrary (need not be prime)
Theorem 11.5 — The class H_{pm} is universal.
- Proof relies on the fact that distinct keys k, l map to distinct intermediate values r, s modulo p (no collisions at the "mod p level"), and the probability that r ≡ s (mod m) is at most 1/m.
Open Addressing
In open addressing, all elements are stored directly in the hash table — no linked lists or external storage. Each table entry contains either an element or NIL.
- The table can "fill up" — the load factor α can never exceed 1
- Avoids pointers entirely, freeing memory for more slots → potentially fewer collisions
Probe Sequences
The hash function is extended to include the probe number:
h : U × {0, 1, ..., m − 1} → {0, 1, ..., m − 1}
For every key k, the probe sequence ⟨h(k,0), h(k,1), ..., h(k,m−1)⟩ must be a permutation of ⟨0, 1, ..., m − 1⟩.
Insertion and Search
HASH-INSERT(T, k)
i = 0
repeat
j = h(k, i)
if T[j] == NIL
T[j] = k
return j
else i = i + 1
until i == m
error "hash table overflow"
HASH-SEARCH(T, k)
i = 0
repeat
j = h(k, i)
if T[j] == k
return j
i = i + 1
until T[j] == NIL or i == m
return NIL
Deletion
- Cannot simply store NIL in a deleted slot (would break search for keys whose insertion probed that slot)
- Solution: mark deleted slots with a special DELETED sentinel
- INSERT treats DELETED as empty
- SEARCH passes over DELETED values
- Drawback: search times no longer depend solely on α → chaining is preferred when deletions are frequent
Probing Techniques
Linear Probing
h(k, i) = (h′(k) + i) mod m
- h′ is an auxiliary hash function
- Only m distinct probe sequences (determined entirely by h′(k))
- Suffers from primary clustering: long runs of occupied slots build up, increasing average search time
- An empty slot preceded by i full slots gets filled next with probability (i + 1)/m
Quadratic Probing
h(k, i) = (h′(k) + c₁i + c₂i²) mod m
- c₁, c₂ are positive constants; values of c₁, c₂, and m must be chosen carefully to use the full table
- Only m distinct probe sequences
- Suffers from secondary clustering: if h(k₁, 0) = h(k₂, 0), then the entire probe sequences are identical
- Milder than primary clustering
Double Hashing
h(k, i) = (h₁(k) + i · h₂(k)) mod m
- Uses two auxiliary hash functions h₁ and h₂
- The probe sequence depends on the key in two ways (initial position and offset)
- Produces Θ(m²) distinct probe sequences — much closer to uniform hashing
- h₂(k) must be relatively prime to m for the full table to be searched
- If m is a power of 2: design h₂ to always produce an odd number
- If m is prime: design h₂ to return a positive integer less than m
- Common choice: m prime, h₁(k) = k mod m, h₂(k) = 1 + (k mod m′) where m′ = m − 1
Uniform Hashing Assumption
The probe sequence of each key is equally likely to be any of the m! permutations of ⟨0, 1, ..., m − 1⟩. This is an idealization — none of the three techniques above achieves it (they produce at most m² sequences vs. m! required).
Analysis of Open-Address Hashing
All results assume uniform hashing with load factor α = n/m < 1.
Theorem 11.6 — Expected probes in an unsuccessful search: at most 1/(1 − α)
| α (load factor) | Expected probes (unsuccessful) |
|---|---|
| 0.50 | 2 |
| 0.75 | 4 |
| 0.90 | 10 |
Corollary 11.7 — Insertion requires at most 1/(1 − α) probes on average (insertion = unsuccessful search + placing the key).
Theorem 11.8 — Expected probes in a successful search: at most (1/α) ln(1/(1 − α))
| α (load factor) | Expected probes (successful) |
|---|---|
| 0.50 | < 1.387 |
| 0.75 | < 1.848 |
| 0.90 | < 2.559 |
Perfect Hashing
Perfect hashing provides O(1) worst-case search time when the set of keys is static (never changes once stored). Examples: reserved words in a programming language, file names on a CD-ROM.
Two-Level Hashing Scheme
Uses universal hashing at both levels:
- First level: hash n keys into m = n slots using a hash function h from a universal family
- Second level: for each slot j, use a secondary hash table S_j with its own hash function h_j
- The size of S_j is m_j = n_j² (square of the number of keys hashing to slot j)
- By choosing h_j carefully, we guarantee no collisions at the secondary level
No Collisions at the Secondary Level
Theorem 11.9 — If we store n keys in a table of size m = n² using a randomly chosen universal hash function, the probability of any collision is less than 1/2.
- Expected number of collisions: C(n,2) · 1/n² = (n² − n)/(2n²) < 1/2
- By Markov's inequality: Pr{collisions ≥ 1} < 1/2
- Therefore, a collision-free hash function can be found with a few random trials
Since m = n² is too much space for the outer table, we use it only at the secondary level (where n_j is small for each slot j).
Space Analysis
Theorem 11.10 — With m = n and a universal hash function at the first level:
E[Σ n_j²] < 2n
Corollary 11.11 — The expected total storage for all secondary hash tables is less than 2n → overall space is O(n).
Corollary 11.12 — The probability that total secondary storage ≥ 4n is less than 1/2.
- By testing a few random first-level hash functions, we quickly find one with reasonable total storage
Summary of Perfect Hashing
- Static key set only
- Two-level universal hashing
- O(1) worst-case search time
- O(n) expected space
- Both levels use hash functions from the universal class H_{pm}: h_{ab}(k) = ((ak + b) mod p) mod m
Previous chapter
Elementary Data StructuresNext chapter
Binary Search Trees