Pagefy

Book

Introduction to Algorithms

1.The Role of Algorithms in Computing 2.Getting Started 3.Growth of Functions 4.Divide and Conquer 5.Probabilistic Analysis and Randomized Algorithms 6.Heapsort 7.Quicksort 8.Sorting in Linear Time 9.Medians and Order Statistics 10.Elementary Data Structures 11.Hash Tables 12.Binary Search Trees 13.Red Black Trees 14.Augmenting Data Structures 15.Dynamic Programming 16.Greedy Algorithms 17.Amortized Analysis 18.B Trees 19.Fibonacci Heaps 20.van Emde Boas Trees 21.Data Structures for Disjoint Sets 22.Elementary Graph Algorithms 23.Minimum Spanning Trees 24.Single Source Shortest Paths 25.All Pairs Shortest Paths 26.Maximum Flow 27.Multithreaded Algorithms 28.Matrix Operations 29.Linear Programming 30.Polynomials and the FFT 31.Number Theoretic Algorithms 32.String Matching 33.Computational Geometry 34.NP Completeness 35.Approximation Algorithms 36.Appendix A Summations 37.Appendix B Sets Etc 38.Appendix C Counting and Probability 39.Appendix D Matrices

Hash Tables

Introduction to AlgorithmsBuy the book

Chapter 11: Hash Tables

Many applications require a dynamic set supporting only INSERT, SEARCH, and DELETE — the dictionary operations. A hash table is an effective data structure for implementing dictionaries. Under reasonable assumptions, the average time to search for an element in a hash table is O(1).

A hash table generalizes the simpler notion of an ordinary array. Instead of using the key as an array index directly, the array index is computed from the key using a hash function. This reduces the required array size from |U| (universe size) to m (proportional to the number of keys actually stored).

Direct-Address Tables

Direct addressing works well when the universe U of keys is reasonably small.

Suppose each element has a key drawn from U = {0, 1, ..., m − 1}, where m is not too large
No two elements have the same key
We use an array (the direct-address table) T[0 .. m − 1], where each slot corresponds to a key in U
Slot k points to the element with key k, or contains NIL if no such element exists

Operations

All operations run in O(1) time:

DIRECT-ADDRESS-SEARCH(T, k)
    return T[k]

DIRECT-ADDRESS-INSERT(T, x)
    T[x.key] = x

DIRECT-ADDRESS-DELETE(T, x)
    T[x.key] = NIL

Space Optimization

The table itself can hold elements directly (instead of pointers to external objects), saving space
If stored in the slot, the key can often be omitted (the index implies the key)
A special key or flag is needed to indicate empty slots

Hash Tables

Motivation

If U is large, storing a table of size |U| may be impractical or impossible
If the set K of actual keys is much smaller than U, most space in a direct-address table is wasted
A hash table reduces storage to Θ(|K|) while maintaining O(1) average-case search time

Hash Functions and Hashing

With hashing, an element with key k is stored in slot h(k), where h is a hash function:
- h : U → {0, 1, ..., m − 1}
The size m of the hash table is typically much less than |U|
h(k) is called the hash value of key k

Collisions

A collision occurs when two keys hash to the same slot: h(k₁) = h(k₂)
Since |U| > m, collisions are unavoidable
A well-designed hash function minimizes collisions but cannot eliminate them
Two main resolution strategies: chaining and open addressing

Collision Resolution by Chaining

All elements that hash to the same slot are placed into a linked list:

Slot j contains a pointer to the head of the list of all elements with hash value j
If no elements hash to j, the slot contains NIL

CHAINED-HASH-INSERT(T, x)
    insert x at the head of list T[h(x.key)]

CHAINED-HASH-SEARCH(T, k)
    search for an element with key k in list T[h(k)]

CHAINED-HASH-DELETE(T, x)
    delete x from the list T[h(x.key)]

Running times:

Operation	Time
Insert	O(1) worst-case (assumes element not already present)
Search	O(length of list)
Delete	O(1) with doubly-linked lists (takes element, not key)

Analysis of Hashing with Chaining

Load factor: α = n/m (n = number of elements, m = number of slots)
- α is the average number of elements per chain
- Can be < 1, = 1, or > 1
Worst case: all n keys hash to the same slot → Θ(n) search time
Average case relies on the assumption of simple uniform hashing: each key is equally likely to hash to any of the m slots, independently of other keys

Theorem 11.1 — An unsuccessful search takes average-case time Θ(1 + α) under simple uniform hashing.

The search examines the entire list T[h(k)], which has expected length α

Theorem 11.2 — A successful search takes average-case time Θ(1 + α) under simple uniform hashing.

Expected elements examined: 1 + α/2 − α/(2n)

Key insight: If n = O(m), then α = O(1), and all dictionary operations take O(1) time on average.

Hash Functions

A good hash function satisfies (approximately) the assumption of simple uniform hashing: each key is equally likely to hash to any of the m slots, independently of where other keys have hashed.

Interpreting Keys as Natural Numbers

Most hash functions assume keys are natural numbers
Non-numeric keys are converted: e.g., a character string can be interpreted as an integer in a suitable radix
- Example: "pt" → (112 × 128) + 116 = 14452 (using ASCII values, radix 128)

The Division Method

The hash function is:

h(k) = k mod m

Fast: requires only a single division
Choosing m:
- Avoid powers of 2: if m = 2ᵖ, then h(k) is just the p lowest-order bits of k
- Avoid m = 2ᵖ − 1 when keys are character strings in radix 2ᵖ (permuting characters doesn't change hash value)
- A prime not too close to an exact power of 2 is often a good choice
- Example: for n ≈ 2000 with desired α ≈ 3, choose m = 701 (prime near 2000/3)

The Multiplication Method

The hash function is:

h(k) = ⌊m (kA mod 1)⌋

where "kA mod 1" means the fractional part of kA, and 0 < A < 1.

Steps:

Multiply key k by constant A (0 < A < 1)
Extract the fractional part of kA
Multiply by m and take the floor

Advantages:

The value of m is not critical — typically choose m = 2ᵖ
Easy to implement on machines with word size w:
- Let s = A · 2ʷ (an integer)
- Multiply k by s to get a 2w-bit result: r₁ · 2ʷ + r₀
- The desired hash value is the p most significant bits of r₀

Suggested constant (Knuth):

A ≈ (√5 − 1)/2 = 0.6180339887...

Universal Hashing

Any fixed hash function is vulnerable to adversarial inputs that force all keys to the same slot → Θ(n) retrieval time.

Universal hashing selects the hash function randomly at runtime from a carefully designed class, guaranteeing good average-case performance for any input.

Definition: A finite collection H of hash functions mapping U → {0, 1, ..., m − 1} is universal if for each pair of distinct keys k, l ∈ U:

The number of hash functions h ∈ H for which h(k) = h(l) is at most |H|/m

Equivalently, the collision probability for any distinct pair is at most 1/m.

Theorem 11.3 — With universal hashing and chaining:

If key k is not in the table: E[n_{h(k)}] ≤ α = n/m
If key k is in the table: E[n_{h(k)}] ≤ 1 + α

Corollary 11.4 — Using universal hashing with chaining in a table with m slots, handling any sequence of n INSERT, SEARCH, DELETE operations (with O(m) inserts) takes expected Θ(n) total time.

Designing a Universal Class

Choose a prime p > |U| (every key k is in range 0 to p − 1). Define:

h_{ab}(k) = ((ak + b) mod p) mod m

where a ∈ {1, 2, ..., p − 1} and b ∈ {0, 1, ..., p − 1}.

The family:

H_{pm} = { h_{ab} : a ∈ Z*_p and b ∈ Z_p }

Contains p(p − 1) hash functions
Output range m is arbitrary (need not be prime)

Theorem 11.5 — The class H_{pm} is universal.

Proof relies on the fact that distinct keys k, l map to distinct intermediate values r, s modulo p (no collisions at the "mod p level"), and the probability that r ≡ s (mod m) is at most 1/m.

Open Addressing

In open addressing, all elements are stored directly in the hash table — no linked lists or external storage. Each table entry contains either an element or NIL.

The table can "fill up" — the load factor α can never exceed 1
Avoids pointers entirely, freeing memory for more slots → potentially fewer collisions

Probe Sequences

The hash function is extended to include the probe number:

h : U × {0, 1, ..., m − 1} → {0, 1, ..., m − 1}

For every key k, the probe sequence ⟨h(k,0), h(k,1), ..., h(k,m−1)⟩ must be a permutation of ⟨0, 1, ..., m − 1⟩.

Insertion and Search

HASH-INSERT(T, k)
    i = 0
    repeat
        j = h(k, i)
        if T[j] == NIL
            T[j] = k
            return j
        else i = i + 1
    until i == m
    error "hash table overflow"

HASH-SEARCH(T, k)
    i = 0
    repeat
        j = h(k, i)
        if T[j] == k
            return j
        i = i + 1
    until T[j] == NIL or i == m
    return NIL

Deletion

Cannot simply store NIL in a deleted slot (would break search for keys whose insertion probed that slot)
Solution: mark deleted slots with a special DELETED sentinel
- INSERT treats DELETED as empty
- SEARCH passes over DELETED values
Drawback: search times no longer depend solely on α → chaining is preferred when deletions are frequent

Probing Techniques

Linear Probing

h(k, i) = (h′(k) + i) mod m

h′ is an auxiliary hash function
Only m distinct probe sequences (determined entirely by h′(k))
Suffers from primary clustering: long runs of occupied slots build up, increasing average search time
- An empty slot preceded by i full slots gets filled next with probability (i + 1)/m

Quadratic Probing

h(k, i) = (h′(k) + c₁i + c₂i²) mod m

c₁, c₂ are positive constants; values of c₁, c₂, and m must be chosen carefully to use the full table
Only m distinct probe sequences
Suffers from secondary clustering: if h(k₁, 0) = h(k₂, 0), then the entire probe sequences are identical
Milder than primary clustering

Double Hashing

h(k, i) = (h₁(k) + i · h₂(k)) mod m

Uses two auxiliary hash functions h₁ and h₂
The probe sequence depends on the key in two ways (initial position and offset)
Produces Θ(m²) distinct probe sequences — much closer to uniform hashing
h₂(k) must be relatively prime to m for the full table to be searched
- If m is a power of 2: design h₂ to always produce an odd number
- If m is prime: design h₂ to return a positive integer less than m
Common choice: m prime, h₁(k) = k mod m, h₂(k) = 1 + (k mod m′) where m′ = m − 1

Uniform Hashing Assumption

The probe sequence of each key is equally likely to be any of the m! permutations of ⟨0, 1, ..., m − 1⟩. This is an idealization — none of the three techniques above achieves it (they produce at most m² sequences vs. m! required).

Analysis of Open-Address Hashing

All results assume uniform hashing with load factor α = n/m < 1.

Theorem 11.6 — Expected probes in an unsuccessful search: at most 1/(1 − α)

α (load factor)	Expected probes (unsuccessful)
0.50	2
0.75	4
0.90	10

Corollary 11.7 — Insertion requires at most 1/(1 − α) probes on average (insertion = unsuccessful search + placing the key).

Theorem 11.8 — Expected probes in a successful search: at most (1/α) ln(1/(1 − α))

α (load factor)	Expected probes (successful)
0.50	< 1.387
0.75	< 1.848
0.90	< 2.559

Perfect Hashing

Perfect hashing provides O(1) worst-case search time when the set of keys is static (never changes once stored). Examples: reserved words in a programming language, file names on a CD-ROM.

Two-Level Hashing Scheme

Uses universal hashing at both levels:

First level: hash n keys into m = n slots using a hash function h from a universal family
Second level: for each slot j, use a secondary hash table S_j with its own hash function h_j
- The size of S_j is m_j = n_j² (square of the number of keys hashing to slot j)
- By choosing h_j carefully, we guarantee no collisions at the secondary level

No Collisions at the Secondary Level

Theorem 11.9 — If we store n keys in a table of size m = n² using a randomly chosen universal hash function, the probability of any collision is less than 1/2.

Expected number of collisions: C(n,2) · 1/n² = (n² − n)/(2n²) < 1/2
By Markov's inequality: Pr{collisions ≥ 1} < 1/2
Therefore, a collision-free hash function can be found with a few random trials

Since m = n² is too much space for the outer table, we use it only at the secondary level (where n_j is small for each slot j).

Space Analysis

Theorem 11.10 — With m = n and a universal hash function at the first level:

E[Σ n_j²] < 2n

Corollary 11.11 — The expected total storage for all secondary hash tables is less than 2n → overall space is O(n).

Corollary 11.12 — The probability that total secondary storage ≥ 4n is less than 1/2.

By testing a few random first-level hash functions, we quickly find one with reasonable total storage

Summary of Perfect Hashing

Static key set only
Two-level universal hashing
O(1) worst-case search time
O(n) expected space
Both levels use hash functions from the universal class H_{pm}: h_{ab}(k) = ((ak + b) mod p) mod m

Previous chapter

Elementary Data Structures

Next chapter

Binary Search Trees