Pagefy

Pagefy

Back to Data Structures and Algorithms

Hash Tables

Introduction to Algorithms

Chapter 11: Hash Tables

Many applications require a dynamic set supporting only INSERT, SEARCH, and DELETE — the dictionary operations. A hash table is an effective data structure for implementing dictionaries. Under reasonable assumptions, the average time to search for an element in a hash table is O(1).

A hash table generalizes the simpler notion of an ordinary array. Instead of using the key as an array index directly, the array index is computed from the key using a hash function. This reduces the required array size from |U| (universe size) to m (proportional to the number of keys actually stored).


Direct-Address Tables

Direct addressing works well when the universe U of keys is reasonably small.

  • Suppose each element has a key drawn from U = {0, 1, ..., m − 1}, where m is not too large
  • No two elements have the same key
  • We use an array (the direct-address table) T[0 .. m − 1], where each slot corresponds to a key in U
  • Slot k points to the element with key k, or contains NIL if no such element exists

Operations

All operations run in O(1) time:

DIRECT-ADDRESS-SEARCH(T, k)
    return T[k]

DIRECT-ADDRESS-INSERT(T, x)
    T[x.key] = x

DIRECT-ADDRESS-DELETE(T, x)
    T[x.key] = NIL

Space Optimization

  • The table itself can hold elements directly (instead of pointers to external objects), saving space
  • If stored in the slot, the key can often be omitted (the index implies the key)
  • A special key or flag is needed to indicate empty slots

Hash Tables

Motivation

  • If U is large, storing a table of size |U| may be impractical or impossible
  • If the set K of actual keys is much smaller than U, most space in a direct-address table is wasted
  • A hash table reduces storage to Θ(|K|) while maintaining O(1) average-case search time

Hash Functions and Hashing

  • With hashing, an element with key k is stored in slot h(k), where h is a hash function:
    • h : U → {0, 1, ..., m − 1}
  • The size m of the hash table is typically much less than |U|
  • h(k) is called the hash value of key k

Collisions

  • A collision occurs when two keys hash to the same slot: h(k₁) = h(k₂)
  • Since |U| > m, collisions are unavoidable
  • A well-designed hash function minimizes collisions but cannot eliminate them
  • Two main resolution strategies: chaining and open addressing

Collision Resolution by Chaining

All elements that hash to the same slot are placed into a linked list:

  • Slot j contains a pointer to the head of the list of all elements with hash value j
  • If no elements hash to j, the slot contains NIL
CHAINED-HASH-INSERT(T, x)
    insert x at the head of list T[h(x.key)]

CHAINED-HASH-SEARCH(T, k)
    search for an element with key k in list T[h(k)]

CHAINED-HASH-DELETE(T, x)
    delete x from the list T[h(x.key)]

Running times:

OperationTime
InsertO(1) worst-case (assumes element not already present)
SearchO(length of list)
DeleteO(1) with doubly-linked lists (takes element, not key)

Analysis of Hashing with Chaining

  • Load factor: α = n/m (n = number of elements, m = number of slots)
    • α is the average number of elements per chain
    • Can be < 1, = 1, or > 1
  • Worst case: all n keys hash to the same slot → Θ(n) search time
  • Average case relies on the assumption of simple uniform hashing: each key is equally likely to hash to any of the m slots, independently of other keys

Theorem 11.1 — An unsuccessful search takes average-case time Θ(1 + α) under simple uniform hashing.

  • The search examines the entire list T[h(k)], which has expected length α

Theorem 11.2 — A successful search takes average-case time Θ(1 + α) under simple uniform hashing.

  • Expected elements examined: 1 + α/2 − α/(2n)

Key insight: If n = O(m), then α = O(1), and all dictionary operations take O(1) time on average.


Hash Functions

A good hash function satisfies (approximately) the assumption of simple uniform hashing: each key is equally likely to hash to any of the m slots, independently of where other keys have hashed.

Interpreting Keys as Natural Numbers

  • Most hash functions assume keys are natural numbers
  • Non-numeric keys are converted: e.g., a character string can be interpreted as an integer in a suitable radix
    • Example: "pt" → (112 × 128) + 116 = 14452 (using ASCII values, radix 128)

The Division Method

The hash function is:

h(k) = k mod m

  • Fast: requires only a single division
  • Choosing m:
    • Avoid powers of 2: if m = 2ᵖ, then h(k) is just the p lowest-order bits of k
    • Avoid m = 2ᵖ − 1 when keys are character strings in radix 2ᵖ (permuting characters doesn't change hash value)
    • A prime not too close to an exact power of 2 is often a good choice
    • Example: for n ≈ 2000 with desired α ≈ 3, choose m = 701 (prime near 2000/3)

The Multiplication Method

The hash function is:

h(k) = ⌊m (kA mod 1)⌋

where "kA mod 1" means the fractional part of kA, and 0 < A < 1.

Steps:

  1. Multiply key k by constant A (0 < A < 1)
  2. Extract the fractional part of kA
  3. Multiply by m and take the floor

Advantages:

  • The value of m is not critical — typically choose m = 2ᵖ
  • Easy to implement on machines with word size w:
    • Let s = A · 2ʷ (an integer)
    • Multiply k by s to get a 2w-bit result: r₁ · 2ʷ + r₀
    • The desired hash value is the p most significant bits of r₀

Suggested constant (Knuth):

A ≈ (√5 − 1)/2 = 0.6180339887...

Universal Hashing

Any fixed hash function is vulnerable to adversarial inputs that force all keys to the same slot → Θ(n) retrieval time.

Universal hashing selects the hash function randomly at runtime from a carefully designed class, guaranteeing good average-case performance for any input.

Definition: A finite collection H of hash functions mapping U → {0, 1, ..., m − 1} is universal if for each pair of distinct keys k, l ∈ U:

The number of hash functions h ∈ H for which h(k) = h(l) is at most |H|/m

Equivalently, the collision probability for any distinct pair is at most 1/m.

Theorem 11.3 — With universal hashing and chaining:

  • If key k is not in the table: E[n_{h(k)}] ≤ α = n/m
  • If key k is in the table: E[n_{h(k)}] ≤ 1 + α

Corollary 11.4 — Using universal hashing with chaining in a table with m slots, handling any sequence of n INSERT, SEARCH, DELETE operations (with O(m) inserts) takes expected Θ(n) total time.

Designing a Universal Class

Choose a prime p > |U| (every key k is in range 0 to p − 1). Define:

h_{ab}(k) = ((ak + b) mod p) mod m

where a ∈ {1, 2, ..., p − 1} and b ∈ {0, 1, ..., p − 1}.

The family:

H_{pm} = { h_{ab} : a ∈ Z*_p and b ∈ Z_p }

  • Contains p(p − 1) hash functions
  • Output range m is arbitrary (need not be prime)

Theorem 11.5 — The class H_{pm} is universal.

  • Proof relies on the fact that distinct keys k, l map to distinct intermediate values r, s modulo p (no collisions at the "mod p level"), and the probability that r ≡ s (mod m) is at most 1/m.

Open Addressing

In open addressing, all elements are stored directly in the hash table — no linked lists or external storage. Each table entry contains either an element or NIL.

  • The table can "fill up" — the load factor α can never exceed 1
  • Avoids pointers entirely, freeing memory for more slots → potentially fewer collisions

Probe Sequences

The hash function is extended to include the probe number:

h : U × {0, 1, ..., m − 1} → {0, 1, ..., m − 1}

For every key k, the probe sequence ⟨h(k,0), h(k,1), ..., h(k,m−1)⟩ must be a permutation of ⟨0, 1, ..., m − 1⟩.

HASH-INSERT(T, k)
    i = 0
    repeat
        j = h(k, i)
        if T[j] == NIL
            T[j] = k
            return j
        else i = i + 1
    until i == m
    error "hash table overflow"

HASH-SEARCH(T, k)
    i = 0
    repeat
        j = h(k, i)
        if T[j] == k
            return j
        i = i + 1
    until T[j] == NIL or i == m
    return NIL

Deletion

  • Cannot simply store NIL in a deleted slot (would break search for keys whose insertion probed that slot)
  • Solution: mark deleted slots with a special DELETED sentinel
    • INSERT treats DELETED as empty
    • SEARCH passes over DELETED values
  • Drawback: search times no longer depend solely on α → chaining is preferred when deletions are frequent

Probing Techniques

Linear Probing

h(k, i) = (h′(k) + i) mod m

  • h′ is an auxiliary hash function
  • Only m distinct probe sequences (determined entirely by h′(k))
  • Suffers from primary clustering: long runs of occupied slots build up, increasing average search time
    • An empty slot preceded by i full slots gets filled next with probability (i + 1)/m

Quadratic Probing

h(k, i) = (h′(k) + c₁i + c₂i²) mod m

  • c₁, c₂ are positive constants; values of c₁, c₂, and m must be chosen carefully to use the full table
  • Only m distinct probe sequences
  • Suffers from secondary clustering: if h(k₁, 0) = h(k₂, 0), then the entire probe sequences are identical
  • Milder than primary clustering

Double Hashing

h(k, i) = (h₁(k) + i · h₂(k)) mod m

  • Uses two auxiliary hash functions h₁ and h₂
  • The probe sequence depends on the key in two ways (initial position and offset)
  • Produces Θ(m²) distinct probe sequences — much closer to uniform hashing
  • h₂(k) must be relatively prime to m for the full table to be searched
    • If m is a power of 2: design h₂ to always produce an odd number
    • If m is prime: design h₂ to return a positive integer less than m
  • Common choice: m prime, h₁(k) = k mod m, h₂(k) = 1 + (k mod m′) where m′ = m − 1

Uniform Hashing Assumption

The probe sequence of each key is equally likely to be any of the m! permutations of ⟨0, 1, ..., m − 1⟩. This is an idealization — none of the three techniques above achieves it (they produce at most m² sequences vs. m! required).

Analysis of Open-Address Hashing

All results assume uniform hashing with load factor α = n/m < 1.

Theorem 11.6 — Expected probes in an unsuccessful search: at most 1/(1 − α)

α (load factor)Expected probes (unsuccessful)
0.502
0.754
0.9010

Corollary 11.7Insertion requires at most 1/(1 − α) probes on average (insertion = unsuccessful search + placing the key).

Theorem 11.8 — Expected probes in a successful search: at most (1/α) ln(1/(1 − α))

α (load factor)Expected probes (successful)
0.50< 1.387
0.75< 1.848
0.90< 2.559

Perfect Hashing

Perfect hashing provides O(1) worst-case search time when the set of keys is static (never changes once stored). Examples: reserved words in a programming language, file names on a CD-ROM.

Two-Level Hashing Scheme

Uses universal hashing at both levels:

  1. First level: hash n keys into m = n slots using a hash function h from a universal family
  2. Second level: for each slot j, use a secondary hash table S_j with its own hash function h_j
    • The size of S_j is m_j = n_j² (square of the number of keys hashing to slot j)
    • By choosing h_j carefully, we guarantee no collisions at the secondary level

No Collisions at the Secondary Level

Theorem 11.9 — If we store n keys in a table of size m = n² using a randomly chosen universal hash function, the probability of any collision is less than 1/2.

  • Expected number of collisions: C(n,2) · 1/n² = (n² − n)/(2n²) < 1/2
  • By Markov's inequality: Pr{collisions ≥ 1} < 1/2
  • Therefore, a collision-free hash function can be found with a few random trials

Since m = n² is too much space for the outer table, we use it only at the secondary level (where n_j is small for each slot j).

Space Analysis

Theorem 11.10 — With m = n and a universal hash function at the first level:

E[Σ n_j²] < 2n

Corollary 11.11 — The expected total storage for all secondary hash tables is less than 2n → overall space is O(n).

Corollary 11.12 — The probability that total secondary storage ≥ 4n is less than 1/2.

  • By testing a few random first-level hash functions, we quickly find one with reasonable total storage

Summary of Perfect Hashing

  • Static key set only
  • Two-level universal hashing
  • O(1) worst-case search time
  • O(n) expected space
  • Both levels use hash functions from the universal class H_{pm}: h_{ab}(k) = ((ak + b) mod p) mod m