Pagefy

Book

Introduction to Algorithms

1.The Role of Algorithms in Computing 2.Getting Started 3.Growth of Functions 4.Divide and Conquer 5.Probabilistic Analysis and Randomized Algorithms 6.Heapsort 7.Quicksort 8.Sorting in Linear Time 9.Medians and Order Statistics 10.Elementary Data Structures 11.Hash Tables 12.Binary Search Trees 13.Red Black Trees 14.Augmenting Data Structures 15.Dynamic Programming 16.Greedy Algorithms 17.Amortized Analysis 18.B Trees 19.Fibonacci Heaps 20.van Emde Boas Trees 21.Data Structures for Disjoint Sets 22.Elementary Graph Algorithms 23.Minimum Spanning Trees 24.Single Source Shortest Paths 25.All Pairs Shortest Paths 26.Maximum Flow 27.Multithreaded Algorithms 28.Matrix Operations 29.Linear Programming 30.Polynomials and the FFT 31.Number Theoretic Algorithms 32.String Matching 33.Computational Geometry 34.NP Completeness 35.Approximation Algorithms 36.Appendix A Summations 37.Appendix B Sets Etc 38.Appendix C Counting and Probability 39.Appendix D Matrices

String Matching

Introduction to Algorithms

Chapter 32: String Matching

Text-editing programs frequently need to find all occurrences of a pattern in a text. Efficient algorithms for this problem — called string matching — have applications in document editing, DNA sequence analysis, and internet search engines.

Problem Formulation

The text is an array T[1..n] of length n
The pattern is an array P[1..m] of length m ≤ n
Elements of P and T are characters drawn from a finite alphabet Σ (e.g., Σ = {0, 1} or Σ = {a, b, ..., z})
Pattern P occurs with shift s in text T if 0 ≤ s ≤ n − m and T[s+1..s+m] = P[1..m]
If P occurs with shift s, then s is a valid shift; otherwise it is an invalid shift
The string-matching problem: find all valid shifts for a given pattern P in a given text T

Summary of Algorithms

Algorithm	Preprocessing Time	Matching Time
Naive	`0`	`O((n − m + 1)m)`
Rabin-Karp	`Θ(m)`	`O((n − m + 1)m)`
Finite Automaton	`O(m	Σ
Knuth-Morris-Pratt	`Θ(m)`	`Θ(n)`

Notation and Terminology

Σ* denotes the set of all finite-length strings formed from alphabet Σ
The empty string ε has zero length and belongs to Σ*
|x| denotes the length of string x
Concatenation of x and y, denoted xy, has length |x| + |y|
Prefix: w ⊏ x if x = wy for some string y ∈ Σ*. Implies |w| ≤ |x|
Suffix: w ⊐ x if x = yw for some string y ∈ Σ*. Implies |w| ≤ |x|
P_k denotes the k-character prefix P[1..k] of pattern P[1..m] (so P_0 = ε, P_m = P)
T_k denotes the k-character prefix of text T
The string-matching problem: find all shifts s in range 0 ≤ s ≤ n − m such that P ⊐ T_{s+m}

Overlapping-Suffix Lemma (Lemma 32.1)

If x ⊐ z and y ⊐ z, then:

If |x| ≤ |y|, then x ⊐ y
If |x| ≥ |y|, then y ⊐ x
If |x| = |y|, then x = y

32.1 The Naive String-Matching Algorithm

The simplest approach: check the condition P[1..m] = T[s+1..s+m] for each of the n − m + 1 possible values of s.

NAIVE-STRING-MATCHER(T, P)
  n = T.length
  m = P.length
  for s = 0 to n − m
      if P[1..m] == T[s+1..s+m]
          print "Pattern occurs with shift" s

Key Points

Conceptually slides a "template" containing the pattern over the text
For each shift s, compares characters until all match or a mismatch is found
No preprocessing required
Worst-case time: O((n − m + 1)m), which is Θ(n²) when m = ⌊n/2⌋
- Example worst case: text T = aⁿ and pattern P = aᵐ — every shift is valid, each requiring m comparisons
Inefficiency: entirely ignores information gained about the text for one value of s when considering other values of s

32.2 The Rabin-Karp Algorithm

Proposed by Rabin and Karp, this algorithm uses hashing to speed up string matching. It performs well in practice and generalizes to related problems (e.g., two-dimensional pattern matching).

Core Idea

Treat each string of m characters as a number and use modular arithmetic to compare pattern and text windows efficiently.

Assume Σ = {0, 1, ..., 9} (each character is a decimal digit; in general, use radix d = |Σ|)
Let p = decimal value of P[1..m]
Let t_s = decimal value of T[s+1..s+m] for s = 0, 1, ..., n − m
t_s = p if and only if T[s+1..s+m] = P[1..m] — i.e., s is a valid shift iff t_s = p

Computing Hash Values Efficiently

Computing p using Horner's rule in Θ(m) time:

p = P[m] + 10(P[m−1] + 10(P[m−2] + ··· + 10(P[2] + 10·P[1])···))

Computing t₀ from T[1..m] similarly in Θ(m) time.

Recurrence for remaining values — compute t_{s+1} from t_s in constant time:

t_{s+1} = 10·(t_s − 10^{m−1} · T[s+1]) + T[s+m+1]

This removes the high-order digit, shifts left, and adds the new low-order digit.

Example: if m = 5 and t_s = 31415, removing high-order digit 3 and adding new digit 2:

t_{s+1} = 10·(31415 − 10000·3) + 2 = 14152

The Modular Arithmetic Approach

Problem: p and t_s may be too large for single-precision arithmetic.

Solution: compute everything modulo q (a suitable prime):

t_{s+1} = (d·(t_s − T[s+1]·h) + T[s+m+1]) mod q

where h ≡ d^{m−1} (mod q) is the value of the high-order digit position.

Choose q as a prime such that d·q fits in one computer word
If t_s ≢ p (mod q), then t_s ≠ p — shift s is definitely invalid
If t_s ≡ p (mod q), we have a hit — but it could be a spurious hit
On a hit, explicitly verify P[1..m] = T[s+1..s+m]

Algorithm

RABIN-KARP-MATCHER(T, P, d, q)
  n = T.length
  m = P.length
  h = d^{m−1} mod q
  p = 0
  t₀ = 0
  for i = 1 to m                          // preprocessing
      p = (d·p + P[i]) mod q
      t₀ = (d·t₀ + T[i]) mod q
  for s = 0 to n − m                      // matching
      if p == t_s
          if P[1..m] == T[s+1..s+m]
              print "Pattern occurs with shift" s
      if s < n − m
          t_{s+1} = (d·(t_s − T[s+1]·h) + T[s+m+1]) mod q

Analysis

Preprocessing time: Θ(m)
Worst-case matching time: Θ((n − m + 1)m) — same as naive (e.g., P = aᵐ, T = aⁿ)
Expected matching time: O(n) + O(m(ν + n/q))
- ν = number of valid shifts
- Expected spurious hits: O(n/q) (assuming hash values behave like random mappings)
If ν = O(1) and q ≥ m, expected matching time is O(n + m) = O(n)

32.3 String Matching with Finite Automata

This method builds a finite automaton specifically designed to search for occurrences of pattern P in text T. It examines each text character exactly once, taking constant time per character.

Finite Automata Definition

A finite automaton M is a 5-tuple (Q, q₀, A, Σ, δ):

Q: finite set of states
q₀ ∈ Q: start state
A ⊆ Q: set of accepting states
Σ: finite input alphabet
δ: transition function from Q × Σ → Q

The automaton starts in state q₀, reads characters one at a time, and transitions: if in state q reading character a, it moves to state δ(q, a). A string w is accepted if the final state φ(w) ∈ A.

Final-state function φ (defined recursively):

φ(ε) = q₀
φ(wa) = δ(φ(w), a) for w ∈ Σ*, a ∈ Σ

The Suffix Function σ

For a given pattern P[1..m], the suffix function σ: Σ* → {0, 1, ..., m} is:

σ(x) = max{k : P_k ⊐ x}

σ(x) is the length of the longest prefix of P that is also a suffix of x.

String-Matching Automaton Construction

For pattern P[1..m]:

State set Q = {0, 1, ..., m}
Start state q₀ = 0
Only accepting state: state m
Transition function:

δ(q, a) = σ(P_q · a)

Intuition: state q means the longest prefix of P that matches a suffix of the text read so far has length q. On reading character a:

If a = P[q+1]: match continues, δ(q, a) = q + 1 (move along the "spine")
If a ≠ P[q+1]: find the longest smaller prefix of P that is a suffix of P_q · a

Matching Algorithm

FINITE-AUTOMATON-MATCHER(T, δ, m)
  n = T.length
  q = 0
  for i = 1 to n
      q = δ(q, T[i])
      if q == m
          print "Pattern occurs with shift" i − m

Matching time: Θ(n) — each character examined exactly once.

Key Theorem (Theorem 32.4)

The automaton maintains the invariant:

φ(T_i) = σ(T_i)

That is, after scanning T[i], the automaton is in the state equal to the length of the longest prefix of P that is a suffix of T_i.

Computing the Transition Function

COMPUTE-TRANSITION-FUNCTION(P, Σ)
  m = P.length
  for q = 0 to m
      for each character a ∈ Σ
          k = min(m + 1, q + 2)
          repeat
              k = k − 1
          until P_k ⊐ P_q · a
          δ(q, a) = k
  return δ

Running time: O(m³|Σ|) (can be improved to O(m|Σ|) with clever use of pattern information)
Total: O(m|Σ|) preprocessing + Θ(n) matching

Supporting Lemmas

Lemma 32.2 (Suffix-function inequality): For any string x and character a:

σ(xa) ≤ σ(x) + 1

Lemma 32.3 (Suffix-function recursion): For any string x and character a, if q = σ(x), then:

σ(xa) = σ(P_q · a)

This is the key insight: to compute σ(xa), we only need to know σ(x) (the current state), not the entire string x.

32.4 The Knuth-Morris-Pratt Algorithm

The KMP algorithm achieves Θ(n) matching time with only Θ(m) preprocessing — avoiding the O(m|Σ|) cost of computing the full transition function δ. It uses an auxiliary prefix function π stored in an array π[1..m].

The Prefix Function

The prefix function π for pattern P[1..m] is defined as:

π[q] = max{k : k < q and P_k ⊐ P_q}

π[q] is the length of the longest proper prefix of P_q that is also a suffix of P_q.

Example for P = ababaca:

i	1	2	3	4	5	6	7
P[i]	a	b	a	b	a	c	a
π[i]	0	0	1	2	3	0	1

Key Idea

Given that q characters have matched at shift s, the next potentially valid shift is:

s' = s + (q − π[q])

At the new shift s', the first π[q] characters of P are guaranteed to match (no need to re-compare them).

The KMP Matcher

KMP-MATCHER(T, P)
  n = T.length
  m = P.length
  π = COMPUTE-PREFIX-FUNCTION(P)
  q = 0                                   // number of characters matched
  for i = 1 to n                          // scan text left to right
      while q > 0 and P[q+1] ≠ T[i]
          q = π[q]                         // next character does not match
      if P[q+1] == T[i]
          q = q + 1                        // next character matches
      if q == m                            // is all of P matched?
          print "Pattern occurs with shift" i − m
          q = π[q]                         // look for the next match

Computing the Prefix Function

COMPUTE-PREFIX-FUNCTION(P)
  m = P.length
  let π[1..m] be a new array
  π[1] = 0
  k = 0
  for q = 2 to m
      while k > 0 and P[k+1] ≠ P[q]
          k = π[k]
      if P[k+1] == P[q]
          k = k + 1
      π[q] = k
  return π

Both procedures share the same structure: KMP-MATCHER matches text T against P, while COMPUTE-PREFIX-FUNCTION matches P against itself.

Running Time Analysis

COMPUTE-PREFIX-FUNCTION runs in Θ(m) (amortized analysis):

k starts at 0 and increases by at most 1 per iteration of the for loop (at most m − 1 total increases)
Each while loop iteration decreases k
k never goes negative
Total decreases ≤ total increases ≤ m − 1
Therefore the while loop executes at most m − 1 times total

KMP-MATCHER runs in Θ(n) by a similar aggregate amortized argument on q.

Total: Θ(m) preprocessing + Θ(n) matching = Θ(n + m)

How π Replaces δ

When the automaton is in state q and reads character a = T[i]:

If a = P[q+1]: match continues, move to state q + 1 (no need for π)
If a ≠ P[q+1]: iterate through states in π*(q) = {π[q], π[π[q]], ...} in decreasing order until finding a state q' where P[q'+1] = a, then move to q' + 1. If no such state exists, move to state 0 (or 1 if P[1] = a)

Prefix-Function Iteration Lemma (Lemma 32.5)

Define π*(q) = {π[q], π^(2)[q], π^(3)[q], ..., 0} (iterating π until reaching 0).

Then:

π*(q) = {k : k < q and P_k ⊐ P_q}

That is, iterating π enumerates all prefixes of P that are proper suffixes of P_q.

Correctness

COMPUTE-PREFIX-FUNCTION: At the start of each for-loop iteration, k = π[q−1]. Lines 6–9 adjust k to the correct value of π[q] using Corollary 32.7:
- If E_{q−1} = ∅ (no extendable prefix), then π[q] = 0
- Otherwise, π[q] = 1 + max{k ∈ E_{q−1}}
- where E_{q−1} = {k ∈ π*(q−1) : P[k+1] = P[q]}
KMP-MATCHER: Simulates FINITE-AUTOMATON-MATCHER — after processing T[i], the state q = σ(T_i). Correctness follows from Theorem 32.4 and the equivalence of the π-based transitions to the δ transitions.

Advantages over Finite Automaton Method

	Finite Automaton	KMP
Preprocessing	`O(m\|Σ\|)`	`Θ(m)`
Matching	`Θ(n)`	`Θ(n)`
Space	`O(m\|Σ\|)` for `δ` table	`O(m)` for `π` array

KMP saves a factor of |Σ| in both preprocessing time and space by computing π instead of δ, while computing transitions "on the fly" during matching.

Previous chapter

Number Theoretic Algorithms

Next chapter

Computational Geometry