String Matching
Introduction to AlgorithmsChapter 32: String Matching
Text-editing programs frequently need to find all occurrences of a pattern in a text. Efficient algorithms for this problem — called string matching — have applications in document editing, DNA sequence analysis, and internet search engines.
Problem Formulation
- The text is an array
T[1..n]of lengthn - The pattern is an array
P[1..m]of lengthm ≤ n - Elements of
PandTare characters drawn from a finite alphabetΣ(e.g.,Σ = {0, 1}orΣ = {a, b, ..., z}) - Pattern
Poccurs with shiftsin textTif0 ≤ s ≤ n − mandT[s+1..s+m] = P[1..m] - If
Poccurs with shifts, thensis a valid shift; otherwise it is an invalid shift - The string-matching problem: find all valid shifts for a given pattern
Pin a given textT
Summary of Algorithms
| Algorithm | Preprocessing Time | Matching Time |
|---|---|---|
| Naive | 0 | O((n − m + 1)m) |
| Rabin-Karp | Θ(m) | O((n − m + 1)m) |
| Finite Automaton | `O(m | Σ |
| Knuth-Morris-Pratt | Θ(m) | Θ(n) |
Notation and Terminology
- Σ* denotes the set of all finite-length strings formed from alphabet
Σ - The empty string
εhas zero length and belongs toΣ* - |x| denotes the length of string
x - Concatenation of
xandy, denotedxy, has length|x| + |y| - Prefix:
w ⊏ xifx = wyfor some stringy ∈ Σ*. Implies|w| ≤ |x| - Suffix:
w ⊐ xifx = ywfor some stringy ∈ Σ*. Implies|w| ≤ |x| - P_k denotes the k-character prefix
P[1..k]of patternP[1..m](soP_0 = ε,P_m = P) - T_k denotes the k-character prefix of text
T - The string-matching problem: find all shifts
sin range0 ≤ s ≤ n − msuch thatP ⊐ T_{s+m}
Overlapping-Suffix Lemma (Lemma 32.1)
If x ⊐ z and y ⊐ z, then:
- If
|x| ≤ |y|, thenx ⊐ y - If
|x| ≥ |y|, theny ⊐ x - If
|x| = |y|, thenx = y
32.1 The Naive String-Matching Algorithm
The simplest approach: check the condition P[1..m] = T[s+1..s+m] for each of the n − m + 1 possible values of s.
NAIVE-STRING-MATCHER(T, P)
n = T.length
m = P.length
for s = 0 to n − m
if P[1..m] == T[s+1..s+m]
print "Pattern occurs with shift" s
Key Points
- Conceptually slides a "template" containing the pattern over the text
- For each shift
s, compares characters until all match or a mismatch is found - No preprocessing required
- Worst-case time:
O((n − m + 1)m), which isΘ(n²)whenm = ⌊n/2⌋- Example worst case: text
T = aⁿand patternP = aᵐ— every shift is valid, each requiringmcomparisons
- Example worst case: text
- Inefficiency: entirely ignores information gained about the text for one value of
swhen considering other values ofs
32.2 The Rabin-Karp Algorithm
Proposed by Rabin and Karp, this algorithm uses hashing to speed up string matching. It performs well in practice and generalizes to related problems (e.g., two-dimensional pattern matching).
Core Idea
Treat each string of m characters as a number and use modular arithmetic to compare pattern and text windows efficiently.
- Assume
Σ = {0, 1, ..., 9}(each character is a decimal digit; in general, use radixd = |Σ|) - Let p = decimal value of
P[1..m] - Let t_s = decimal value of
T[s+1..s+m]fors = 0, 1, ..., n − m t_s = pif and only ifT[s+1..s+m] = P[1..m]— i.e.,sis a valid shift ifft_s = p
Computing Hash Values Efficiently
Computing p using Horner's rule in Θ(m) time:
p = P[m] + 10(P[m−1] + 10(P[m−2] + ··· + 10(P[2] + 10·P[1])···))
Computing t₀ from T[1..m] similarly in Θ(m) time.
Recurrence for remaining values — compute t_{s+1} from t_s in constant time:
t_{s+1} = 10·(t_s − 10^{m−1} · T[s+1]) + T[s+m+1]
This removes the high-order digit, shifts left, and adds the new low-order digit.
Example: if m = 5 and t_s = 31415, removing high-order digit 3 and adding new digit 2:
t_{s+1} = 10·(31415 − 10000·3) + 2 = 14152
The Modular Arithmetic Approach
Problem: p and t_s may be too large for single-precision arithmetic.
Solution: compute everything modulo q (a suitable prime):
t_{s+1} = (d·(t_s − T[s+1]·h) + T[s+m+1]) mod q
where h ≡ d^{m−1} (mod q) is the value of the high-order digit position.
- Choose
qas a prime such thatd·qfits in one computer word - If
t_s ≢ p (mod q), thent_s ≠ p— shiftsis definitely invalid - If
t_s ≡ p (mod q), we have a hit — but it could be a spurious hit - On a hit, explicitly verify
P[1..m] = T[s+1..s+m]
Algorithm
RABIN-KARP-MATCHER(T, P, d, q)
n = T.length
m = P.length
h = d^{m−1} mod q
p = 0
t₀ = 0
for i = 1 to m // preprocessing
p = (d·p + P[i]) mod q
t₀ = (d·t₀ + T[i]) mod q
for s = 0 to n − m // matching
if p == t_s
if P[1..m] == T[s+1..s+m]
print "Pattern occurs with shift" s
if s < n − m
t_{s+1} = (d·(t_s − T[s+1]·h) + T[s+m+1]) mod q
Analysis
- Preprocessing time:
Θ(m) - Worst-case matching time:
Θ((n − m + 1)m)— same as naive (e.g.,P = aᵐ,T = aⁿ) - Expected matching time:
O(n) + O(m(ν + n/q))ν= number of valid shifts- Expected spurious hits:
O(n/q)(assuming hash values behave like random mappings)
- If
ν = O(1)andq ≥ m, expected matching time isO(n + m) = O(n)
32.3 String Matching with Finite Automata
This method builds a finite automaton specifically designed to search for occurrences of pattern P in text T. It examines each text character exactly once, taking constant time per character.
Finite Automata Definition
A finite automaton M is a 5-tuple (Q, q₀, A, Σ, δ):
- Q: finite set of states
- q₀ ∈ Q: start state
- A ⊆ Q: set of accepting states
- Σ: finite input alphabet
- δ: transition function from
Q × Σ → Q
The automaton starts in state q₀, reads characters one at a time, and transitions: if in state q reading character a, it moves to state δ(q, a). A string w is accepted if the final state φ(w) ∈ A.
Final-state function φ (defined recursively):
φ(ε) = q₀φ(wa) = δ(φ(w), a)forw ∈ Σ*,a ∈ Σ
The Suffix Function σ
For a given pattern P[1..m], the suffix function σ: Σ* → {0, 1, ..., m} is:
σ(x) = max{k : P_k ⊐ x}
σ(x) is the length of the longest prefix of P that is also a suffix of x.
String-Matching Automaton Construction
For pattern P[1..m]:
- State set
Q = {0, 1, ..., m} - Start state
q₀ = 0 - Only accepting state: state
m - Transition function:
δ(q, a) = σ(P_q · a)
Intuition: state q means the longest prefix of P that matches a suffix of the text read so far has length q. On reading character a:
- If
a = P[q+1]: match continues,δ(q, a) = q + 1(move along the "spine") - If
a ≠ P[q+1]: find the longest smaller prefix ofPthat is a suffix ofP_q · a
Matching Algorithm
FINITE-AUTOMATON-MATCHER(T, δ, m)
n = T.length
q = 0
for i = 1 to n
q = δ(q, T[i])
if q == m
print "Pattern occurs with shift" i − m
Matching time: Θ(n) — each character examined exactly once.
Key Theorem (Theorem 32.4)
The automaton maintains the invariant:
φ(T_i) = σ(T_i)
That is, after scanning T[i], the automaton is in the state equal to the length of the longest prefix of P that is a suffix of T_i.
Computing the Transition Function
COMPUTE-TRANSITION-FUNCTION(P, Σ)
m = P.length
for q = 0 to m
for each character a ∈ Σ
k = min(m + 1, q + 2)
repeat
k = k − 1
until P_k ⊐ P_q · a
δ(q, a) = k
return δ
- Running time:
O(m³|Σ|)(can be improved toO(m|Σ|)with clever use of pattern information) - Total:
O(m|Σ|)preprocessing +Θ(n)matching
Supporting Lemmas
Lemma 32.2 (Suffix-function inequality): For any string x and character a:
σ(xa) ≤ σ(x) + 1
Lemma 32.3 (Suffix-function recursion): For any string x and character a, if q = σ(x), then:
σ(xa) = σ(P_q · a)
This is the key insight: to compute σ(xa), we only need to know σ(x) (the current state), not the entire string x.
32.4 The Knuth-Morris-Pratt Algorithm
The KMP algorithm achieves Θ(n) matching time with only Θ(m) preprocessing — avoiding the O(m|Σ|) cost of computing the full transition function δ. It uses an auxiliary prefix function π stored in an array π[1..m].
The Prefix Function
The prefix function π for pattern P[1..m] is defined as:
π[q] = max{k : k < q and P_k ⊐ P_q}
π[q] is the length of the longest proper prefix of P_q that is also a suffix of P_q.
Example for P = ababaca:
| i | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|
| P[i] | a | b | a | b | a | c | a |
| π[i] | 0 | 0 | 1 | 2 | 3 | 0 | 1 |
Key Idea
Given that q characters have matched at shift s, the next potentially valid shift is:
s' = s + (q − π[q])
At the new shift s', the first π[q] characters of P are guaranteed to match (no need to re-compare them).
The KMP Matcher
KMP-MATCHER(T, P)
n = T.length
m = P.length
π = COMPUTE-PREFIX-FUNCTION(P)
q = 0 // number of characters matched
for i = 1 to n // scan text left to right
while q > 0 and P[q+1] ≠ T[i]
q = π[q] // next character does not match
if P[q+1] == T[i]
q = q + 1 // next character matches
if q == m // is all of P matched?
print "Pattern occurs with shift" i − m
q = π[q] // look for the next match
Computing the Prefix Function
COMPUTE-PREFIX-FUNCTION(P)
m = P.length
let π[1..m] be a new array
π[1] = 0
k = 0
for q = 2 to m
while k > 0 and P[k+1] ≠ P[q]
k = π[k]
if P[k+1] == P[q]
k = k + 1
π[q] = k
return π
Both procedures share the same structure: KMP-MATCHER matches text T against P, while COMPUTE-PREFIX-FUNCTION matches P against itself.
Running Time Analysis
COMPUTE-PREFIX-FUNCTION runs in Θ(m) (amortized analysis):
kstarts at 0 and increases by at most 1 per iteration of the for loop (at mostm − 1total increases)- Each while loop iteration decreases
k knever goes negative- Total decreases ≤ total increases ≤
m − 1 - Therefore the while loop executes at most
m − 1times total
KMP-MATCHER runs in Θ(n) by a similar aggregate amortized argument on q.
Total: Θ(m) preprocessing + Θ(n) matching = Θ(n + m)
How π Replaces δ
When the automaton is in state q and reads character a = T[i]:
- If
a = P[q+1]: match continues, move to stateq + 1(no need forπ) - If
a ≠ P[q+1]: iterate through states inπ*(q) = {π[q], π[π[q]], ...}in decreasing order until finding a stateq'whereP[q'+1] = a, then move toq' + 1. If no such state exists, move to state 0 (or 1 ifP[1] = a)
Prefix-Function Iteration Lemma (Lemma 32.5)
Define π*(q) = {π[q], π^(2)[q], π^(3)[q], ..., 0} (iterating π until reaching 0).
Then:
π*(q) = {k : k < q and P_k ⊐ P_q}
That is, iterating π enumerates all prefixes of P that are proper suffixes of P_q.
Correctness
-
COMPUTE-PREFIX-FUNCTION: At the start of each for-loop iteration,
k = π[q−1]. Lines 6–9 adjustkto the correct value ofπ[q]using Corollary 32.7:- If
E_{q−1} = ∅(no extendable prefix), thenπ[q] = 0 - Otherwise,
π[q] = 1 + max{k ∈ E_{q−1}} - where
E_{q−1} = {k ∈ π*(q−1) : P[k+1] = P[q]}
- If
-
KMP-MATCHER: Simulates
FINITE-AUTOMATON-MATCHER— after processingT[i], the stateq = σ(T_i). Correctness follows from Theorem 32.4 and the equivalence of theπ-based transitions to theδtransitions.
Advantages over Finite Automaton Method
| Finite Automaton | KMP | |
|---|---|---|
| Preprocessing | O(m|Σ|) | Θ(m) |
| Matching | Θ(n) | Θ(n) |
| Space | O(m|Σ|) for δ table | O(m) for π array |
KMP saves a factor of |Σ| in both preprocessing time and space by computing π instead of δ, while computing transitions "on the fly" during matching.
Previous chapter
Number Theoretic AlgorithmsNext chapter
Computational Geometry