Pagefy

Pagefy

Back to Data Structures and Algorithms

String Matching

Introduction to Algorithms

Chapter 32: String Matching

Text-editing programs frequently need to find all occurrences of a pattern in a text. Efficient algorithms for this problem — called string matching — have applications in document editing, DNA sequence analysis, and internet search engines.

Problem Formulation

  • The text is an array T[1..n] of length n
  • The pattern is an array P[1..m] of length m ≤ n
  • Elements of P and T are characters drawn from a finite alphabet Σ (e.g., Σ = {0, 1} or Σ = {a, b, ..., z})
  • Pattern P occurs with shift s in text T if 0 ≤ s ≤ n − m and T[s+1..s+m] = P[1..m]
  • If P occurs with shift s, then s is a valid shift; otherwise it is an invalid shift
  • The string-matching problem: find all valid shifts for a given pattern P in a given text T

Summary of Algorithms

AlgorithmPreprocessing TimeMatching Time
Naive0O((n − m + 1)m)
Rabin-KarpΘ(m)O((n − m + 1)m)
Finite Automaton`O(mΣ
Knuth-Morris-PrattΘ(m)Θ(n)

Notation and Terminology

  • Σ* denotes the set of all finite-length strings formed from alphabet Σ
  • The empty string ε has zero length and belongs to Σ*
  • |x| denotes the length of string x
  • Concatenation of x and y, denoted xy, has length |x| + |y|
  • Prefix: w ⊏ x if x = wy for some string y ∈ Σ*. Implies |w| ≤ |x|
  • Suffix: w ⊐ x if x = yw for some string y ∈ Σ*. Implies |w| ≤ |x|
  • P_k denotes the k-character prefix P[1..k] of pattern P[1..m] (so P_0 = ε, P_m = P)
  • T_k denotes the k-character prefix of text T
  • The string-matching problem: find all shifts s in range 0 ≤ s ≤ n − m such that P ⊐ T_{s+m}

Overlapping-Suffix Lemma (Lemma 32.1)

If x ⊐ z and y ⊐ z, then:

  • If |x| ≤ |y|, then x ⊐ y
  • If |x| ≥ |y|, then y ⊐ x
  • If |x| = |y|, then x = y

32.1 The Naive String-Matching Algorithm

The simplest approach: check the condition P[1..m] = T[s+1..s+m] for each of the n − m + 1 possible values of s.

NAIVE-STRING-MATCHER(T, P)
  n = T.length
  m = P.length
  for s = 0 to n − m
      if P[1..m] == T[s+1..s+m]
          print "Pattern occurs with shift" s

Key Points

  • Conceptually slides a "template" containing the pattern over the text
  • For each shift s, compares characters until all match or a mismatch is found
  • No preprocessing required
  • Worst-case time: O((n − m + 1)m), which is Θ(n²) when m = ⌊n/2⌋
    • Example worst case: text T = aⁿ and pattern P = aᵐ — every shift is valid, each requiring m comparisons
  • Inefficiency: entirely ignores information gained about the text for one value of s when considering other values of s

32.2 The Rabin-Karp Algorithm

Proposed by Rabin and Karp, this algorithm uses hashing to speed up string matching. It performs well in practice and generalizes to related problems (e.g., two-dimensional pattern matching).

Core Idea

Treat each string of m characters as a number and use modular arithmetic to compare pattern and text windows efficiently.

  • Assume Σ = {0, 1, ..., 9} (each character is a decimal digit; in general, use radix d = |Σ|)
  • Let p = decimal value of P[1..m]
  • Let t_s = decimal value of T[s+1..s+m] for s = 0, 1, ..., n − m
  • t_s = p if and only if T[s+1..s+m] = P[1..m] — i.e., s is a valid shift iff t_s = p

Computing Hash Values Efficiently

Computing p using Horner's rule in Θ(m) time:

p = P[m] + 10(P[m−1] + 10(P[m−2] + ··· + 10(P[2] + 10·P[1])···))

Computing t₀ from T[1..m] similarly in Θ(m) time.

Recurrence for remaining values — compute t_{s+1} from t_s in constant time:

t_{s+1} = 10·(t_s − 10^{m−1} · T[s+1]) + T[s+m+1]

This removes the high-order digit, shifts left, and adds the new low-order digit.

Example: if m = 5 and t_s = 31415, removing high-order digit 3 and adding new digit 2:

t_{s+1} = 10·(31415 − 10000·3) + 2 = 14152

The Modular Arithmetic Approach

Problem: p and t_s may be too large for single-precision arithmetic.

Solution: compute everything modulo q (a suitable prime):

t_{s+1} = (d·(t_s − T[s+1]·h) + T[s+m+1]) mod q

where h ≡ d^{m−1} (mod q) is the value of the high-order digit position.

  • Choose q as a prime such that d·q fits in one computer word
  • If t_s ≢ p (mod q), then t_s ≠ p — shift s is definitely invalid
  • If t_s ≡ p (mod q), we have a hit — but it could be a spurious hit
  • On a hit, explicitly verify P[1..m] = T[s+1..s+m]

Algorithm

RABIN-KARP-MATCHER(T, P, d, q)
  n = T.length
  m = P.length
  h = d^{m−1} mod q
  p = 0
  t₀ = 0
  for i = 1 to m                          // preprocessing
      p = (d·p + P[i]) mod q
      t₀ = (d·t₀ + T[i]) mod q
  for s = 0 to n − m                      // matching
      if p == t_s
          if P[1..m] == T[s+1..s+m]
              print "Pattern occurs with shift" s
      if s < n − m
          t_{s+1} = (d·(t_s − T[s+1]·h) + T[s+m+1]) mod q

Analysis

  • Preprocessing time: Θ(m)
  • Worst-case matching time: Θ((n − m + 1)m) — same as naive (e.g., P = aᵐ, T = aⁿ)
  • Expected matching time: O(n) + O(m(ν + n/q))
    • ν = number of valid shifts
    • Expected spurious hits: O(n/q) (assuming hash values behave like random mappings)
  • If ν = O(1) and q ≥ m, expected matching time is O(n + m) = O(n)

32.3 String Matching with Finite Automata

This method builds a finite automaton specifically designed to search for occurrences of pattern P in text T. It examines each text character exactly once, taking constant time per character.

Finite Automata Definition

A finite automaton M is a 5-tuple (Q, q₀, A, Σ, δ):

  • Q: finite set of states
  • q₀ ∈ Q: start state
  • A ⊆ Q: set of accepting states
  • Σ: finite input alphabet
  • δ: transition function from Q × Σ → Q

The automaton starts in state q₀, reads characters one at a time, and transitions: if in state q reading character a, it moves to state δ(q, a). A string w is accepted if the final state φ(w) ∈ A.

Final-state function φ (defined recursively):

  • φ(ε) = q₀
  • φ(wa) = δ(φ(w), a) for w ∈ Σ*, a ∈ Σ

The Suffix Function σ

For a given pattern P[1..m], the suffix function σ: Σ* → {0, 1, ..., m} is:

σ(x) = max{k : P_k ⊐ x}

σ(x) is the length of the longest prefix of P that is also a suffix of x.

String-Matching Automaton Construction

For pattern P[1..m]:

  • State set Q = {0, 1, ..., m}
  • Start state q₀ = 0
  • Only accepting state: state m
  • Transition function:
δ(q, a) = σ(P_q · a)

Intuition: state q means the longest prefix of P that matches a suffix of the text read so far has length q. On reading character a:

  • If a = P[q+1]: match continues, δ(q, a) = q + 1 (move along the "spine")
  • If a ≠ P[q+1]: find the longest smaller prefix of P that is a suffix of P_q · a

Matching Algorithm

FINITE-AUTOMATON-MATCHER(T, δ, m)
  n = T.length
  q = 0
  for i = 1 to n
      q = δ(q, T[i])
      if q == m
          print "Pattern occurs with shift" i − m

Matching time: Θ(n) — each character examined exactly once.

Key Theorem (Theorem 32.4)

The automaton maintains the invariant:

φ(T_i) = σ(T_i)

That is, after scanning T[i], the automaton is in the state equal to the length of the longest prefix of P that is a suffix of T_i.

Computing the Transition Function

COMPUTE-TRANSITION-FUNCTION(P, Σ)
  m = P.length
  for q = 0 to m
      for each character a ∈ Σ
          k = min(m + 1, q + 2)
          repeat
              k = k − 1
          until P_k ⊐ P_q · a
          δ(q, a) = k
  return δ
  • Running time: O(m³|Σ|) (can be improved to O(m|Σ|) with clever use of pattern information)
  • Total: O(m|Σ|) preprocessing + Θ(n) matching

Supporting Lemmas

Lemma 32.2 (Suffix-function inequality): For any string x and character a:

σ(xa) ≤ σ(x) + 1

Lemma 32.3 (Suffix-function recursion): For any string x and character a, if q = σ(x), then:

σ(xa) = σ(P_q · a)

This is the key insight: to compute σ(xa), we only need to know σ(x) (the current state), not the entire string x.


32.4 The Knuth-Morris-Pratt Algorithm

The KMP algorithm achieves Θ(n) matching time with only Θ(m) preprocessing — avoiding the O(m|Σ|) cost of computing the full transition function δ. It uses an auxiliary prefix function π stored in an array π[1..m].

The Prefix Function

The prefix function π for pattern P[1..m] is defined as:

π[q] = max{k : k < q and P_k ⊐ P_q}

π[q] is the length of the longest proper prefix of P_q that is also a suffix of P_q.

Example for P = ababaca:

i1234567
P[i]ababaca
π[i]0012301

Key Idea

Given that q characters have matched at shift s, the next potentially valid shift is:

s' = s + (q − π[q])

At the new shift s', the first π[q] characters of P are guaranteed to match (no need to re-compare them).

The KMP Matcher

KMP-MATCHER(T, P)
  n = T.length
  m = P.length
  π = COMPUTE-PREFIX-FUNCTION(P)
  q = 0                                   // number of characters matched
  for i = 1 to n                          // scan text left to right
      while q > 0 and P[q+1] ≠ T[i]
          q = π[q]                         // next character does not match
      if P[q+1] == T[i]
          q = q + 1                        // next character matches
      if q == m                            // is all of P matched?
          print "Pattern occurs with shift" i − m
          q = π[q]                         // look for the next match

Computing the Prefix Function

COMPUTE-PREFIX-FUNCTION(P)
  m = P.length
  let π[1..m] be a new array
  π[1] = 0
  k = 0
  for q = 2 to m
      while k > 0 and P[k+1] ≠ P[q]
          k = π[k]
      if P[k+1] == P[q]
          k = k + 1
      π[q] = k
  return π

Both procedures share the same structure: KMP-MATCHER matches text T against P, while COMPUTE-PREFIX-FUNCTION matches P against itself.

Running Time Analysis

COMPUTE-PREFIX-FUNCTION runs in Θ(m) (amortized analysis):

  • k starts at 0 and increases by at most 1 per iteration of the for loop (at most m − 1 total increases)
  • Each while loop iteration decreases k
  • k never goes negative
  • Total decreases ≤ total increases ≤ m − 1
  • Therefore the while loop executes at most m − 1 times total

KMP-MATCHER runs in Θ(n) by a similar aggregate amortized argument on q.

Total: Θ(m) preprocessing + Θ(n) matching = Θ(n + m)

How π Replaces δ

When the automaton is in state q and reads character a = T[i]:

  1. If a = P[q+1]: match continues, move to state q + 1 (no need for π)
  2. If a ≠ P[q+1]: iterate through states in π*(q) = {π[q], π[π[q]], ...} in decreasing order until finding a state q' where P[q'+1] = a, then move to q' + 1. If no such state exists, move to state 0 (or 1 if P[1] = a)

Prefix-Function Iteration Lemma (Lemma 32.5)

Define π*(q) = {π[q], π^(2)[q], π^(3)[q], ..., 0} (iterating π until reaching 0).

Then:

π*(q) = {k : k < q and P_k ⊐ P_q}

That is, iterating π enumerates all prefixes of P that are proper suffixes of P_q.

Correctness

  • COMPUTE-PREFIX-FUNCTION: At the start of each for-loop iteration, k = π[q−1]. Lines 6–9 adjust k to the correct value of π[q] using Corollary 32.7:

    • If E_{q−1} = ∅ (no extendable prefix), then π[q] = 0
    • Otherwise, π[q] = 1 + max{k ∈ E_{q−1}}
    • where E_{q−1} = {k ∈ π*(q−1) : P[k+1] = P[q]}
  • KMP-MATCHER: Simulates FINITE-AUTOMATON-MATCHER — after processing T[i], the state q = σ(T_i). Correctness follows from Theorem 32.4 and the equivalence of the π-based transitions to the δ transitions.

Advantages over Finite Automaton Method

Finite AutomatonKMP
PreprocessingO(m|Σ|)Θ(m)
MatchingΘ(n)Θ(n)
SpaceO(m|Σ|) for δ tableO(m) for π array

KMP saves a factor of |Σ| in both preprocessing time and space by computing π instead of δ, while computing transitions "on the fly" during matching.