sabin m. thomas - string matching algorithms

Upload: tran-cong

Post on 10-Apr-2018

232 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 Sabin M. Thomas - String Matching Algorithms

    1/28

    String MatchingAlgorithms

    Sabin Thomas

  • 8/8/2019 Sabin M. Thomas - String Matching Algorithms

    2/28

    History of String Search

    The brute force algorithm:

    invented in the dawn of computer history

    re-invented many times, stillcommon

    Knuth & Pratt invented a better one in 1970

    published 1976 as Knuth-Morris-Pratt

    Boyer & Moore found a better one before 1976

    Published 1977Karp & Rabin found a better one in 1980

    Published 1987

  • 8/8/2019 Sabin M. Thomas - String Matching Algorithms

    3/28

    Brute-force

    Worst O(m*n)

    Best O(n)

    algorithm brute-force:

    input: an array of characters, T (the string to be analyzed) , length n

    an array of characters, P (the pattern to be searched for), length m

    for i := 0 to n-m do

    for j := 0 to m-1do

    compare T[j] with P[i+j]

    ifnot equal, exit the inner loop

  • 8/8/2019 Sabin M. Thomas - String Matching Algorithms

    4/28

    Boyer-Moore

    (Example 1)

    t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t[10]

    A B C D

    p[0] p[1] p[2] p[3]

    N

    There is no E in the pattern : thus the pattern cant match ifanycharacters lie

    under t[3]. So, move four boxes to the right.

    A B C E F G A B C D E

  • 8/8/2019 Sabin M. Thomas - String Matching Algorithms

    5/28

    Boyer-Moore

    (Example 1)

    t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t[10]

    A B C E F G A B C D E

    A B C D

    p[0] p[1] p[2] p[3]

    N

    Again, no match. But there is a B in the pattern. So move two boxes to the

    right.

  • 8/8/2019 Sabin M. Thomas - String Matching Algorithms

    6/28

    Boyer-Moore

    (Example 1)

    t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10]

    A B C E F G A B C D E

    A B C D

    p[0] p[1] p[2] p[3]

    YYYY

  • 8/8/2019 Sabin M. Thomas - String Matching Algorithms

    7/28

    Boyer-Moore

    (Pseudocode)

    Compares right to left

    2 precomputed functions

    Good suffix shift Bad character shift

  • 8/8/2019 Sabin M. Thomas - String Matching Algorithms

    8/28

    Boyer-Moore

    (Performance)

    Performance depends on length of pattern

    O(n/m)

    Longer patterns = better performance Smallest pattern = m = 1

    O(n) linear search

  • 8/8/2019 Sabin M. Thomas - String Matching Algorithms

    9/28

    Knuth-Morris-Pratt

    searches for occurrences of a "word" W

    within a main "text string" S

    Bypasses re-examination of previouslymatched characters.

  • 8/8/2019 Sabin M. Thomas - String Matching Algorithms

    10/28

    Knuth-Morris-Pratt

    (Example 1)

    t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t[10] t[11] t[12] t[13]

    p[0] p[1] p[2] p[3] p[4] p[5] p[6]

    Y

    A B C A B C D A B A B C

    NY Y

    m = 0

    A B C D A B D

  • 8/8/2019 Sabin M. Thomas - String Matching Algorithms

    11/28

    Knuth-Morris-Pratt

    (Example 1)

    t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t[10] t[11] t[12] t[13]

    p[0] p[1] p[2] p[3] p[4] p[5] p[6]

    Y

    A B C A B C D A B A B C

    NY Y

    m = 4

    A B C D A B D

    Y Y Y

  • 8/8/2019 Sabin M. Thomas - String Matching Algorithms

    12/28

    Knuth-Morris-Pratt

    (Example 1)

    t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t[10] t[11] t[12] t[13]

    p[0] p[1] p[2] p[3] p[4] p[5] p[6]

    A B C A B C D A B A B C

    N

    m = 10

    A B C D A B D

  • 8/8/2019 Sabin M. Thomas - String Matching Algorithms

    13/28

    Knuth-Morris-Pratt

    (Example 1)

    t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t[10] t[11] t[12] t[13]

    p[0] p[1] p[2] ..

    Y

    A B C A B C D A B A B C

    Y

    m = 11

    A B C ..

    Y

  • 8/8/2019 Sabin M. Thomas - String Matching Algorithms

    14/28

    Knuth-Morris-Pratt

    PseudoCode

    Search O(n)

    algorithm kmp_search:

    Input an array of characters, S (the text to be searched)

    an array of characters, W (the word sought)

    Output an integer (the zero-based position in S at which W is found)

    define variables:

    an integer, m 0 (the beginning of the current match in S)

    an integer, i 0 (the position of the current character in W)

    an array of integers, T (the table, computed elsewhere)

    while m + i is less than the length of S, do:

    ifW[i] = S[m + i],let i i + 1

    ifi equals the length ofW,

    return m

    otherwise,

    let m m + i - T[i],

    ifi > 0,

    let i T[i]

  • 8/8/2019 Sabin M. Thomas - String Matching Algorithms

    15/28

    Knuth-Morris-Pratt

    PseudoCode

    Partial Table Match O(k)

    algorithm kmp_table:

    input: an array of characters, W (the word to be analyzed)

    an array of integers, T (the table to be filled)

    define variables:

    an integer, i 2 (the current position we are computing in T) an integer, j 0 (the zero-based index in W of the next character of the current candidate substring)

    let T[0] -1, T[1] 0

    while i is less than the length ofW, do:

    (first case: the substring continues)

    ifW[i - 1] = W[j], let T[i] j + 1, i i + 1, j j + 1

    (second case: it doesn't, but we can fall back)

    otherwise, ifj > 0, letj T[j]

    (third case: we have run out of candidates. Note j = 0)

    otherwise, let T[i] 0, i i + 1

  • 8/8/2019 Sabin M. Thomas - String Matching Algorithms

    16/28

    Karp-Rabin

    Slower for Single pattern match

    Fast for Multiple pattern match

    Trick is Hash compare.

  • 8/8/2019 Sabin M. Thomas - String Matching Algorithms

    17/28

    Karp-Rabin

    (Example 1)t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t[10]

    A C D Bp[0] p[1] p[2] p[3]

    A C D E A C A C C D E

    Hash(ACDB) = 5

    Hash(ACDE) = 10

  • 8/8/2019 Sabin M. Thomas - String Matching Algorithms

    18/28

    Karp-Rabin

    (Example 1)t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t[10]

    A C D Bp[0] p[1] p[2] p[3]

    A C D E A C A C C D E

    Hash(ACDB) = 5

    Hash(CDEA) = 6

  • 8/8/2019 Sabin M. Thomas - String Matching Algorithms

    19/28

    Karp-Rabin

    (Caveats)

    Good Hashing function with few collisions

    Hashing result must be small number for

    faster compare Rolling Hash

    S.T1

  • 8/8/2019 Sabin M. Thomas - String Matching Algorithms

    20/28

    Slide 19

    S.T1 Rolling Hash - Allows for faster recomputing of the hash. Instead of completely recomputing from scratch, make use of the fact that w

    are computing the hash for just an extra letter. Do addition and subtraction to the original hash algorithm

    Sabin Thomas, 4/11/2007

  • 8/8/2019 Sabin M. Thomas - String Matching Algorithms

    21/28

    Karp-Rabin

    (Pseudocode)

    algorithm RabinKarp:

    Input an array of characters, S, length n

    an array of characters sub, length m

    hsub := hash(sub[1..m])

    hs := hash(s[1..m])

    fori from 1 to n-m+1

    ifhs = hsub

    ifs[i..i+m-1] = sub

    return i

    hs := hash(s[i+1..i+m])

    return not found

  • 8/8/2019 Sabin M. Thomas - String Matching Algorithms

    22/28

    Karp-Rabin

    (Example 2)t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t[10]

    A C D Bp[0] p[1] p[2] p[3]

    A C D E A C A C C D E

    Hash(ACDB) = 5

    Hash(ACDE) = 10

    F M D T

    j[0] j[1] j[2] j[3]

    Hash(FMDT) = 62

    S.T2

  • 8/8/2019 Sabin M. Thomas - String Matching Algorithms

    23/28

    Slide 21

    S.T2 Multiple Pattern SearchSabin Thomas, 4/11/2007

  • 8/8/2019 Sabin M. Thomas - String Matching Algorithms

    24/28

    Karp-Rabin

    (Example 2)t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t[10]

    p[0] p[1] p[2] p[3]

    A C D E A C A C C D E

    Hash(CDEA) = 6

    A C D B

    Hash(ACDB) = 5

    F M D T

    j[0] j[1] j[2] j[3]

    Hash(FMDT) = 62

    S.T3

  • 8/8/2019 Sabin M. Thomas - String Matching Algorithms

    25/28

    Slide 22

    S.T3 Hashes of the pattern have already been precomputed.Sabin Thomas, 4/11/2007

  • 8/8/2019 Sabin M. Thomas - String Matching Algorithms

    26/28

    Karp-Rabin

    (Performance)

    Single Pattern

    BM O(n/m)

    KMP O(n)

    Karp-Rabin O(mn)

    Multiple Pattern

    BM, KMP O(n k)

    Karp-Rabin O(n + k)

  • 8/8/2019 Sabin M. Thomas - String Matching Algorithms

    27/28

    (Applications)

    BM - Text Editors search/replace

    Karp-Rabin Plagiarism finder

  • 8/8/2019 Sabin M. Thomas - String Matching Algorithms

    28/28

    Questions?