Question

In: Computer Science

Overview: Pattern-Matching (aka String Search) is the process of algorithmically finding copies of a pattern P...

Overview:

Pattern-Matching (aka String Search) is the process of algorithmically finding copies of a pattern P inside a (generally much larger) text T. The goal is to implement and compare classical string-matching algorithms.

Input:

Your code should work for any text either inputted directly or read in from a file.

However, for testing - input file has been provided:

  • The Gettysburg Address (by President Abraham Lincoln, 1863)

You should minimally search for these three patterns in each text: FREE, BRAVE, NATION.

You should convert the entire string to upper case and conduct your searches accordingly.

Methodology and Output:

  1. Implement the Brute Force Algorithm, along the lines

Given a text txt[0..n-1] and a pattern pat[0..m-1], a function search(char pat[], char txt[]) that prints all occurrences of pat[] in txt[]. You may assume that n > m.

Input:  txt[] = "THIS IS A TEST TEXT"
        pat[] = "TEST"
Output: Pattern found at index 10

Input:  txt[] =  "AABAACAADAABAABA"
        pat[] =  "AABA"
Output: Pattern found at index 0
        Pattern found at index 9
        Pattern found at index 12

2 Implement the Knuth-Morris-Pratt Algorithm, along the lines

Input:  txt[] = "THIS IS A TEST TEXT"
        pat[] = "TEST"
Output: Pattern found at index 10

Input:  txt[] =  "AABAACAADAABAABA"
        pat[] =  "AABA"
Output: Pattern found at index 0
        Pattern found at index 9
        Pattern found at index 12

(For each algorithm and text/pattern combination)

(Track and tabulate the number of character-comparison operations)

(Track and tabulate the clock time for each algorithm)

(Print the first position within the text where the pattern appears, or -1 if it does not appear )

Solutions

Expert Solution

// C++ program for implementation of KMP pattern searching 
// algorithm 
#include <bits/stdc++.h> 

void computeLPSArray(char* pat, int M, int* lps); 

// Prints occurrences of txt[] in pat[] 
void KMPSearch(char* pat, char* txt) 
{ 
        int M = strlen(pat); 
        int N = strlen(txt); 

        // create lps[] that will hold the longest prefix suffix 
        // values for pattern 
        int lps[M]; 

        // Preprocess the pattern (calculate lps[] array) 
        computeLPSArray(pat, M, lps); 

        int i = 0; // index for txt[] 
        int j = 0; // index for pat[] 
        while (i < N) { 
                if (pat[j] == txt[i]) { 
                        j++; 
                        i++; 
                } 

                if (j == M) { 
                        printf("Found pattern at index %d ", i - j); 
                        j = lps[j - 1]; 
                } 

                // mismatch after j matches 
                else if (i < N && pat[j] != txt[i]) { 
                        // Do not match lps[0..lps[j-1]] characters, 
                        // they will match anyway 
                        if (j != 0) 
                                j = lps[j - 1]; 
                        else
                                i = i + 1; 
                } 
        } 
} 

// Fills lps[] for given patttern pat[0..M-1] 
void computeLPSArray(char* pat, int M, int* lps) 
{ 
        // length of the previous longest prefix suffix 
        int len = 0; 

        lps[0] = 0; // lps[0] is always 0 

        // the loop calculates lps[i] for i = 1 to M-1 
        int i = 1; 
        while (i < M) { 
                if (pat[i] == pat[len]) { 
                        len++; 
                        lps[i] = len; 
                        i++; 
                } 
                else // (pat[i] != pat[len]) 
                { 
                        // This is tricky. Consider the example. 
                        // AAACAAAA and i = 7. The idea is similar 
                        // to search step. 
                        if (len != 0) { 
                                len = lps[len - 1]; 

                                // Also, note that we do not increment 
                                // i here 
                        } 
                        else // if (len == 0) 
                        { 
                                lps[i] = 0; 
                                i++; 
                        } 
                } 
        } 
} 

// Driver program to test above function 
int main() 
{ 
        char txt[] = "ABABDABACDABABCABAB"; 
        char pat[] = "ABABCABAB"; 
        KMPSearch(pat, txt); 
        return 0; 
} 
  • Output:
    Found pattern at index 10

Searching Algorithm: KMP (Knuth Morris Pratt)
Unlike Naive algorithm, where we slide the pattern by one and compare all characters at each shift, we use a value from lps[] to decide the next characters to be matched. The idea is to not match a character that we know will anyway match.

How to use lps[] to decide next positions (or to know a number of characters to be skipped)?

  • We start comparison of pat[j] with j = 0 with characters of current window of text.
  • We keep matching characters txt[i] and pat[j] and keep incrementing i and j while pat[j] and txt[i] keep matching.
  • When we see a mismatch
    • We know that characters pat[0..j-1] match with txt[i-j…i-1] (Note that j starts with 0 and increment it only when there is a match).
    • We also know (from above definition) that lps[j-1] is count of characters of pat[0…j-1] that are both proper prefix and suffix.
    • From above two points, we can conclude that we do not need to match these lps[j-1] characters with txt[i-j…i-1] because we know that these characters will anyway match. Let us consider above example to understand this.
 
txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
lps[] = {0, 1, 2, 3} 

i = 0, j = 0
txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
txt[i] and pat[j] match, do i++, j++

i = 1, j = 1
txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
txt[i] and pat[j] match, do i++, j++

i = 2, j = 2
txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
pat[i] and pat[j] match, do i++, j++

i = 3, j = 3
txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
txt[i] and pat[j] match, do i++, j++

i = 4, j = 4
Since j == M, print pattern found and reset j,
j = lps[j-1] = lps[3] = 3

Here unlike Naive algorithm, we do not match first three 
characters of this window. Value of lps[j-1] (in above 
step) gave us index of next character to match.
i = 4, j = 3
txt[] = "AAAAABAAABA" 
pat[] =  "AAAA"
txt[i] and pat[j] match, do i++, j++

i = 5, j = 4
Since j == M, print pattern found and reset j,
j = lps[j-1] = lps[3] = 3

Again unlike Naive algorithm, we do not match first three 
characters of this window. Value of lps[j-1] (in above 
step) gave us index of next character to match.
i = 5, j = 3
txt[] = "AAAAABAAABA" 
pat[] =   "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = lps[j-1] = lps[2] = 2

i = 5, j = 2
txt[] = "AAAAABAAABA" 
pat[] =    "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = lps[j-1] = lps[1] = 1 

i = 5, j = 1
txt[] = "AAAAABAAABA" 
pat[] =     "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = lps[j-1] = lps[0] = 0

i = 5, j = 0
txt[] = "AAAAABAAABA" 
pat[] =      "AAAA"
txt[i] and pat[j] do NOT match and j is 0, we do i++.

i = 6, j = 0
txt[] = "AAAAABAAABA" 
pat[] =       "AAAA"
txt[i] and pat[j] match, do i++ and j++

i = 7, j = 1
txt[] = "AAAAABAAABA" 
pat[] =       "AAAA"
txt[i] and pat[j] match, do i++ and j++

We continue this way...

Related Solutions

ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT