Question

In: Computer Science

Overview: Pattern-Matching (aka String Search) is the process of algorithmically finding copies of a pattern P...

Overview:

Pattern-Matching (aka String Search) is the process of algorithmically finding copies of a pattern P inside a (generally much larger) text T. The goal is to implement and compare classical string-matching algorithms.

Input:

Your code should work for any text either inputted directly or read in from a file.

However, for testing - input file has been provided:

  • The Gettysburg Address (by President Abraham Lincoln, 1863)

You should minimally search for these three patterns in each text: FREE, BRAVE, NATION.

You should convert the entire string to upper case and conduct your searches accordingly.

Methodology and Output:

  1. Implement the Brute Force Algorithm, along the lines

Given a text txt[0..n-1] and a pattern pat[0..m-1], a function search(char pat[], char txt[]) that prints all occurrences of pat[] in txt[]. You may assume that n > m.

Input:  txt[] = "THIS IS A TEST TEXT"
        pat[] = "TEST"
Output: Pattern found at index 10

Input:  txt[] =  "AABAACAADAABAABA"
        pat[] =  "AABA"
Output: Pattern found at index 0
        Pattern found at index 9
        Pattern found at index 12

2 Implement the Knuth-Morris-Pratt Algorithm, along the lines

Input:  txt[] = "THIS IS A TEST TEXT"
        pat[] = "TEST"
Output: Pattern found at index 10

Input:  txt[] =  "AABAACAADAABAABA"
        pat[] =  "AABA"
Output: Pattern found at index 0
        Pattern found at index 9
        Pattern found at index 12

(For each algorithm and text/pattern combination)

(Track and tabulate the number of character-comparison operations)

(Track and tabulate the clock time for each algorithm)

(Print the first position within the text where the pattern appears, or -1 if it does not appear )

Solutions

Expert Solution

// C++ program for implementation of KMP pattern searching 
// algorithm 
#include <bits/stdc++.h> 

void computeLPSArray(char* pat, int M, int* lps); 

// Prints occurrences of txt[] in pat[] 
void KMPSearch(char* pat, char* txt) 
{ 
        int M = strlen(pat); 
        int N = strlen(txt); 

        // create lps[] that will hold the longest prefix suffix 
        // values for pattern 
        int lps[M]; 

        // Preprocess the pattern (calculate lps[] array) 
        computeLPSArray(pat, M, lps); 

        int i = 0; // index for txt[] 
        int j = 0; // index for pat[] 
        while (i < N) { 
                if (pat[j] == txt[i]) { 
                        j++; 
                        i++; 
                } 

                if (j == M) { 
                        printf("Found pattern at index %d ", i - j); 
                        j = lps[j - 1]; 
                } 

                // mismatch after j matches 
                else if (i < N && pat[j] != txt[i]) { 
                        // Do not match lps[0..lps[j-1]] characters, 
                        // they will match anyway 
                        if (j != 0) 
                                j = lps[j - 1]; 
                        else
                                i = i + 1; 
                } 
        } 
} 

// Fills lps[] for given patttern pat[0..M-1] 
void computeLPSArray(char* pat, int M, int* lps) 
{ 
        // length of the previous longest prefix suffix 
        int len = 0; 

        lps[0] = 0; // lps[0] is always 0 

        // the loop calculates lps[i] for i = 1 to M-1 
        int i = 1; 
        while (i < M) { 
                if (pat[i] == pat[len]) { 
                        len++; 
                        lps[i] = len; 
                        i++; 
                } 
                else // (pat[i] != pat[len]) 
                { 
                        // This is tricky. Consider the example. 
                        // AAACAAAA and i = 7. The idea is similar 
                        // to search step. 
                        if (len != 0) { 
                                len = lps[len - 1]; 

                                // Also, note that we do not increment 
                                // i here 
                        } 
                        else // if (len == 0) 
                        { 
                                lps[i] = 0; 
                                i++; 
                        } 
                } 
        } 
} 

// Driver program to test above function 
int main() 
{ 
        char txt[] = "ABABDABACDABABCABAB"; 
        char pat[] = "ABABCABAB"; 
        KMPSearch(pat, txt); 
        return 0; 
} 
  • Output:
    Found pattern at index 10

Searching Algorithm: KMP (Knuth Morris Pratt)
Unlike Naive algorithm, where we slide the pattern by one and compare all characters at each shift, we use a value from lps[] to decide the next characters to be matched. The idea is to not match a character that we know will anyway match.

How to use lps[] to decide next positions (or to know a number of characters to be skipped)?

  • We start comparison of pat[j] with j = 0 with characters of current window of text.
  • We keep matching characters txt[i] and pat[j] and keep incrementing i and j while pat[j] and txt[i] keep matching.
  • When we see a mismatch
    • We know that characters pat[0..j-1] match with txt[i-j…i-1] (Note that j starts with 0 and increment it only when there is a match).
    • We also know (from above definition) that lps[j-1] is count of characters of pat[0…j-1] that are both proper prefix and suffix.
    • From above two points, we can conclude that we do not need to match these lps[j-1] characters with txt[i-j…i-1] because we know that these characters will anyway match. Let us consider above example to understand this.
 
txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
lps[] = {0, 1, 2, 3} 

i = 0, j = 0
txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
txt[i] and pat[j] match, do i++, j++

i = 1, j = 1
txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
txt[i] and pat[j] match, do i++, j++

i = 2, j = 2
txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
pat[i] and pat[j] match, do i++, j++

i = 3, j = 3
txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
txt[i] and pat[j] match, do i++, j++

i = 4, j = 4
Since j == M, print pattern found and reset j,
j = lps[j-1] = lps[3] = 3

Here unlike Naive algorithm, we do not match first three 
characters of this window. Value of lps[j-1] (in above 
step) gave us index of next character to match.
i = 4, j = 3
txt[] = "AAAAABAAABA" 
pat[] =  "AAAA"
txt[i] and pat[j] match, do i++, j++

i = 5, j = 4
Since j == M, print pattern found and reset j,
j = lps[j-1] = lps[3] = 3

Again unlike Naive algorithm, we do not match first three 
characters of this window. Value of lps[j-1] (in above 
step) gave us index of next character to match.
i = 5, j = 3
txt[] = "AAAAABAAABA" 
pat[] =   "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = lps[j-1] = lps[2] = 2

i = 5, j = 2
txt[] = "AAAAABAAABA" 
pat[] =    "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = lps[j-1] = lps[1] = 1 

i = 5, j = 1
txt[] = "AAAAABAAABA" 
pat[] =     "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = lps[j-1] = lps[0] = 0

i = 5, j = 0
txt[] = "AAAAABAAABA" 
pat[] =      "AAAA"
txt[i] and pat[j] do NOT match and j is 0, we do i++.

i = 6, j = 0
txt[] = "AAAAABAAABA" 
pat[] =       "AAAA"
txt[i] and pat[j] match, do i++ and j++

i = 7, j = 1
txt[] = "AAAAABAAABA" 
pat[] =       "AAAA"
txt[i] and pat[j] match, do i++ and j++

We continue this way...

Related Solutions

Using the following text string: axcbramneltodaydy Search for the following pattern: 'today' Specify whether or not...
Using the following text string: axcbramneltodaydy Search for the following pattern: 'today' Specify whether or not you found the pattern. The minimum requirement is to use the string STL, the brute force method and hard code both the text and the pattern. For extra credit, you can add any of the following: 1. Do not use the string STL. 2. Allow me to enter both the text string and the pattern, including spaces. You can designate a special end-of-text character...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT