In: Computer Science
This program focuses on programming with Java Collections classes. You will implement a module that finds a simplified Levenshtein distance between two words represented by strings.
Your program will need support files LevenDistanceFinder.Java, dictionary.txt, while you will be implementing your code in the LevenDistanceFinder.java file.
INSTRUCTIONS
The Levenshtein distance, named for it's creator Vladimir Levenshtein, is a measure of the distance between two words. The edit distance between two strings is the minimum number of operations that are needed to transform one sting into the other. For a full implementation of the distance we would need to account for three different types of operations, removing a letter, adding a letter, or changing a letter.
For THIS program however, we are going to focus solely on the ability to change one letter to another. So, we will be thinking of changing one letter at a time, while maintaining the fact that the word is a still a valid word at all times.
So, for an example, the edit distance from "dog" to "cat" is 3. One possible path for this is
"dog" to "dot" to "cot" to cat"
Note, that there might be other paths, but they all involve at least 3 changes. Another longer example is from "witch" to "coven".
witch->watch->match->march->marcs->mares->mores->moves->coves->coven
You will be implementing a secondary class called LevenshteinFinder that will be used by the main program to solve the problem.
You are given a client program LevenDistanceFinder.java that does the file processing and user interaction. LevenDistanceFinder.java reads a dictionary text file as input and passes its entire contents to you as a list of strings. You are to write a class called LevenshteinFinder that will manage the actual work of finding the distance and the path.
The class will probably have the following necessary private fields:
Starting and Ending Strings
A map of words to a set of neighbors.
An integer distance that starts at -1.
A List of strings that is the path itself.
PUBLIC LEVENSHTEINFINDER(STRING, STRING, SET<STRING> )
Your constructor is passed a string which is the starting word of the path. A string that is the ending word of the path, and a collection of words as a set. It should use these values to initialize the class. The set of words will initially be all words from the dictionary.
You should throw an IllegalArgumentException if length of the two words is not the same.
You will want to store the starting and stopping words for later use, but you should not need to save the words list. First you will want to pull out only the words that are of the correct size. Then from this smaller list, what you will want to do is to create a map of words to their "neighbor" words. Start this by going through every word and add it and an empty set, to a map.
You will then go through the set a second time and check every word to every other word. If they are "neighbors" you will add each word to the set that goes with the other word. So if stringA and stringB are neigbors, then you add stringA to stringB's set, and vice versa.
Once the neighbor map is created, you will run the findDistacnce method and the findPath method to save those values for future delivery.
Note that this is all done in the constructor, so that all the work is done when it is “newed”/created.
PRIVATE INT DIFFERENTLETTERS(STRING A, STRING B)
This method should take two strings and find the number of letters that are different from each other and return that number.
PUBLIC INT GETDISTANCE()
This method is called by the main program to get the distance between the two words. Note that you found this in a different method. So this should just return a saved value. If it is longer than 2 lines, you are doing it wrong.
PUBLIC STRING GETPATH()
This method returns a string that is the path from the first word to the second word, assuming it exists. If the path distance is negative one, then return back an error message, if the path distance is zero or higher, then take the pathArray and convert it to a string with arrows in between.
Example: love->lave->late->hate
PRIVATE INT FINDDISTANCE(STRING A, STRING B)
This method finds the distance between two strings. (Note, only the distance, not the path itself). Start by creating two empty sets. Add the staring word to the second set. Set a counter to zero. Then while the sets are different sizes and the 2nd set does not contain the ending word, add everything from set 2 into set one, clear the 2nd set, and then take every word from set1 AND the neighbor of every word from set1 into set 2. Increment the loop counter.
When the loop finishes, if the 2nd set contains the final word return the counter because it was found, if the 2nd set doesn't contain the word, return -1 because there is no path.
PRIVATE VOID FINDPATH(STRING A, STRING B)
This method will find the path between two words. It should probably be the last method you work on. In fact I would create it, and have it do nothing until you are ready for it.
When running this method should only be run after findDistance() so that the distance has already been found and stored in a private int value.
When you are ready for it, this method should do the
following.
Initialize the class path List to a new empty List.
Check the distance, if it is negative, store an error message in
the path List and then exit the method. If the distance is zero or
positive add the first element to the list.
Now in a loop that starts at the distance minus 1, and goes down until 1, look at the set of neighbors of the word in the last box of the list. Find one that has a distance to the ending word that matches the current loop counter. There may be multiples, but there has to be at least one. Find this word, and add it to the list.
Now repeat the loop until the for loop quits. Then add the
ending word to the list. You are done.
Here is an example for the path from love -> hate.
The distance from love to hate is 3, so that is bigger than -1. So add love to the list. The list is now size one. Now start your loop from distance -1 (2) to 1. So i is currently 2. Find any word in the neighbor of love, that has a distance to hate of size 2. Use your findDistance method here!. One of those words is "lave". So add that to the array.
Next round, i is one. The list is now [love, lave]. So find a neighbor of "lave" that has a distance to "hate" of one. One possible word is "late". So add "late" to the list which now looks like [love, lave, late]
That should finish the loop, add "hate" to the list. The final list looks like [love, lave, late, hate]. Store that list in the class field. And you are done.
Your program should exactly reproduce the format and general behavior demonstrated on the previous pages.
You may assume that the list of words passed to your constructor will be a nonempty list of nonempty strings composed entirely of lowercase letters.
Use the TreeMaps, TreeSets and ArrayLists implementations for all of your ADT's.
HINTS
If you have not had to deal with exceptions before the proper way to throw and exception is the following: throw new IlleagalStateException() or IllegalArgumentEXception() as appropriate. This will end your method and return back to the primary program.
Given two strings str1 and str2 and below operations that can performed on str1. Find minimum number of edits (operations) required to convert ‘str1’ into ‘str2’.
1. Insert
2. Remove
3. Replace
All of the above operations are of equal cost.
What are the subproblems in this case?
The idea is process all characters one by one staring from either
from left or right sides of both strings.
Let us traverse from right corner, there are two possibilities for
every pair of character being traversed.
m: Length of str1 (first string) n: Length of str2 (second string)
1. If last characters of two strings are same, nothing much to do. Ignore last characters and get count for remaining strings. So we recur for lengths m-1 and n-1.
2. Else (If last characters are not same), we consider all operations on ‘str1’, consider all three operations on last character of first string, recursively compute minimum cost for all three operations and take minimum of three values.
1. Insert: Recur for m and n-1
2. Remove: Recur for m-1 and n
3. Replace: Recur for m-1 and n-1
Below is the java code:
// A Naive recursive Java program to find minimum number
// operations to convert str1 to str2
class EDIST {
static int min(int x, int y, int z)
{
if (x <= y && x <= z)
return x;
if (y <= x && y <= z)
return y;
else
return z;
}
static int editDist(String str1, String str2, int m, int n)
{
// If first string is empty, the only option is to
// insert all characters of second string into first
if (m == 0)
return n;
// If second string is empty, the only option is to
// remove all characters of first string
if (n == 0)
return m;
// If last characters of two strings are same, nothing
// much to do. Ignore last characters and get count for
// remaining strings.
if (str1.charAt(m - 1) == str2.charAt(n - 1))
return editDist(str1, str2, m - 1, n - 1);
// If last characters are not same, consider all three
// operations on last character of first string, recursively
// compute minimum cost for all three operations and take
// minimum of three values.
return 1 + min(editDist(str1, str2, m, n - 1), // Insert
editDist(str1, str2, m - 1, n), // Remove
editDist(str1, str2, m - 1, n - 1) // Replace
);
}
public static void main(String args[])
{
String str1 = "sunday";
String str2 = "saturday";
System.out.println(editDist(str1, str2, str1.length(), str2.length()));
}
}
Output:
3
The time complexity of above solution is exponential. In worst case, we may end up doing O(3m) operations. The worst case happens when none of characters of two strings match. Below is a recursive call diagram for worst case.
We can see that many subproblems are solved, again and again, for example, eD(2, 2) is called three times. Since same suproblems are called again, this problem has Overlapping Subprolems property. So Edit Distance problem has both properties (see this and this) of a dynamic programming problem. Like other typical Dynamic Programming(DP) problems, recomputations of same subproblems can be avoided by constructing a temporary array that stores results of subproblems.
Code in java:
// A Dynamic Programming based Java program to find
minimum
// number operations to convert str1 to str2
class
EDIST {
static
int
min(
int
x,
int
y,
int
z)
{
if
(x <= y && x <= z)
return
x;
if
(y <= x && y <= z)
return
y;
else
return
z;
}
static
int
editDistDP(String str1, String str2,
int
m,
int
n)
{
//
Create a table to store results of subproblems
int
dp[][] =
new
int
[m
+
1
][n +
1
];
//
Fill d[][] in bottom up manner
for
(
int
i =
0
; i <= m; i++) {
for
(
int
j =
0
; j <= n; j++) {
//
If first string is empty, only option is to
//
insert all characters of second string
if
(i ==
0
)
dp[i][j]
= j;
// Min. operations = j
//
If second string is empty, only option is to
//
remove all characters of second string
else
if
(j ==
0
)
dp[i][j]
= i;
// Min. operations = i
//
If last characters are same, ignore last char
//
and recur for remaining string
else
if
(str1.charAt(i -
1
)
== str2.charAt(j -
1
))
dp[i][j]
= dp[i -
1
][j -
1
];
//
If the last character is different, consider all
//
possibilities and find the minimum
else
dp[i][j]
=
1
+ min(dp[i][j -
1
],
// Insert
dp[i
-
1
][j],
//
Remove
dp[i
-
1
][j -
1
]);
// Replace
}
}
return
dp[m][n];
}
public
static
void
main(String
args[])
{
String
str1 =
"sunday"
;
String
str2 =
"saturday"
;
System.out.println(editDistDP(str1,
str2, str1.length(), str2.length()));
}
}
3
Time Complexity: O(m x n)
Auxiliary Space: O(m x n)
Space Complex Solution: In the above-given method we require O(m x n) space. This will not be suitable if the length of strings is greater than 2000 as it can only create 2D array of 2000 x 2000. To fill a row in DP array we require only one row the upper row. For example, if we are filling the i = 10 rows in DP array we require only values of 9th row. So we simply create a DP array of 2 x str1 length. This approach reduces the space complexity. Here is the C++ implementation of the above-mentioned problem.
// A Space efficient Dynamic Programming
// based C++ program to find minimum
// number operations to convert str1 to str2
#include <bits/stdc++.h>
using
namespace
std;
void
EditDistDP(string str1, string
str2)
{
int
len1
= str1.length();
int
len2
= str2.length();
// Create a DP array
to memoize result
// of previous
computations
int
DP[2][len1 + 1];
// To fill the DP
array with 0
memset
(DP,
0,
sizeof
DP);
// Base condition
when second string
// is empty then we
remove all characters
for
(
int
i = 0; i <= len1;
i++)
DP[0][i]
= i;
// Start filling the
DP
// This loop run for
every
// character in
second string
for
(
int
i = 1; i <= len2; i++)
{
//
This loop compares the char from
//
second string with first string
//
characters
for
(
int
j = 0; j <= len1; j++)
{
//
if first string is empty then
//
we have to perform add character
//
operation to get second string
if
(j == 0)
DP[i
% 2][j] = i;
//
if character from both string
//
is same then we do not perform any
//
operation . here i % 2 is for bound
//
the row number.
else
if
(str1[j - 1] == str2[i - 1]) {
DP[i
% 2][j] = DP[(i - 1) % 2][j - 1];
}
//
if character from both string is
//
not same then we take the minimum
//
from three specified operation
else
{
DP[i
% 2][j] = 1 + min(DP[(i - 1) % 2][j],
min(DP[i
% 2][j - 1],
DP[(i
- 1) % 2][j - 1]));
}
}
}
// after complete
fill the DP array
// if the len2 is
even then we end
// up in the 0th row
else we end up
// in the 1th row so
we take len2 % 2
// to get
row
cout << DP[len2
% 2][len1] << endl;
}
// Driver program
int
main()
{
string str1 =
"food"
;
string str2 =
"money"
;
EditDistDP(str1,
str2);
return
0;
}
Output:
4
Time Complexity: O(m x n)
Auxiliary Space: O( m )