In: Computer Science
Purpose
This project is meant to give you experience with sorting, binary searching, and Big-Oh complexity.
Objective "Write in Java please"
Your goal is to take a book, as a large text file, and build a digital “concordance”. A concordance is like an index for a book in that it contains entries for words in the book, but it also includes passages from the book containing that word. For example, a query for the word Dormouse in Alice in Wonderland would produce results like:
14153
he poured a little hot tea upon its nose. The Dormouse shook its head impatiently, and said
14803
`Once upon a time there were three little sisters,' the Dormouse began in a great hurry; `and their names were Elsie,
where 14153 is the position of the word Dormouse in the file, and the passage includes the 10 words before and after Dormouse.
Approach
You should approach the project in four steps:
1. Open the book as a text file, identify each word, and save its position
2. Sort the words in the text file alphabetically using a merge sort
3. Remove duplicate words, saving only their indices
4. Allow the user to enter words, and return the passages where the words occur.
STEP 1: READ THE TEXT FILE AND NUMBER ALL THE WORDS
Consider a text file fox.txt containing:
the quick brown fox jumps over the lazy dog
Generate a new file fox.txt_words.txt that reads as follows:
9
the 0
quick 1
brown 2
fox 3
jumps 4
over 5
the 6
lazy 7
dog 8
where the first number, 9, is the total number of words, and what follows is each word and its placement in the file.
Suggested approach
1. Read the text file and count the words but don’t save them.
2. Make an array to hold the words.txt
3. Make a new FileReader and Scanner object, read the words from the text file into that array
4. Write the number of words to a new text file, then step through the array and write each word and its index to that file.
STEP 2: SORT THE WORDS ALPHABETICALLY
Starting with the text file fox.txt_words.txt, make a new text file fox.txt_sorted.txt . That file should contain the same words as the previous file, but this time the words are in alphabetical order. This is what the file should look like:
9
brown 2
dog 8
fox 3
jumps 4
lazy 7
over 5
quick 1
the 0
the 6
You may try this initially with a bubble sort. However, to get full credit for this step, you must use a merge sort to sort the words.
Approach:
1. Make a filereader/scanner for words.txt, read the number of words, and make two arrays: a string array to hold the words, and an int array to hold the indices.
2. Use the merge sort algorithm we learned in class to sort the string array. Make sure when you adjust the string array that you also adjust the int array, preserving the index of each word
3. Make a printwriter and write the words and indices to a new file fox.txt_sorted.txt.
STEP 3: REMOVE DUPLICATE WORDS
Now use fox.txt_sorted.txt to make a fourth file fox.txt_index.txt that reads like this:
brown 2
dog 8
fox 3
jumps 4
lazy 7
over 5
quick 1
the 0 6
Where 8 refers to the number of unique words, and each word is followed by one or more indices.
Approach:
1. Make a filereader and printwriter
2. Repeat the following for each word:
3. Read a word and an index
4. Check if that word is the same as a the previous word read from the input file
5. If no, write that word and its index to the output file
6. If yes, just write the index to the output file
STEP 4: QUERY AND SEARCH
Your program should now prompt the user for a word, find the word, and return its indices. For example:
Enter a word: the
the: 0 6
Enter a word: lazy
lazy: 7
Enter a word: dormouse
Can’t find dormouse in the text
You must use a binary search, like we did in class, to locate the word.
Your program should continue to query and search until the user closes the program.
Approach:
1. Make an filereader/scanner for fox.txt_index.txt
2. Read in each word and its indices. You may want to use the nextLine function to read all the indices as a single string, in which case you’ll have two arrays: string words[] and string indices[].
3. Prompt the user for a word.
4. Use a binary search to find the matching word.
5. Print out its indices if it exists or an error if it doesn’t
6. Repeat steps 3 and 4 forever
STEP 5: PRINT OUT THE SURROUNDING WORDS
Modify your code in step 4 so that, for each index, it prints out text from the original file. You should print out the previous two words, the query word, and the subsequent two words. For example:
Enter a word: the
0: the quick brown
6: jumps over the lazy dog
Enter a word: lazy
7: over the lazy dog
Enter a word: dormouse
Can’t find dormouse in the text
Approach:
Modify your code from Step 4:
1. After reading from fox.txt_index.txt, make a second array string originaltext[] and read in the original fox.txt word-by-word.
2. After querying the user and searching for the word, get its index entry and:
3. Make an istringstream object to parse the index string
4. Read a number index from that string: an index where that word occurs
5. Make a for loop, starting at i=index-2 and going to i<=index+2
6. Print originaltext[i] followed by a space
7. Repeat steps 4 – 7 until you have printed all the indices.
STEP 6: TIMING
1.
Go to Project Gutenberg and download three books: a short book, a medium book, and a long book. For example, Alice in Wonderland’s page is
https://www.gutenberg.org/ebooks/11
and you should download the Plain Text as alice.txt.
I recommend Alice in Wonderland or the Wizard of Oz as a short book, Great Expectations as a medium book, and War and Peace or the King James Bible as a long book.
2.
Use your STEP 1 code and record the total number of words in each book.
3.
Use System.currentTimeMillis to measure the elapsed time for STEP 1 on each book. Depending on the speed of your computer, you may or may not get non-zero values. If you do, do you see a O(n) runtime? (Use the approach below).
4.
Now measure the merge sort times in STEP 2. Unless your computer is really fast, you ought to see a measurable delay. Run the program a few times on each book, throw out any outliers, and average the runtimes for each book.
Since merge-sort is O(n log n), the runtime should be
runtime = k * #words * log(#words)
Try computing k for each book:
k = runtime / ( #words * log(#words)) the log base shouldn’t matter.
Are all three k’s approximately the same?
5.
If you have a really slow computer, you might be able to time the queries and see if they have a O(log n) time. Don’t bother doing this if they all have zero time.
Grading
Grading for this project will be as follows:
You complete STEP 1 correctly 50% (F+)
You complete STEP 2, but use a sort that isn’t merge sort 60% (D-)
You complete STEP 2 using merge sort 70% (C-)
You complete STEP 3 75% (C)
You complete STEP 4 85% (B)
You complete STEP 5 95% (A)
You measure the timings in STEP 6 100% (A+)
No credit will be given for code that doesn’t compile. If you only partially complete a step, comment your work on the step out before submitting.
Basic Requirements
Working code implemented in Java and appropriate comments provided for better understanding:
Source code for Main.java:
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.Scanner;
public class Main {
public static void main(String[] args) {
Scanner in = new
Scanner(System.in);
boolean fileFound = false;
String fileName = null;
// prompt user for filename, will
keep asking until user gives valid file name
while (fileFound == false) {
System.out.println("What is the name of the file?");
fileName =
in.nextLine();
if ( new
File(fileName).exists()) {
fileFound = true;
} else {
System.out.println("File does not
exist.");
}
}
// open file now, also makes the
word file
int wordCount =
makeWordsFile(fileName);
// this sorts and created the
sorted file
alphaSort(fileName);
// this removes the
duplicates
int numUniqueWords =
removeDuplicates(fileName);
// this does the search and
print entries.
queryAndSearch(numUniqueWords,
fileName, wordCount);
in.close();
}
public static int makeWordsFile(String fileName)
{
int wordCount = 0; // tracks word
count
// try catch in case file does not
exist
try {
FileReader
fReader = new FileReader(fileName); // open reader
Scanner in = new
Scanner(fReader);
long time =
System.currentTimeMillis(); // start time
while
(in.hasNext() == true) {
// only counts word if it is not all
punctuation
if (in.next().replaceAll("\\p{Punct}",
"").equals("") == false) {
wordCount++;
}
}
in.close();
time =
System.currentTimeMillis() - time; // record elapsed time
System.out.println("The elapsed time to count words is " + time + "
ms.");
System.out.println("The number of words is " + wordCount);
try {
fReader.close();
} catch
(IOException e) {
e.printStackTrace();
}
} catch (FileNotFoundException ex)
{
ex.printStackTrace();
}
PrintWriter p;
// keeps track of current word
location
int currentNum = 0;
try {
// create word
file
p = new
PrintWriter(fileName.concat("_words.txt"));
FileReader
fReader = new FileReader(fileName);
Scanner in = new
Scanner(fReader);
// puts # of
words in the file
p.println(wordCount);
// continues if
their is another token available
while
(in.hasNext() == true) {
String current = in.next();
// print word to file only if the word is not
all punctuation
if (!current.replaceAll("\\p{Punct}",
"").toLowerCase().equals("")) {
p.println(current.replaceAll("\\p{Punct}", "").toLowerCase() + " "
+ currentNum);
currentNum++;
}
}
p.close(); //
close printer and scanner
in.close();
} catch (FileNotFoundException e)
{
// you don't
want to do make this empty because if a error happens here you
won't know what happened
e.printStackTrace();
}
return currentNum;
}
// creates file and sorts words
public static String alphaSort(String fileName)
{
int numWords = 0;
String[] words = null;
try {
// opens word
file
Scanner fileScan
= new Scanner(new FileReader(fileName.concat("_words.txt")));
// receives the
number of words, moves line cursor, parse string to int
numWords =
Integer.parseInt(fileScan.nextLine());
words = new
String[numWords];
for (int i = 0;
i < numWords; i++) { // collect entries
words[i] = fileScan.nextLine();
}
fileScan.close();
} catch (FileNotFoundException e)
{
e.printStackTrace();
}
long time =
System.currentTimeMillis(); // start time
words = mergeSort(words); // sort
entries
time = System.currentTimeMillis() -
time; // print elapsed time
System.out.println("Elapsed time to
organize words alphabetically is " + time + " ms");
// write entries to file
try {
PrintWriter pr2
= new PrintWriter(fileName.concat("_sorted.txt"));
// print # of
words
pr2.println(words.length);
// print word +
index to line
for (int i = 0;
i < numWords; i++) {
pr2.println(words[i]);
}
pr2.close();
} catch (FileNotFoundException e)
{
e.printStackTrace();
}
return
fileName.concat("_sorted.txt");
}
// merge sort
public static String[] mergeSort(String[] strs)
{
// base case if array is size
1
if (strs.length <=1)
return
strs;
// splits array into 2 halves
String[] left = new
String[strs.length/2];
String[] right = new
String[strs.length - strs.length/2];
// make two list and copy left and
right half
for (int i=0; i<strs.length/2;
i++) {
left[i] =
strs[i];
}
for (int i=0; i< strs.length -
strs.length/2; i++) {
right[i] =
strs[i + strs.length/2];
}
left = mergeSort(left); // assume
merge sort gives assorted list
right = mergeSort(right);
// both left and right are
sorted
strs = merge(left, right);
return strs;
}
public static String[] merge(String[] left, String[]
right) {
// make a place to hold the merged
list
String[] sorted = new
String[left.length + right.length];
int l = 0; // left index
int r = 0; // right index
int s = 0; // index in output
list
// keep going until one runs
out
while (l<left.length &&
r<right.length) {
if
(left[l].compareToIgnoreCase(right[r]) <= 0) { // if low element
in left list is smaller
sorted[s] = left[l]; // copy it down
l++;
s++;
} else {
sorted[s] = right[r]; // copy it down
r++;
s++;
}
}
// empty out any remaining in
left
while (l<left.length) {
sorted[s] =
left[l];
s++;
l++;
}
// empty out any remaining in
right
while (r<right.length) {
sorted[s] =
right[r];
s++;
r++;
}
return sorted;
}
public static int removeDuplicates(String fileName)
{
int numUniqueWords = 0;
// open file
try {
Scanner in = new
Scanner(new FileReader(fileName.concat("_sorted.txt")));
PrintWriter pr =
new PrintWriter(fileName.concat("_index.txt"));
// keeps track
of last word
String lastWord
= "";
int numWords
= in.nextInt();
// receive 1st
string
for (int i =
0; i < numWords; i++) {
String word = in.next();
int index = in.nextInt();
// if this word is not the last word then make
newline
if (!lastWord.equals(word) && i > 0)
{
pr.println();
}
// if this word is the last word, then just
print index
if (lastWord.equals(word)) {
pr.print(" " + index);
} else {
// print word, does not make
new line mark
lastWord = word;
pr.print(word + " " +
index);
// keeps track of unique
words, increase unique words by one
numUniqueWords++;
}
}
pr.close();
in.close();
} catch (FileNotFoundException e)
{
e.printStackTrace();
}
return numUniqueWords;
}
public static void queryAndSearch(int numUniqueWords, String fileName, int numWords) {
// receive entries
String[] words = new
String[numUniqueWords];
String[] indexs = new
String[numUniqueWords];
// open file, and go through each
line and store word and index into two separate arrays
try {
String
storeLine;
String[]
brokenLine; // breaks line into array , holds word and index
Scanner scan =
new Scanner( new FileReader(fileName.concat("_index.txt")));
for (int i = 0;
i < numUniqueWords; i++) {
storeLine = scan.nextLine();
brokenLine = storeLine.split(" ", 2);
// stores word and index into appropriate
position in array
words[i] = brokenLine[0];
indexs[i] = brokenLine[1];
}
scan.close();
} catch (FileNotFoundException e)
{
e.printStackTrace();
}
Scanner in = new Scanner(System.in);
boolean status = true;
// keep asking forever until
program is close
while (status == true) {
System.out.println("Enter the word. ");
// does binary
search, then receive result
int result =
binarySearch(words, in.nextLine());
if (result ==
-1) {
// if search is not in book then this
happens
System.out.println("Sorry this word is not in
the concordance.");
} else {
// if search is in array then print
results
System.out.print(words[result] + ": ");
// formats then results
organizePrintLn(indexs[result]);
// print all entries
printEntries(fileName, indexs[result],
numWords);
}
}
in.close();
}
public static void organizePrintLn(String str) {
String[] strings = str.split("
");
int currentLetters = 0;
for (String s : strings) {
currentLetters
+= s.length() + 1;
// if current
line has printed words is about 100 length then it will make next
line
// if not it
will print on same line
if
(currentLetters < 100) {
System.out.print(s + " ");
} else {
System.out.println(s + " ");
currentLetters = 0;
}
}
System.out.println();
}
public static int binarySearch(String[] array,
String searchInput) {
int leftBound = 0; // left index
window
int rightBound = array.length; //
right index window
while (leftBound <=
rightBound) {
int midPoint =
leftBound + (rightBound - leftBound)/2; // midpoint calculate
if
(array[midPoint].compareToIgnoreCase(searchInput) == 0) {
return midPoint;
// if is alphabetically lower
} else if
(array[midPoint].compareToIgnoreCase(searchInput) < 0) {
leftBound = midPoint +1;
// if is alphabetically higher
} else {
rightBound = midPoint - 1;
}
}
// if not found
return -1;
}
public static void printEntries(String fileName,
String indexs, int numWords) {
String[] originalText = new
String[numWords];
try {
FileReader fr =
new FileReader(fileName);
Scanner in = new
Scanner(fr);
// will put word
into array only if it is not all punctuation
// will add all
valid words to array
for (int i = 0;
i < numWords; i++) {
boolean check = true;
String currentWord;
// will check until it finds valid word
while (check == true) {
currentWord =
in.next().replaceAll("\\p{Punct}", "");
if (!currentWord.equals(""))
{ // checks if current word is not all punctuation, if so add to
array
originalText[i] = currentWord;
check =
false;
}
}
}
in.close();
} catch (FileNotFoundException
e) {
e.printStackTrace();
}
// takes string indexs
String[] stringIndex =
indexs.split(" ");
int numIndexs =
stringIndex.length;
int[] indexInt = new
int[numIndexs];
// turn index string to
numbers
for (int i = 0 ; i < numIndexs;
i++) {
indexInt[i] =
Integer.parseInt(stringIndex[i]);
}
// for all indexs in the
list
for (int index : indexInt ) {
int leftBound =
index; // leftindex
int rightBound =
index; // rightindex
// checks if
left bound is too close to beginning, if so the leftbound is
beginning of file
if ((leftBound -
10) < 0) {
leftBound = 0;
} else {
leftBound -= 10;
}
// checks is
right bound is too close to end of text file. if so the rightbound
is end of file
if ((rightBound
+ 11) > originalText.length) {
rightBound = originalText.length;
} else {
rightBound += 11;
}
System.out.print(index + ": ");
// prints
word from the leftbound to the rightbound
for (int i =
leftBound; i < rightBound; i++) {
System.out.print(" " + originalText[i]);
}
System.out.println();
}
}
}
Sample Output Screenshots:
Hope it helps, if you like the answer give it a thumbs up. Thank you.