In: Computer Science
Write a java simple document retrieval program that first asks the user to enter a single term query, then goes through two docuements named doc1.txt and doc2.txt (provided with assignment7) to find which document is more relevant by calculating the frequency of the query term in each document. Output the conclusion to a file named asmnt7output.txt in the following format (as an example for query “java” and “problem”). The percentages in parenthese are respective supporting frequencies.
java: doc1(6.37%) is more relevant than doc2(0.00%)
problem: doc2(0.41%) is more relevant than doc1(0.17%)
Your code should keep asking for a query until the user enters -1.
Matching of the query term and the document terms should be case
insensitive. Especially, if the query term is the prefix of a
document term ignoring the case, it should also be considered
matching (e.g. “Computer” matches “computers”, “Network” matches
“netWorking”, etc).
The Java document retrieval program is given below, and the file is named DocQuery.java. All the methods have been extensively commented and explained. Build it any way you like but be careful with the location of the input files "doc1.txt" and "doc2.txt". If you have any related queries, please leave a comment and I will get back to you.
DocQuery.java
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.io.BufferedReader;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.Scanner;
//The DocQuery class contains the method that we use to read the files and their content,
//process it and calculate the supporting frequencies into query terms
public class DocQuery {
//This method takes a filename as argument and reads the content of the file'
//line by line, and returns a string for the entire text of the file.
public String readFile(String filename) {
//We use StringBuilder for building in the entire string
//This is better than concatenation on string
StringBuilder content = new StringBuilder();
try {
FileInputStream file = new FileInputStream(new File(filename));
//Read with BufferedReader
BufferedReader br = new BufferedReader(new InputStreamReader(file));
String line;
//Read till br fetches null (EOF)
while((line = br.readLine()) != null) {
content.append(line+"\n");
}
br.close();
} catch(Exception ex) {
ex.printStackTrace();;
}
//Return the string created by calling the StringBuilder's toString() method
return content.toString();
}
//Takes a string as input and returns all the document terms (valid words), in the string
//in an ArrayList of String
public ArrayList<String> splitIntoTerms(String line) {
ArrayList<String> termList = new ArrayList<>();
//We use Regex to find the terms
//here the regex w+ matches any set of characters that contain either numbers 0-9,
//charcters A-Z or a-z and the underscore_character
Pattern regExp = Pattern.compile("\\w+");
//We collect all the matching terms in String line and find them
Matcher terms = regExp.matcher(line);
//Add the terms to termList
while(terms.find()) {
termList.add(terms.group());
}
//return termList
return termList;
}
//Calculates the supporting frequncy percentage, given the text of a file and query term
public double getSuportingFrequency(String content, String query) {
//Splits the text into lines
String lines[] = content.split("\n");
//initilize the count for total and supporting terms to zero
int totalTerms = 0;
int supportingTerms = 0;
//Go through each line
for(int i = 0; i < lines.length; i++) {
//get the terms in each line
ArrayList<String> terms = splitIntoTerms(lines[i]);
//add to the totalTerms count
totalTerms += terms.size();
//traverse each term in each line
for(int j = 0; j < terms.size(); j++) {
//check for macthes if and only if the term's length is greater than or equal to query term
if(terms.get(j).length() >= query.length())
//We have to match conditions: 1. Match the entire term. 2)Match prefix and in a case insensitive manner
//for each terms we convert them to lowercase, get the first n terms of the documnet term
//and match it with our query term and increment supporting terms if matched
if(terms.get(j).toLowerCase().substring(0,query.length()).indexOf(query.toLowerCase()) > -1)
supportingTerms++;
}
}
//calculate the percentage. Cast to double to include fractional values
double per = (double)supportingTerms/(double)totalTerms;
per *= 100;
return per;
}
public static void main(String args[]) {
DocQuery dq = new DocQuery();
//Read the text from files doc1 and doc2
String doc1 = dq.readFile("doc1.txt");
String doc2 = dq.readFile("doc2.txt");
Scanner in = new Scanner(System.in);
int choice = 0;
//Enter a do while loop
do {
String query;
//get the query term
System.out.println("Enter the query term : ");
query = in.next();
//calculate the supporting frequency for the term in both the files
double support1 = dq.getSuportingFrequency(doc1, query);
double support2 = dq.getSuportingFrequency(doc2, query);
//get a string representation of the percentage values with the values rounded
//to two places after the decimal
String s1_str = String.format("%.2f",support1);
String s2_str = String.format("%.2f",support2);
//use if else to print appropriate results
if(support1 > support2)
System.out.println(query+": doc1("+s1_str+"%) is more relevant than doc2("+s2_str+"%)");
else if(support1 < support2)
System.out.println(query+": doc2("+s2_str+"%) is more relevant than doc1("+s1_str+"%)");
else if(support1 == support2)
System.out.println(query+": doc1("+s1_str+"%) and doc2("+s2_str+"%) are equally relevant");
//ask for user choice
System.out.println("Again? 0(for y)/1(for n)");
choice = in.nextInt();
//exit if choice is 1
} while(choice != 1);
in.close();
}
}
Sample Input Files (doc1.txt and doc2.txt)
These files were used to test the program
Sample Output