In: Computer Science
Please complete in Python and neatly explain and format code. Use snake case style when defining variables. Write a program named wordhistogram.py which takes one file as an argument. The file is an plain text file(make your own) which shall be analyzed by the program. Upon completing the analysis, the program shall output a report detailing the shortest word(s), the longest word(s), the most frequently used word(s), and a histogram of all the words used in the input file.
If there is a tie, then all words that are of the same length for that classification (longest, shortest, most frequent) are displayed as part of that class.
A word histogram shows an approximate representation of the distribution of words used in the input file. An example text, The_Jungle_Upton_Sinclair.txt, is provided as a starting point for analyzing natural language. Draw your histogram by listing the word first and then print up to 65 * characters to represent the frequency of the word.
Since there is limited space on a terminal, ensure that your histogram does not wrap along the right edge of the terminal. Assume that the width of the histogram can not be wider than 65 characters. In calculating your histogram, map the highest frequency to 65 characters. For example, if the text has a word that appears 2000 times and it is the most frequently used word, then divide 2000 by 65 to approximate that each * character represents 30 occurrences of the word in question in the text. Thus if a word should appear less than 30 times, it receives zero * characters, a word that appeared 125 time would receive 4 * characters (0-30, 31-60, 61-90, 91-120, 120-150).
Print the order of the histogram from most frequent to least frequent.
The program must have a class named WordHistogram. This class must be the data structure that keeps track of the words that appear in the input text file and can return the histogram as a single string. The main function is responsible for opening and reading the the input text file.
Make sure your WordHistogram class has data members that are correctly named (use the underscore character!), has an initializer, and any necessary methods and data members.
In your main, use the given function to read from the filehandle.
def word_iterator(file_handle): """ This iterates through all the words of a given file handle. """ for line in file_handle: for word in line.split(): yield word
DO NOT COMPUTE OR STORE THE HISTOGRAM OUTSIDE OF AN OBJECT named WordHistogram
Example Output
$ ./wordhistogram.py Candide_Voltaire.txt Word Histogram Report the (2179) ****************************************************************** of (1233) ************************************* to (1130) ********************************** and (1127) ********************************** a (863) ************************** in (623) ****************** i (446) ************* was (434) ************* that (414) ************ he (410) ************ with (395) *********** is (348) ********** his (333) ********** you (317) ********* said (302) ********* not (276) ******** ... $
First we get all the words from the file using the function and store it into a variable called all_words:
def word_iterator(file_handle):
""" This iterates through all the words of a given file handle. """
for line in file_handle:
for word in line.split():
yield word
if __name__ == '__main__':
file_name = sys.argv[1] # get the name of the file from the command line
file_handle = open(file_name)
file_iterator = word_iterator(file_handle)
all_words = []
for word in file_iterator:
all_words.append(word.lower()) # store all the words to the all_words list
Next we define the class and the necessary class variables:
class WordHistogram:
def __init__(self, all_words):
self.all_words = all_words # stores all the words from the file
self.shortest_words = [] # we use a list because there could be more than one shortest word
self.longest_words = [] # we use a list because there could be more than one longest word
self.freq = {} # to store the frequency of all the words
Then we define the class functions :
def find_shortest_words(self):
"""
Function to calculate the shortest word(s)
"""
shortest_length = len(self.all_words[0]) # set the shortest word length as the length of the first word
for word in all_words[1:]: # loop over all the words except the first word
shortest_length = min(shortest_length, len(word))
# as we found the shortest word length we find all the words equal to that length
for word in all_words:
if len(word) == shortest_length:
self.shortest_words.append(word)
def find_longest_words(self):
"""
Function to calculate the longest word(s)
"""
longest_length = 0 # set the longest word length to 0 as all words will be greater than 0
for word in all_words: # loop over all the words to find the longest word
longest_length = max(longest_length, len(word))
# as we found the longest word length we find all the words equal to that length
for word in all_words:
if len(word) == longest_length:
self.longest_words.append(word)
def calculate_word_frequency(self):
"""
Function to calculate the word frequency
"""
# We loop over all the words an add it to the dictionary. If it is already present in the
# dictionary, we increment the count. If it is not present in freq, then we get a KeyError.
# In that case, we set the count to 1.
for word in self.all_words:
try:
self.freq[word] += 1
except KeyError:
self.freq[word] = 1
# We sort the dictionary based on the descending value of the word frequency count, as mentioned
# in the question
self.freq = {key: value for key, value in sorted(self.freq.items(), key=lambda item: item[1], reverse=True)}
def print_histogram(self):
"""
Function to print the Word Histogram Report
"""
print("Word Histogram Report")
print("---------------------")
print("Shortest word(s):")
print(", ".join(set(self.shortest_words)))
print()
print("Longest word(s):")
print(", ".join(set(self.longest_words)))
print()
print("Word frequency:")
max_count = len(self.longest_words[0]) # get the word length of the longest word
for key in self.freq:
count = self.freq[key]
number_of_stars = count / max_count * 65 # calculate the number of stars to be displayed
print(f"{key} ({count}) " + "*" * int(number_of_stars))
Now we need to get the words from the file and call the necessary class functions from the main method:
if __name__ == '__main__':
file_name = sys.argv[1] # get the name of the file from the command line
file_handle = open(file_name)
file_iterator = word_iterator(file_handle)
all_words = []
for word in file_iterator:
# store all the words to the all_words list
all_words.append(word.lower()) # since case doesn't matter, we convert the words to lower case
histogram = WordHistogram(all_words) # pass all the words to the WordHistogram class
histogram.find_longest_words()
histogram.find_shortest_words()
histogram.calculate_word_frequency()
histogram.print_histogram()
The complete code is below:
import sys
class WordHistogram:
def __init__(self, all_words):
self.all_words = all_words # stores all the words from the file
self.shortest_words = [] # we use a list because there could be more than one shortest word
self.longest_words = [] # we use a list because there could be more than one longest word
self.freq = {} # to store the frequency of all the words
def find_shortest_words(self):
"""
Function to calculate the shortest word(s)
"""
shortest_length = len(self.all_words[0]) # set the shortest word length as the length of the first word
for word in all_words[1:]: # loop over all the words except the first word
shortest_length = min(shortest_length, len(word))
# as we found the shortest word length we find all the words equal to that length
for word in all_words:
if len(word) == shortest_length:
self.shortest_words.append(word)
def find_longest_words(self):
"""
Function to calculate the longest word(s)
"""
longest_length = 0 # set the longest word length to 0 as all words will be greater than 0
for word in all_words: # loop over all the words to find the longest word
longest_length = max(longest_length, len(word))
# as we found the longest word length we find all the words equal to that length
for word in all_words:
if len(word) == longest_length:
self.longest_words.append(word)
def calculate_word_frequency(self):
"""
Function to calculate the word frequency
"""
# We loop over all the words an add it to the dictionary. If it is already present in the
# dictionary, we increment the count. If it is not present in freq, then we get a KeyError.
# In that case, we set the count to 1.
for word in self.all_words:
try:
self.freq[word] += 1
except KeyError:
self.freq[word] = 1
# We sort the dictionary based on the descending value of the word frequency count, as mentioned
# in the question
self.freq = {key: value for key, value in sorted(self.freq.items(), key=lambda item: item[1], reverse=True)}
def print_histogram(self):
"""
Function to print the Word Histogram Report
"""
print("Word Histogram Report")
print("---------------------")
print("Shortest word(s):")
print(", ".join(set(self.shortest_words)))
print()
print("Longest word(s):")
print(", ".join(set(self.longest_words)))
print()
print("Word frequency:")
max_count = len(self.longest_words[0]) # get the word length of the longest word
for key in self.freq:
count = self.freq[key]
number_of_stars = count / max_count * 65 # calculate the number of stars to be displayed
print(f"{key} ({count}) " + "*" * int(number_of_stars))
def word_iterator(file_handle):
""" This iterates through all the words of a given file handle. """
for line in file_handle:
for word in line.split():
yield word
if __name__ == '__main__':
file_name = sys.argv[1] # get the name of the file from the command line
file_handle = open(file_name)
file_iterator = word_iterator(file_handle)
all_words = []
for word in file_iterator:
# store all the words to the all_words list
all_words.append(word.lower()) # since case doesn't matter, we convert the words to lower case
histogram = WordHistogram(all_words) # pass all the words to the WordHistogram class
histogram.find_longest_words()
histogram.find_shortest_words()
histogram.calculate_word_frequency()
histogram.print_histogram()
If you want to change the way the report is displayed, you can edit the print_histogram function.