In: Computer Science
using python
In the directory rootdir some of the files contain randomly generated content while others are written by human authors. Determine how many of the files contained in the directory are written by human authors. Store your answer in the variable number_human_authored.
The logic I am going to be using here is that human authored files have words which are in english, and randomly generated files dont have english words. So the next problem is how to identify if a word is a proper english word or not. For this purpose we can use the nltk library in python which contains many functions for this very purpose (meaning of words, checking if a word is in english etc.
If you dont have nltk installed, please install the library using
pip install nltk
remember to run the terminal/ command prompt in administrator mode to avoid any permission issues.
The code -
from nltk.corpus import words
import os
path = input("Please enter the path of the folder : ")
files = os.listdir(path)
number_human_authored = 0
treshold = 50
# Dictionary is a set that contains words from the english language.
dictionary = set(words.words())
for f in files:
complete_path = path + "\\" + f
File = open(complete_path, 'r')
total_words = 0
english_words = 0
# Open each file, and read word by word.
for line in File:
word_list = line.split(" ")
for word in word_list:
word = word.lower()
# If word is in dictionary, it is an english word.
if(word in dictionary):
english_words+=1
total_words+=1
# Calculate percentage of english words.
percent_english_words = (english_words/total_words)*100
# If percentage is greater than treshold, then the file is human authored.
if(percent_english_words > treshold):
number_human_authored+=1
print("The number of human authored files are :", number_human_authored)
The threshold is changable, but dont make it too high as the dictionary does not contain informal words or words with punctuations, so 50 percent would be a good threshold.
In the input please pass the path of the folder in which the files are kept and run to get the answer,
The code has been commented so you can understand the code better.
I would love to resolve any queries in the comments. Please consider dropping an upvote to help out a struggling college kid :)
Happy Coding !!