Question

In: Computer Science

Using python: In the directory rootdir some of the files contain randomly generated content while others...

Using python: In the directory rootdir some of the files contain randomly generated content while others are written by human authors. Determine how many of the files contained in the directory are written by human authors. Store your answer in the variable number_human_authored.

Solutions

Expert Solution

The logic I am going to be using here is that human authored files have words which are in english, and randomly generated files dont have english words. So the next problem is how to identify if a word is a proper english word or not. For this purpose we can use the nltk library in python which contains many functions for this very purpose (meaning of words, checking if a word is in english etc.

If you dont have nltk installed, please install the library using

pip install nltk

remember to run the terminal/ command prompt in administrator mode to avoid any permission issues.

The code -

from nltk.corpus import words
import os

path = input("Please enter the path of the folder : ")

files = os.listdir(path)

number_human_authored = 0

treshold = 50

# Dictionary is a set that contains words from the english language.
dictionary = set(words.words())



for f in files:
    complete_path = path + "\\" + f

    File = open(complete_path, 'r')

    total_words = 0
    english_words = 0

    # Open each file, and read word by word.
    for line in File:
        word_list = line.split(" ")
        for word in word_list:
            
            word = word.lower()
            # If word is in dictionary, it is an english word.
            if(word in dictionary):
                english_words+=1
            
            total_words+=1

    # Calculate percentage of english words.
    percent_english_words = (english_words/total_words)*100

    # If percentage is greater than treshold, then the file is human authored.
    if(percent_english_words > treshold):
        number_human_authored+=1

print("The number of human authored files are :", number_human_authored)

    

The threshold is changable, but dont make it too high as the dictionary does not contain informal words or words with punctuations, so 50 percent would be a good threshold.

In the input please pass the path of the folder in which the files are kept and run to get the answer,

The code has been commented so you can understand the code better.

I would love to resolve any queries in the comments. Please consider dropping an upvote to help out a struggling college kid :)

Happy Coding !!


Related Solutions

using python In the directory rootdir some of the files contain randomly generated content while others...
using python In the directory rootdir some of the files contain randomly generated content while others are written by human authors. Determine how many of the files contained in the directory are written by human authors. Store your answer in the variable number_human_authored.
Using python: In the directory rootdir some of the files contain randomly generated content while others...
Using python: In the directory rootdir some of the files contain randomly generated content while others are written by human authors. Determine how many of the files contained in the directory are written by human authors. Store your answer in the variable number_human_authored
Visual Basic Make a directory and copy some files from desktop to the created directory
Visual Basic Make a directory and copy some files from desktop to the created directory
python . The two csv files will contain lines in the following format: Angle, time, speed...
python . The two csv files will contain lines in the following format: Angle, time, speed - where each action will be specified as: • A, t, s = Move at an angle of A degrees with respect to East direction (positive horizontal axis) for t seconds with speed s meters per second.    The function should return 3 numpy arrays: • The expected horizontal displacements for each microcar • The expected vertical displacements for each microcar • The expected...
Write a program using python that loops over each file in a specified directory and checks...
Write a program using python that loops over each file in a specified directory and checks the size of each file.You should create 2-tuple with the filename and size, should append the 2-tuple to a list, and then store all the lists in a dictionary.  
essay question: some environmental forces are considered controllable while some others are seen as beyond control...
essay question: some environmental forces are considered controllable while some others are seen as beyond control of the organization discuss:
Essay Question: Some environmental forces are considered controllable while some others are seen as beyond control...
Essay Question: Some environmental forces are considered controllable while some others are seen as beyond control of the organization. Discuss.
While some of life's experiences are under our control, others are not. External factors are those...
While some of life's experiences are under our control, others are not. External factors are those in life which we cannot control (such as who raised us or where we live). Can you identify events in your life that have contributed to the person you are today? Can you identify events in your life that may have contributed to feelings of inadequacy? If faced with the same event today, would your actions and/or feelings regarding the incident be different? How...
While some of life's experiences are under our control, others are not. External factors are those...
While some of life's experiences are under our control, others are not. External factors are those in life which we cannot control (such as who raised us or where we live). Can you identify events in your life that have contributed to the person you are today? Can you identify events in your life that may have contributed to feelings of inadequacy? If faced with the same event today, would your actions and/or feelings regarding the incident be different? How...
Discuss why some fixed assets require alms to be paid while others do not.
Discuss why some fixed assets require alms to be paid while others do not.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT