In: Computer Science
Natural Language processing
Read and implement the code in NLTK Ch 6, Section 1.3 “Document Classification”, which examines the sentiment analysis of movie reviews.
a) Using the movie review document classifier discussed in this chapter, generate a list of the 30 features that the classifier finds to be most informative. Can you explain why these particular features are informative? Do you find any of them surprising?
b) Word features can be very useful for performing document classification, since the words that appear in a document give a strong indication about what its semantic content is. However, many words occur very infrequently, and some of the most informative words in a document may never have occurred in our training data. One solution is to make use of a lexicon, which describes how different words relate to one another. In a paragraph, describe how you might utilize the WordNet lexicon for this problem to improve the movie review document classifier.
a) In this case, the most important features will be the most frequent keywords.
# to download nltk import nltk nltk.download() #importing dataset movie_reviews from nltk.corpus import movie_reviews documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] #random shuffle the document random.shuffle(documents) #print first row to see the data print(documents[1]) words_all = [] for w in movie_reviews.words(): words_all.append(w.lower()) #compute the frequency of all the words words_all = nltk.FreqDist(words_all) #to store the most important 30 features final_data=words_all.most_common(30) # to print frequency of top 30 features for i in words_all: print("The frequency of",i,"is",final_data[i])
The most frequent words are considered as the most important features. However, If we find the top 30 words, most of them are either punctuations or articles. They don't provide any information.
So, for getting appropriate information of data, all punctuations and stopwords are avoided to get the top 30 features as shown in below code:
from nltk.corpus import stopwords import string # store all stopwords in variable stopwords_english = stopwords.words('english') clean_words = [] # remove punctauations and stopwords from frequent list. for i in words_all: if i not in stopwords_english and word not in string.punctuation: clean_words.append(i) # to compute the frequency of each word words_frequency = FreqDist(clean_words) #Top 30 features final_data=all_words_frequency.most_common(30)
b) After the selection of top k features, it is possible that some of the features are correlated or have similar meaning. Presence of such features will add redundancy in data. The classifier will have high computational time in comparison to the case if this redundancy is avoided. For removing similar features, one can use WordNet lexicon database. It has all the sets of synonyms and antonyms. Thus, by removing the synonyms from the top k selected features in training data, we can decrease the complexity of our movie review document classifier.