Question

In: Computer Science

Natural Language processing Read and implement the code in NLTK Ch 6, Section 1.3 “Document Classification”,...

Natural Language processing

Read and implement the code in NLTK Ch 6, Section 1.3 “Document Classification”, which examines the sentiment analysis of movie reviews.

a) Using the movie review document classifier discussed in this chapter, generate a list of the 30 features that the classifier finds to be most informative. Can you explain why these particular features are informative? Do you find any of them surprising?

b) Word features can be very useful for performing document classification, since the words that appear in a document give a strong indication about what its semantic content is. However, many words occur very infrequently, and some of the most informative words in a document may never have occurred in our training data. One solution is to make use of a lexicon, which describes how different words relate to one another. In a paragraph, describe how you might utilize the WordNet lexicon for this problem to improve the movie review document classifier.  

Solutions

Expert Solution

a) In this case, the most important features will be the most frequent keywords.

 # to download nltk import nltk nltk.download() #importing dataset movie_reviews from nltk.corpus import movie_reviews documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] #random shuffle the document random.shuffle(documents) #print first row to see the data print(documents[1]) words_all = [] for w in movie_reviews.words(): words_all.append(w.lower()) #compute the frequency of all the words words_all = nltk.FreqDist(words_all) #to store the most important 30 features final_data=words_all.most_common(30) # to print frequency of top 30 features for i in words_all: print("The frequency of",i,"is",final_data[i])

The most frequent words are considered as the most important features. However, If we find the top 30 words, most of them are either punctuations or articles. They don't provide any information.

So, for getting appropriate information of data, all punctuations and stopwords are avoided to get the top 30 features as shown in below code:

 from nltk.corpus import stopwords import string # store all stopwords in variable stopwords_english = stopwords.words('english') clean_words = [] # remove punctauations and stopwords from frequent list. for i in words_all: if i not in stopwords_english and word not in string.punctuation: clean_words.append(i) # to compute the frequency of each word words_frequency = FreqDist(clean_words) #Top 30 features final_data=all_words_frequency.most_common(30)

b) After the selection of top k features, it is possible that some of the features are correlated or have similar meaning. Presence of such features will add redundancy in data. The classifier will have high computational time in comparison to the case if this redundancy is avoided. For removing similar features, one can use WordNet lexicon database. It has all the sets of synonyms and antonyms. Thus, by removing the synonyms from the top k selected features in training data, we can decrease the complexity of our movie review document classifier.


Related Solutions

Natural Language Processing (NLP), is not a new technology, but it is one that is not...
Natural Language Processing (NLP), is not a new technology, but it is one that is not yet fully developed. Many healthcare professionals and organizations are working diligently on how to combine NLP with use in EMRs. What is NLP? What possible disadvantages will occur due to using NLP technology? What are the barriers to using NLP? What possible benefits will NLP provide to medical professionals and organizations? How can NLP benefit the patient experience?
what is the most use applications of Natural language processing , could you describe it?
what is the most use applications of Natural language processing , could you describe it?
6.       Compare and contrast the logical positivist to the natural language theorist.
6.       Compare and contrast the logical positivist to the natural language theorist.
1-what is natural langage processing NLP and natural language generation NLG 2-name five use of each...
1-what is natural langage processing NLP and natural language generation NLG 2-name five use of each one on tax return auditing
// the language is java, please implement the JOptionPane Use method overloading to code an operation...
// the language is java, please implement the JOptionPane Use method overloading to code an operation class called CircularComputing in which there are 3 overloaded methods and an output method as follows: • computeObject(double radius) – compute the area of a circle • computeObject(double radius, double height) – compute area of a cylinder • computeObject(double radiusOutside, double radiusInside, double height) – compute the volume of a cylindrical object • output() use of JOptionPane to display instance field(s) and the result...
MATLAB CODE FOR E xtreme learning machine using for classification task. image processing electrical. if you...
MATLAB CODE FOR E xtreme learning machine using for classification task. image processing electrical. if you know then try or leave for other
Assume that a very good Natural Language Processing (NLP) system, called, say, ‘Whitman’ is available. Whitman...
Assume that a very good Natural Language Processing (NLP) system, called, say, ‘Whitman’ is available. Whitman can listen to speech and understand the context 98 to 99% of the time. It has a very complex vocabulary and has mastered the understanding of syntax and idiom. How could this system make a significant difference to the cost of care? How could it bend the cost curve? Give one clearly illustrated example.
Zhaoxin needs to successfully complete a coding project that involves complex natural language processing algorithms. Zhaoxin...
Zhaoxin needs to successfully complete a coding project that involves complex natural language processing algorithms. Zhaoxin must choose between three different Application Programming Interfaces (API), but is unsure which API is best for the project. Being a Statistician, he decides to collect data, then use random chance to make the final decision. He assigns a probability of 0.45 for PyTorch, 0.25 for Keras, and the rest to TensorFlow. Each API affects Zhaoxin chances of completing the project on time. Zhaoxin...
Use Ch. 6 to answer the following question. Suppose a depletable natural resource has a renewable...
Use Ch. 6 to answer the following question. Suppose a depletable natural resource has a renewable substitute. The depletable resource can be extracted from the ground at an increasing marginal cost. The renewable substitute can be extracted at a constant marginal cost. Furthermore, the marginal willingness to pay exceeds the marginal cost for the initial quantities, so it is worth extracting at least some of the resource. Assume a dynamically efficient allocation of the resources. Suppose technological progress occurs in...
Read through "The Huffman Code" on page 415 of the book. Write a program to implement...
Read through "The Huffman Code" on page 415 of the book. Write a program to implement Huffman coding and decoding. It should do the following: Accept a text message, possibly of more than one line. Create a Huffman tree for this message. Create a code table. Encode the message into binary. Decode the message from binary back to text. in JAVA
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT