In: Computer Science
in python: (can use ntlk)
Given the following documents:
• Can we go to Disney??!!!!!! Let's go on a plane!
• The New England Patriots won the Super Bowl..
• I HATE going to school so early
• When will I be considered an adult?
• I want to go to A&M, Baylor, or the University of Texas.
Conduct punctuation removal, stop word removal, casefolding, lemmatization, stemming on the documents.
#!/usr/bin/env python
# coding: utf-8
# In[4]:
from nltk.tokenize import word_tokenize
text = "Can we go to Disney Land???!!! Let's go on a plane!"
tokens = word_tokenize(text)
tokens = [x.lower() for x in tokens]
tokens
# In[5]:
import re
regex = re.compile(r'[\W\d]')
post_punctuation = []
for word in tokens:
word = re.sub(regex,'',word)
if len(word)>0:
post_punctuation.append(word)
post_punctuation
# In[6]:
from nltk.corpus import stopwords
no_stop_words = []
for word in post_punctuation:
if word not in stopwords.words('english'):
no_stop_words.append(word)
no_stop_words
# In[7]:
from nltk.stem import WordNetLemmatizer
word_lem = WordNetLemmatizer()
for word in no_stop_words:
word = word_lem.lemmatize(word)
no_stop_words