In: Computer Science
Perform SVM training and testing using SMS spam data set spam.csv . The objective is the predict the class of a new SMS using SVM classifier. The data set is a collection of a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged according being ham (legitimate) or spam. The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.
Data set: The details of learning and testing task is as follows:
1. Understand distribution of classes in the data set using suitable plots.
2. Plot distribution of frequent words under “spam" and “ham" classes.
3. Preprocess the data set if required.
4. Apply SVM classifier.
5. Use Cross validation and Hold-out approaches to learn and evaluate SVM classifier.
6. Discuss results achieved by SVM classifier using confusion matrix, sensitivity, specificity and accuracy.
Question 3 Consider a problem where we are given collection of reviews of a movie by 5 people. Each review is a sentence summarising the comments given by the person. The review is classified as either good or bad.
1. Create example supervised data set on this problem.
2. Formulate Naive Bayes classifier on this data set to predict a new review in defined category
[Hint:- Please refer to the given example Analyzing Textual Data with Natural Language Processing.pdf for solving question 2 & 3.]
Note: I've used SVM algorithm for this case scenario:
Source Code: in Python
import pandas as pd
import re
from nltk.corpus import stopwords
from sklearn import svm
email=pd.read_csv("../input/spam.csv")
email=email.rename(columns = {'v1':'label','v2':'message'})
cols=['label','message']
email=email[cols]
email=email.dropna(axis=0, how='any')
num_e=email["message"].size
def processing(raw_email):
letters_only=re.sub("[^a-zA-Z]"," ",raw_email)
words=letters_only.lower().split()
stops=set(stopwords.words("english"))
m_w=[w for w in words if not w in stops]
return (" ".join(m_w))
clean_email=[]
for i in range(0,num_e):
clean_email.append(processing(email["message"][i]))
email["Processed_Msg"]=clean_email
cols2=["Processed_Msg","label"]
email=email[cols2]
X_train=email["Processed_Msg"][:5000]
Y_train=email["label"][:5000]
X_test=email["Processed_Msg"][5001:5500]
Y_test=email["label"][5001:5500]
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
vectorizer=CountVectorizer(analyzer = "word",tokenizer = None,preprocessor = None,stop_words = None,max_features = 5000)
train_data_features=vectorizer.fit_transform(X_train)
train_data_features=train_data_features.toarray()
test_data_features=vectorizer.transform(X_test)
test_data_features=test_data_features.toarray()
clf=svm.SVC(kernel='linear',C=1.0)
print ("Training")
clf.fit(train_data_features,Y_train)
print ("Testing")
predicted=clf.predict(test_data_features)
accuracy=np.mean(predicted==Y_test)
print ("Accuracy: ",accuracy)
X=email["Processed_Msg"][5501:5502]
validation_data=vectorizer.transform(X)
validation_data=validation_data.toarray()
print ("SMS: ",X)
classification=clf.predict(validation_data)
print ("Classification: ",classification)
Note: If you have any doubts regarding code, feel free to ask.
I'll respond to you as soon as I can
And if you like my answer, kindly upvote