Question

In: Computer Science

Perform SVM training and testing using SMS spam data set spam.csv . The objective is the...

Perform SVM training and testing using SMS spam data set spam.csv . The objective is the predict the class of a new SMS using SVM classifier. The data set is a collection of a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged according being ham (legitimate) or spam. The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

Data set: The details of learning and testing task is as follows:

1. Understand distribution of classes in the data set using suitable plots.

2. Plot distribution of frequent words under “spam" and “ham" classes.

3. Preprocess the data set if required.

4. Apply SVM classifier.

5. Use Cross validation and Hold-out approaches to learn and evaluate SVM classifier.

6. Discuss results achieved by SVM classifier using confusion matrix, sensitivity, specificity and accuracy.

Question 3 Consider a problem where we are given collection of reviews of a movie by 5 people. Each review is a sentence summarising the comments given by the person. The review is classified as either good or bad.

1. Create example supervised data set on this problem.

2. Formulate Naive Bayes classifier on this data set to predict a new review in defined category

[Hint:- Please refer to the given example Analyzing Textual Data with Natural Language Processing.pdf for solving question 2 & 3.]

Solutions

Expert Solution

Note: I've used SVM algorithm for this case scenario:

Source Code: in Python

import pandas as pd
import re
from nltk.corpus import stopwords
from sklearn import svm

email=pd.read_csv("../input/spam.csv")
email=email.rename(columns = {'v1':'label','v2':'message'})
cols=['label','message']
email=email[cols]
email=email.dropna(axis=0, how='any')

num_e=email["message"].size
def processing(raw_email):
    letters_only=re.sub("[^a-zA-Z]"," ",raw_email)
    words=letters_only.lower().split()
    stops=set(stopwords.words("english"))
    m_w=[w for w in words if not w in stops]
    return (" ".join(m_w))


clean_email=[]
for i in range(0,num_e):
    clean_email.append(processing(email["message"][i]))

email["Processed_Msg"]=clean_email
cols2=["Processed_Msg","label"]
email=email[cols2]


X_train=email["Processed_Msg"][:5000]
Y_train=email["label"][:5000]
X_test=email["Processed_Msg"][5001:5500]
Y_test=email["label"][5001:5500]


import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
vectorizer=CountVectorizer(analyzer = "word",tokenizer = None,preprocessor = None,stop_words = None,max_features = 5000)

train_data_features=vectorizer.fit_transform(X_train)
train_data_features=train_data_features.toarray()

test_data_features=vectorizer.transform(X_test)
test_data_features=test_data_features.toarray()

clf=svm.SVC(kernel='linear',C=1.0)
print ("Training")
clf.fit(train_data_features,Y_train)

print ("Testing")
predicted=clf.predict(test_data_features)
accuracy=np.mean(predicted==Y_test)
print ("Accuracy: ",accuracy)


X=email["Processed_Msg"][5501:5502]
validation_data=vectorizer.transform(X)
validation_data=validation_data.toarray()

print ("SMS: ",X)
classification=clf.predict(validation_data)
print ("Classification: ",classification)

Note: If you have any doubts regarding code, feel free to ask.

I'll respond to you as soon as I can

And if you like my answer, kindly upvote


Related Solutions

A large data set is separated into a training set and a test set. (a) Is...
A large data set is separated into a training set and a test set. (a) Is it necessary to do this randomly? Why or why not? (b) In R how might this separation be done in a reproducible way? (c) The statistician chooses 20% of the data for training and 80% for testing. Comment briefly on this—2 or 3 lines would be plenty.
Solve SVM for a data set with 3 data instances in 2 dimensions: (1,1,+), (-1,1,-),(0,-1,-). Here...
Solve SVM for a data set with 3 data instances in 2 dimensions: (1,1,+), (-1,1,-),(0,-1,-). Here the first 2 number are the 2-dimension coordinates. ‘ +’ in 3rd place is positive class. And ‘-‘ in 3rd place is negative class . Your task is to compute alpha’s, w, b. Then, Solve SVM when data are non-separable, using k=2 when minimizing the violations of the mis-classification, i.e., on those slack variables.
Explain the difference between a training set and a testing set. Why do we need to...
Explain the difference between a training set and a testing set. Why do we need to differentiate them? Can the same set be used for both purposes? Why or why not? explain with your own words please
Answer using R Studio Here we consider the amount of data needed to perform hypothesis testing....
Answer using R Studio Here we consider the amount of data needed to perform hypothesis testing. Suppose we are testing a coin using observations of tosses. We wish to test H0: p = 0.5 against an alternative of HA : p = 0.6 (in this question use one-sided tests only). How many tosses are needed to guarantee a size Æ∑ 0.05 and Ø∑ 0.2? Now generalize to consider HA : p = 0.5+delta. Choose sensible values for delta and quantify...
functions loadMNISTImages and loadMNISTLabels to read in the training/testing set and label files Function loadMNISTImages will...
functions loadMNISTImages and loadMNISTLabels to read in the training/testing set and label files Function loadMNISTImages will return an array of 784 rows, each column contains pixel values of an image. Use the following code to convert labels into a 10-row array, each column represents one digit: labels = loadMNISTLabels('training_label'); % initialize figure labels = labels'; labels(labels==0)=10; labels=dummyvar(labels); .
Suppose we are testing a data set for normality. The data has a median of 80.25,...
Suppose we are testing a data set for normality. The data has a median of 80.25, a mean of 77.5, and a standard deviation of 2.1. Compute Pearson's Coefficient of Skewness for this data. PC = Suppose that the histogram looks roughly bell-shaped and there are no outliers. Can we say that this data is normally distributed? A. Yes B. No Round your Pearson's Coefficient to at least three decimal places if applicable. A laser printer is advertised to print...
2. Consider a data set {3, 20, 35, 62, 80}, perform hierarchical clustering using complete linkage...
2. Consider a data set {3, 20, 35, 62, 80}, perform hierarchical clustering using complete linkage and plot the dendogram to visualize it. R code needed with full steps including packages
A set of reliability testing data for a special equipment was obtained and the ordered ages...
A set of reliability testing data for a special equipment was obtained and the ordered ages at failure (hours) were: 8.3, 13, 16.9, 20.1, 23.8, 26.5, 29.8, 33.2, 36.5, 41, 45.1, 51.7, 61.3. Assume that these times to failure are normally distributed. Estimate the equipment reliability and hazard function at age 25 hours.
4. Hypothesis testing for the population proportion. This next problem focuses on using technology to perform...
4. Hypothesis testing for the population proportion. This next problem focuses on using technology to perform a hypothesis test for the true proportion of students with no credit card debt. Prior studies seemed to indicate that the true proportion of students with no credit card debt was 10%; however, new data leads a researcher to claim that the true proportion of students with no credit card debt is different than 10%. You will test this claim. Use appropriate technology with...
Using the descriptions of data collected, indicate whether it is subjective or objective data: Group of...
Using the descriptions of data collected, indicate whether it is subjective or objective data: Group of answer choices Study participant reporting pain level using a pain scale Study coordinators calculating a participants BMI using their recorded weight and height Study coordinators reviewing patients’ charts to identify the number of medications they are taking Study participants discussing their challenges completing an exercise program in a focus group
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT