Question

In: Computer Science

Suppose your work for the university of Michigan. The university wants to implement an email spam...

Suppose your work for the university of Michigan. The university wants to implement an email spam filter. So use the mbox.txt file that we used in class, build a filter system that filters emails into several categories of security levels: level 1 is the email from Uof M (@ umich.edu), level 2 is the email from other North American universities (@ xyz.edu), level 3 is the email from other educational institutions around the world (.uk, .za, etc.), and level 4 is the email from other email services (g m ail, etc). Please print out these 4 levels and also list all the senders (only those in the "From" lines) in each category. Example output: Level 1: zqian@ umich.edu Level 2: louis@ media.berkeley.edu ray@ media.berkeley.edu mmmy@ indiana.edu Level 3: antranig@ caret.cam.ac.uk david.horwitz@ uct.ac.za Level 4: gopal.ramasammycook@ g mail.com

Solutions

Expert Solution

We will walk through the following steps to build this application :

  1. Preparing the text data.
  2. Creating word dictionary.
  3. Feature extraction process
  4. Training the classifier

1. Preparing the text data.


The data-set used here, is split into a training set and a test set containing 702 mails and 260 mails respectively, divided equally between spam and ham mails. You will easily recognize spam mails as it contains *spmsg* in its filename.

In any text mining problem, text cleaning is the first step where we remove those words from the document which may not contribute to the information we want to extract. Emails may contain a lot of undesirable characters like punctuation marks, stop words, digits, etc which may not be helpful in detecting the spam email. The emails in Ling-spam corpus have been already preprocessed in the following ways:

a) Removal of stop words – Stop words like (@ umich.edu), level 2 is the email from other North American universities (@ xyz.edu), level 3 is the email from other educational institutions around the world (.uk, .za, etc.), and level 4 is the email from other email services (g m ail, etc). Please print out these 4 levels and also list all the senders (only those in the "From" lines) in each category. Example output: Level 1: zqian@ umich.edu Level 2: louis@ media.berkeley.edu ray@ media.berkeley.edu mmmy@ indiana.edu Level 3: antranig@ caret.cam.ac.uk david.horwitz@ uct.ac.za Level 4: gopal.ramasammycook@ g mail.com etc are very common in all English sentences and are not very meaningful in deciding spam or legitimate status, so these words have been removed from the emails.

b) Lemmatization – It is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. For example, “include”, “includes,” and “included” would all be represented as “include”. The context of the sentence is also preserved in lemmatization as opposed to stemming (another buzz word in text mining which does not consider meaning of the sentence).

We still need to remove the non-words like punctuation marks or special characters from the mail documents. There are several ways to do it. Here, we will remove such words after creating a dictionary, which is a very convenient method to do so since when you have a dictionary, you need to remove every such word only once. So cheers !! As of now you don’t need to do anything.

1.Creating word dictionary.
A sample email in the data-set looks like this:

Subject: posting

hi , ' m work phonetics project modern irish ' m hard source . anyone recommend book article english ? ' , specifically interest palatal ( slender ) consonant , work helpful too . thank ! laurel sutton ( sutton @ garnet . berkeley . edu

It can be seen that the first line of the mail is subject and the 3rd line contains the body of the email. We will only perform text analytics on the content to detect the spam mails. As a first step, we need to create a dictionary of words and their frequency. For this task, training set of 700 mails is utilized. This python function creates the dictionary for you.

def make_Dictionary(train_dir):
    emails = [os.path.join(train_dir,f) for f in os.listdir(train_dir)]    
    all_words = []       
    for mail in emails:    
        with open(mail) as m:
            for i,line in enumerate(m):
                if i == 2:  #Body of email is only 3rd line of text file
                    words = line.split()
                    all_words += words
    
    dictionary = Counter(all_words)
    # Paste code for non-word removal here(code snippet is given below) 
    return dictionary

Once the dictionary is created we can add just a few lines of code written below to the above function to remove non-words about which we talked in step 1. I have also removed absurd single characters in the dictionary which are irrelevant here. Do not forget to insert the below code in the function def make_Dictionary(train_dir).

list_to_remove = dictionary.keys()
for item in list_to_remove:
    if item.isalpha() == False: 
        del dictionary[item]
    elif len(item) == 1:
        del dictionary[item]
dictionary = dictionary.most_common(3000)

Dictionary can be seen by the command print dictionary. You may find some absurd word counts to be high but don’t worry, it’s just a dictionary and you always have the scope of improving it later. If you are following this blog with provided data-set, make sure your dictionary has some of the entries given below as most frequent words. Here I have chosen 3000 most frequently used words in the dictionary.

[('order', 1414), ('address', 1293), ('report', 1216), ('mail', 1127), ('send', 1079), ('language', 1072), ('email', 1051), ('program', 1001), ('our', 987), ('list', 935), ('one', 917), ('name', 878), ('receive', 826), ('money', 788), ('free', 762)

3. Feature extraction process.


Once the dictionary is ready, we can extract word count vector (our feature here) of 3000 dimensions for each email of training set. Each word count vector contains the frequency of 3000 words in the training file. Of course you might have guessed by now that most of them will be zero. Let us take an example. Suppose we have 500 words in our dictionary. Each word count vector contains the frequency of 500 dictionary words in the training file. Suppose text in training file was “Get the work done, work done” then it will be encoded as [0,0,0,0,0,…….0,0,2,0,0,0,……,0,0,1,0,0,…0,0,1,0,0,……2,0,0,0,0,0]. Here, all the word counts are placed at 296th, 359th, 415th, 495th index of 500 length word count vector and the rest are zero.

The below python code will generate a feature vector matrix whose rows denote 700 files of training set and columns denote 3000 words of dictionary. The value at index ‘ij’ will be the number of occurrences of jth word of dictionary in ith file.

def extract_features(mail_dir): 
    files = [os.path.join(mail_dir,fi) for fi in os.listdir(mail_dir)]
    features_matrix = np.zeros((len(files),3000))
    docID = 0;
    for fil in files:
      with open(fil) as fi:
        for i,line in enumerate(fi):
          if i == 2:
            words = line.split()
            for word in words:
              wordID = 0
              for i,d in enumerate(dictionary):
                if d[0] == word:
                  wordID = i
                  features_matrix[docID,wordID] = words.count(word)
        docID = docID + 1     
    return features_matrix

4. Training the classifiers.


Here, I will be using scikit-learn ML library for training classifiers. It is an open source python ML library which comes bundled in 3rd party distribution anaconda or can be used by separate installation following this. Once installed, we only need to import it in our program.

I have trained two models here namely Naive Bayes classifier and Support Vector Machines (SVM). Naive Bayes classifier is a conventional and very popular method for document classification problem. It is a supervised probabilistic classifier based on Bayes theorem assuming independence between every pair of features. SVMs are supervised binary classifiers which are very effective when you have higher number of features. The goal of SVM is to separate some subset of training data from rest called the support vectors (boundary of separating hyper-plane). The decision function of SVM model that predicts the class of the test data is based on support vectors and makes use of a kernel trick.

Once the classifiers are trained, we can check the performance of the models on test-set. We extract word count vector for each mail in test-set and predict its class(ham or spam) with the trained NB classifier and SVM model. Below is the full code for spam filtering application. You have to include the two functions we have defined before in step 2 and step 3.

import os
import numpy as np
from collections import Counter
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.svm import SVC, NuSVC, LinearSVC

# Create a dictionary of words with its frequency

train_dir = 'train-mails'
dictionary = make_Dictionary(train_dir)

# Prepare feature vectors per training mail and its labels

train_labels = np.zeros(702)
train_labels[351:701] = 1
train_matrix = extract_features(train_dir)

# Training SVM and Naive bayes classifier

model1 = MultinomialNB()
model2 = LinearSVC()
model1.fit(train_matrix,train_labels)
model2.fit(train_matrix,train_labels)

# Test the unseen mails for Spam
test_dir = 'test-mails'
test_matrix = extract_features(test_dir)
test_labels = np.zeros(260)
test_labels[130:260] = 1
result1 = model1.predict(test_matrix)
result2 = model2.predict(test_matrix)
print confusion_matrix(test_labels,result1)
print confusion_matrix(test_labels,result2)

Related Solutions

4. The department of HR at the University of Michigan wants to estimate the amount of...
4. The department of HR at the University of Michigan wants to estimate the amount of an annual healthcare premium needed for an Assistant Professor as part of a new recruiting program. In a sample of 50 Assistant Professors, they found that the average yearly premium needed is $16,000 with a standard deviation of $3,500. (a) What is the population mean? What is the best estimate of the population mean? (b) Develop an 90% confidence interval for the population mean....
A system manager at a large corporation believes that the percentage of spam email received at...
A system manager at a large corporation believes that the percentage of spam email received at his company may be 61%. He examines a random sample of 213 emails received at an email server, and finds that 66% of the messages are spam. Use a significance level of α = 0.07. a.) State the null and alternative hypothesis using correct symbolic form. H0: Answerρμσ Answer=≠<> Answer H1: Answerρμ σ Answer≠<> Answer b.) Is this a left-tailed, right-tailed, or two-tailed hypothesis...
A researcher reported that 71.8% of all email sent in a recent month was spam. A...
A researcher reported that 71.8% of all email sent in a recent month was spam. A system manager at a large corporation believes that the percentage at his company may be 78%. He examines a random sample of 500 emails received at an email server, and finds that 367 of the messages are spam. Can you conclude that the percentage of emails that are spam differs from 78%? Use both α=0.05 and α=0.10 levels of significance and the P-value method...
A researcher reported that 71.8% of all email sent in a recent month was spam. A...
A researcher reported that 71.8% of all email sent in a recent month was spam. A system manager at a large corporation believes that the percentage at his company may be 71%. He examines a random sample of 500 emails received at an email server, and finds that 377 of the messages are spam. Can you conclude that the percentage of emails that are spam is greater than 71% ? Use both α=0.10 and α=0.01 levels of significance and the...
According to MessageLabs Ltd., 80% of all email sent in July 2018 was spam. A system...
According to MessageLabs Ltd., 80% of all email sent in July 2018 was spam. A system manager at a large corporation believes that the percentage at his company may be greater than 80%. He examines a random sample of 500 emails received at an email server and finds that 418 of the messages are spam. a. Who/what is the population and what is the parameter of interest? Use the proper notation b. What are the Null and Alternative hypotheses, in...
The application of lexical analysis techniques in spam email detection You should cover: 1) What is...
The application of lexical analysis techniques in spam email detection You should cover: 1) What is the problem? 2) What is the compiler construction techniques used to solve the problem 3) How to solve the problem using the compiling techniques.
) Suppose you work for Meijer, a large grocer headquartered in Michigan. 20 years ago, Meijer...
) Suppose you work for Meijer, a large grocer headquartered in Michigan. 20 years ago, Meijer bought a parcel of land on the outskirts of Lafayette, Indiana. It is currently being rented to a farmer. They intended to build a new store on the lot after a proposed new highway was complete. However, when the new highway was built it went in a different direction and now they must decide whether to build the new store. You ask around and...
Suppose the Australian government decides to implement a binding price ceiling for university fees in 2021....
Suppose the Australian government decides to implement a binding price ceiling for university fees in 2021. A. Will this policy make the market more efficient? Why/why not? Explain your answer in 3-4 lines. [1.5 marks] B. Are all students better off under this policy? Why/why not? Explain your answer in 2-3 lines. [1.5 marks
Spam Spam filters try to sort your e-mails, deciding which are real messages and which are...
Spam Spam filters try to sort your e-mails, deciding which are real messages and which are unwanted. One method used is a point system. The filter reads each incoming message and assigns points to the sender, the subject, key words in the message and so on. The higher the point total, the more likely it is that the message is unwanted. The filter has a cutoff value for the point total; any message rated lower than the cutoff passes through...
In your opinion, should emails sent from your work computer, using your employer's email account. during...
In your opinion, should emails sent from your work computer, using your employer's email account. during work hours be subject to the hours be subject to the right of privacy? Does your opinion change if the facts change, such as, what if it is from your work laptopk, using your employer's email account, but it is from your home after working hours? Please explain why? As the employer, why would you want to maintain control over the emails sent by...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT