In: Statistics and Probability
The five most common words appearing in spam emails are shipping!, today!, here!, available, and fingertips!. Many spam filters separate spam from ham (email not considered to be spam) through application of Bayes' theorem. Suppose that for one email account, in every messages is spam and the proportions of spam messages that have the five most common words in spam email are given below.
shipping! 0.050
today! 0.047
here! 0.034
Available 0.016
fingertips! 0.016
Also suppose that the proportions of ham messages that have these
words are
shipping! |
0.0016 |
today! |
0.0021 |
here! |
0.0021 |
available |
0.0041 |
fingertips! |
0.0010 |
Round your answers to three decimal places.
If a message includes the word shipping!, what is the probability the message is spam?
If a message includes the word shipping!, what is the probability the message is ham?
Should messages that include the word shipping! be flagged as spam?
b. If a message includes the word today!, what is the probability the message is spam?
If a message includes the word here!, what is the probability the message is spam?
Which of these two words is a stronger indicator that a message is spam?
Why?
Because the probability is
c. If a message includes the word available, what is the probability the message is spam?
If a message includes the word fingertips!, what is the probability the message is spam?
Which of these two words is a stronger indicator that a message is spam?
Why?
Because the probability is
d. What insights do the results of parts (b) and (c) yield about what enables a spam filter that uses Bayes' theorem to work effectively?
Explain.
It is easier to distinguish spam from ham when a word occurs in spam and less often in ham.
a. Probability that a message is spam given that it has the word shipping:
Probability that a message is ham given that it has the word shipping:
Yes, the messages with shipping should be flag spammed because the probability that the message is spam is much higher than that it will beham.
b. Probability that a message is spam given that it has the word today:
.
Probability that a message is spam given that it has the word here:
Today has the more probability of being spam.
c. Probability that a message is spam given that it has the word available:
Probability that a message is spam given that it has the word fingertips:
The probability of both the letters being spam is same. Both are equally strong indicator of message being spam.
d. The probability of words being spam is quite high as can be seen from the parts b and c. However, the words, fingertips and available have more probability to be in spam than the words here and today.
You can comment if you still have any doubts. Please rate the answer if it was helpful.