In: Statistics and Probability
The five most common words appearing in spam emails are shipping!, today!, here!, available, and fingertips!. Many spam filters separate spam from ham (email not considered to be spam) through application of Bayes' theorem. Suppose that for one email account, in every messages is spam and the proportions of spam messages that have the five most common words in spam email are given below.
shipping! 0.050
today! 0.047
here! 0.034
Available 0.016
fingertips! 0.016
Also suppose that the proportions of ham messages that have these
words are
shipping! |
0.0016 |
today! |
0.0021 |
here! |
0.0021 |
available |
0.0041 |
fingertips! |
0.0010 |
Round your answers to three decimal places.
If a message includes the word shipping!, what is the probability the message is spam?
If a message includes the word shipping!, what is the probability the message is ham?
Should messages that include the word shipping! be flagged as spam?
b. If a message includes the word today!, what is the probability the message is spam?
If a message includes the word here!, what is the probability the message is spam?
Which of these two words is a stronger indicator that a message is spam?
Why?
Because the probability is
c. If a message includes the word available, what is the probability the message is spam?
If a message includes the word fingertips!, what is the probability the message is spam?
Which of these two words is a stronger indicator that a message is spam?
Why?
Because the probability is
d. What insights do the results of parts (b) and (c) yield about what enables a spam filter that uses Bayes' theorem to work effectively?
Explain.
It is easier to distinguish spam from ham when a word occurs in spam and less often in ham.
Let S shows the event that message is spam and H shows the event that message is ham. Assuming 1 in every 10 message is spam so
P(S) = 1 /10 = 0.10, P(H) = 1 - P(S) = 1 - 0.10 = 0.90
From the given information we have
P(shipping |S) = 0.050, P(today|S) = 0.047, P(here|S) = 0.034, P(available|S) = 0.016, P(fingertips |S) = 0.016
P(shipping |H) = 0.0016, P(today|H) = 0.0021, P(here|H) = 0.0021, P(available|H) = 0.0041, P(fingertips |H) = 0.0010
By the law of total probability we have
P(shipping) = P(shipping |S) P(S) + P(shipping |H) P(H) = 0.050 * 0.10 + 0.0016 *0.90 = 0.00644
P(today) = P(today |S) P(S) + P(today |H) P(H) = 0.047 * 0.10 + 0.0021 *0.90 = 0.00659
P(here) = P(here |S) P(S) + P(here |H) P(H) = 0.034 * 0.10 + 0.0021 *0.90 = 0.00529
P(available) = P(available |S) P(S) + P(available |H) P(H) = 0.016 * 0.10 + 0.0041 *0.90 = 0.00529
P(fingertips ) = P(fingertips |S) P(S) + P(fingertips |H) P(H) = 0.016 * 0.10 + 0.001 *0.90 = 0.0025
(a)
If a message includes the word shipping!, what is the probability the message is spam?
P(S | shipping) = [P(shipping |S) P(S)] / P(shipping) = [0.050 *0.10] / 0.00644 = 0.776
If a message includes the word shipping!, what is the probability the message is ham?
P(H | shipping) = [P(shipping |H) P(H)] / P(shipping) = [0.0016 *0.90] / 0.00644 = 0.224
Should messages that include the word shipping! be flagged as spam?
No because the probability P(H | shipping) is not very low.
(b)
If a message includes the word today!, what is the probability the message is spam?
P(S|today) = [P(today|S)P(S)] / P(today) = [0.047 * 0.10] /0.00659 = 0.713
If a message includes the word here!, what is the probability the message is spam?
P(S|here) = [P(here|S)P(S)] / P(here) = [0.034 * 0.10] /0.00529 = 0.643
Which of these two words is a stronger indicator that a message is spam? today
Why? Because the probability P(S|today) is larger.
(c)
If a message includes the word available, what is the probability the message is spam?
P(S|available) = [P(available|S)P(S)] / P(available) = [0.016 * 0.10] /0.00529 = 0.302
If a message includes the word fingertips!, what is the probability the message is spam?
P(S|fingertips) = [P(fingertips|S)P(S)] / P(fingertips) = [ 0.016 * 0.10] /0.0025 = 0.640
Which of these two words is a stronger indicator that a message is spam? fingertips
Why?
Because the probability P(S|fingertips) is larger.
(d)
We can find out using Baye's theorem that these words occur most often in spam messages in comparison to ham messages which enables a spam filter that uses Bayes' theorem to work effectively.