Question

In: Statistics and Probability

The five most common words appearing in spam emails are shipping!, today!, here!, available, and fingertips!....

The five most common words appearing in spam emails are shipping!, today!, here!, available, and fingertips!. Many spam filters separate spam from ham (email not considered to be spam) through application of Bayes' theorem. Suppose that for one email account, in every messages is spam and the proportions of spam messages that have the five most common words in spam email are given below.

shipping!        0.050      

today!             0.047

here!              0.034

Available       0.016

fingertips!      0.016


Also suppose that the proportions of ham messages that have these words are

shipping!

0.0016

today!

0.0021

here!

0.0021

available

0.0041

fingertips!

0.0010

Round your answers to three decimal places.

If a message includes the word shipping!, what is the probability the message is spam?

If a message includes the word shipping!, what is the probability the message is ham?

Should messages that include the word shipping! be flagged as spam?

b. If a message includes the word today!, what is the probability the message is spam?

If a message includes the word here!, what is the probability the message is spam?

Which of these two words is a stronger indicator that a message is spam?

Why?

Because the probability is

c. If a message includes the word available, what is the probability the message is spam?

If a message includes the word fingertips!, what is the probability the message is spam?

Which of these two words is a stronger indicator that a message is spam?

Why?

Because the probability is

d. What insights do the results of parts (b) and (c) yield about what enables a spam filter that uses Bayes' theorem to work effectively?

Explain.

It is easier to distinguish spam from ham when a word occurs in spam and less often in ham.

Solutions

Expert Solution

Let S shows the event that message is spam and H shows the event that message is ham. Assuming 1 in every 10 message is spam so

P(S) = 1 /10 = 0.10, P(H) = 1 - P(S) = 1 - 0.10 = 0.90

From the given information we have

P(shipping |S) = 0.050, P(today|S) = 0.047, P(here|S) = 0.034, P(available|S) = 0.016, P(fingertips |S) = 0.016

P(shipping |H) = 0.0016, P(today|H) = 0.0021, P(here|H) = 0.0021, P(available|H) = 0.0041, P(fingertips |H) = 0.0010

By the law of total probability we have

P(shipping) = P(shipping |S) P(S) + P(shipping |H) P(H) = 0.050 * 0.10 + 0.0016 *0.90 = 0.00644

P(today) = P(today |S) P(S) + P(today |H) P(H) = 0.047 * 0.10 + 0.0021 *0.90 = 0.00659

P(here) = P(here |S) P(S) + P(here |H) P(H) = 0.034 * 0.10 + 0.0021 *0.90 = 0.00529

P(available) = P(available |S) P(S) + P(available |H) P(H) = 0.016 * 0.10 + 0.0041 *0.90 = 0.00529

P(fingertips ) = P(fingertips |S) P(S) + P(fingertips |H) P(H) = 0.016 * 0.10 + 0.001 *0.90 = 0.0025

(a)

If a message includes the word shipping!, what is the probability the message is spam?

P(S | shipping) = [P(shipping |S) P(S)]  / P(shipping) = [0.050 *0.10] / 0.00644 = 0.776

If a message includes the word shipping!, what is the probability the message is ham?

P(H | shipping) = [P(shipping |H) P(H)]  / P(shipping) = [0.0016 *0.90] / 0.00644 = 0.224

Should messages that include the word shipping! be flagged as spam?

No because the probability P(H | shipping) is not very low.

(b)

If a message includes the word today!, what is the probability the message is spam?

P(S|today) = [P(today|S)P(S)] / P(today) = [0.047 * 0.10] /0.00659 = 0.713

If a message includes the word here!, what is the probability the message is spam?

P(S|here) = [P(here|S)P(S)] / P(here) = [0.034 * 0.10] /0.00529 = 0.643

Which of these two words is a stronger indicator that a message is spam? today

Why? Because the probability P(S|today) is larger.

(c)

If a message includes the word available, what is the probability the message is spam?

P(S|available) = [P(available|S)P(S)] / P(available) = [0.016 * 0.10] /0.00529 = 0.302

If a message includes the word fingertips!, what is the probability the message is spam?

P(S|fingertips) = [P(fingertips|S)P(S)] / P(fingertips) = [ 0.016 * 0.10] /0.0025 = 0.640

Which of these two words is a stronger indicator that a message is spam? fingertips

Why?

Because the probability P(S|fingertips) is larger.

(d)

We can find out using Baye's theorem that these words occur most often in spam messages in comparison to ham messages which enables a spam filter that uses Bayes' theorem to work effectively.


Related Solutions

The five most common words appearing in spam emails are shipping!, today!, here!, available, and fingertips!....
The five most common words appearing in spam emails are shipping!, today!, here!, available, and fingertips!. Many spam filters separate spam from ham (email not considered to be spam) through application of Bayes' theorem. Suppose that for one email account, in every messages is spam and the proportions of spam messages that have the five most common words in spam email are given below. shipping!        0.050       today!             0.047 here!              0.034 Available       0.016 fingertips!      0.016 Also suppose that the proportions of...
Which entity (not including transportation or shipping companies) in the distribution channel is most important today...
Which entity (not including transportation or shipping companies) in the distribution channel is most important today and why? What changes are happening to companies in that area? I expect you to reference the concepts in the chapter.
What tool is the most appropriate among the different monetary policy tools available today?
What tool is the most appropriate among the different monetary policy tools available today?
1.Today, ____________ is the most common software used to manage ESXi servers and the vSphere environment....
1.Today, ____________ is the most common software used to manage ESXi servers and the vSphere environment. the VMware Remote Console the Windows-based vSphere Desktop Client the Direct Console User Interface (DCUI) a supported web browser 2-_________ vNetwork switches must be managed independently on each ESXi host. 3-____________________ a patch copies the files across to the host to speed up the actual time of remediation. 4-During installation, selecting the evaluation licensing mode starts a ____ day trial period.
Explain why these five rapist profiles listed are the most common. The Sadist. The Woman Hater....
Explain why these five rapist profiles listed are the most common. The Sadist. The Woman Hater. The Opportunist. The Date Rapist The Husband Rapist
In this unit, we introduce five paths to business ownership. The second most common way is...
In this unit, we introduce five paths to business ownership. The second most common way is to purchase an existing business. In the US, it is really an environment of caveat emptor when it comes to purchasing a business. My business for the project is a Tea Lounge. Find a business that is for sale in your industry and preferably, is similar to your business idea at BizBuySell. I found: Grilled Cheese Restaurant For Sale in Downtown Cranford Cranford, NJ...
In your own words name five things paper and electronic health records have in common and...
In your own words name five things paper and electronic health records have in common and five ways in which they differ
Compare and contrast the five most common types of system units. Explain expansion slots and car...
Compare and contrast the five most common types of system units. Explain expansion slots and car Cables are used to connect exterior devices to the system unit via the ports. One end of the cable is attached to the device, and the other end has a connector that is attached to a matching connector on the port. True or False
Identify the five most common threats facing firms from their local competitive environment that are represented...
Identify the five most common threats facing firms from their local competitive environment that are represented in the five forces framework, and discuss under what conditions firms in a specific industry are most likely to earn an above average profit and when they are likely to earn a below average profit ?
Write a 1200 words research paper using a minimum of five different sources on -Healthcare crisis-Most...
Write a 1200 words research paper using a minimum of five different sources on -Healthcare crisis-Most developed nations have universal health coverage. Why doesn’t the U.S., the wealthiest nation, have it?. Use proper MLA parenthetical citation and prepare a “Works Cited” page.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT