1 bayesian spam filters key conceptskey concepts –conditional probability –independence –bayes...

22
1 Bayesian Spam Filters Key Concepts Key Concepts Conditional Probability Conditional Probability Independence Independence Bayes Theorem Bayes Theorem

Upload: vincent-casey

Post on 17-Dec-2015

238 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Bayesian Spam Filters Key ConceptsKey Concepts –Conditional Probability –Independence –Bayes Theorem

1

Bayesian Spam Filters

• Key ConceptsKey Concepts– Conditional ProbabilityConditional Probability– IndependenceIndependence– Bayes TheoremBayes Theorem

Page 2: 1 Bayesian Spam Filters Key ConceptsKey Concepts –Conditional Probability –Independence –Bayes Theorem

2

Spam or Ham?

FROMFROM: Terry Delaney [removed]: Terry Delaney [removed]TOTO: (removed): (removed)SubjectSubject: FDA approved on-line pharmacies! click : FDA approved on-line pharmacies! click here (removed)here (removed)Chose your product and site below:Chose your product and site below:Canadian pharmacy (removed) - Cialis Soft Tabs - Canadian pharmacy (removed) - Cialis Soft Tabs -

$5.78, Viagra Professional - $4.07, Soma - $1.38, $5.78, Viagra Professional - $4.07, Soma - $1.38, Human Growth Hormone - $43.37, Meridia - Human Growth Hormone - $43.37, Meridia - $3.32, Tramadol - $2.17, Levitra - $11.97. $3.32, Tramadol - $2.17, Levitra - $11.97.

Page 3: 1 Bayesian Spam Filters Key ConceptsKey Concepts –Conditional Probability –Independence –Bayes Theorem

3

Quick Reminders

• Conditional Probability: Events E, F withConditional Probability: Events E, F with

• Independence: E and F are independent if Independence: E and F are independent if and only if and only if

Page 4: 1 Bayesian Spam Filters Key ConceptsKey Concepts –Conditional Probability –Independence –Bayes Theorem

4

Baye’s Theorem: A quick Proof

Page 5: 1 Bayesian Spam Filters Key ConceptsKey Concepts –Conditional Probability –Independence –Bayes Theorem

Proof cont.

5

Page 6: 1 Bayesian Spam Filters Key ConceptsKey Concepts –Conditional Probability –Independence –Bayes Theorem

6

Applying Baye’s Theorem

• Let our sample space be the set of emails.Let our sample space be the set of emails.• Let S be the event a message is spam; hence is Let S be the event a message is spam; hence is

the event a message is not spamthe event a message is not spam• Let E be the event a message contains a word Let E be the event a message contains a word ww. .

Page 7: 1 Bayesian Spam Filters Key ConceptsKey Concepts –Conditional Probability –Independence –Bayes Theorem

7

Estimations

Page 8: 1 Bayesian Spam Filters Key ConceptsKey Concepts –Conditional Probability –Independence –Bayes Theorem

8

Estimation Continued

Page 9: 1 Bayesian Spam Filters Key ConceptsKey Concepts –Conditional Probability –Independence –Bayes Theorem

9

Spam based on single words?

• Probabilities based on single words: Bad IdeaProbabilities based on single words: Bad Idea– False positives AND false negatives aplentyFalse positives AND false negatives aplenty

• Calculate based on Calculate based on n n words, assuming each event words, assuming each event EEii|S (E|S (Eii|S|SCC) is independent; P(S) = P(S) is independent; P(S) = P(SCC).).

Page 10: 1 Bayesian Spam Filters Key ConceptsKey Concepts –Conditional Probability –Independence –Bayes Theorem

Final Approximation

10

Page 11: 1 Bayesian Spam Filters Key ConceptsKey Concepts –Conditional Probability –Independence –Bayes Theorem

11

How do we use this?

• User must train the filter based on messages User must train the filter based on messages in his/her inbox to estimate probabilitiesin his/her inbox to estimate probabilities

• The program or user must define a The program or user must define a threshold probability threshold probability rr: :

• If , the message is considered If , the message is considered spam.spam.

Page 12: 1 Bayesian Spam Filters Key ConceptsKey Concepts –Conditional Probability –Independence –Bayes Theorem

12

Example

• Suppose the filter has the following dataSuppose the filter has the following data• Threshold Probability: .9Threshold Probability: .9• ““Viagra” occurs in 250 of 2000 spam Viagra” occurs in 250 of 2000 spam

messages messages • ““Viagra” occurs in only 5 of 1000 non-Viagra” occurs in only 5 of 1000 non-

spam messagesspam messages• Let’s try to estimate the probability, using Let’s try to estimate the probability, using

the process we just definedthe process we just defined

Bebis
Main course objective: improve math reasoning- logic is the basis of math reasoning- to understand math, we must undertand what makes up a correct math argument (proof)- the rules of logic give precise meanining to math statements (proof)- using the rules of logic we can distinguish between valid and invalid math arguments- math logic serves as the foundation for discussing methods of proof- using the rules of logic, we can constrcut correct math arguments
Page 13: 1 Bayesian Spam Filters Key ConceptsKey Concepts –Conditional Probability –Independence –Bayes Theorem

13

Example Cont.

• Step 1: Find the probability that the message has Step 1: Find the probability that the message has the word “Viagra” in it and is spam. the word “Viagra” in it and is spam. – p(p(ViagraViagra) = 250 / 2000 = 0.125) = 250 / 2000 = 0.125

• Step 2: Find the probability that the message has Step 2: Find the probability that the message has the word “Viagra” in it and is not spam.the word “Viagra” in it and is not spam.– q(q(ViagraViagra) = 5 / 1000 = 0.005) = 5 / 1000 = 0.005

Page 14: 1 Bayesian Spam Filters Key ConceptsKey Concepts –Conditional Probability –Independence –Bayes Theorem

14

• Since we are assuming that it is equally Since we are assuming that it is equally likely that an incoming message is or is not likely that an incoming message is or is not spam, we can estimate the probability with spam, we can estimate the probability with this equation:this equation:– r(Viagra) = r(Viagra) = p(Viagra) p(Viagra)

p(Viagra) + q(Viagra)p(Viagra) + q(Viagra)

Example Cont.

Page 15: 1 Bayesian Spam Filters Key ConceptsKey Concepts –Conditional Probability –Independence –Bayes Theorem

15

• 0.1250.125

0.125 + 0.0050.125 + 0.005

= = 0.1250.125

0.1300.130

= 0.962= 0.962

Since Since r(Viagra)r(Viagra) is greater than the threshold of 0.9, is greater than the threshold of 0.9, we can reject this message as spam.we can reject this message as spam.

Example Cont.

Page 16: 1 Bayesian Spam Filters Key ConceptsKey Concepts –Conditional Probability –Independence –Bayes Theorem

16

• Single-word detection can lead to a lot of Single-word detection can lead to a lot of false positives and false negatives.false positives and false negatives.

• To counter this, most spam filters look for To counter this, most spam filters look for the presence of multiple words.the presence of multiple words.

Harder Stuff

Page 17: 1 Bayesian Spam Filters Key ConceptsKey Concepts –Conditional Probability –Independence –Bayes Theorem

17

Another Example

• 2000 Spam messages; 1000 real messages2000 Spam messages; 1000 real messages• ““Viagra” appears in 400 spam messagesViagra” appears in 400 spam messages• ““Viagra” appears in 60 real messagesViagra” appears in 60 real messages• ““Cialis” appears in 200 spam and 25 real messagesCialis” appears in 200 spam and 25 real messages• Threshold Probability: .9Threshold Probability: .9• Let’s calculate the probability that it’s spam.Let’s calculate the probability that it’s spam.

Bebis
In English, there is always a relationship between p and q (not always in math -- see last example)
Page 18: 1 Bayesian Spam Filters Key ConceptsKey Concepts –Conditional Probability –Independence –Bayes Theorem

18

Example Cont.

• Step 1: Find the probability that the Step 1: Find the probability that the message has the word “Viagra” in it and is message has the word “Viagra” in it and is spam.spam.– p(Viagra) = 400 / 2000 = 0.2p(Viagra) = 400 / 2000 = 0.2

• Step 2: Find the probability that the Step 2: Find the probability that the message has the word “Viagra” and is not message has the word “Viagra” and is not spam.spam.– q(Viagra) = 60 / 1000 = 0.06q(Viagra) = 60 / 1000 = 0.06

Page 19: 1 Bayesian Spam Filters Key ConceptsKey Concepts –Conditional Probability –Independence –Bayes Theorem

19

Example Cont.

• Step 3: Find the probability that the Step 3: Find the probability that the message contains the word “Cialis” and is message contains the word “Cialis” and is spam.spam.– p(Cialis) = 200 / 2000 = 0.1p(Cialis) = 200 / 2000 = 0.1

• Step 4: Find the probability that the Step 4: Find the probability that the message contains the word “Cialis” and is message contains the word “Cialis” and is not spam.not spam.– q(Cialis) = 25 / 1000 = 0.025q(Cialis) = 25 / 1000 = 0.025

Page 20: 1 Bayesian Spam Filters Key ConceptsKey Concepts –Conditional Probability –Independence –Bayes Theorem

20

Example Cont

• Using our approximation, we have:Using our approximation, we have:

– r(Viagra,Cialis) = r(Viagra,Cialis) = p(Viagra) * p(Cialis) p(Viagra) * p(Cialis)

p(Viagra) * p(Cialis) + q(Viagra) * q(Cialis)p(Viagra) * p(Cialis) + q(Viagra) * q(Cialis)

Page 21: 1 Bayesian Spam Filters Key ConceptsKey Concepts –Conditional Probability –Independence –Bayes Theorem

21

Example Cont.

• r(Viagra,Cialis)r(Viagra,Cialis) = = (0.2)(0.1) (0.2)(0.1)

(0.2)(0.1) + (0.6)(0.025)(0.2)(0.1) + (0.6)(0.025)

= = 0.9300.930

This message will be rejected however since we set the This message will be rejected however since we set the threshold probability at 0.9.threshold probability at 0.9.

Page 22: 1 Bayesian Spam Filters Key ConceptsKey Concepts –Conditional Probability –Independence –Bayes Theorem

22

Questions?