naive bayes presentation

32
Naive Bayes Md Enamul Haque Chowdhury ID : CSE013083972D University of Luxembourg (Based on Ke Chen and Ashraf Uddin Presentation)

Upload: chowdhury343

Post on 17-Jul-2015

200 views

Category:

Education


9 download

TRANSCRIPT

Page 1: Naive Bayes Presentation

Naive Bayes

Md Enamul Haque Chowdhury

ID : CSE013083972D

University of Luxembourg

(Based on Ke Chen and Ashraf Uddin Presentation)

Page 2: Naive Bayes Presentation

Contents

Background

Bayes Theorem

Bayesian Classifier

Naive Bayes

Uses of Naive Bayes classification

Relevant Issues

Advantages and Disadvantages

Some NBC Applications

Conclusions

1

Page 3: Naive Bayes Presentation

Background

There are three methods to establish a classifier

a) Model a classification rule directly

Examples: k-NN, decision trees, perceptron, SVM

b) Model the probability of class memberships given input data

Example: perceptron with the cross-entropy cost

c) Make a probabilistic model of data within each class

Examples: Naive Bayes, Model based classifiers

a) and b) are examples of discriminative classification

c) is an example of generative classification

b) and c) are both examples of probabilistic classification

2

Page 4: Naive Bayes Presentation

Bayes Theorem

Given a hypothesis h and data D which bears on the hypothesis:

P(h): independent probability of h: prior probability

P(D): independent probability of D

P(D|h): conditional probability of D given h: likelihood

P(h|D): conditional probability of h given D: posterior probability

3

Page 5: Naive Bayes Presentation

Maximum A Posterior

Based on Bayes Theorem, we can compute the Maximum A Posterior (MAP)

hypothesis for the data

We are interested in the best hypothesis for some space H given observed training

data D.

H: set of all hypothesis.

Note that we can drop P(D) as the probability of the data is constant (and

independent of the hypothesis).

)|(argmax DhPhHh

MAP

)(

)()|(argmax

DP

hPhDP

Hh

)()|(argmax hPhDPHh

4

Page 6: Naive Bayes Presentation

Maximum Likelihood

Now assume that all hypothesis are equally probable a prior, i.e. P(hi ) = P(hj ) for all

hi, hj belong to H.

This is called assuming a uniform prior. It simplifies computing the posterior:

This hypothesis is called the maximum likelihood hypothesis.

)|(maxarg hDPhHh

ML

5

Page 7: Naive Bayes Presentation

Bayesian Classifier

The classification problem may be formalized using a-posterior probabilities:

P(C|X) = prob. that the sample tuple X=<x1,…,xk> is of class C.

E.g. P(class=N | outlook= sunny, windy=true,…)

Idea: assign to sample X the class label C such that P(C|X) is maximal

6

Page 8: Naive Bayes Presentation

Estimating a-posterior probabilities

Bayes theorem:

P(C|X) = P(X|C)·P(C) / P(X)

P(X) is constant for all classes

P(C) = relative freq of class C samples

C such that P(C|X) is maximum = C such that P(X|C)·P(C) is maximum

Problem: computing P(X|C) is unfeasible!

7

Page 9: Naive Bayes Presentation

Naive Bayes

Bayes classification

Difficulty: learning the joint probability

Naive Bayes classification

-Assumption that all input features are conditionally independent!

-MAP classification rule: for

)()|,,()()( )( 1 CPCXXPCPC|P|CP n XX

)|()|()|(

)|,,()|(

)|,,(),,,|()|,,,(

21

21

22121

CXPCXPCXP

CXXPCXP

CXXPCXXXPCXXXP

n

n

nnn

),,,( 21 nxxx x

Lnn ccccccPcxPcxPcPcxPcxP ,, , ),()]|()|([)()]|()|([ 1

*

1

***

1

8

Page 10: Naive Bayes Presentation

Naive Bayes

Algorithm: Discrete-Valued Features

-Learning Phase: Given a training set S,

Output: conditional probability tables; for elements

-Test Phase: Given an unknown instance

Look up tables to assign the label c* to X´ if

;in examples with )|( estimate)|(ˆ

),1 ;,,1( featureeach of valuefeatureevery For

;in examples with )( estimate)(ˆ

of t valueeach targeFor 1

S

S

ijkjijkj

jjjk

ii

Lii

cCxXPcCxXP

N,knj Xx

cCPcCP

)c,,c(c c

Lnn ccccccPcaPcaPcPcaPcaP ,, , ),(ˆ)]|(ˆ)|(ˆ[)(ˆ)]|(ˆ)|(ˆ[ 1

*

1

***

1

LNX jj ,

),,( 1 naa X

9

Page 11: Naive Bayes Presentation

Example

10

Page 12: Naive Bayes Presentation

Example

Learning Phase :

P(Play=Yes) = 9/14

P(Play=No) = 5/14

Outlook Play=Yes Play=No

Sunny 2/9 3/5Overcast 4/9 0/5Rain 3/9 2/5

Temperature Play=Yes Play=No

Hot 2/9 2/5Mild 4/9 2/5Cool 3/9 1/5

Humidity Play=Yes Play=No

High 3/9 4/5Normal 6/9 1/5

Wind Play=Yes Play=No

Strong 3/9 3/5Weak 6/9 2/5

11

Page 13: Naive Bayes Presentation

Example

Test Phase :

-Given a new instance, predict its label

x´=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)

-Look up tables achieved in the learning phrase

-Decision making with the MAP rule:

P(Outlook=Sunny|Play=Yes) = 2/9

P(Temperature=Cool|Play=Yes) = 3/9

P(Huminity=High|Play=Yes) = 3/9

P(Wind=Strong|Play=Yes) = 3/9

P(Play=Yes) = 9/14

P(Outlook=Sunny|Play=No) = 3/5

P(Temperature=Cool|Play==No) = 1/5

P(Huminity=High|Play=No) = 4/5

P(Wind=Strong|Play=No) = 3/5

P(Play=No) = 5/14

P(Yes|x´): [ P(Sunny|Yes) P(Cool|Yes) P(High|Yes) P(Strong|Yes) ] P(Play=Yes) = 0.0053

P(No|x´): [ P(Sunny|No) P(Cool|No) P(High|No) P(Strong|No) ] P(Play=No) = 0.0206

Given the fact P(Yes|x´) < P(No|x´) , we label x´ to be “No”.

12

Page 14: Naive Bayes Presentation

Naive Bayes

Algorithm: Continuous-valued Features

- Numberless values for a feature

- Conditional probability often modeled with the normal distribution

- Learning Phase:

Output: normal distributions and

- Test Phase: Given an unknown instance

-Instead of looking-up tables, calculate conditional probabilities with all the normal

distributions achieved in the learning phrase

-Apply the MAP rule to make a decision

ijji

ijji

ji

jij

ji

ij

cC

cX

XcCXP

for which examples of X valuesfeature ofdeviation standard :

Cfor which examples of valuesfeature of (avearage)mean :

2

)(exp

2

1)|(ˆ

2

2

Ln ccCXX ,, ),,,(for 11 X

LicCP i ,,1 )( Ln

),,( 1 naa X

13

Page 15: Naive Bayes Presentation

Naive Bayes

Example: Continuous-valued Features

-Temperature is naturally of continuous value.

Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.8

No: 27.3, 30.1, 17.4, 29.5, 15.1

-Estimate mean and variance for each class

-Learning Phase: output two Gaussian models for P(temp|C)

N

n

n

N

n

n xN

xN 1

22

1

)(1

,1

09.7 ,88.23

35.2 ,64.21

NoNo

YesYes

25.50

)88.23(exp

209.7

1)|(ˆ

09.11

)64.21(exp

235.2

1)|(ˆ

2

2

xNoxP

xYesxP

14

Page 16: Naive Bayes Presentation

Uses of Naive Bayes classification

Text Classification

Spam Filtering

Hybrid Recommender System

- Recommender Systems apply machine learning and data mining techniques for

filtering unseen information and can predict whether a user would like a given

resource

Online Application

- Simple Emotion Modeling

15

Page 17: Naive Bayes Presentation

Why text classification?

Learning which articles are of interest

Classify web pages by topic

Information extraction

Internet filters

16

Page 18: Naive Bayes Presentation

Examples of Text Classification

CLASSES=BINARY

“spam” / “not spam”

CLASSES =TOPICS

“finance” / “sports” / “politics”

CLASSES =OPINION

“like” / “hate” / “neutral”

CLASSES =TOPICS

“AI” / “Theory” / “Graphics”

CLASSES =AUTHOR

“Shakespeare” / “Marlowe” / “Ben Jonson”

17

Page 19: Naive Bayes Presentation

Naive Bayes Approach

Build the Vocabulary as the list of all distinct words that appear in all the documents

of the training set.

Remove stop words and markings

The words in the vocabulary become the attributes, assuming that classification is

independent of the positions of the words

Each document in the training set becomes a record with frequencies for each word

in the Vocabulary.

Train the classifier based on the training data set, by computing the prior probabilities

for each class and attributes.

Evaluate the results on Test data

18

Page 20: Naive Bayes Presentation

Text Classification Algorithm: Naive Bayes

Tct – Number of particular word in particular class

Tct’ – Number of total words in particular class

B´ – Number of distinct words in all class

19

Page 21: Naive Bayes Presentation

Relevant Issues

Violation of Independence Assumption

Zero conditional probability Problem

20

Page 22: Naive Bayes Presentation

Violation of Independence Assumption

Naive Bayesian classifiers assume that the effect of an attribute value on a given

class is independent of the values of the other attributes. This assumption is called

class conditional independence. It is made to simplify the computations involved and,

in this sense, is considered “naive.”

21

Page 23: Naive Bayes Presentation

Improvement

Bayesian belief network are graphical models, which unlike naive Bayesian

classifiers, allow the representation of dependencies among subsets of attributes.

Bayesian belief networks can also be used for classification.

22

Page 24: Naive Bayes Presentation

Zero conditional probability Problem

If a given class and feature value never occur together in the training set then the

frequency-based probability estimate will be zero.

This is problematic since it will wipe out all information in the other probabilities when

they are multiplied.

It is therefore often desirable to incorporate a small-sample correction in all

probability estimates such that no probability is ever set to be exactly zero.

23

Page 25: Naive Bayes Presentation

Naive Bayes Laplace Correction

To eliminate zeros, we use add-one or Laplace smoothing, which simply adds one to

each count

24

Page 26: Naive Bayes Presentation

Example

Suppose that for the class buys computer D (yes) in some training database, D, containing 1000

tuples.

we have 0 tuples with income D low,

990 tuples with income D medium, and

10 tuples with income D high.

The probabilities of these events, without the Laplacian correction, are 0, 0.990 (from 990/1000),

and 0.010 (from 10/1000), respectively.

Using the Laplacian correction for the three quantities, we pretend that we have 1 more tuple for

each income-value pair. In this way, we instead obtain the following probabilities :

respectively. The “corrected” probability estimates are close to their “uncorrected” counterparts,

yet the zero probability value is avoided.

25

Page 27: Naive Bayes Presentation

Advantages

• Advantages :

Easy to implement

Requires a small amount of training data to estimate the parameters

Good results obtained in most of the cases

26

Page 28: Naive Bayes Presentation

Disadvantages

Disadvantages:

Assumption: class conditional independence, therefore loss of accuracy

Practically, dependencies exist among variables

-E.g., hospitals: patients: Profile: age, family history, etc.

Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.

Dependencies among these cannot be modelled by Naïve Bayesian Classifier

27

Page 29: Naive Bayes Presentation

Some NBC Applications

Credit scoring

Marketing applications

Employee selection

Image processing

Speech recognition

Search engines…

28

Page 30: Naive Bayes Presentation

Conclusions

Naive Bayes is:

- Really easy to implement and often works well

- Often a good first thing to try

- Commonly used as a “punching bag” for smarter algorithms

29

Page 31: Naive Bayes Presentation

References

http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/mlbook/ch6.pdf

Data Mining: Concepts and Techniques, 3rd

Edition, Han & kamber & Pei ISBN: 9780123814791

http://en.wikipedia.org/wiki/Naive_Bayes_classifier

http://www.slideshare.net/ashrafmath/naive-bayes-15644818

http://www.slideshare.net/gladysCJ/lesson-71-naive-bayes-classifier

30

Page 32: Naive Bayes Presentation

Questions ?