bayesian learning - indian institute of technology...

Bayesian LearningCSL465/603 - Fall 2016Narayanan C Krishnan

[email protected]

Outline• Bayes’ Theorem• MAP Learners• Bayes optimal classifier• Naïve Bayes classifier• Example – text classification• Bayesian networks• EM algorithm

Bayesian Learning CSL465/603 - Machine Learning 2

Features Bayesian Learning• Practical learning algorithms• Naïve Bayes learning• Bayesian network learning• Combine prior knowledge with observations

• Require prior probabilities

• Useful conceptual framework• “gold standard” for evaluating other classifiers• Tools for analysis


Bayes Theorem• If 𝐴 and 𝐵 are two random variables

• In the context of classifier hypothesis h and training data 𝐼

• 𝑃(h) – prior probability of hypothesis h• 𝑃(𝐼) – prior probability of training data 𝐼• 𝑃 h|𝐼 – probability of h given 𝐼• 𝑃 𝐼|h – probability of 𝐼 given h


𝑃 𝐴 𝐵 =𝑃 𝐵 𝐴 𝑃(𝐴)

𝑃(𝐵)

𝑃 h 𝐼 =𝑃 𝐼 h 𝑃(h)

𝑃(𝐼)

Choosing the Hypotheses• Given the training data, we are interested in the

most probable hypothesis• Maximum a posteriori hypothesis - h+,-

h+,- ≡ argmax4∈6𝑃 h 𝐼 ≡ argmax4∈6

𝑃 𝐼 h 𝑃(h) 𝑃(𝐼)

≡ argmax4∈6𝑃 𝐼 h 𝑃 h • If every hypothesis is equally probable, 𝑃 h7 =𝑃 h8 , ∀h7, h8 ∈ 𝐻, then we can simplify it to Maximum likelihood (ML) hypothesis - h+<

h+< = argmax4=∈6𝑃 𝐼|h7


Example• Does the patient have cancer or not?• A patient takes a lab test and the result comes back

positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is no present. Furthermore, 0.008 of the entire population have this cancer.


Brute-Force MAP Hypothesis Learner (1)• If we are given 𝐼 =< 𝑥H, 𝑦H , … , 𝑥K, 𝑦K >,

examples and the class labels, • For each hypothesis h ∈ 𝐻 , calculate the posterior

probability

• Output the hypothesis ℎ+,- that has the highest posterior probability


𝑃 h 𝐼 =𝑃 𝐼 h 𝑃(h)

𝑃(𝐼)

h+,- = argmax4∈6𝑃 h 𝐼


examples and the class labels, choose 𝑃(𝐼|h)• 𝑃(𝐼|h) = 1 if h is consistent with 𝐼• 𝑃(𝐼|h) = 0 otherwise

• Choose 𝑃(h) to be uniform distribution• 𝑃 h = H

6 ∀h ∈ 𝐻

• Then 𝑃 h 𝐼 = - 𝐼 h -(4) -(P)



examples and the class labels, choose 𝑃(𝐼|h)• 𝑃(𝐼|h) = 1 if h is consistent with 𝐼• 𝑃(𝐼|h) = 0 otherwise

• Choose 𝑃(h) to be uniform distribution• 𝑃 h = H

6

• Then

𝑃 h 𝐼 = QH

RS,T, ifhisconsistentwith𝐼

0, otherwise


Evolution of Posterior Probabilities


𝑃h

h

Ph|𝐼 H

h

Ph|𝐼 H,𝐼_

h

Classifying new instances• Given a new instance x, what is the most probable

classification?• One solution – h+,-(x)

• But can we do better?• Consider the following example containing three

hypotheses:𝑃 hH 𝐼 = 0.4, 𝑃 h_ 𝐼 = 0.3, 𝑃 hc 𝐼 = 0.3

• Given a new instance x,hH x = +, h_ x = −, hc x = −

• What is the most probable classification for x


Bayes Optimal Classifier (1)• Combine the prediction of all hypotheses weighted

by their posterior probabilities• Bayes optimal classification

argmaxd∈𝒴 f 𝑃 y h7 𝑃(h7|𝐼)�

4=∈6

• Example𝑃 hH 𝐼 = .4, 𝑃 − hH = 0, 𝑃 + hH = 1𝑃 h_ 𝐼 = .3, 𝑃 − h_ = 1, 𝑃 + h_ = 0𝑃 hc 𝐼 = .3, 𝑃 − hc = 1, 𝑃 + hc = 0


f 𝑃 + h7 𝑃(h7|𝐼)�

4=∈6

= f 𝑃 − h7 𝑃(h7|𝐼)�

4=∈6

=

Bayes Optimal Classifier (2)• Optimal in the sense• No other classification method using the same hypothesis

space and same prior knowledge can outperform this method on average.

• Method maximizes the probability that the new instance is classified correctly, given the available data, hypothesis space and prior probabilities over the hypothesis.• But it is inefficient• Compute posterior probability for every hypothesis and

combine the predictions of each hypothesis.


Gibbs Classifier• Gibbs Algorithm• Choose a hypothesis h ∈ H at random, according to the

posterior probability distribution over 𝐻• Use h to classify the new instance x.

• Observation – Assume target concepts are drawn at random from 𝐻 according to the priors on 𝐻, Then


E 𝐸l7mmn ≤ 2E 𝐸qrstnuvw7xry

Haussler et al., ML 1994

Naïve Bayes Classifier (1)• Bayes rule, slightly different application• Let 𝒴 = {𝑐H, 𝑐_, … 𝑐{} be the different class labels. • The label for 𝑖~� instance 𝑦� ∈ 𝒴

• 𝑃 𝑐� x - posterior probability that instance x belongs to class 𝑐�

• 𝑃 x 𝑐� - probability that an instance drawn from class 𝑐�would be x (likelihood)

• 𝑃(𝑐�) – probability of class 𝑐� (prior)• 𝑃(x) – probability of instance x (evidence)


𝑃 𝑐� x =𝑃 x 𝑐� 𝑃(𝑐�)

𝑃(x)

Naïve Bayes Classifier (2)• Classify instance x as class 𝑦 with maximum

posterior probability

• Ignore the denominator (since we are only interested in the maximum)

• If the prior is uniform


𝑦 = argmax�𝑃(𝑐�|x)

𝑦 = argmax�𝑃 x 𝑐� 𝑃(𝑐�)

𝑦 = argmax�𝑃 x 𝑐�

Naïve Bayes Classifier (3)• Look at the classifier

• What is each instance x?• A 𝐷 dimensional tuple – (𝑥H, … , 𝑥�)

• Estimate the joint probability distribution𝑃 𝑥H, … 𝑥� 𝑐�• Practical issue- need to know the probability of every

possible instance given every possible class.• With 𝐷 Boolean features and 𝐾 classes – K2� probability

values!!!


𝑦 = argmax�𝑃 x 𝑐�

Naïve Bayes Classifier (4)• Make the naïve Bayes assumption –

features/attributes are conditionally independent given the target attribute (class label)

𝑃 𝑥H, … 𝑥� 𝑐� = �𝑃 𝑥� 𝑐�

�

��H

• This results in the naïve Bayes classifier (NBC)!


𝑦 = argmax��𝑃 𝑥� 𝑐�

�

��H

𝑃(𝑐�)

NBC – Practical Issues (2)• If the features are continuous?• Assume some parameterized distribution for 𝑥�, e.g.,

Normal• Learn parameters of distribution from data, e.g., mean and

variance of 𝑥� values• Determine the parameters that maximize the likelihood.• 𝑃 𝑥� 𝑐� ~𝑁(𝜇, 𝜎_), 𝜇 and 𝜎_ are unknown


Maximum Likelihood Estimate


NBC – Practical Issues (3)• If the features are continuous?• Assume some parameterized distribution for 𝑥�, e.g.,

Normal• Learn parameters of distribution from data, e.g., mean and

variance of 𝑥� values• Determine the parameters that maximize the likelihood.

• Discretize the feature• E.g., price ∈ ℛ to price ∈ 𝑙𝑜𝑤,𝑚𝑒𝑑𝑖𝑢𝑚, ℎ𝑖𝑔ℎ


NBC – Practical Issues (4)• If there are no examples in class 𝑐� for which 𝑥� = 𝑣

𝑃 𝑥� = 𝑣 𝑐� = 0

�𝑃 𝑥� 𝑐�

�

��H

𝑃 𝑐� = 0

• Use m-estimate defined as follows

𝑃 𝑥� = 𝑣 𝑐� =x7, 𝑦� : 𝑥�� = 𝑣 ∧ 𝑦�= 𝑐� + 𝑚𝑝

x7, 𝑦� : 𝑦� = 𝑐� + 𝑚• Prior estimate of the probability – 𝑝• Equivalent sample size – 𝑚 (how heavily to weight 𝑝

relative to the observed data)


Example – Learn to Classify Text• Problem Definition• Given a set of news articles that are of interest, we would

to like to learn to classify the articles by topic.• Naïve Bayes is among the most effective algorithms

to perform this task.• What will be attributes to represent the documents?• Vector of words – one attribute per word position in the

document• What is the Target concept• Is the document interesting?• Topic of the document


Algorithm – Learn Naïve Bayes• Collect all words and tokens that occur in the

Examples (𝐼)• Vocabulary – all distinct words and tokens in 𝐼

• Compute probabilities 𝑃 𝑐� and 𝑃 𝑥� 𝑐�• 𝐼� – Examples for which the target label is 𝑐�• 𝑃 𝑐� = |P�||P|• 𝑛 – total number of words in 𝐼� (counting duplicates

multiple times)• For each work 𝑥� in Vocabulary

• 𝑛� = number of times word 𝑥� occurs in 𝐼�• 𝑃 𝑥�|𝑐� = ��H

��|R�� ¡¢£ ¤d|


Algorithm – Classify Naïve Bayes• Given a test instance• Compute the frequency of occurrence in the test instance

of each term in the vocabulary• Apply naïve Bayes classification rule!


Example: 20 Newsgroup• Given 1000 training documents from each group• Learn to classify new documents according to the

newsgroup it came from• NBC – 89% accuracy


Bayesian Network (1)• Naïve Bayes assumption of conditional

independence is too restrictive.• The problem is intractable without some conditional

independent assumption• Bayesian networks describe conditional

independence among subsets of variables.• Allows for combining prior knowledge about (in)

dependencies among variables with training data• Recollect – Conditional Independence


Bayesian Network - Example

Storm

CampfireLightning

Thunder ForestFire

Campfire

C

¬C

¬S,B ¬S,¬B

0.4

0.6

0.1

0.9

0.8

0.2

0.2

0.8

S,¬B

BusTourGroup

S,B


Bayes Network (2)• Network represents the joint probability distribution

over all variables• 𝑃(𝑆𝑡𝑜𝑟𝑚, 𝐵𝑢𝑠𝑇𝑜𝑢𝑟𝐺𝑟𝑜𝑢𝑝,… 𝐹𝑜𝑟𝑒𝑠𝑡𝐹𝑖𝑟𝑒)• In general,

𝑃 𝑥H, 𝑥_, … , 𝑥� =�𝑃 𝑥�|𝑃𝑎𝑟𝑒𝑛𝑡𝑠 𝑥�

�

��H• Where 𝑃𝑎𝑟𝑒𝑛𝑡𝑠 𝑥� denotes immediate predecessors of 𝑥� in the graph.

• What is the Bayes Network corresponding to the Naive Bayes Classifier?


Bayes Network (3)• Inference• Bayes network encodes all the information required for

inference.• Exact inference methods

• Work well for some structures• Monte Carlo methods

• Simulate the network randomly to calculate approximate solutions.

• Learning• If the structure is known and there are no missing values,

it is easy to learn a Bayes network• If the network structure is known and there are some

missing values, expectation – maximization algorithm• If the structure is unknown, the problem is very difficult.


Summary• Bayes rule• Bayes Optimal Classifier• Practical Naïve Bayes Classifier• Example – text classification task

• Maximum-likelihood estimates• Bayesian networks


bayesian learning - indian institute of technology...

Documents