comp595 hw11

6
COMP595DM Homework #11 Professor Wang Matthew Alcazar, Kyun Yong Park, Siriphong Plianchaow 4/8/2011 Predicting a class label using naïve Bayesian classification.

Upload: matthew-alcazar

Post on 26-Dec-2014

53 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: COMP595 HW11

COMP595DM Homework #11

Professor Wang Matthew Alcazar, Kyun Yong Park, Siriphong Plianchaow

4/8/2011

Predicting a class label using naïve Bayesian classification.

Page 2: COMP595 HW11

Example 6.4 Predicting a class label using naïve Bayesian classification.

This is another way of predicting class labels using statistical probability. They predict membership

probabilities in other words find out which tuple belongs to a certain class. From basic conditional

probability, we are given Bayes’ theorem stating,

In English, we would say the probability of hypothesis h occurring given the tuple X has already

occurred. In this case, X is an n-dimensional attribute vector, .

So we are given classes, C1, C2, etc… and given the tuples in X, we want to maximize so we use

the equation by Bayes’ Theorem

The already constant for all classes so we only need to maximize .

In real world situations, we’re dealing with many attributes and classes so, calculating everything would

be expensive and time consuming. So make the assumption using class conditional independence. This

means that the values of the attributes are conditionally independent of one another given the tuple.

There are no relationships among each attribute.

Now we have this equation…

Where xk refers to the attribute for tuple X. Now we check whether it is categorical or continuous-

valued.

1. if Ak is cateogirlca, then , then

2. if Ak is continuous-valued, then we use the equation

Page 3: COMP595 HW11

In the example, we are given the training data table.

RID age income student credit_rating Class: buys_computer

1 youth high no fair no

2 youth high no excellent no

3 middle_aged high no fair yes

4 senior medium no fair yes

5 senior low yes fair yes

6 senior low yes excellent no

7 middle_aged low yes excellent yes

8 youth medium no fair no

9 youth low yes fair yes

10 senior medium yes fair yes

11 youth medium yes excellent yes

12 middle_aged medium no excellent yes

13 middle_aged high yes fair yes

14 senior medium no excellent no

We have the attributes, age, income, student, and credit_rating. The Class, buys_computer has two

values which are {yes, no}. We can set the classes C1 and C2 to be the values of yes and no. Here is

the tuple x defined.

Now we need to maximize for i=1, 2. In English, it is the probability of X given

buys_computer = yes and no. X described above are the specific parameters given in X.

In order to do all of the calculations, we need to do some prior probability calculations.

Page 4: COMP595 HW11

No we calculate

Furthermore, we calculate

Since we wanted to maximize , we choose the higher value which is 0.028. So with naïve

Bayesian classification, we predict buys_computer = yes for tuple X given above.

Page 5: COMP595 HW11

Exercise 1

• Only one in 1000 adults is afflicted with a rare disease for which a diagnostic test has been

developed. The test is such that, when an individual actually has the disease, a positive

result will occur 99% of the time, while an individual without the disease will show a

positive test result only 2% of the time. If a randomly selected individual is tested and the

result is positive, what is the probability that the individual has the disease?

According to this question we want to find the probability of a person having the disease given that

the person has tested positive. We can rewrite it as…

We can apply Bayes’ Theorum

To get…

nk

APABP

APABP

BP

APABPBAP

i

n

i

i

kkkkk ,....,1

)()|(

)()|(

)(

)()|()|(

1

Page 6: COMP595 HW11

Exercise 2

• Consider a medical diagnosis problem in which there are two alternative hypotheses: (1)

that the patient has a particular form of cancer, and (2) that the patient does not. The

available data is from a particular laboratory with two possible outcome: positive and

negative. We have prior knowledge that over the entire population of people only .008 have

this disease. Furthermore, the lab test is only an imperfect indicator of the disease. The test

returns a correct positive result in only 98% of the case in which the disease is actually

present and a correct negative result in only 97% of the cases in which the disease is not

present. In other cases, the test returns the opposite result. Suppose we now observe a new

patient for whom the lab test returns a positive result. Should we diagnose the patient as

having cancer or not?

In this situation, we are given some scenarios with multiple hypotheses. There are two

hypotheses to be exact and we want to find out if a patient has cancer or not based on the

certain data given. In Bayesian learning, we use the maximum a posteriori hypothesis to

choose the best hypothesis. hmap is given by

we can show the probabilities based on the given data D.

Now we can calculate hMAP

Given the scenarios of the patient testing positive, we have the possibilities with their probabilities

and the maximum is the 0.0298 which is not cancer. So we will diagnose the patient with having

no cancer.

P(cancer) = 0.008 P(not cancer) = 0.992 P(positive | cancer) = 0.98 P(negative | cancer) = 0.02 P(positive | not cancer) = 0.03 P(negative | not cancer) = 0.97