© tan,steinbach, kumar introduction to data mining 4/18/2004 1 statistical inference (by michael...

20
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Statistical Inference (By Michael Jordon) Bayesian perspective conditional perspective—inferences should be made conditional on the current data natural in the setting of a long-term project with a domain expert the optimist: let’s make the best possible use of our sophisticated inferential tool Frequentist perspective unconditional perspective—inferential methods should give good answers in repeated use natural in the setting of writing software that will be used by many people with many data sets the pessimist: let’s protect ourselves against bad decisions given that our inferential procedure is inevitably based on a simplification of reality

Upload: melinda-lewis

Post on 04-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Statistical Inference (By Michael Jordon)

Bayesian perspective– conditional perspective—inferences should be made conditional

on the current data

– natural in the setting of a long-term project with a domain expert

– the optimist: let’s make the best possible use of our sophisticated inferential tool

Frequentist perspective– unconditional perspective—inferential methods should give good

answers in repeated use

– natural in the setting of writing software that will be used by many people with many data sets

– the pessimist: let’s protect ourselves against bad decisions given that our inferential procedure is inevitably based on a simplification of reality

Page 2: © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2

Bayes Classifier

A probabilistic framework for solving classification problems

Conditional Probability:

Bayes theorem:

)()()|(

)|(APCPCAP

ACP

)(),(

)|(

)(),(

)|(

CPCAP

CAP

APCAP

ACP

Page 3: © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3

Example of Bayes Theorem

Given: – A doctor knows that C causes A 50% of the time

– Prior probability of any patient having C is 1/50,000

– Prior probability of any patient having A is 1/20

If a patient has A, what’s the probability he/she has C?

0002.020/1

50000/15.0

)(

)()|()|(

AP

CPCAPACP

Page 4: © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4

Bayesian Classifiers

Consider each attribute and class label as random variables

Given a record with attributes (A1, A2,…,An)

– Goal is to predict class C

– Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An )

Can we estimate P(C| A1, A2,…,An ) directly from data?

Page 5: © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5

Bayesian Classifiers

Approach:

– compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem

– Choose value of C that maximizes P(C | A1, A2, …, An)

– Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C)

How to estimate P(A1, A2, …, An | C )?

)()()|(

)|(21

21

21

n

n

n AAAPCPCAAAP

AAACP

Page 6: © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6

Naïve Bayes Classifier

Assume independence among attributes Ai when class is

given:

– P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)

– Can estimate P(Ai| Cj) for all Ai and Cj.

– New point is classified to Cj if P(Cj) P(Ai| Cj) is maximal.

Page 7: © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7

Training dataset

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no30…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

Class:C1:buys_computer=‘yes’C2:buys_computer=‘no’

Data sample X =(age<=30,Income=medium,Student=yesCredit_rating=Fair)

Page 8: © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8

Naïve Bayesian Classifier: Example

Compute P(X|Ci) for each class

P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4

X=(age<=30 , income =medium, student=yes, credit_rating=fair)

P(X|Ci) : P(X|buys_computer=“yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer=“no”) = 0.6 x 0.4 x 0.2 x 0.4 =0.019

P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”) = 0.028 P(X|buys_computer=“no”) * P(buys_computer=“no”) = 0.007

X belongs to class “buys_computer=yes”

Page 9: © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9

Naïve Bayes Classifier

If one of the conditional probability is zero, then the entire expression becomes zero.

Probability estimation:

mN

mpNCAP

cN

NCAP

N

NCAP

c

ici

c

ici

c

ici

)|(:estimate-m

1)|(:Laplace

)|( :Original c: number of classes

p: prior probability

m: parameter (equivalent sample size)

Page 10: © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10

Naïve Bayes (Summary)

Robust to isolated noise points because such points are averaged out.

Handle missing values by ignoring the instance during probability estimate calculations

Robust to irrelevant attributes. If Ai is irrelevant, then P(Ai | Y) becomes almost uniformly distributed.

Independence assumption may not hold for some attributes

– Use other techniques such as Bayesian Belief Networks (BBN)

Page 11: © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11

Instance-Based Classifiers

Atr1 ……... AtrN ClassA

B

B

C

A

C

B

Set of Stored Cases

Atr1 ……... AtrN

Unseen Case

• Store the training records

• Use training records to predict the class label of unseen cases

Page 12: © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12

Nearest Neighbor Classifiers

Basic idea:

– If it walks like a duck, quacks like a duck, then it’s probably a duck

Training Records

Test Record

Compute Distance

Choose k of the “nearest” records

Page 13: © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13

Nearest-Neighbor Classifiers

Requires three things

– The set of stored records

– Distance Metric to compute distance between records

– The value of k, the number of nearest neighbors to retrieve

To classify an unknown record:

– Compute distance to other training records

– Identify k nearest neighbors

– Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)

Unknown record

Page 14: © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14

Nearest Neighbor Classification

Compute distance between two points:

– Euclidean distance

Determine the class from nearest neighbor list

– take the majority vote of class labels among the k-nearest neighbors

– Weigh the vote according to distance weight factor, w = 1/d2

i ii

qpqpd 2)(),(

Page 15: © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15

Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points that have the k smallest distance to x

Page 16: © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16

Nearest Neighbor Classification…

Choosing the value of k:– If k is too small, sensitive to noise points

– If k is too large, neighborhood may include points from other classes

X

Page 17: © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17

K-nearest neighbors

Page 18: © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18

Distance functions

Dsum(A,B) = Dgender(A,B) + Dage(A,B) + Dsalary(A,B)

Dnorm(A,B) = Dsum(A,B)/max(Dsum)

Deuclid(A,B) = sqrt(Dgender(A,B)2 + Dage(A,B)2 + Dsalary(A,B)2

Page 19: © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19

Page 20: © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20

Remarks on Lazy vs. Eager Learning

Instance-based learning: lazy evaluation Decision-tree and Bayesian classification: eager evaluation Key differences

– Lazy method may consider query instance xq when deciding how to generalize beyond the training data D

– Eager method cannot since they have already chosen global approximation when seeing the query

Efficiency: Lazy - less time training but more time predicting Accuracy

– Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function

– Eager: must commit to a single hypothesis that covers the entire instance space