classification - knn classifier, naive bayesian classifierknn classi er naive bayesian classi er...

KNN Classifier Naive Bayesian Classifier

ClassificationKNN Classifier, Naive Bayesian Classifier

CS479/508, Slide 1/27


Outline

1 KNN Classifier

2 Naive Bayesian Classifier

CS479/508, Slide 2/27


k-Nearest-Neighbor classifiers

First introduced in early 1950s

Distance metrics

Generally Euclidean distance is used when the values arecontinuousHamming distance for categorical/nominal values

CS479/508, Slide 3/27


Algorithm idea

Let k be the number of nearest neighbors and D be the set oftraining examples

for each test example z = (x′, y ′), do

Compute d(x, x′), the distance between z and every example(x, y) ∈ DSelect Dz ⊆ D, the set of k closest training examples to zy ′ = argmaxv

∑(xi ,yi )∈Dz

I (v == yi ) //Majority voting

end for

v : a class label

yi : the class label for one of the NNs

I : indicate function that returns value 1 if its argument is trueand 0 otherwise.

CS479/508, Slide 4/27


Discussions

Computation can be costly if the number of training examplesis large.

Efficient indexing techniques are available to find the nearestneighbors of a test example.

Sensitive k: reduce the impact of k is to weight the influenceof a NN xi : wi = 1

d(x′,xi )2

Distance-weighted voting:

y ′ = argmaxv∑

(xi ,yi )∈Dz

wi × I (v == yi )

CS479/508, Slide 5/27


Characteristics

Instance-based learning: (1) require a proximity measure (2)make predictions without having to maintain a model

Lazy learner: do not require model building.

Predictions are made based on local information. Thus, it isquite susceptible to noise.

Can produce arbitrarily shaped decision boundaries. Moreflexible compared with the decision tree approach.

CS479/508, Slide 6/27


References

Hui Ding, Goce Trajcevski, Peter Scheuermann, XiaoyueWang, Eamonn J. Keogh: Querying and mining of time seriesdata: experimental comparison of representations anddistance measures. PVLDB 1(2):1542-1552 (2008)

Abdullah Mueen, Eamonn J. Keogh, Neal Young:Logical-shapelets: an expressive primitive for time seriesclassification. KDD 2011:1154-1162.

Minh Nhut Nguyen, Xiao-Li Li, See-Kiong Ng: PositiveUnlabeled Learning for Time Series Classification. IJCAI2011:1421-1426.

CS479/508, Slide 7/27


kNN classifier – R

> install.packages(’class’)

> library(class)

> help(knn)

https://cran.r-project.org/web/packages/class/class.pdf

CS479/508, Slide 8/27

https://cran.r-project.org/web/packages/class/class.pdf


Naive Bayesian Classifier

Non-deterministic relationship between attribute set and theclass label.

Reason 1: noisy dataReason 2: complex factors

Probabilistic relationships between attribute set and the classlabel.

Bayesian classifiers: a probabilistic framework for solvingclassification problems.

CS479/508, Slide 9/27


Bayes Theorem

Conditional probability

P(C |A) =P(A,C )

P(A)

P(A|C ) =P(A,C )

P(C )

Bayes theorem

P(C |A) =P(A|C )P(C )

P(A)

CS479/508, Slide 10/27


Example of Bayes Theorem

Given:

A doctor knows that meningitis causes stiff neck 50% of thetimePrior probability of any patient having meningitis is 1/50,000Prior probability of any patient having stiff neck is 1/20

If a patient has stiff neck, what’s the probability he/she hasmeningitis?

Solve this problem:

M: a patient has meningitisS: a patient has stiff neckP(S|M) = 50%, P(M) = 1/50, 000, P(S) = 1/20, P(M|S)=?

P(M|S) =P(S|M)P(M)

P(S)=

(1/2) · (1/50, 000)

1/20=

20

100, 000= 0.0002

CS479/508, Slide 11/27


Bayesian Classifiers

Consider each attribute and class label as random variables

Given a record with attributes (A1,A2, · · · ,An)

Goal is to predict class CSpecifically, we want to find the value of C that maximizesP(C |A1,A2, · · · ,An)

Can we estimate P(C |A1,A2, · · · ,An) directly from data?

CS479/508, Slide 12/27


Bayesian Classifiers – approach

Compute the posterior probability P(C |A1,A2, · · · ,An) for allvalues of C using the Bayes theorem

P(C |A1,A2, · · · ,An) =P(A1,A2, · · · ,An|C )P(C )

P(A1,A2, · · · ,An)

Choose value of C that maximizes P(C |A1,A2, · · · ,An)

≡ maximize P(A1,A2, · · · ,An|C )P(C )

How to estimate P(A1,A2, · · · ,An|C )?

CS479/508, Slide 13/27


Estimate Probabilities from Data

Tid Refund Marital Status Taxable Income Cheat

1 Yes Single 125K No2 No Married 100K No3 No Single 70K No4 Yes Married 120K No5 No Divorced 95K Yes6 No Married 60K No7 Yes Divorced 220K No8 No Single 85K Yes9 No Married 75K No

10 No Single 90K Yes

Class: P(C = Ci ) = NiN

Discrete attributes: P(Ai = ai |C = Cj ) =Nij

Nj

Nij : # of instances having attribute Ai = ai and belonging to class Cj

Nj : # of instances belonging to class Cj

P(No) = 0.7, P(Yes) = 0.3

P(Status = Married |No) = 47P(Refund = Yes|Yes) = 0

CS479/508, Slide 15/27


Estimate Probabilities from Data – continuous attributes

Discretize the range into binsone ordinal attribute per bin

Two-way split: (A < v) or (A ≥ v)choose only one of the two splits as new attribute

Probability density estimationAssume attribute follows a normal distributionUse data to estimate parameters of distribution (e.g., mean µand standard deviation σ)Once probability distribution is known, can use it to estimatethe conditional probability

P(Ai |cj) =1√

2πσijexp−

(Ai−µij )2

2σ2ij

µij : estimated as sample mean of Ai for all training recordswith class cj .σij : estimated as sample variance s2 of all training recordswith class cj .

CS479/508, Slide 16/27


Estimate Probabilities from Data – Example

Tid Refund Marital Status Taxable Income Cheat

1 Yes Single 125K No2 No Married 100K No3 No Single 70K No4 Yes Married 120K No5 No Divorced 95K Yes6 No Married 60K No7 Yes Divorced 220K No8 No Single 85K Yes9 No Married 75K No

10 No Single 90K Yes

P(Ai |cj ) =1

√2πσij

exp−

(Ai−µij )2

2σ2ij

For (Income, Class=No)Sample mean: 125+100+70+120+60+220+75

7= 110

Sample variance: 2975

P(Income = 120|No) =1

√2π(54.54)

exp− (120−110)2

2(2975) = 0.0072

CS479/508, Slide 17/27


Example of Naive Bayes Classifier – Example 1

Given a test record: X=(Refund=No, Married, Income=120K)P(Refund = Yes|No) = 3/7P(Refund = No|No) = 4/7P(Refund = Yes|Yes) = 0P(Marital Status = Single|No) = 2/7P(Marital Status = Divorced |No) = 1/7P(Marital Status = Married |No) = 4/7P(Marital Status = Single|Yes) = 2/3P(Marital Status = Divorced |Yes) = 1/3P(Marital Status = Married |Yes) = 0P(Class = Yes) = 0.3P(Class = No) = 0.7For taxable incomeIf class=No: Sample mean=110, variance=2975

If class=Yes: Sample mean=90, variance=25

CS479/508, Slide 18/27


Naive Bayes Classifier – Probability estimation for Zeroconditional probability

Original: P(Ai = ai |C = Cj) =Nij

Nj

Laplace: P(Ai = ai |C = Cj) =Nij+1Nj+c

m-estimate: P(Ai = ai |C = Cj) =Nij+mpNj+m

Nij : the number of instances with attribute value ai andbelonging to a class Cj

Nj : the number of instances belonging to a class Cj

c: total number of classes

p: prior probability for the positive class Cj

m: parameter

E.g., use m-estimate: m = 3 and p = 0.3 (prior for positive class Yes)

P(Marital Status=Married|Yes)= 0+0.3∗33+3 = 0.9

6

CS479/508, Slide 20/27


Probability estimation - Zero probability in R packagee1071

R package e1071, method naiveBayes assumes

independence of the predictor variables, and

Gaussian distribution (given the target class) of metric predictors.

For attributes with missing values, the corresponding table entries are

omitted for prediction.

Laplace and threshold are used to adjust the attribute likelihood

P(Ai =ai |C = Cj).

Laplace is a positive number controlling Laplace smoothing, whichadjusts the attribute likelihood P(Ai =ai |C = Cj) to be

Nij+laplaceNj+|Ai |×laplace , where

|Ai | is the number of different values for attribute Ai .

This parameter is only used to process attributes with categorical values.

Default laplace value is zero.

The threshold parameter is used to replace the zero attribute

likelihood P(Ai =ai |C = Cj).

CS479/508, Slide 21/27


Example of Naive Bayes Classifier – Example 2

Name Give Birth Can Fly Live in Water Have Legs Class

human yes no no yes mammalspython no no no no non-mammalssalmon no no yes no non-mammalswhale yes no yes no mammalsfrog no no sometimes yes non-mammalskomodo no no no yes non-mammalsbat yes yes no yes mammalspigeon no yes no yes non-mammalscat yes no no yes mammalsleopard shark yes no yes no non-mammalsturtle no no sometimes yes non-mammalspenguin no no sometimes yes non-mammalsporcupine yes no no yes mammalseel no no yes no non-mammalssalamander no no sometimes yes non-mammalsgila monster no no no yes non-mammalsplatypus no no no yes mammalsowl no yes no yes non-mammalsdolphin yes no yes no mammalseagle no yes no yes non-mammals

CS479/508, Slide 22/27


Example 2 (cont.)

Give test case:

Name Give Birth Can Fly Live in Water Have Legs Class

yes no yes no ?

A: attributes

M: mammals

N: non-mammals

CS479/508, Slide 23/27


Naive Bayesian classifier – R

> install.packages(’e1071’, dependencies = TRUE)

> library(e1071)

> help(naiveBayes)

https://cran.r-project.org/web/packages/e1071/e1071.pdf

CS479/508, Slide 25/27

https://cran.r-project.org/web/packages/e1071/e1071.pdf


Naive Bayesian classifier – R

Sometimes, we may run into the following mistake:

model_nb <- naiveBayes(trn.classlabel ~ ., data = trn.data)

predict(model_nb, tst.data, type=c("class"))

#factor(0)

#Levels:

One possible reason is that your trn.classlabel are treated as numericalvalues.

They should be categorical class labels as shown in naiveBayes methoddescription.

Computes the conditional a-posterior probabilities of a categorical class variablegiven independent predictor variables using the Bayes rule.

You may want to run this to see whether the class labels are categorical values.

is.factor(trn.classlabel)

If they are not categorical class labels, the following change should work

model_nb <- naiveBayes(as.factor(trn.classlabel) ~ ., data = trn.data)

predict(model_nb, tst.data, type=c("class"))

CS479/508, Slide 26/27


Naive Bayes Classifier (Summary)

Robust to isolated noise points

Handle missing values by ignoring the instance duringprobability estimate calculations

Robust to irrelevant attributes

Independence assumption may not hold for some attributes

Use other techniques such as Bayesian Belief Networks (BBN)

CS479/508, Slide 27/27

classification - knn classifier, naive bayesian classifierknn classi er naive bayesian classi er...

Documents