data mining taylor statistics 202: data...

Statistics 202:Data Mining

c©JonathanTaylor

Statistics 202: Data MiningWeek 6

Based in part on slides from textbook, slides of Susan Holmes

c©Jonathan Taylor

October 29, 2012

1 / 1


c©JonathanTaylor

Part I

Other classification techniques

2 / 1


c©JonathanTaylor

Rule based classifiers


Examples:

if Refund==No and Marital_Status==Married =⇒Cheat=No

if Blood_Type==Warm (Lay_Eggs==Yes) =⇒Class=Birds

LHS=rule antecedent or condition

RHS=rule consequent

3 / 1


c©JonathanTaylor


© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3

Rule-based Classifier (Example)

R1: (Give Birth = no) ! (Can Fly = yes) " Birds R2: (Give Birth = no) ! (Live in Water = yes) " Fishes R3: (Give Birth = yes) ! (Blood Type = warm) " Mammals R4: (Give Birth = no) ! (Can Fly = no) " Reptiles R5: (Live in Water = sometimes) " Amphibians 4 / 1


c©JonathanTaylor



A rule covers an case or instance, if the features satisfythe antecedent or condition.

The coverage of a rule on a dataset is the fraction of casesthe rule covers.

The accuracy of a rule is the fraction of cases it covers forwhich it the consequent agrees with the label.

Ideal rules are rules with high coverage and high accuracy.

5 / 1


c©JonathanTaylor



Rule Coverage and Accuracy

! Coverage of a rule: –  Fraction of records

that satisfy the antecedent of a rule

! Accuracy of a rule: –  Fraction of records

that satisfy both the antecedent and consequent of a rule (Status=Single) ! No

Coverage = 40%, Accuracy = 50%

6 / 1


c©JonathanTaylor



How does Rule-based Classifier Work?

R1: (Give Birth = no) ! (Can Fly = yes) " Birds

R2: (Give Birth = no) ! (Live in Water = yes) " Fishes R3: (Give Birth = yes) ! (Blood Type = warm) " Mammals R4: (Give Birth = no) ! (Can Fly = no) " Reptiles R5: (Live in Water = sometimes) " Amphibians

A lemur triggers rule R3, so it is classified as a mammal

A turtle triggers both R4 and R5 A dogfish shark triggers none of the rules

7 / 1


c©JonathanTaylor



A rule-based classifier is determined by a set of rules.

The rules are mutually exclusive if each case is covered byone and only one rule.

The rules are exhaustive if each case is covered by at leastone rule.

A decision tree is an example of a rule-based classifier.

8 / 1


c©JonathanTaylor


9 / 1


c©JonathanTaylor



From Decision Trees To Rules

Rules are mutually exclusive and exhaustive

Rule set contains as much information as the tree

10 / 1


c©JonathanTaylor



Rules Can Be Simplified

Initial Rule: (Refund=No) ! (Status=Married) " No

Simplified Rule: (Status=Married) " No We can simplify the Married==No rule.

11 / 1


c©JonathanTaylor


Possible issues

Rules may not be mutually exclusive: they can be ordered,or use majority rule.

Rules may not be exhaustive: use a default class.

Finding rules

Try to extract directly from data (see PRIM in ESL).

Extracted from a decision tree (some tree software such asC4.5 does this).

12 / 1


c©JonathanTaylor

Learning a rule


PRIM (Patient Rule Induction)

13 / 1


c©JonathanTaylor

Removing a discovered rule to find additional rules

14 / 1


c©JonathanTaylor


Evaluating a rule

Accuracy(R)= 1 - Misclassification Error(R) = nc (R)n(R)

Laplace(R)= nc (R)+1n(R)+k where k is the number of classes . . .

M-estimate(R)=nc (R)+αpi(R)

n(R)+α where i(R) is the majorityclass and pi is a set of prior probabilities on each of the kclasses and α > 0 is some prior weight.

The Laplace rule is effectively a posterior estimate of theprobability of an instance belonging to a class based on aprior probability . . .

15 / 1


c©JonathanTaylor


Properties of rule-based classifiers

Flexible like decision trees (even more flexible).

As expressive as decision trees.

Easy to classify new cases (once ordering established).

Can be extracted from decision trees.

Disadvantage: there is no clear loss function they areoptimizing . . .

16 / 1


c©JonathanTaylor

Nearest-neighbour classifiers

Nearest-neighbour classifiers

For each xxx find k nearest neighbours in the cases datamatrix XXX .

Let k∗(xxx) denote the majority rule among the k-nearestneighbours.

Assign f (xxx) = k∗(xxx).

Easy to describe, but can be expensive to compute manynearest neighbours . . .

17 / 1


c©JonathanTaylor

Nearest neighbour classifier


Nearest Neighbor Classifiers

! Basic idea: –  If it walks like a duck, quacks like a duck, then

it’s probably a duck

Training Records

Test Record

Compute Distance

Choose k of the “nearest” records

18 / 1


c©JonathanTaylor

Choice of k


Nearest Neighbor Classification…

! Choosing the value of k: –  If k is too small, sensitive to noise points –  If k is too large, neighborhood may include points from

other classes

19 / 1


c©JonathanTaylor

Nearest neighbour classifier k = 1

20 / 1


c©JonathanTaylor


21 / 1


c©JonathanTaylor

Linear Discriminant Analysis using (sepal.width,sepal.length)

3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.02.0

2.5

3.0

3.5

4.0

4.5

5.0

24 / 1


c©JonathanTaylor

Naive Bayes classifiers

Bayesian classifiers

In LDA / QDA, we used Bayes’ Theorem to classify.

Formally, given a partition E1, . . . ,Ek Bayes’ Theorem says

P(Ei |A) =P(A|Ei )P(Ei )

P(A)

=P(A|Ei )P(Ei )∑kj=1 P(A|Ej)P(Ej)

We applied this to the events Ei = {Y = c} andA = {X = xxx} yielding

P(Y = c|X = xxx) =P(X = xxx |Y = c)P(Y = c)∑kj=1 P(X = xxx |Y = j)P(Y = j)

∝ P(X = xxx |Y = c)P(Y = c)

In LDA/QDA P(X = xxx |Y = c) = φµi ,Σi(xxx)dxxx with φµ,Σ a

multivariate Gaussian density.

25 / 1


c©JonathanTaylor



A Naive Bayes classifier applies Bayes’ Theorem based onseveral features xxx = (xxx1, . . . ,xxxp)

P(Y = c|X1 = xxx1, . . . ,Xp = xxxp)

∝ P(X1 = xxx1, . . . ,Xp = xxxp|Y = c)P(Y = c)

To use the classifier, we need to estimate the RHS above.

26 / 1


c©JonathanTaylor



The Naive Bayes classifier assumes that the features areindependent given the class label. Therefore,

P(Y = c |X1 = xxx1, . . . ,Xp = xxxp)

∝

(p∏

l=1

P(Xl = xxx l |Y = c)

)P(Y = c)

Fits a discriminant model within each feature.

For continuous features, typically a 1-dimensional QDAmodel is used (i.e. Gaussian within each class).

27 / 1


c©JonathanTaylor


Laplace smoothing

For discrete features, it uses a discrete estimate

P(Xj = l |Y = c) =# {iXXX ij = l ,YYY i = c}

# {YYY i = c}.

The discrete estimates have the problem that if noinstances of class c had feature Xj = l then any newinstances xxx with xxx j = l cannot be classified to class c .

One solution: use the Laplace smoothed probabilities

P(Xj = l |Y = c) =# {i : XXX ij = l ,YYY i = c}+ α

# {YYY i = c}+ α · k.

28 / 1


c©JonathanTaylor

Quadratic Discriminant Analysis using(sepal.width, sepal.length)

3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.02.0

2.5

3.0

3.5

4.0

4.5

5.0

30 / 1


c©JonathanTaylor


Properties

Features may not be conditionally independent, which mayforce some bias.

Continuous features may not be Gaussian (LDA suffersfrom this also . . . )

Can be slow to predict new instances (at least e1071’simplementation).

31 / 1

data mining taylor statistics 202: data...

Documents