data mining taylor statistics 202: data...
TRANSCRIPT
Statistics 202:Data Mining
c©JonathanTaylor
Statistics 202: Data MiningWeek 6
Based in part on slides from textbook, slides of Susan Holmes
c©Jonathan Taylor
October 29, 2012
1 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Part I
Other classification techniques
2 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Rule based classifiers
Rule based classifiers
Examples:
if Refund==No and Marital_Status==Married =⇒Cheat=No
if Blood_Type==Warm (Lay_Eggs==Yes) =⇒Class=Birds
LHS=rule antecedent or condition
RHS=rule consequent
3 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Rule based classifiers
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3
Rule-based Classifier (Example)
R1: (Give Birth = no) ! (Can Fly = yes) " Birds R2: (Give Birth = no) ! (Live in Water = yes) " Fishes R3: (Give Birth = yes) ! (Blood Type = warm) " Mammals R4: (Give Birth = no) ! (Can Fly = no) " Reptiles R5: (Live in Water = sometimes) " Amphibians 4 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Rule based classifiers
Rule based classifiers
A rule covers an case or instance, if the features satisfythe antecedent or condition.
The coverage of a rule on a dataset is the fraction of casesthe rule covers.
The accuracy of a rule is the fraction of cases it covers forwhich it the consequent agrees with the label.
Ideal rules are rules with high coverage and high accuracy.
5 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Rule based classifiers
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5
Rule Coverage and Accuracy
! Coverage of a rule: – Fraction of records
that satisfy the antecedent of a rule
! Accuracy of a rule: – Fraction of records
that satisfy both the antecedent and consequent of a rule (Status=Single) ! No
Coverage = 40%, Accuracy = 50%
6 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Rule based classifiers
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6
How does Rule-based Classifier Work?
R1: (Give Birth = no) ! (Can Fly = yes) " Birds
R2: (Give Birth = no) ! (Live in Water = yes) " Fishes R3: (Give Birth = yes) ! (Blood Type = warm) " Mammals R4: (Give Birth = no) ! (Can Fly = no) " Reptiles R5: (Live in Water = sometimes) " Amphibians
A lemur triggers rule R3, so it is classified as a mammal
A turtle triggers both R4 and R5 A dogfish shark triggers none of the rules
7 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Rule based classifiers
Rule based classifiers
A rule-based classifier is determined by a set of rules.
The rules are mutually exclusive if each case is covered byone and only one rule.
The rules are exhaustive if each case is covered by at leastone rule.
A decision tree is an example of a rule-based classifier.
8 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Rule based classifiers
9 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Rule based classifiers
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8
From Decision Trees To Rules
Rules are mutually exclusive and exhaustive
Rule set contains as much information as the tree
10 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Rule based classifiers
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9
Rules Can Be Simplified
Initial Rule: (Refund=No) ! (Status=Married) " No
Simplified Rule: (Status=Married) " No We can simplify the Married==No rule.
11 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Rule based classifiers
Possible issues
Rules may not be mutually exclusive: they can be ordered,or use majority rule.
Rules may not be exhaustive: use a default class.
Finding rules
Try to extract directly from data (see PRIM in ESL).
Extracted from a decision tree (some tree software such asC4.5 does this).
12 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Learning a rule
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20
PRIM (Patient Rule Induction)
13 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Removing a discovered rule to find additional rules
14 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Rule based classifiers
Evaluating a rule
Accuracy(R)= 1 - Misclassification Error(R) = nc (R)n(R)
Laplace(R)= nc (R)+1n(R)+k where k is the number of classes . . .
M-estimate(R)=nc (R)+αpi(R)
n(R)+α where i(R) is the majorityclass and pi is a set of prior probabilities on each of the kclasses and α > 0 is some prior weight.
The Laplace rule is effectively a posterior estimate of theprobability of an instance belonging to a class based on aprior probability . . .
15 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Rule based classifiers
Properties of rule-based classifiers
Flexible like decision trees (even more flexible).
As expressive as decision trees.
Easy to classify new cases (once ordering established).
Can be extracted from decision trees.
Disadvantage: there is no clear loss function they areoptimizing . . .
16 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Nearest-neighbour classifiers
Nearest-neighbour classifiers
For each xxx find k nearest neighbours in the cases datamatrix XXX .
Let k∗(xxx) denote the majority rule among the k-nearestneighbours.
Assign f (xxx) = k∗(xxx).
Easy to describe, but can be expensive to compute manynearest neighbours . . .
17 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Nearest neighbour classifier
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34
Nearest Neighbor Classifiers
! Basic idea: – If it walks like a duck, quacks like a duck, then
it’s probably a duck
Training Records
Test Record
Compute Distance
Choose k of the “nearest” records
18 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Choice of k
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38
Nearest Neighbor Classification…
! Choosing the value of k: – If k is too small, sensitive to noise points – If k is too large, neighborhood may include points from
other classes
19 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Nearest neighbour classifier k = 1
20 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Nearest neighbour classifier k = 3
21 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Nearest neighbour classifier k = 10
22 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Nearest neighbour classifier k = 20
23 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Linear Discriminant Analysis using (sepal.width,sepal.length)
3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.02.0
2.5
3.0
3.5
4.0
4.5
5.0
24 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Naive Bayes classifiers
Bayesian classifiers
In LDA / QDA, we used Bayes’ Theorem to classify.
Formally, given a partition E1, . . . ,Ek Bayes’ Theorem says
P(Ei |A) =P(A|Ei )P(Ei )
P(A)
=P(A|Ei )P(Ei )∑kj=1 P(A|Ej)P(Ej)
We applied this to the events Ei = {Y = c} andA = {X = xxx} yielding
P(Y = c|X = xxx) =P(X = xxx |Y = c)P(Y = c)∑kj=1 P(X = xxx |Y = j)P(Y = j)
∝ P(X = xxx |Y = c)P(Y = c)
In LDA/QDA P(X = xxx |Y = c) = φµi ,Σi(xxx)dxxx with φµ,Σ a
multivariate Gaussian density.
25 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Naive Bayes classifiers
Naive Bayes classifiers
A Naive Bayes classifier applies Bayes’ Theorem based onseveral features xxx = (xxx1, . . . ,xxxp)
P(Y = c|X1 = xxx1, . . . ,Xp = xxxp)
∝ P(X1 = xxx1, . . . ,Xp = xxxp|Y = c)P(Y = c)
To use the classifier, we need to estimate the RHS above.
26 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Naive Bayes classifiers
Naive Bayes classifiers
The Naive Bayes classifier assumes that the features areindependent given the class label. Therefore,
P(Y = c |X1 = xxx1, . . . ,Xp = xxxp)
∝
(p∏
l=1
P(Xl = xxx l |Y = c)
)P(Y = c)
Fits a discriminant model within each feature.
For continuous features, typically a 1-dimensional QDAmodel is used (i.e. Gaussian within each class).
27 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Naive Bayes classifiers
Laplace smoothing
For discrete features, it uses a discrete estimate
P(Xj = l |Y = c) =# {iXXX ij = l ,YYY i = c}
# {YYY i = c}.
The discrete estimates have the problem that if noinstances of class c had feature Xj = l then any newinstances xxx with xxx j = l cannot be classified to class c .
One solution: use the Laplace smoothed probabilities
P(Xj = l |Y = c) =# {i : XXX ij = l ,YYY i = c}+ α
# {YYY i = c}+ α · k.
28 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Naive Bayes classifier
29 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Quadratic Discriminant Analysis using(sepal.width, sepal.length)
3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.02.0
2.5
3.0
3.5
4.0
4.5
5.0
30 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Naive Bayes classifiers
Properties
Features may not be conditionally independent, which mayforce some bias.
Continuous features may not be Gaussian (LDA suffersfrom this also . . . )
Can be slow to predict new instances (at least e1071’simplementation).
31 / 1