lecture 3 cbr indexing
TRANSCRIPT
8/14/2019 Lecture 3 Cbr Indexing
http://slidepdf.com/reader/full/lecture-3-cbr-indexing 1/20
Case Based Reasoning
Lecture 3: CBR Case-Base Indexing
8/14/2019 Lecture 3 Cbr Indexing
http://slidepdf.com/reader/full/lecture-3-cbr-indexing 2/20
Outline
Indexing CBR case knowledge
Why might we want an index?
Decision tree indexes
C4.5 algorithm
Summary
8/14/2019 Lecture 3 Cbr Indexing
http://slidepdf.com/reader/full/lecture-3-cbr-indexing 3/20
Why might we want an index?
Efficiency
Similarity matching is computationally
expensive for large case-bases
Similarity matching can be computationallyexpensive for complex case representations
Relevancy of cases for similarity matching
some features of new problem may make
certain cases irrelevant
despite being very similar
Cases are pre-selected from case-base
Similarity matching is applied to subset of cases
8/14/2019 Lecture 3 Cbr Indexing
http://slidepdf.com/reader/full/lecture-3-cbr-indexing 4/20
What to index?
Client Ref #: 64
Client Name: John Smith
Address: 39 Union Street
Tel: 01224 665544
Photo:
Age: 37
Occupation: IT AnalystIncome: £ 20000
…
Unindexed
features
Indexedfeatures
Case Features are:- Indexed
- Unindexed
8/14/2019 Lecture 3 Cbr Indexing
http://slidepdf.com/reader/full/lecture-3-cbr-indexing 5/20
Indexed vs Unindexed Features
Indexed features are:
used for retrieval
are predictive of the case’s solution
Unindexed feature are:
not used for retrieval
not predictive of the case’s solution
provide valuable contextual information andlessons learned
8/14/2019 Lecture 3 Cbr Indexing
http://slidepdf.com/reader/full/lecture-3-cbr-indexing 6/20
Playing Tennis Example (case-base)
Outlook Temperature Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Cloudy Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Cloudy Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Cloudy Mild High True Yes
Cloudy Hot Normal False Yes
Rainy Mild High True No
8/14/2019 Lecture 3 Cbr Indexing
http://slidepdf.com/reader/full/lecture-3-cbr-indexing 7/20
Decision Tree (Index) for Playing
Tennis
outlook
Yes
sunny
cloudy
rainy
humidity
No Yes
high normal
windy
No Yes
true false
8/14/2019 Lecture 3 Cbr Indexing
http://slidepdf.com/reader/full/lecture-3-cbr-indexing 8/20
Choosing the Root Attribute
humidity
Yes Yes YesNoNo
NoNo
Yes Yes Yes Yes Yes
YesNo
temperature
Yes YesNoNo
Yes Yes Yes YesNo
No
Yes Yes YesNo
outlook
Yes YesNoNoNo
Yes Yes Yes Yes
Yes Yes YesNoNo
windy
Yes Yes Yes Yes Yes YesNoNo
Yes Yes YesNoNo
No
Which attribute is best for the root of the tree?- the one that gives the best information gain
- in this case outlook (as we are going to see)
sunny
cloudy
rainy high low true falsehot
mild
cold
8/14/2019 Lecture 3 Cbr Indexing
http://slidepdf.com/reader/full/lecture-3-cbr-indexing 9/20
Building Decision Trees – C4.5
Algorithm
Based on the Information Theory (Shannon 1948)
Divide and conquer strategy
Choose attribute for root node
Create branch for each value of that attribute Split cases according to branches
Repeat process for each branch until all cases inthe branch have the same class
Assumption:
simplest tree which classifies the cases is best
8/14/2019 Lecture 3 Cbr Indexing
http://slidepdf.com/reader/full/lecture-3-cbr-indexing 10/20
Entropy of a set of cases
Playing Tennis Example:
S is the set of 14 cases
We want to classify the cases according to the values of“Play”, i.e., Yes and No in this example.
the proportion of “Yes” cases is 9 out of 14: 9/14 = 0.64
the proportion of “No” cases is 5 out of 14: 5/14 = 0.36
The Entropy measures the impurity of S
Entropy (S) = - 0.64(log2 0.64) – 0.36(log2 0.36)
= -0.64(-0.644)-0.36(-1.474) = 0.41+0.53 = 0.94Outlook Temperature Humidity Windy Play
Sunny Hot High False No
Cloudy Hot High False Yes
… … … … …
“Yes” case
“No” case
14 cases
8/14/2019 Lecture 3 Cbr Indexing
http://slidepdf.com/reader/full/lecture-3-cbr-indexing 11/20
Entropy of a set of cases
S is a set of cases
A is a feature
Play in the example
{S1 ... Si … Sn} are the partitions of S according tovalues of A
Yes and No in the example
{P1 ... Pi … Pn} are the proportions of {S1 ... Si … Sn}in S
i
n
i
i plog pS Entropy2
*)(1
8/14/2019 Lecture 3 Cbr Indexing
http://slidepdf.com/reader/full/lecture-3-cbr-indexing 12/20
)(*)(),(1
i
n
i
iS Entropy
S
S S Entropy AS Gain
Gain of an attribute
Calculate Gain (S, A) for each attribute A
expected reduction in entropy due to sorting on A
Choose the attribute with highest gain as root of tree
Gain (S, A) = Entropy(S) – Expectation(A)
{S1, ..., Si, …, Sn} = partitions of S according tovalues of attribute A
n = number of attributes A
|Si| = number of cases in the partition Si
|S| = total number of cases in S
8/14/2019 Lecture 3 Cbr Indexing
http://slidepdf.com/reader/full/lecture-3-cbr-indexing 13/20
Which attribute is root?
If Outlook is made root of the tree
There are 3 partitions of the cases
S1 for Sunny, S2 for Cloudy, S3 for Rainy
S1(Sunny)= {cases 1,2,8,9,11}
|S1| = 5
In these 5 cases values for Play are
3 No and 2 Yes
Entropy(S1)
= - 2/5 (log2 2/5) – 3/5(log2 3/5) = 0.97
Similarly
Entropy(S2)= 0
Entropy(S3)= 0.97
Outlook Tempe
rature
Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Cloudy Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Cloudy Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Cloudy Mild High True Yes
Cloudy Hot Normal False Yes
Rainy Mild High True No
8/14/2019 Lecture 3 Cbr Indexing
http://slidepdf.com/reader/full/lecture-3-cbr-indexing 14/20
Choosing the Root Attribute
humidity
Yes Yes YesNoNo
NoNo
Yes Yes Yes Yes Yes
YesNo
temperature
Yes YesNoNo
Yes Yes Yes YesNo
No
Yes Yes YesNo
outlook
Yes YesNoNoNo
Yes Yes Yes Yes
Yes Yes YesNoNo
windy
Yes Yes Yes Yes Yes YesNoNo
Yes Yes YesNoNo
No
Which attribute is best for the root of the tree?- the one that gives the best information gain
- in this case outlook (as we are going to see)
sunny
cloudy
rainy high low true falsehot
mild
cold
8/14/2019 Lecture 3 Cbr Indexing
http://slidepdf.com/reader/full/lecture-3-cbr-indexing 15/20
Which attribute is root?
Gain(S, Outlook) = Entropy(S) – Expectation(Outlook) =
Gain(S, Outlook) = 0.94 – [5/14 * 0.97 + 4/14 * 0 + 5/14 * 0.97]
= 0.247
Similarly
Gain(S, Temperature)= 0.059 Gain(S, Humidity)= 0.051
Gain(S, Windy)= 0.048
Gain(S, Outlook) is the highest gain
Outlook should be the root of the decision tree (index)
)3(*
||
||)2(*
||
||)(*
||
||)( 32
11 S Entropy
S
S S Entropy
S
S S Entropy
S
S S Entropy
8/14/2019 Lecture 3 Cbr Indexing
http://slidepdf.com/reader/full/lecture-3-cbr-indexing 16/20
Repeat for Sunny Node
outlook
Yes
sunny
cloudy
rainy
?temperature
NoNo
YesNo
Yes
hot mild cold
outlook
Yes
sunny
cloudy
rainy
?windy
YesNoNo
YesNo
false true
outlook
Yes
sunny
cloudy
rainy
?humidity
NoNoNo
Yes Yes Yes
high normal
8/14/2019 Lecture 3 Cbr Indexing
http://slidepdf.com/reader/full/lecture-3-cbr-indexing 17/20
Repeat for Rainy Node
outlook
Yes
sunny
cloudy
rainy
humidity
No Yes
high normal
Mild High False YesCool Normal False YesCool Normal True NoMild Normal False YesMild High True No
8/14/2019 Lecture 3 Cbr Indexing
http://slidepdf.com/reader/full/lecture-3-cbr-indexing 18/20
Decision Tree (Index) for Playing
Tennis
outlook
Yes
sunny
cloudy
rainy
humidity
No Yes
high normal
windy
No Yes
true false
8/14/2019 Lecture 3 Cbr Indexing
http://slidepdf.com/reader/full/lecture-3-cbr-indexing 19/20
Case Retrieval via DTree Index
Typical implementation
e.g.
Case-Base indexedusing a decision-tree
Cases are“stored” in the
indexleaves…
DTree created from cases
Automated indexing of case-base
8/14/2019 Lecture 3 Cbr Indexing
http://slidepdf.com/reader/full/lecture-3-cbr-indexing 20/20
Summary
Decision tree is built from cases
Decision tree is often used for problem-solving
In CBR, decision tree is used to partition
cases Similarity matching is applied to cases in leaf
node
Indexing pre-selects relevant cases for k-NNretrieval
BRING CALCULATOR on MONDAY