data mining tutorial - session 4: classification · data mining tutorial e. schubert, e. ntoutsi...

Data MiningTutorial

E. Schubert,E. Ntoutsi

Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Data Mining TutorialSession 4: Classification

Erich Schubert, Eirini Ntoutsi

Ludwig-Maximilians-Universität München

2012-05-31 — KDD class tutorial

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Evaluation of classifiers

Build the confusion matrix:

A B C Ci

A

4 0 1 5

B

2 2 1 5

C

1 1 3 5

Ki

7 3 5 15

|TP| |FP| |FN|

4 3 12 1 33 2 2

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2



A B C Ci

A 4 0 1

5

B 2 2 1

5

C 1 1 3

5

Ki

7 3 5 15

|TP| |FP| |FN|

4 3 12 1 33 2 2

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2



A B C Ci

A 4 0 1 5B 2 2 1 5C 1 1 3 5

Ki 7 3 5

15

|TP| |FP| |FN|

4 3 12 1 33 2 2

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2



A B C Ci

A 4 0 1 5B 2 2 1 5C 1 1 3 5

Ki 7 3 5 15

|TP| |FP| |FN|

4 3 12 1 33 2 2

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2



A B C Ci

A 4 0 1 5B 2 2 1 5C 1 1 3 5

Ki 7 3 5 15

|TP| |FP| |FN|4

3 1

2

1 3

3

2 2

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2



A B C Ci

A 4 0 1 5B 2 2 1 5C 1 1 3 5

Ki 7 3 5 15

|TP| |FP| |FN|4 3 12

1 3

3

2 2

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2



A B C Ci

A 4 0 1 5B 2 2 1 5C 1 1 3 5

Ki 7 3 5 15

|TP| |FP| |FN|4 3 12 1 33

2 2

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2



A B C Ci

A 4 0 1 5B 2 2 1 5C 1 1 3 5

Ki 7 3 5 15

|TP| |FP| |FN|4 3 12 1 33 2 2

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2



A B C Ci

A 4 0 1 5B 2 2 1 5C 1 1 3 5

Ki 7 3 5 15

|TP| |FP| |FN|4 3 12 1 33 2 2

Precision(K,A) = 4/7Precision(K,B) = 2/3Precision(K,C) = 3/5

Recall(K,A) = 4/5Recall(K,B) = 2/5Recall(K,C) = 3/5

F1(K,A) = 2/3F1(K,B) = 1/2F1(K,C) = 3/5

Precision(K, i) =|{o ∈ Ki |K(o) = C(o)}|

|Ki|=

|TPi||TPi|+ |FPi|

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2



A B C Ci

A 4 0 1 5B 2 2 1 5C 1 1 3 5

Ki 7 3 5 15

|TP| |FP| |FN|4 3 12 1 33 2 2



F1(K,A) = 2/3F1(K,B) = 1/2F1(K,C) = 3/5

Recall(K, i) =|{o ∈ Ci |K(o) = C(o)}|

|Ci|=

|TPi||TPi|+ |FNi|

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2



A B C Ci

A 4 0 1 5B 2 2 1 5C 1 1 3 5

Ki 7 3 5 15

|TP| |FP| |FN|4 3 12 1 33 2 2



F1(K,A) = 2/3F1(K,B) = 1/2F1(K,C) = 3/5

F1(K, i) =2 · Recall(K, i) · Precision(K, i)Recall(K, i) + Precision(K, i)(

not general: =2|TPi|

2|TPi|+ |FPi|+ |FNi|

)

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2



A B C Ci

A 4 0 1 5B 2 2 1 5C 1 1 3 5

Ki 7 3 5 15

|TP| |FP| |FN|4 3 12 1 33 2 2

Micro Average F1:|TP| = 4 + 2 + 3 = 9|FP| = 3 + 1 + 2 = 6|FN| = 1 + 3 + 2 = 6

Precision: 9/15Recall: 9/15Micro Average F1: 9/15 = 0.6

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2



A B C Ci

A 4 0 1 5B 2 2 1 5C 1 1 3 5

Ki 7 3 5 15

|TP| |FP| |FN|4 3 12 1 33 2 2

Micro Average F1:|TP| = 4 + 2 + 3 = 9|FP| = 3 + 1 + 2 = 6|FN| = 1 + 3 + 2 = 6

Precision: 9/15Recall: 9/15Micro Average F1: 9/15 = 0.6

Notice: Precision = Recall = F1Not just by chance: sum of diagonal / sum total

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2



A B C Ci

A 4 0 1 5B 2 2 1 5C 1 1 3 5

Ki 7 3 5 15

|TP| |FP| |FN|4 3 12 1 33 2 2

Macro Average F1:average precision: 1/3(4/7 + 2/3 + 3/5) ≈ 0.613average recall: 1/3(4/5 + 2/5 + 3/5) = 0.6Macro Average F1 ≈ 2·0.6·0.613

0.6+0.613 = 0.606.

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2



A B C Ci

A 4 0 1 5B 2 2 1 5C 1 1 3 5

Ki 7 3 5 15

|TP| |FP| |FN|4 3 12 1 33 2 2

Micro average: all instances are weighted equally.Macro average: all classes are weighted equally.Consider the following scenario:1% of object are in class “interesting”,99% are in class “uninteresting”.Which measure is then more useful?

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Naive Bayes

weather snow ski? weather snow ski?sunny < 50 no snow < 50 norainy < 50 no sunny ≥ 50 yesrainy ≥ 50 no snow ≥ 50 yessnow ≥ 50 yes rainy < 50 yes

A priori probabilities and conditional probabilities:

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Naive Bayes



P(ski) = 1/2

P(¬ski) = 1/2

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Naive Bayes



P(weather = sunny|ski) = 1/4

P(weather = snow|ski) = 2/4

P(weather = rainy|ski) = 1/4

P(weather = sunny|¬ski) = 1/4

P(weather = snow|¬ski) = 1/4

P(weather = rainy|¬ski) = 2/4

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Naive Bayes



P(snow ≥ 50|ski) = 3/4

P(snow < 50|ski) = 1/4

P(snow ≥ 50|¬ski) = 1/4

P(snow < 50|¬ski) = 3/4

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Naive Bayes

weather snowa priori sunny snow rainy ≥ 50 < 50

ski 1/2 1/4 2/4 1/4 3/4 1/4¬ ski 1/2 1/4 1/4 2/4 1/4 3/4

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Naive Bayes


ski 1/2 1/4 2/4 1/4 3/4 1/4¬ ski 1/2 1/4 1/4 2/4 1/4 3/4

A) weather=sunny, snow ≥ 50

P(ski|weather = sunny, snow ≥ 50)

=P(weather = sunny|ski) · P(snow ≥ 50|ski) · P(ski)

P(weather = sunny, snow ≥ 50)

=14 ·

34 ·

12


=3

32P(weather = sunny, snow ≥ 50)

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Naive Bayes


ski 1/2 1/4 2/4 1/4 3/4 1/4¬ ski 1/2 1/4 1/4 2/4 1/4 3/4


P(¬ski|weather = sunny, snow ≥ 50)

=P(weather = sunny|¬ski) · P(snow ≥ 50|¬ski) · P(¬ski)


=14 ·

14 ·

12


=132


Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Naive Bayes


ski 1/2 1/4 2/4 1/4 3/4 1/4¬ ski 1/2 1/4 1/4 2/4 1/4 3/4


P(ski|weather = sunny, snow ≥ 50) =332

P(. . .)

P(¬ski|weather = sunny, snow ≥ 50) =1

32P(. . .)

⇒ Ski

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Naive Bayes


ski 1/2 1/4 2/4 1/4 3/4 1/4¬ ski 1/2 1/4 1/4 2/4 1/4 3/4

B) weather=rainy, snow < 50

P(ski|weather = rainy, snow < 50)

=P(weather = rainy|ski) · P(snow < 50|ski) · P(ski)

P(weather = rainy, snow < 50)

=14 ·

14 ·

12


=1

32P(weather = rainy, snow < 50)

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Naive Bayes


ski 1/2 1/4 2/4 1/4 3/4 1/4¬ ski 1/2 1/4 1/4 2/4 1/4 3/4


P(¬ski|weather = rainy, snow < 50)

=P(weather = rainy|¬ski) · P(snow < 50|¬ski) · P(¬ski)


=24 ·

34 ·

12


=632


Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Naive Bayes


ski 1/2 1/4 2/4 1/4 3/4 1/4¬ ski 1/2 1/4 1/4 2/4 1/4 3/4


P(ski|weather = rainy, snow < 50) =1

32P(. . .)

P(¬ski|weather = rainy, snow < 50) =6

32P(. . .)

⇒ do not ski

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Naive Bayes


ski 1/2 1/4 2/4 1/4 3/4 1/4¬ ski 1/2 1/4 1/4 2/4 1/4 3/4

C) weather=snow, snow < 50

P(ski|weather = snow, snow < 50)

=P(weather = snow|ski) · P(snow < 50|ski) · P(ski)

P(weather = snow, snow < 50)

=24 ·

14 ·

12


=232


Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Naive Bayes


ski 1/2 1/4 2/4 1/4 3/4 1/4¬ ski 1/2 1/4 1/4 2/4 1/4 3/4


P(¬ski|weather = snow, snow < 50)

=P(weather = snow|¬ski) · P(snow < 50|¬ski) · P(¬ski)


=14 ·

34 ·

12


=3

32P(weather = snow, snow < 50)

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Naive Bayes


ski 1/2 1/4 2/4 1/4 3/4 1/4¬ ski 1/2 1/4 1/4 2/4 1/4 3/4


P(ski|weather = snow, snow < 50) =232

P(. . .)

P(¬ski|weather = snow, snow < 50) =3

32P(. . .)

⇒ do not ski

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

kNN classification

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2


Optimal classifier (only for random labels):always assign the majority label.Expected error rate?

Since |A| = |B| = |D|/2, the error rate will be 50%.

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Evaluation of classifiersLeave-one-out validation

Leave one out validation:Expected error rate?

The wrong class always becomes majority since we leaveout only the test object.

Expected error will be 100%!This is obviously too pessimistic.

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Evaluation of classifiersBootstrap method

Bootstrap by sampling with replacement: every object isleft out with a probability of

(1− 1

n

)n ≈ 0.368i.e. only about 63.2% of the objects are used for training.(In 10-fold cross validation, 90% of the data is used!)Regular error estimation would be pessimistic, as itcontains duplicates.

Common practise: also include observed classificationerror (on training data) during evaluation:

error rate = 0.632 · Error on test set

+ 0.368 · Error on training set

This will be repeated multiple times (with differentsamples) and averaged.

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Evaluation of classifiersBootstrap method

The error rate of the constant classifier is ≈ 50%.(If we average enough iterations)

New best classifier for the training set: “memorize”!On the training set the “memorize” approach can achieve aprecision of 100%!

Then:

error rate = 0.632 · 50%+ 0.368 · 0% = 31.6%

which is a too optimistic estimation.

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Nearest Neighbor Classification

1 2 3 4 5 6 7 8 9

123456789

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Decision trees

Recall: when splitting T on attribute A into partitionsT1 . . . Tm:

entropy(T) = −k∑

i=1

pi · log pi

information-gain(T,A) = entropy(T)−m∑

i=1

|Ti||T|

entropy(Ti)

Full data set:entropy(T) = 1, since p(R = low) = 1

2 = p(R = high)

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Decision trees

Information gain for time attribute: Entropy for T11-2 years: T1 = Person 1,4,6

p(R = low) =13

p(R = high) =23

entropy(T1) = −∑i=1,2

pi log pi

= −(

13

log13+

23

log23

)≈ 0.918

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Decision trees

Information gain for time attribute: Entropy for T22-7 years: T2 = Person 2,7,8

p(R = low) =23

p(R = high) =13

entropy(T2) = entropy(T1)

≈ 0.918

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Decision trees

Information gain for time attribute: Entropy for T3> 7 years: T3 = Person 3,5

p(R = low) =12

p(R = high) =12

entropy(T3) = −(

12

log12

)· 2

= 1

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Decision trees

Information gain for time attribute.

information-gain(T,Time)

= entropy(T)−∑

i=1,2,3

|Ti||T|

entropy(Ti)

= 1−(

38· 0.918 +

38· 0.918 +

28· 1)

≈ 0.06

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Decision trees

Information gain for gender attribute: Entropy for T1m: T1 = Person 1,2,5,6,8

p(R = low) =25

p(R = high) =35

entropy(T1) ≈ 0.971

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Decision trees

Information gain for gender attribute: Entropy for T2w: T2 = Person 3,4,7

p(R = low) =23

p(R = high) =13


Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Decision trees

Information gain for gender attribute.

information-gain(T,Gender)

= entropy(T)−∑i=1,2

|Ti||T|

entropy(Ti)

= 1−(

58· 0.971 +

38· 0.918

)≈ 0.05

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Decision trees

Information gain for area attribute: Entropy for T1urban: T1 = Person 1,7,8

p(R = low) = 1

p(R = high) = 0

entropy(T1) = 0

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Decision trees

Information gain for area attribute: Entropy for T2Rural: T2 = Person 2,3,4,5,6

p(R = low) =15

p(R = high) =45


Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Decision trees


information-gain(T,Area)

= 1−(

0 +58· 0.722

)≈ 0.55

Attribute Area has the highest gain.

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Decision trees

Area

Person 1,7,8p(R = low) = 1

urban

Person 2-6p(R = low) = 1/5

p(R = high) = 4/5

rural

Right branch:

entropy(T) = −(

15

log15+

45

log45

)≈ 0.722

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Decision trees

Information gain for time attribute: Entropy for T11-2 years: T1 = Person 4,6

p(R = high) = 1

entropy(T1) = 0

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Decision trees

Information gain for time attribute: Entropy for T22-7 years: T2 = Person 2

p(R = high) = 1

entropy(T2) = 0

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Decision trees

Information gain for time attribute: Entropy for T3> 7 years: T3 = Person 3,5

p(R = low) =12

p(R = high) =12

entropy(T3) = −(

12

log12

)· 2

= 1

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Decision trees

Information gain for time attribute.

information-gain(T,Time)

= entropy(T)−∑

i=1,2,3

|Ti||T|

entropy(Ti)

= 0.722−(

25· 0 +

15· 0 +

25· 1)

≈ 0.322

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Decision trees

Information gain for gender attribute: Entropy for T1m: T1 = Person 2,5,6

p(R = high) = 1

entropy(T1) = 0

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Decision trees

Information gain for gender attribute: Entropy for T2w: T2 = Person 3,4

p(R = low) =12

p(R = high) =12

entropy(T2) = 1

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Decision trees


information-gain(T,Gender)

= entropy(T)−∑i=1,2

|Ti||T|

entropy(Ti)

= 0.722−(

35· 0 +

25· 1)

≈ 0.322

Same gain for both. Choose any.

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Decision trees

Area

Person 1,7,8p(R = low) = 1

urban

Person 2-6Gender

Person 2,5,6p(R = high) = 1

m

Person 3,4Time

Person 3p(R = low) = 1

> 7

Person 4p(R = high) = 1

1− 2

f

rural

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Information gain

A) Uniform distribution on classes and values: ∀ipi =1k

−∑

i

pi log pi = −k · 1k

log1k

= −log1k= log k

Since pi = pi(TAj ), entropy(T

Aj ) = entropy(T) = log k

information-gain(T,A) = entropy(T)−mA∑j

|TAi ||T|·entropy(TA

i )

= log k − mA1

mAlog k = 0

As expected, splitting on this attribute yields no gain.

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Information gain

A) Uniform distribution on classes and values: ∀ipi =1k

−∑

i

pi log pi = −k · 1k

log1k= −log

1k= log k

Since pi = pi(TAj ), entropy(T

Aj ) = entropy(T) = log k


|TAi ||T|·entropy(TA

i )

= log k − mA1

mAlog k = 0

As expected, splitting on this attribute yields no gain.

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Information gain

B) Additional uniform attribute:


|TAi ||T|· entropy(TA

i )

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Information gain


information-gain(T,A) = entropy(T)− 1|T|

mA∑j

|TAi | · entropy(TA

i )

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Information gain



mA∑j


i )

information-gain(T,A′) = entropy(T)− 1|T|

mA+1∑j

|TA′

i | ·entropy(TA′

i )

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Information gain



mA∑j


i )


mA+1∑j

|TA′

i | ·entropy(TA′

i )

. . .− 1|T|

(mA∑j

|TA′

i | · entropy(TA′

i ) + |TA′

mA+1| · entropy(TA′

mA+1)

)

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Information gain



mA∑j


i )


mA+1∑j

|TA′

i | ·entropy(TA′

i )

. . .− 1|T|

mA∑j

|TA′



i )≤log k

+|TA′



Gain cannot improve, log k is the maximal entropy!

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Information gain



mA∑j


i )


mA+1∑j

|TA′

i | ·entropy(TA′

i )

. . .− 1|T|

mA∑j

|TA′



i )≤log k

+|TA′



Gain cannot improve, log k is the maximal entropy!

Therefore, a split on A is preferred to A′.

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Information gain

C) At most one instance per attribute value.

∃i ⇒ pi = 1,∀j6=ipj = 0⇒ ∀ientropy(TAi ) = 0

information-gain(T,A) = entropy(T)−∑

i

_ · 0

Best choice – maximum information gain!Single split: every branch is pure and the tree complete.

Danger of overfitting! There is no training error, but thetree is just memorizing the data, and will not generalize tounseen data. True error will likely be much higher.Example: split by unique record ID.

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Information gain


∃i ⇒ pi = 1,∀j6=ipj = 0

⇒ ∀ientropy(TAi ) = 0


i

_ · 0



Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Information gain




i

_ · 0



Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Information gain




i

_ · 0


Danger of overfitting! There is no training error, but thetree is just memorizing the data, and will not generalize tounseen data. True error will likely be much higher.

Example: split by unique record ID.

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Information gain




i

_ · 0



Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Kernel functions

“Kernel” can be confusing. Distinguish:I Kernel function (this section)I Kernel density function (in probability distributions)I Kernel matrix (often: a precomputed distance matrix)I positivie (semi-) definite matrix in d(x, x) := xTAx ≥ 0

Positive definite matrix A⇒ xTAy is a kernel function.Arbitrary kernel function 6⇒ representable as positivedefinite matrix

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Kernel functions


Positive definite matrix A⇒ xTAy is a kernel function.

Arbitrary kernel function 6⇒ representable as positivedefinite matrix

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Kernel functions


Positive definite matrix A⇒ xTAy is a kernel function.Arbitrary kernel function 6⇒ representable as positivedefinite matrix

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Kernel functions

positive semi-definite↔ generalized dot productsUsual dot product: 〈x, y〉 =

∑i xiyi

Generalized dot product: 〈x, y〉A = xT · A · y

Matrix E such that xT · E · y = 〈x, y〉?

〈x, y〉 =∑

i

∑j

eij · xi · yj

eij =

{1 i = j0 i 6= j

This is the unit matrix!

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Kernel functions

positive semi-definite↔ generalized dot productsUsual dot product: 〈x, y〉 =

∑i xiyi

Generalized dot product: 〈x, y〉A = xT · A · yMatrix E such that xT · E · y = 〈x, y〉?

〈x, y〉 =∑

i

∑j

eij · xi · yj

eij =

{1 i = j0 i 6= j

This is the unit matrix!

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Kernel functions

Proof for some kernel functions:

0) k0(x, y) = 〈x, y〉 = xT · y

k0(x, x) = 〈x, x〉 =∑

i xixi =∑

i x2i ≥ 0 obviously

A) k1(x, y) = 1 = c+ for c+ ≥ 0k1(x, x) = 〈x, x〉 = c+ ≥ 0 trivial.

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Kernel functions


0) k0(x, y) = 〈x, y〉 = xT · yk0(x, x) = 〈x, x〉 =

∑i xixi

=∑

i x2i ≥ 0 obviously


Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Kernel functions


0) k0(x, y) = 〈x, y〉 = xT · yk0(x, x) = 〈x, x〉 =

∑i xixi =

∑i x2

i

≥ 0 obviously


Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Kernel functions


0) k0(x, y) = 〈x, y〉 = xT · yk0(x, x) = 〈x, x〉 =

∑i xixi =

∑i x2

i ≥ 0 obviously


Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Kernel functions


0) k0(x, y) = 〈x, y〉 = xT · yk0(x, x) = 〈x, x〉 =

∑i xixi =

∑i x2

i ≥ 0 obviously

A) k1(x, y) = 1

= c+ for c+ ≥ 0k1(x, x) = 〈x, x〉 = c+ ≥ 0 trivial.

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Kernel functions


0) k0(x, y) = 〈x, y〉 = xT · yk0(x, x) = 〈x, x〉 =

∑i xixi =

∑i x2

i ≥ 0 obviously

A) k1(x, y) = 1 = c+ for c+ ≥ 0

k1(x, x) = 〈x, x〉 = c+ ≥ 0 trivial.

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Kernel functions


0) k0(x, y) = 〈x, y〉 = xT · yk0(x, x) = 〈x, x〉 =

∑i xixi =

∑i x2

i ≥ 0 obviously


Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Kernel functions


B) k2(x, y) = 3 · xT · y

= c+ · k0(x, y)k2(x, x) = c+︸︷︷︸ · k0(x, y)︸︷︷︸C) k3(x, y) = 3 · xT · y + 5 = c+ · k0(x, y) + d+

Same thing. More general: any polynomial built ofnon-negative factors and positive semi-definite kernelfunctions is positive semi-definite.Example: 2k0(x, y) · k1(x, y) + k0(x, y)2 + k1(x, y)2 + 7

Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Kernel functions


B) k2(x, y) = 3 · xT · y = c+ · k0(x, y)

k2(x, x) = c+︸︷︷︸ · k0(x, y)︸︷︷︸C) k3(x, y) = 3 · xT · y + 5 = c+ · k0(x, y) + d+


Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Kernel functions


B) k2(x, y) = 3 · xT · y = c+ · k0(x, y)k2(x, x) = c+︸︷︷︸ · k0(x, y)︸︷︷︸

C) k3(x, y) = 3 · xT · y + 5 = c+ · k0(x, y) + d+


Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Kernel functions


B) k2(x, y) = 3 · xT · y = c+ · k0(x, y)k2(x, x) = c+︸︷︷︸

≥0

· k0(x, y)︸︷︷︸≥0

≥ 0

C) k3(x, y) = 3 · xT · y + 5 = c+ · k0(x, y) + d+


Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Kernel functions



≥0

· k(x, y)︸︷︷︸≥0

≥ 0

C) k3(x, y) = 3 · xT · y + 5 = c+ · k0(x, y) + d+


Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Kernel functions



≥0

· k(x, y)︸︷︷︸≥0

≥ 0

C) k3(x, y) = 3 · xT · y + 5

= c+ · k0(x, y) + d+


Data MiningTutorial


Aufgabe 3-3

Aufgabe 4-1

Aufgabe 4-2

Aufgabe 4-3

Aufgabe 5-1

Aufgabe 5-2

Aufgabe 5-3

Aufgabe 6-2

Kernel functions



≥0

· k(x, y)︸︷︷︸≥0

≥ 0

C) k3(x, y) = 3 · xT · y + 5 = c+ · k0(x, y) + d+


data mining tutorial - session 4: classification · data mining tutorial e. schubert, e. ntoutsi...

Documents