cse674 hw3 (midterm prep.) reviewweb.cse.ohio-state.edu/~parthasarathy.2/674/hw3-soln.pdf · cse674...

26
CSE674 HW3 (Midterm Prep.) Review Prof. S. Parthasarathy prepared by Yiye Ruan February 8, 2011 Yiye Ruan (ruan@cse) CSE674 HW3 February 8, 2011 1 / 26

Upload: others

Post on 30-Sep-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSE674 HW3 (Midterm Prep.) Reviewweb.cse.ohio-state.edu/~parthasarathy.2/674/HW3-soln.pdf · CSE674 HW3 (Midterm Prep.) Review Prof. S. Parthasarathy prepared by Yiye Ruan February

CSE674 HW3 (Midterm Prep.) Review

Prof. S. Parthasarathyprepared by Yiye Ruan

February 8, 2011

Yiye Ruan (ruan@cse) CSE674 HW3 February 8, 2011 1 / 26

Page 2: CSE674 HW3 (Midterm Prep.) Reviewweb.cse.ohio-state.edu/~parthasarathy.2/674/HW3-soln.pdf · CSE674 HW3 (Midterm Prep.) Review Prof. S. Parthasarathy prepared by Yiye Ruan February

Outline

1 2.2 Classify Attributes

2 2.13 Problematic k-NN

3 2.16 TF-IDF

4 2.19 Similarity/Distance Measures

5 4.1 Decision Tree

6 4.2 Gini Index

7 4.3 Entropy and Information Gain

8 4.7 Decision Tree

9 Questions and Discussion

Yiye Ruan (ruan@cse) CSE674 HW3 February 8, 2011 2 / 26

Page 3: CSE674 HW3 (Midterm Prep.) Reviewweb.cse.ohio-state.edu/~parthasarathy.2/674/HW3-soln.pdf · CSE674 HW3 (Midterm Prep.) Review Prof. S. Parthasarathy prepared by Yiye Ruan February

2.2 Classify Attributes

(a) Time in terms of AM or PM: binary, qualitative, ordinal (or nominal?).

(b) Brightness by light meter: continuous, quantitative, ratio.

(c) Brightness by human judgements: discrete, qualitative, ordinal.

(d) Angles in degrees: continuous, quantitative, ratio.

(e) Bronze, Silver and Gold medals: discrete, qualitative, ordinal (ornominal for sportsmanship :D).

(f) Height above sea level: continuous, quantitative, interval or ordinal.

(g) Number of patients: discrete, quantitative, ratio.

Yiye Ruan (ruan@cse) CSE674 HW3 February 8, 2011 3 / 26

Page 4: CSE674 HW3 (Midterm Prep.) Reviewweb.cse.ohio-state.edu/~parthasarathy.2/674/HW3-soln.pdf · CSE674 HW3 (Midterm Prep.) Review Prof. S. Parthasarathy prepared by Yiye Ruan February

2.2 Classify Attributes

(h) ISBN: discrete, qualitative, nominal.

(i) Ability to pass light: discrete, qualitative, ordinal.

(j) Military rank: discrete, qualitative, ordinal.

(k) Distance from campus center: continuous, quantitative, interval orratio.

(l) Density of a substance: continuous, quantitative, ratio.

(m) Coat check number: discrete, qualitative, nominal.

Yiye Ruan (ruan@cse) CSE674 HW3 February 8, 2011 4 / 26

Page 5: CSE674 HW3 (Midterm Prep.) Reviewweb.cse.ohio-state.edu/~parthasarathy.2/674/HW3-soln.pdf · CSE674 HW3 (Midterm Prep.) Review Prof. S. Parthasarathy prepared by Yiye Ruan February

2.13 Problematic k-NN

Problems (corner cases):

L[i-1], L[i] and L[i+1] have the same value but different labels.

L[2], L[3], L[4] are duplicates of the same object (assuming k = 5).May or may not be a problem.

Solutions:

Adaptive boundary (k).

Eliminate duplicates.

Yiye Ruan (ruan@cse) CSE674 HW3 February 8, 2011 5 / 26

Page 6: CSE674 HW3 (Midterm Prep.) Reviewweb.cse.ohio-state.edu/~parthasarathy.2/674/HW3-soln.pdf · CSE674 HW3 (Midterm Prep.) Review Prof. S. Parthasarathy prepared by Yiye Ruan February

2.16 TF-IDF

(a) If a term occurs in only one document, then dfi = 1, and the newscore is multiplied by log m.If a term occurs in every document, then dfi = m, log m

dfi= 0, and

the new score is 0.

(b) The purpose is to alleviate the drawback of original tf score. Stopwords like the and I have high tf scores since they occur frequently indocuments. However, these words are not representative since theyappear in most of the documents.When inverse document frequency is leveraged, scores for stop wordsare reduced since their df values are generally very close to m.

Yiye Ruan (ruan@cse) CSE674 HW3 February 8, 2011 6 / 26

Page 7: CSE674 HW3 (Midterm Prep.) Reviewweb.cse.ohio-state.edu/~parthasarathy.2/674/HW3-soln.pdf · CSE674 HW3 (Midterm Prep.) Review Prof. S. Parthasarathy prepared by Yiye Ruan February

2.19 Similarity/Distance Measures

(a) cosine: cos(x, y) = x·y‖x‖‖y‖

(b) correlation: corr(x, y) =sxysx sy

(c) Euclidean: d(x, y) =√∑n

k=1 (xk − yk)2

(d) Jaccard: J(x, y) = f11f01+f10+f11

Yiye Ruan (ruan@cse) CSE674 HW3 February 8, 2011 7 / 26

Page 8: CSE674 HW3 (Midterm Prep.) Reviewweb.cse.ohio-state.edu/~parthasarathy.2/674/HW3-soln.pdf · CSE674 HW3 (Midterm Prep.) Review Prof. S. Parthasarathy prepared by Yiye Ruan February

2.19 Similarity/Distance Measures

(a) x = (1, 1, 1, 1), y = (2, 2, 2, 2)cosine:1, corr:undefined, Euclidean:2

(b) x = (0, 1, 0, 1), y = (1, 0, 1, 0)cosine:0, corr:−1, Euclidean:2, Jaccard:0

(c) x = (0,−1, 0, 1), y = (1, 0,−1, 0)cosine:0, corr:0, Euclidean:2

(d) x = (1, 1, 0, 1, 0, 1), y = (1, 1, 1, 0, 0, 1)cosine:0.75, corr:0.25, Jaccard:0.6

(e) x = (2,−1, 0, 2, 0,−3), y = (−1, 1,−1, 0, 0,−1)cosine:0, corr:0

Yiye Ruan (ruan@cse) CSE674 HW3 February 8, 2011 8 / 26

Page 9: CSE674 HW3 (Midterm Prep.) Reviewweb.cse.ohio-state.edu/~parthasarathy.2/674/HW3-soln.pdf · CSE674 HW3 (Midterm Prep.) Review Prof. S. Parthasarathy prepared by Yiye Ruan February

4.1 Decision Tree

“In Boolean algebra, a parity function is a Boolean function whose value is1 if the input vector has an odd number of ones.”1

This tree cannot be simplified because the class distribution won’t beuniform even when any three Boolean attributes are fixed.

1http://en.wikipedia.org/wiki/Parity functionYiye Ruan (ruan@cse) CSE674 HW3 February 8, 2011 9 / 26

Page 10: CSE674 HW3 (Midterm Prep.) Reviewweb.cse.ohio-state.edu/~parthasarathy.2/674/HW3-soln.pdf · CSE674 HW3 (Midterm Prep.) Review Prof. S. Parthasarathy prepared by Yiye Ruan February

4.2 Gini Index

Yiye Ruan (ruan@cse) CSE674 HW3 February 8, 2011 10 / 26

Page 11: CSE674 HW3 (Midterm Prep.) Reviewweb.cse.ohio-state.edu/~parthasarathy.2/674/HW3-soln.pdf · CSE674 HW3 (Midterm Prep.) Review Prof. S. Parthasarathy prepared by Yiye Ruan February

4.2 Gini Index

Gini = 1 −∑n

i=1 p(ci )2

(a) Overall: 1 − (12)2 − (1

2)2 = 0.5

(b) Customer ID: 20 × 120 [1 − (1

1)2 − (01)2] = 0

(c) Gender: 1020 [1 − ( 6

10)2 − ( 410)2] + 10

20 [1 − ( 410)2 − ( 6

10)2] = 0.48

(d) Car Type:420 [1−(1

4)2−(34)2]+ 8

20 [1−(88)2−(0

8)2]+ 820 [1−(1

8)2−(78)2] = 0.1625

(e) Shirt Size: 520 [1 − (3

5)2 − (25)2] + 7

20 [1 − (37)2 − (4

7)2] + 420 [1 −

(24)2 − (2

4)2] + 420 [1 − (2

4)2 − (24)2] = 0.4914

(f) Car Type is the best attribute to split because it has the lowest Giniindex.

(g) The tree built upon Customer ID won’t be able to recognize any newrecord. That is, it will overfit the model.

Yiye Ruan (ruan@cse) CSE674 HW3 February 8, 2011 11 / 26

Page 12: CSE674 HW3 (Midterm Prep.) Reviewweb.cse.ohio-state.edu/~parthasarathy.2/674/HW3-soln.pdf · CSE674 HW3 (Midterm Prep.) Review Prof. S. Parthasarathy prepared by Yiye Ruan February

4.3 Entropy and Information Gain

Yiye Ruan (ruan@cse) CSE674 HW3 February 8, 2011 12 / 26

Page 13: CSE674 HW3 (Midterm Prep.) Reviewweb.cse.ohio-state.edu/~parthasarathy.2/674/HW3-soln.pdf · CSE674 HW3 (Midterm Prep.) Review Prof. S. Parthasarathy prepared by Yiye Ruan February

4.3 Entropy and Information Gain

Entropy = −∑n

i=1 p(ci ) log2 p(ci )

∆ = Entropy(parent) −∑k

j=1N(vj )

N Entropy(vj)

(a) Overall: 49 log 9

4 + 59 log 9

5 = 0.9911

(b) ∆(a1): 0.9911 − 49(3

4 log 43 + 1

4 log 41) − 5

9(15 log 5

1 + 45 log 5

4) = 0.2294∆(a2): 0.9911 − 5

9(25 log 5

2 + 35 log 5

3) − 49(2

4 log 42 + 2

4 log 42) = 0.0072

Yiye Ruan (ruan@cse) CSE674 HW3 February 8, 2011 13 / 26

Page 14: CSE674 HW3 (Midterm Prep.) Reviewweb.cse.ohio-state.edu/~parthasarathy.2/674/HW3-soln.pdf · CSE674 HW3 (Midterm Prep.) Review Prof. S. Parthasarathy prepared by Yiye Ruan February

4.3 Entropy and Information Gain

(c) ∆(a3):

∆(a3 = 0.5): 0∆(a3 = 2):0.9911 − 1

9 ( 11 log 1

1 + 01 log 1

0 ) − 89 ( 3

8 log 83 + 5

8 log 85 ) = 0.1427

∆(a3 = 3.5):0.9911 − 2

9 ( 12 log 2

1 + 12 log 2

1 ) − 79 ( 3

7 log 73 + 4

7 log 74 ) = 0.0026

∆(a3 = 4.5):0.9911 − 3

9 ( 23 log 3

2 + 13 log 3

1 ) − 69 ( 2

6 log 62 + 4

6 log 64 ) = 0.0728

∆(a3 = 5.5):0.9911 − 5

9 ( 25 log 5

2 + 35 log 5

3 ) − 49 ( 2

4 log 42 + 2

4 log 42 ) = 0.0072

∆(a3 = 6.5):0.9911 − 6

9 ( 36 log 6

3 + 36 log 6

3 ) − 39 ( 1

3 log 31 + 2

3 log 32 ) = 0.0183

∆(a3 = 7.5):0.9911 − 8

9 ( 48 log 8

4 + 48 log 8

4 ) − 19 ( 0

1 log 10 + 1

1 log 11 ) = 0.1022

∆(a3 = 8.5): 0

Yiye Ruan (ruan@cse) CSE674 HW3 February 8, 2011 14 / 26

Page 15: CSE674 HW3 (Midterm Prep.) Reviewweb.cse.ohio-state.edu/~parthasarathy.2/674/HW3-soln.pdf · CSE674 HW3 (Midterm Prep.) Review Prof. S. Parthasarathy prepared by Yiye Ruan February

4.3 Entropy and Information Gain

(d) According to the information gain, the best split is a3.

(e) Err(a1) = 29 = 0.2222, Err(a2) = 4

9 = 0.4444According to the classification error rate, the best split is a1.

(f) Gini(a1): 49 [1 − (3

4)2 − (14)2] + 5

9 [1 − (15)2 − (4

5)2] = 0.3444Gini(a2): 5

9 [1 − (25)2 − (3

5)2] + 49 [1 − (2

4)2 − (24)2] = 0.4889

According to the Gini index, the best split is a1.

Yiye Ruan (ruan@cse) CSE674 HW3 February 8, 2011 15 / 26

Page 16: CSE674 HW3 (Midterm Prep.) Reviewweb.cse.ohio-state.edu/~parthasarathy.2/674/HW3-soln.pdf · CSE674 HW3 (Midterm Prep.) Review Prof. S. Parthasarathy prepared by Yiye Ruan February

4.7 Decision Tree

Yiye Ruan (ruan@cse) CSE674 HW3 February 8, 2011 16 / 26

Page 17: CSE674 HW3 (Midterm Prep.) Reviewweb.cse.ohio-state.edu/~parthasarathy.2/674/HW3-soln.pdf · CSE674 HW3 (Midterm Prep.) Review Prof. S. Parthasarathy prepared by Yiye Ruan February

4.7 Decision Tree

(a) Err(overall) = 5050+50 = 0.5

Err(A) = 0+2525+75 = 0.25, Gain(A) = 0.5 − 0.25 = 0.25

A = T A = F Total

+ 25 25 50

− 0 50 50

Total 25 75 100

Err(B) = 20+2050+50 = 0.4, Gain(B) = 0.5 − 0.4 = 0.1

B = T B = F Total

+ 30 20 50

− 20 30 50

Total 50 50 100

Yiye Ruan (ruan@cse) CSE674 HW3 February 8, 2011 17 / 26

Page 18: CSE674 HW3 (Midterm Prep.) Reviewweb.cse.ohio-state.edu/~parthasarathy.2/674/HW3-soln.pdf · CSE674 HW3 (Midterm Prep.) Review Prof. S. Parthasarathy prepared by Yiye Ruan February

4.7 Decision Tree

(a) Err(C ) = 25+2550+50 = 0.5, Gain(C ) = 0.5 − 0.5 = 0

C = T C = F Total

+ 25 25 50

− 25 25 50

Total 50 50 100

According to the classification error rate, attribute A would be chosenas the first splitting attribute.

Yiye Ruan (ruan@cse) CSE674 HW3 February 8, 2011 18 / 26

Page 19: CSE674 HW3 (Midterm Prep.) Reviewweb.cse.ohio-state.edu/~parthasarathy.2/674/HW3-soln.pdf · CSE674 HW3 (Midterm Prep.) Review Prof. S. Parthasarathy prepared by Yiye Ruan February

4.7 Decision Tree

(b) Left child (A = T ) dataset:

B C + −T T 5 0

F T 20 0

T F 0 0

F F 0 0

No need to split because all data have the same class +.Right Child (A = F ) dataset:

B C + −T T 0 20

F T 0 5

T F 25 0

F F 0 25

Need further split.

Yiye Ruan (ruan@cse) CSE674 HW3 February 8, 2011 19 / 26

Page 20: CSE674 HW3 (Midterm Prep.) Reviewweb.cse.ohio-state.edu/~parthasarathy.2/674/HW3-soln.pdf · CSE674 HW3 (Midterm Prep.) Review Prof. S. Parthasarathy prepared by Yiye Ruan February

4.7 Decision Tree

(b) Err(B) = 20+045+30 = 0.2667

B = T B = F Total

+ 25 0 25

− 20 30 50

Total 45 30 75

Err(C ) = 0+2525+50 = 0.3333

C = T C = F Total

+ 0 25 25

− 25 25 50

Total 25 50 75

Attribute B would be chosen as the splitting attribute for the rightchild node A = F .

(c) The number of misclassified instances are 0 + 20 + 0 = 20.

Yiye Ruan (ruan@cse) CSE674 HW3 February 8, 2011 20 / 26

Page 21: CSE674 HW3 (Midterm Prep.) Reviewweb.cse.ohio-state.edu/~parthasarathy.2/674/HW3-soln.pdf · CSE674 HW3 (Midterm Prep.) Review Prof. S. Parthasarathy prepared by Yiye Ruan February

4.7 Decision Tree

(d) Left node C = T , the dataset is:

A B + −T T 5 0

F T 0 20

T F 20 0

F F 0 5

Then Err(A) = 0+025+25 = 0

A = T A = F Total

+ 25 0 25

− 0 25 25

Total 25 25 50

Yiye Ruan (ruan@cse) CSE674 HW3 February 8, 2011 21 / 26

Page 22: CSE674 HW3 (Midterm Prep.) Reviewweb.cse.ohio-state.edu/~parthasarathy.2/674/HW3-soln.pdf · CSE674 HW3 (Midterm Prep.) Review Prof. S. Parthasarathy prepared by Yiye Ruan February

4.7 Decision Tree

(d) Err(B) = 5+525+25 = 0.2

B = T B = F Total

+ 5 20 25

− 20 5 25

Total 25 25 50

Attribute A would be chosen as the splitting attribute for the leftchild node C = T .

Yiye Ruan (ruan@cse) CSE674 HW3 February 8, 2011 22 / 26

Page 23: CSE674 HW3 (Midterm Prep.) Reviewweb.cse.ohio-state.edu/~parthasarathy.2/674/HW3-soln.pdf · CSE674 HW3 (Midterm Prep.) Review Prof. S. Parthasarathy prepared by Yiye Ruan February

4.7 Decision Tree

(d) Right child C = F , the dataset is

A B + −T T 0 0

F T 25 0

T F 0 0

F F 0 25

Then Err(A) = 0+250+50 = 0.5

A = T A = F Total

+ 0 25 25

− 0 25 25

Total 0 50 50

Yiye Ruan (ruan@cse) CSE674 HW3 February 8, 2011 23 / 26

Page 24: CSE674 HW3 (Midterm Prep.) Reviewweb.cse.ohio-state.edu/~parthasarathy.2/674/HW3-soln.pdf · CSE674 HW3 (Midterm Prep.) Review Prof. S. Parthasarathy prepared by Yiye Ruan February

4.7 Decision Tree

(d) Err(B) = 0+025+25 = 0

B = T B = F Total

+ 25 0 25

− 0 25 25

Total 25 25 50

Attribute B would be chosen as the splitting attribute for the rightchild node C = F .The number of misclassified instances are 0 + 0 + 0 + 0 = 0.

Yiye Ruan (ruan@cse) CSE674 HW3 February 8, 2011 24 / 26

Page 25: CSE674 HW3 (Midterm Prep.) Reviewweb.cse.ohio-state.edu/~parthasarathy.2/674/HW3-soln.pdf · CSE674 HW3 (Midterm Prep.) Review Prof. S. Parthasarathy prepared by Yiye Ruan February

4.7 Decision Tree

(e) When building the decision tree, we always choose the split attributeby minimizing the classification error at current level. As we cannotforesee the error reduction by child nodes, the local optimizationdoesn’t always lead to global optimization. That’s the greedy natureof the decision tree induction algorithm.When we split the dataset at the first level, attribute A produces thelowest classification rate, and the final error rate is 0.2. If we chooseC to split at the first level, however, the final product will be able toclassify all records correctly.

Yiye Ruan (ruan@cse) CSE674 HW3 February 8, 2011 25 / 26

Page 26: CSE674 HW3 (Midterm Prep.) Reviewweb.cse.ohio-state.edu/~parthasarathy.2/674/HW3-soln.pdf · CSE674 HW3 (Midterm Prep.) Review Prof. S. Parthasarathy prepared by Yiye Ruan February

Questions and Discussion

Good Luck!

Yiye Ruan (ruan@cse) CSE674 HW3 February 8, 2011 26 / 26