cis671-knowledge discovery and data mining

33
CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on notes by Christos Faloutsos)

Upload: delta

Post on 19-Jan-2016

42 views

Category:

Documents


0 download

DESCRIPTION

CIS671-Knowledge Discovery and Data Mining. AI reminders. Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University. (based on notes by Christos Faloutsos). Agenda. Statistics AI - decision trees Problem Approach Conclusions DB. Decision Trees. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CIS671-Knowledge Discovery and Data Mining

CIS671-Knowledge Discovery and Data Mining

Vasileios MegalooikonomouDept. of Computer and Information Sciences

Temple University

AI reminders

(based on notes by Christos Faloutsos)

Page 2: CIS671-Knowledge Discovery and Data Mining

Agenda• Statistics• AI - decision trees

– Problem– Approach– Conclusions

• DB

Page 3: CIS671-Knowledge Discovery and Data Mining

Decision Trees• Problem: Classification - Ie., • given a training set (N tuples, with M attributes,

plus a label attribute)• find rules, to predict the label for newcomers

Pictorially:

Page 4: CIS671-Knowledge Discovery and Data Mining

Decision treesAge Chol-level Gender … CLASS-ID

30 150 M +

-

??

Page 5: CIS671-Knowledge Discovery and Data Mining

Decision trees

• Issues:– missing values– noise– ‘rare’ events

Page 6: CIS671-Knowledge Discovery and Data Mining

Decision trees

• types of attributes– numerical (= continuous) - eg: ‘salary’– ordinal (= integer) - eg.: ‘# of children’– nominal (= categorical) - eg.: ‘car-type’

Page 7: CIS671-Knowledge Discovery and Data Mining

Decision trees

• Pictorially, we have

num. attr#1 (eg., ‘age’)

num. attr#2(eg., chol-level)

+

-++ +

++

+

-

--

--

Page 8: CIS671-Knowledge Discovery and Data Mining

Decision trees

• and we want to label ‘?’

num. attr#1 (eg., ‘age’)

num. attr#2(eg., chol-level)

+

-++ +

++

+

-

--

--

?

Page 9: CIS671-Knowledge Discovery and Data Mining

Decision trees

• so we build a decision tree:

num. attr#1 (eg., ‘age’)

num. attr#2(eg., chol-level)

+

-++ +

++

+

-

--

--

?

50

40

Page 10: CIS671-Knowledge Discovery and Data Mining

Decision trees

• so we build a decision tree:

age<50

Y

+ chol. <40

N

- ...

Y N

Page 11: CIS671-Knowledge Discovery and Data Mining

Agenda• Statistics• AI - decision trees

– Problem– Approach– Conclusions

• DB

Page 12: CIS671-Knowledge Discovery and Data Mining

Decision trees

• Typically, two steps:– tree building– tree pruning (for over-training/over-fitting)

Page 13: CIS671-Knowledge Discovery and Data Mining

Tree building

• How?

num. attr#1 (eg., ‘age’)

num. attr#2(eg.,chol-level)

+

-++ +

++

+

-

--

--

Page 14: CIS671-Knowledge Discovery and Data Mining

Tree building

• How?• A: Partition, recursively - pseudocode:

Partition ( dataset S)if all points in S have same labelthen returnevaluate splits along each attribute Apick best split, to divide S into S1 and S2Partition(S1); Partition(S2)

• Q1: how to introduce splits along attribute Ai

• Q2: how to evaluate a split?

Page 15: CIS671-Knowledge Discovery and Data Mining

Tree building

• Q1: how to introduce splits along attribute Ai

• A1:– for numerical attributes:

• binary split, or

• multiple split

– for categorical attributes:• compute all subsets (expensive!), or

• use a greedy approach

Page 16: CIS671-Knowledge Discovery and Data Mining

Tree building

• Q1: how to introduce splits along attribute Ai

• Q2: how to evaluate a split?

Page 17: CIS671-Knowledge Discovery and Data Mining

Tree building

• Q1: how to introduce splits along attribute Ai

• Q2: how to evaluate a split?

• A: by how close to uniform each subset is - ie., we need a measure of uniformity:

Page 18: CIS671-Knowledge Discovery and Data Mining

Tree building

entropy: H(p+, p-)

p+10

0

1

0.5

Any other measure?

p+: relative frequency of class + in S

Page 19: CIS671-Knowledge Discovery and Data Mining

Tree building

entropy: H(p+, p-)

p+10

0

1

0.5

‘gini’ index: 1-p+2 - p-

2

p+10

0

1

0.5

p+: relative frequency of class + in S

Page 20: CIS671-Knowledge Discovery and Data Mining

Tree building

entropy: H(p+, p-) ‘gini’ index: 1-p+2 - p-

2

(How about multiple labels?)

Page 21: CIS671-Knowledge Discovery and Data Mining

Tree building

Intuition:

• entropy: #bits to encode the class label

• gini: classification error, if we randomly guess ‘+’ with prob. p+

Page 22: CIS671-Knowledge Discovery and Data Mining

Tree building

Thus, we choose the split that reduces entropy/classification-error the most: Eg.:

num. attr#1 (eg., ‘age’)

num. attr#2(eg., chol-level)

+

-++ +

++

+

-

----

Page 23: CIS671-Knowledge Discovery and Data Mining

Tree building

• Before split: we need(n+ + n-) * H( p+, p-) = (7+6) * H(7/13, 6/13)

bits total, to encode all the class labels

• After the split we need:0 bits for the first half and

(2+6) * H(2/8, 6/8) bits for the second half

Page 24: CIS671-Knowledge Discovery and Data Mining

Agenda• Statistics• AI - decision trees

– Problem– Approach

• tree building• tree pruning

– Conclusions

• DB

Page 25: CIS671-Knowledge Discovery and Data Mining

Tree pruning• What for?

num. attr#1 (eg., ‘age’)

num. attr#2(eg., chol-level)

+

-++ +

++

+

-

----

...

Page 26: CIS671-Knowledge Discovery and Data Mining

Tree pruning• Q: How to do it?

num. attr#1 (eg., ‘age’)

num. attr#2(eg., chol-level)

+

-++ +

++

+

-

----

...

Page 27: CIS671-Knowledge Discovery and Data Mining

Tree pruning• Q: How to do it?• A1: use a ‘training’ and a ‘testing’ set -

prune nodes that improve classification in the ‘testing’ set. (Drawbacks?)

Page 28: CIS671-Knowledge Discovery and Data Mining

Tree pruning• Q: How to do it?• A1: use a ‘training’ and a ‘testing’ set -

prune nodes that improve classification in the ‘testing’ set. (Drawbacks?)

• A2: or, rely on MDL (= Minimum Description Language) - in detail:

Page 29: CIS671-Knowledge Discovery and Data Mining

Tree pruning• envision the problem as compression (of

what?)

Page 30: CIS671-Knowledge Discovery and Data Mining

Tree pruning• envision the problem as compression (of

what?)• and try to minimize the # bits to compress

(a) the class labels AND(b) the representation of the decision tree

Page 31: CIS671-Knowledge Discovery and Data Mining

(MDL)• a brilliant idea - eg.: best n-degree

polynomial to compress these points:• the one that minimizes (sum of errors + n )

Page 32: CIS671-Knowledge Discovery and Data Mining

Conclusions• Classification through trees• Building phase - splitting policies• Pruning phase (to avoid over-fitting)• Observation: classification is subtly related

to compression

Page 33: CIS671-Knowledge Discovery and Data Mining

Reference

• M. Mehta, R. Agrawal and J. Rissanen, `SLIQ: A Fast Scalable Classifier for Data Mining', Proc. of the Fifth Int'l Conference on Extending Database Technology, Avignon, France, March 1996