multi-way classification. decision trees.milos/courses/cs2750-spring2011/... · 3 cs 2750 machine...

1

CS 2750 Machine Learning

CS 2750 Machine LearningLecture 11

Milos [email protected] Sennott Square

Multi-way classification.Decision trees.


Multi-way classification

• Binary classification• Multi-way classification

– K classes– Goal: learn to classify correctly K classes– Or learn

• Errors:– Zero-one (misclassification) error for an example:

– Mean misclassification error (for a dataset):

}1,,1,0{ −= KY K

}1,,1,0{: −→ KXf K

=≠

=ii

iiii yf

yfyError

),(0),(1

),(1 wxwx

x

}1,0{=Y

),(11

1ii

n

iyrrorE

nx∑

=

2


Multi-way classificationApproaches: • Generative model approach

– Generative model of the distribution p(x,y)– Learns the parameters of the model through density

estimation techniques– Discriminant functions are based on the model

• “Indirect” learning of a classifier • Discriminative approach

– Parametric discriminant functions – Learns discriminant functions directly

• A logistic regression model.


Generative model approach

Indirect:1. Represent and learn the distribution2. Define and use probabilistic discriminant functions

Model• = Class-conditional distributions (densities)

k class-conditional distributions

• = Priors on classes • - probability of class y

1)(1

1==∑

−

=

iypK

i

y

x

),( yp x

)()|(),( ypypyp xx =)|( yp x

10)|( −≤≤∀= Kiiiyp x)( yp

)|(log)( xx iypg i ==

3


Multi-way classification. Example

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

2.5

3

0

12


Multi-way classification

4


Making class decision

Discriminant functions can be based on: • Likelihood of data – choose the class (Gaussian) that explains

the input data (x) better (likelihood of the data)

• Posterior of a class – choose the class with higher posterior probability

)|(maxarg1,0

iki

pi θx−=

=K

)()|(

)()|()|( 1

0jypp

iyppiypj

k

j

i

=Θ

=Θ==

∑−

=

x

xx

Choice:

),|()|( iii pp Σxθx µ≈ For Gaussians

),|(maxarg1,0

iki

iypi θx==−= K

Choice:


Discriminative approach• Parametric model of discriminant functions• Learns the discriminant functions directly

How to learn to classify multiple classes, say 0,1,2?Approach 1:

– A binary logistic regression on every class versus the rest

0 vs. (1 or 2)

1 vs. (0 or 2)

2 vs. (0 or 1)

1

1x

dx

5



-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

2.5

3

0

12


Multi-way classification. Approach 1.

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

2.5

3

0 vs {1,2}

1 vs {0,2}

2 vs {0,1}

0

12

6



-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

2.5

3

Region of nobody

0 vs {1,2}

1 vs {0,2}2 vs {0,1}

0

12

Ambiguousregion



-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

2.5

3

0 vs {1,2}

1 vs {0,2}2 vs {0,1}

0

12

7


Discriminative approach.How to learn to classify multiple classes, say 0,1,2 ?

Approach 2:– A binary logistic regression on all pairs

0 vs. 1

0 vs. 2

1 vs. 2

1

1x

dx



-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

2.5

3

0

12

8


Multi-way classification. Approach 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

2.5

3

1 vs 2

0 vs 1

0 vs 2

0

12



-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

2.5

3

Ambiguousregion

1 vs 2

0 vs 1

0 vs 2

0

12

9


-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

2.5

3


1 vs 2

0 vs 1

0 vs 2

0

12


Multi-way classification with softmax

∑===

j

Tj

Ti

iiyp)exp(

)exp()|(xw

xwx µ ∑ =i

i 1µ

• A solution to the problem of having an ambiguous region

1

1x

dx

1µ

2µ

softmax

0µ0z

1z

2z

∑

∑

∑

10


Multi-way classification with softmax

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

2.5

3 0

12


Learning of the softmax model

• Learning of parameters w: statistical view

Multi-wayCoin toss

∈

1..00

..

0..10

0..01

y

x)|0(0 x== yPµ

)|1(1 x−==− kyPkµySoftmax

network

Assume outputs y are transformed as follows

{ }1..10 −∈ ky

11



• Learning of the parameters w: statistical view• Likelihood of outputs

• We want parameters w that maximize the likelihood• Log-likelihood trick

– Optimize log-likelihood of outputs instead:

• Objective to optimize

∑∏==

==ni

iini

ypypDl,..1,..1

)|(log)|(log),( wx,wx,w

∑∑−

==

−=1

0,,

1log),(

k

qqiqi

n

ii yDJ µw

∑ ∑∑ ∑=

−

==

−

=

==ni

qi

k

qqi

ni

yi

k

q

yqi

,..1,

1

0,

,..1

1

0

loglog , µµ

)|(),|(),(,..1

w,xXYw iini

ypwpDL ∏=

==



• Error to optimize:

• Gradient

• The same very easy gradient update as used for the binary logistic regression

• But now we have to update the weights of k networks

∑∑−

==

−=1

0,,

1log),(

k

qqiqi

n

ii yDJ µw

)(),( ,,,1

qiqiji

n

ii

jq

yxDJw

µ−−=∂∂ ∑

=

w

∑=

−+←n

iiqiqiqq y

1,, )( xww µα

12


Multi-way classification• Yet another approach 3

Not 2

2 vs 3

Not 3

Not 2 Not 1 Not 3

1 3

Not 1

3 vs 1

2 vs 1

2


Decision trees

• An alternative approach to classification:– Partition the input space to regions– Regress or classify independently in every region

1 11

0 0

0 0

00

01

1 1

0 00 0

1

2x

1x

13


Decision trees• The partitioning idea is used in the decision tree model:

– Split the space recursively according to inputs in x– Regress or classify at the bottom of the tree

03 =xx

t f01 =x 02 =x

t tf f

Example:Binary classification Binary attributes

1 0 0 1

0

1 0

321 ,, xxx}1,0{

classify


Decision trees

How to construct the decision tree?• Top-bottom algorithm:

– Find the best split condition (quantified based on the impurity measure)

– Stops when no improvement possible• Impurity measure:

– Measures how well are the two classes separated – Ideally we would like to separate all 0s and 1

• Splits of finite vs. continuous value attributesContinuous value attributes conditions: 5.03 ≤x

x

t f?

14


Impurity measure

Let

• Impurity measure defines how well the classes are separated• In general the impurity measure should satisfy:

– Largest when data are split evenly for attribute values

– Should be 0 when all data belong to the same class

||||

DDp i

i =

|| D - Total number of data entries

|| iD - Number of data entries classified as i

- ratio of instances classified as i

classesofnumber1

=ip


Impurity measures

• There are various impurity measures used in the literature– Entropy based measure (Quinlan, C4.5)

– Gini measure (Breiman, CART)

∑=

−==k

iii ppDEntropyDI

1log)()(

∑=

−==k

iipDGiniDI

1

21)()(

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Example for k=2

15


Impurity measures

• Gain due to split – expected reduction in the impurity measure (entropy example)

)(||||)(),(

)(∑

∈

−=AValuesv

vv

DEntropyDDDEntropyADGain

|| vD - a partition of D with the value of attribute A = v

03 =xx

t f

0

03 =xx

t f01 =x

t f0

)(DEntropy

)(DEntropy

)( tDEntropy )( fDEntropy


Decision tree learning

• Greedy learning algorithm:Repeat until no or small improvement in the purity– Find the attribute with the highest gain– Add the attribute to the tree and split the set accordingly

• Builds the tree in the top-down fashion– Gradually expands the leaves of the partially built tree

• The method is greedy– It looks at a single attribute and gain in each step– May fail when the combination of attributes is needed to

improve the purity (parity functions)

16


Decision tree learning

• Limitations of greedy methodsCases in which a combination of two or more attributes

improves the impurity

1 11 1

0 0

00

0 00

0

11

1

1


Decision tree learningBy reducing the impurity measure we can grow very large treesProblem: Overfitting• We may split and classify very well the training set, but we may

do worse in terms of the generalization error Solutions to the overfitting problem:• Solution 1.

– Prune branches of the tree built in the first phase– Use validation set to test for the overfit

• Solution 2. – Test for the overfit in the tree building phase– Stop building the tree when performance on the validation set

deteriorates

17


K-Nearest-Neighbours for Classification

• Given a data set with Nk data points from class Ckand , we have

• and correspondingly

• Since , Bayes’ theorem gives


K-Nearest-Neighbours for Classification

K = 1K = 3

18


Nonparametric kernel-based classification

• Kernel function: k(x,x’)– Models similarity between x, x’– Example: Gaussian kernel we used in kernel density

estimation

• Kernel for classification

∑∑===

'

':'

)',(

)',()|(

x

Cyxk xxk

xxkxCyp k

−−= 2

2

2/2 2)'(exp

)2(1)',(

hxx

hxxk Dπ

),(1)(1

i

N

ixxk

Nxp ∑

=

=

multi-way classification. decision trees.milos/courses/cs2750-spring2011/... · 3 cs 2750 machine...

Documents