multi-way classification. decision trees.milos/courses/cs2750-spring2011/... · 3 cs 2750 machine...
TRANSCRIPT
1
CS 2750 Machine Learning
CS 2750 Machine LearningLecture 11
Milos [email protected] Sennott Square
Multi-way classification.Decision trees.
CS 2750 Machine Learning
Multi-way classification
• Binary classification• Multi-way classification
– K classes– Goal: learn to classify correctly K classes– Or learn
• Errors:– Zero-one (misclassification) error for an example:
– Mean misclassification error (for a dataset):
}1,,1,0{ −= KY K
}1,,1,0{: −→ KXf K
=≠
=ii
iiii yf
yfyError
),(0),(1
),(1 wxwx
x
}1,0{=Y
),(11
1ii
n
iyrrorE
nx∑
=
2
CS 2750 Machine Learning
Multi-way classificationApproaches: • Generative model approach
– Generative model of the distribution p(x,y)– Learns the parameters of the model through density
estimation techniques– Discriminant functions are based on the model
• “Indirect” learning of a classifier • Discriminative approach
– Parametric discriminant functions – Learns discriminant functions directly
• A logistic regression model.
CS 2750 Machine Learning
Generative model approach
Indirect:1. Represent and learn the distribution2. Define and use probabilistic discriminant functions
Model• = Class-conditional distributions (densities)
k class-conditional distributions
• = Priors on classes • - probability of class y
1)(1
1==∑
−
=
iypK
i
y
x
),( yp x
)()|(),( ypypyp xx =)|( yp x
10)|( −≤≤∀= Kiiiyp x)( yp
)|(log)( xx iypg i ==
3
CS 2750 Machine Learning
Multi-way classification. Example
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1
-0.5
0
0.5
1
1.5
2
2.5
3
0
12
CS 2750 Machine Learning
Multi-way classification
4
CS 2750 Machine Learning
Making class decision
Discriminant functions can be based on: • Likelihood of data – choose the class (Gaussian) that explains
the input data (x) better (likelihood of the data)
• Posterior of a class – choose the class with higher posterior probability
)|(maxarg1,0
iki
pi θx−=
=K
)()|(
)()|()|( 1
0jypp
iyppiypj
k
j
i
=Θ
=Θ==
∑−
=
x
xx
Choice:
),|()|( iii pp Σxθx µ≈ For Gaussians
),|(maxarg1,0
iki
iypi θx==−= K
Choice:
CS 2750 Machine Learning
Discriminative approach• Parametric model of discriminant functions• Learns the discriminant functions directly
How to learn to classify multiple classes, say 0,1,2?Approach 1:
– A binary logistic regression on every class versus the rest
0 vs. (1 or 2)
1 vs. (0 or 2)
2 vs. (0 or 1)
1
1x
dx
5
CS 2750 Machine Learning
Multi-way classification. Example
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1
-0.5
0
0.5
1
1.5
2
2.5
3
0
12
CS 2750 Machine Learning
Multi-way classification. Approach 1.
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1
-0.5
0
0.5
1
1.5
2
2.5
3
0 vs {1,2}
1 vs {0,2}
2 vs {0,1}
0
12
6
CS 2750 Machine Learning
Multi-way classification. Approach 1.
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1
-0.5
0
0.5
1
1.5
2
2.5
3
Region of nobody
0 vs {1,2}
1 vs {0,2}2 vs {0,1}
0
12
Ambiguousregion
CS 2750 Machine Learning
Multi-way classification. Approach 1.
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1
-0.5
0
0.5
1
1.5
2
2.5
3
0 vs {1,2}
1 vs {0,2}2 vs {0,1}
0
12
7
CS 2750 Machine Learning
Discriminative approach.How to learn to classify multiple classes, say 0,1,2 ?
Approach 2:– A binary logistic regression on all pairs
0 vs. 1
0 vs. 2
1 vs. 2
1
1x
dx
CS 2750 Machine Learning
Multi-way classification. Example
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1
-0.5
0
0.5
1
1.5
2
2.5
3
0
12
8
CS 2750 Machine Learning
Multi-way classification. Approach 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1
-0.5
0
0.5
1
1.5
2
2.5
3
1 vs 2
0 vs 1
0 vs 2
0
12
CS 2750 Machine Learning
Multi-way classification. Approach 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1
-0.5
0
0.5
1
1.5
2
2.5
3
Ambiguousregion
1 vs 2
0 vs 1
0 vs 2
0
12
9
CS 2750 Machine Learning
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1
-0.5
0
0.5
1
1.5
2
2.5
3
Multi-way classification. Approach 2
1 vs 2
0 vs 1
0 vs 2
0
12
CS 2750 Machine Learning
Multi-way classification with softmax
∑===
j
Tj
Ti
iiyp)exp(
)exp()|(xw
xwx µ ∑ =i
i 1µ
• A solution to the problem of having an ambiguous region
1
1x
dx
1µ
2µ
softmax
0µ0z
1z
2z
∑
∑
∑
10
CS 2750 Machine Learning
Multi-way classification with softmax
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1
-0.5
0
0.5
1
1.5
2
2.5
3 0
12
CS 2750 Machine Learning
Learning of the softmax model
• Learning of parameters w: statistical view
Multi-wayCoin toss
∈
1..00
..
0..10
0..01
y
x)|0(0 x== yPµ
)|1(1 x−==− kyPkµySoftmax
network
Assume outputs y are transformed as follows
{ }1..10 −∈ ky
11
CS 2750 Machine Learning
Learning of the softmax model
• Learning of the parameters w: statistical view• Likelihood of outputs
• We want parameters w that maximize the likelihood• Log-likelihood trick
– Optimize log-likelihood of outputs instead:
• Objective to optimize
∑∏==
==ni
iini
ypypDl,..1,..1
)|(log)|(log),( wx,wx,w
∑∑−
==
−=1
0,,
1log),(
k
qqiqi
n
ii yDJ µw
∑ ∑∑ ∑=
−
==
−
=
==ni
qi
k
qqi
ni
yi
k
q
yqi
,..1,
1
0,
,..1
1
0
loglog , µµ
)|(),|(),(,..1
w,xXYw iini
ypwpDL ∏=
==
CS 2750 Machine Learning
Learning of the softmax model
• Error to optimize:
• Gradient
• The same very easy gradient update as used for the binary logistic regression
• But now we have to update the weights of k networks
∑∑−
==
−=1
0,,
1log),(
k
qqiqi
n
ii yDJ µw
)(),( ,,,1
qiqiji
n
ii
jq
yxDJw
µ−−=∂∂ ∑
=
w
∑=
−+←n
iiqiqiqq y
1,, )( xww µα
12
CS 2750 Machine Learning
Multi-way classification• Yet another approach 3
Not 2
2 vs 3
Not 3
Not 2 Not 1 Not 3
1 3
Not 1
3 vs 1
2 vs 1
2
CS 2750 Machine Learning
Decision trees
• An alternative approach to classification:– Partition the input space to regions– Regress or classify independently in every region
1 11
0 0
0 0
00
01
1 1
0 00 0
1
2x
1x
13
CS 2750 Machine Learning
Decision trees• The partitioning idea is used in the decision tree model:
– Split the space recursively according to inputs in x– Regress or classify at the bottom of the tree
03 =xx
t f01 =x 02 =x
t tf f
Example:Binary classification Binary attributes
1 0 0 1
0
1 0
321 ,, xxx}1,0{
classify
CS 2750 Machine Learning
Decision trees
How to construct the decision tree?• Top-bottom algorithm:
– Find the best split condition (quantified based on the impurity measure)
– Stops when no improvement possible• Impurity measure:
– Measures how well are the two classes separated – Ideally we would like to separate all 0s and 1
• Splits of finite vs. continuous value attributesContinuous value attributes conditions: 5.03 ≤x
x
t f?
14
CS 2750 Machine Learning
Impurity measure
Let
• Impurity measure defines how well the classes are separated• In general the impurity measure should satisfy:
– Largest when data are split evenly for attribute values
– Should be 0 when all data belong to the same class
||||
DDp i
i =
|| D - Total number of data entries
|| iD - Number of data entries classified as i
- ratio of instances classified as i
classesofnumber1
=ip
CS 2750 Machine Learning
Impurity measures
• There are various impurity measures used in the literature– Entropy based measure (Quinlan, C4.5)
– Gini measure (Breiman, CART)
∑=
−==k
iii ppDEntropyDI
1log)()(
∑=
−==k
iipDGiniDI
1
21)()(
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Example for k=2
15
CS 2750 Machine Learning
Impurity measures
• Gain due to split – expected reduction in the impurity measure (entropy example)
)(||||)(),(
)(∑
∈
−=AValuesv
vv
DEntropyDDDEntropyADGain
|| vD - a partition of D with the value of attribute A = v
03 =xx
t f
0
03 =xx
t f01 =x
t f0
)(DEntropy
)(DEntropy
)( tDEntropy )( fDEntropy
CS 2750 Machine Learning
Decision tree learning
• Greedy learning algorithm:Repeat until no or small improvement in the purity– Find the attribute with the highest gain– Add the attribute to the tree and split the set accordingly
• Builds the tree in the top-down fashion– Gradually expands the leaves of the partially built tree
• The method is greedy– It looks at a single attribute and gain in each step– May fail when the combination of attributes is needed to
improve the purity (parity functions)
16
CS 2750 Machine Learning
Decision tree learning
• Limitations of greedy methodsCases in which a combination of two or more attributes
improves the impurity
1 11 1
0 0
00
0 00
0
11
1
1
CS 2750 Machine Learning
Decision tree learningBy reducing the impurity measure we can grow very large treesProblem: Overfitting• We may split and classify very well the training set, but we may
do worse in terms of the generalization error Solutions to the overfitting problem:• Solution 1.
– Prune branches of the tree built in the first phase– Use validation set to test for the overfit
• Solution 2. – Test for the overfit in the tree building phase– Stop building the tree when performance on the validation set
deteriorates
17
CS 2750 Machine Learning
K-Nearest-Neighbours for Classification
• Given a data set with Nk data points from class Ckand , we have
• and correspondingly
• Since , Bayes’ theorem gives
CS 2750 Machine Learning
K-Nearest-Neighbours for Classification
K = 1K = 3
18
CS 2750 Machine Learning
Nonparametric kernel-based classification
• Kernel function: k(x,x’)– Models similarity between x, x’– Example: Gaussian kernel we used in kernel density
estimation
• Kernel for classification
∑∑===
'
':'
)',(
)',()|(
x
Cyxk xxk
xxkxCyp k
−−= 2
2
2/2 2)'(exp
)2(1)',(
hxx
hxxk Dπ
),(1)(1
i
N
ixxk
Nxp ∑
=
=