instance-based learning and clustering · example: recognition of handwritten digits 30 pixels 20...
TRANSCRIPT
1
Instance-Based Learning and Clustering
R&N 20.4, a bit of 20.3
2
Different kinds of Inductive Learning
• Supervised learning – Basic idea: Learn an approximation for a
function y=f(x) based on labelled examples { (x1,y1), (x2,y2), …, (xn,yn) }
– E.g. Decision Trees, Bayes classifiers, Instance-based learning methods
• Unsupervised learning
Instance-based learning
• Idea: For every test data point, search database of training data for ‘similar’ points and predict according to those points
3
Instance-based learning
• Idea: For every test data point, search database of training data for ‘similar’ points and predict according to those points
• Four elements of an instance-based learner:– How do we define ‘similarity’?– How many similar data points (neighbors) do we use?– (Optional) What weights do we give these neighbors?– How do we predict using these neighbors?
One-nearest-neighbor (1-NN)
• Simplest Instance-based learning method• Four elements of 1-NN:
– How do we define ‘similarity’?• Euclidian distance metric
– How many similar data points (neighbors) do we use?• one
– (Optional) What weights do we give these neighbors?• unused
– How do we predict using these neighbors?• predict the same value as the nearest neighbor
4
1-NN Prediction
class A
class B
1. Classification (predicting discrete-valued labels)
1-NN Prediction
class A
class B
p2
1. Classification (predicting discrete-valued labels)
PredictionTest point
p2
p1
p1
5
1-NN Prediction
class A
class B
p2
1. Classification (predicting discrete-valued labels)
PredictionTest point
p2
p1
p1
1-NN Prediction
1. Classification (predicting discrete-valued labels)
• Three classes: • Background color
indicates prediction in different areas
• Solid lines are decision boundaries between classes
[ignore the dashed purple line]
6
1-NN Prediction
2. Regression (predicting real-valued labels)
K-nearest-neighbor (K-NN)• A generalization of 1-NN to multiple neighbors• Four elements of K-NN:
– How do we define ‘similarity’?• Euclidian distance metric
– How many similar data points (neighbors) do we use?• K
– (Optional) What weights do we give these neighbors?• unused
– How do we predict using these neighbors?• Classification: predict majority label among neighbors• Regression: predict average value among neighbors
7
K-NN Prediction
class A
class B
p2
1. Classification (K=3)
PredictionTest point
p2
p1
p1
K-NN Prediction
class A
class B
p2
1. Classification (K=3)
PredictionTest point
p2
p1
p1
8
K-NN Prediction
1. Classification (K=15)
• Three classes: • Background color
indicates prediction in different areas
• Solid lines are decision boundaries between classes
[ignore the dashed purple line]
K-NN Prediction
K=1 K=15
9
K-NN Prediction
2. Regression (with K=9)
K-NN Prediction
K=1
K=9
10
Example: Recognition of handwritten digits
30 pixels
20 pixels
.
.
.
.
.
.
600-dimensionaldata point
Example: Recognition of handwritten digits
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
N sets of handwritten digit samples
10xN 600-dimensional training points
New handwritten sample classifiedby K-NN in 600-dimensional space
on training data
(Each color represents samples of a particular digit)
11
K-NN vs. other techniques• Most Instance-based methods work only for real-valued
inputs• Instance-based methods do not need a training phase,
unlike decision trees and Bayes classifiers• However, the nearest-neighbors-search step can be
expensive for large/high-dimensional datasets • Instance-based learning is non-parametric, i.e. no prior
model assumptions• No foolproof way to pre-select K … must try different
values and pick one that works well• Problems of discontinuities and edge effects in K-NN
regression … can be addressed by introducing weightsfor data points that are proportional to closeness
Unsupervised Learning (a.k.a. Clustering)
• Unsupervised learning – Basic idea: Learn an approximation for a
function y=f(x) based on unlabelled examples { x1, x2, …, xn }
– The goal is to uncover distinct classes of data points (clusters), which might then lead to a supervised learning scenario
– E.g. K-means, hierarchical clustering
The following slides are adapted from Andrew Moore’s slides at http://www.autonlab.org/tutorials/kmeans.html
12
K-means• Even if we have no labels for a
data set, there might still be interesting structure in the data in the form of distinct clusters/clumps.
• K-means is an iterative algorithm to find such clusters given the assumption that exactly K clusters exist.
K-means1. Ask user how many
clusters they’d like. (e.g. k=5)
13
K-means1. Ask user how many
clusters they’d like. (e.g. k=5)
2. Randomly guess k cluster Center locations
K-means1. Ask user how many
clusters they’d like. (e.g. k=5)
2. Randomly guess k cluster Center locations
3. Each datapoint finds out which Center it’s closest to. (Thus each Center “owns”a set of datapoints)
14
K-means1. Ask user how many
clusters they’d like. (e.g. k=5)
2. Randomly guess k cluster Center locations
3. Each datapoint finds out which Center it’s closest to.
4. Each Center finds the centroid of the points it owns
K-means1. Ask user how many
clusters they’d like. (e.g. k=5)
2. Randomly guess k cluster Center locations
3. Each datapoint finds out which Center it’s closest to.
4. Each Center finds the centroid of the points it owns…
5. …and jumps there
6. …Repeat until terminated!
15
K-means Questions
• What is it trying to optimize?• Are we sure it will terminate?• Are we sure it will find an optimal
clustering?• How should we start it?
DistortionGiven..
•an encoder function: ENCODE : ℜm → [1..k]
•a decoder function: DECODE : [1..k] → ℜm
Define…
( )∑=
−=R
iii
1
2)]([Distortion ENCODEDECODE xx
16
DistortionGiven..
•an encoder function: ENCODE : ℜm → [1..k]
•a decoder function: DECODE : [1..k] → ℜm
Define…
We may as well write
∑=
−=
=R
ii
j
i
j
1
2)(ENCODE )(Distortionso
][DECODE
xcx
c
( )∑=
−=R
iii
1
2)]([Distortion ENCODEDECODE xx
The Minimal Distortion
What properties must centers c1 , c2 , … , ck have when distortion is minimized?
∑=
−=R
ii i
1
2)(ENCODE )(Distortion xcx
17
The Minimal Distortion (1)
What properties must centers c1 , c2 , … , ck have when distortion is minimized?
(1) xi must be encoded by its nearest center
….why?
∑=
−=R
ii i
1
2)(ENCODE )(Distortion xcx
2
},...,{)(ENCODE )(minarg
21
jikj
icxc
ccccx −=
∈
..at the minimal distortion
The Minimal Distortion (1)
What properties must centers c1 , c2 , … , ck have when distortion is minimized?
(1) xi must be encoded by its nearest center
….why?
∑=
−=R
ii i
1
2)(ENCODE )(Distortion xcx
2
},...,{)(ENCODE )(minarg
21
jikj
icxc
ccccx −=
∈
..at the minimal distortion
Otherwise distortion could be reduced by replacing ENCODE[xi] by the nearest center
18
The Minimal Distortion (2)
What properties must centers c1 , c2 , … , ck have when distortion is minimized?
(2) The partial derivative of Distortion with respect to each center location must be zero.
∑=
−=R
ii i
1
2)(ENCODE )(Distortion xcx
(2) The partial derivative of Distortion with respect to each center location must be zero.
minimum) a(for 0
)(2
)(Distortion
)(
)(Distortion
)OwnedBy(
)OwnedBy(
2
1 )OwnedBy(
2
1
2)(ENCODE
=
−−=
−∂∂
=∂
∂
−=
−=
∑
∑
∑ ∑
∑
∈
∈
= ∈
=
j
j
j
i
iji
iji
jj
k
j iji
R
ii
c
c
c
x
cx
cxcc
cx
cx
OwnedBy(cj ) = the set of records owned by Center cj .
19
(2) The partial derivative of Distortion with respect to each center location must be zero.
minimum) a(for 0
)(2
)(Distortion
)(
)(Distortion
)OwnedBy(
)OwnedBy(
2
1 )OwnedBy(
2
1
2)(ENCODE
=
−−=
−∂∂
=∂
∂
−=
−=
∑
∑
∑ ∑
∑
∈
∈
= ∈
=
j
j
j
i
iji
iji
jj
k
j iji
R
ii
c
c
c
x
cx
cxcc
cx
cx
∑∈
=)OwnedBy(|)OwnedBy(|
1
jii
jj
c
xc
cThus, at a minimum:
At the minimum distortion
What properties must centers c1 , c2 , … , ck have when distortion is minimized?
(1) xi must be encoded by its nearest center
(2) Each Center must be at the centroid of points it owns.
∑=
−=R
ii i
1
2)(ENCODE )(Distortion xcx
20
Improving a suboptimal configuration…
What properties can be changed for centers c1 , c2 , … , ckhave when distortion is not minimized?
∑=
−=R
ii i
1
2)(ENCODE )(Distortion xcx
Improving a suboptimal configuration…
What properties can be changed for centers c1 , c2 , … , ckhave when distortion is not minimized?
(1) Change encoding so that xi is encoded by its nearest center
(2) Set each Center to the centroid of points it owns.
There’s no point applying either operation twice in succession.
But it can be profitable to alternate.
…And that’s K-means!
∑=
−=R
ii i
1
2)(ENCODE )(Distortion xcx
21
Will we find the optimal configuration?
• Not necessarily.• Can you invent a configuration that has
converged, but does not have the minimum distortion?
Will we find the optimal configuration?
• Not necessarily.• Can you invent a configuration that has
converged, but does not have the minimum distortion?
22
Trying to find good optima
• Idea 1: Be careful about where you start• Idea 2: Do many runs of k-means, each
from a different random start configuration• Many other ideas floating around.
Other distance metrics
• Note that we could have used the Manhattan distance metric instead of the one above.
• If so,
∑=
−=R
ii i
1)(ENCODE ||Distortion xcx
How would you find the distortion-minimizing centers in this case?
23
Example: Image Segmentation• Once K-means is performed, the resulting cluster centers can be
thought of as K labelled data points for 1-NN on the entire training set, such that each data point is labelled with its nearest center. This is called Vector Quantization.
Example: Image Segmentation• Once K-means is performed, the resulting cluster centers can be
thought of as K labelled data points for 1-NN on the entire training set, such that each data point is labelled with its nearest center. This is called Vector Quantization.
24
Example: Image Segmentation• Once K-means is performed, the resulting cluster centers can be
thought of as K labelled data points for 1-NN on the entire training set, such that each data point is labelled with its nearest center. This is called Vector Quantization.
Vector quantization on pixel intensities
Vector quantization on pixel colors
Common uses of K-means
• Often used as an exploratory data analysis tool• In one-dimension, a good way to quantize real-
valued variables into k non-uniform buckets• Used on acoustic data in speech understanding
to convert waveforms into one of k categories (i.e. Vector Quantization)
• Also used for choosing color palettes on old fashioned graphical display devices!
25
Single Linkage Hierarchical Clustering
1. Say “Every point is its own cluster”
Single Linkage Hierarchical Clustering
1. Say “Every point is its own cluster”
2. Find “most similar” pair of clusters
26
Single Linkage Hierarchical Clustering
1. Say “Every point is its own cluster”
2. Find “most similar” pair of clusters
3. Merge it into a parent cluster
Single Linkage Hierarchical Clustering
1. Say “Every point is its own cluster”
2. Find “most similar” pair of clusters
3. Merge it into a parent cluster
4. Repeat
27
Single Linkage Hierarchical Clustering
1. Say “Every point is its own cluster”
2. Find “most similar” pair of clusters
3. Merge it into a parent cluster
4. Repeat
Single Linkage Hierarchical Clustering
1. Say “Every point is its own cluster”
2. Find “most similar” pair of clusters
3. Merge it into a parent cluster
4. Repeat…until you’ve merged the whole dataset into one clusterYou’re left with a nice
dendrogram, or taxonomy, or hierarchy of datapoints
How do we define similarity between clusters?
• Minimum distance between points in clusters
• Maximum distance between points in clusters
• Average distance between points in clusters
28
Hierarchical Clustering Comments• It’s nice that you get a hierarchy instead of
an amorphous collection of groups• If you want k groups, just cut the (k-1)
longest links