chapter 13 (prototype methods and nearest-neighbors ) 2006 년 2 월 24 일

24
Chapter 13 Chapter 13 (Prototype Methods and Nearest- (Prototype Methods and Nearest- Neighbors ) Neighbors ) 2006 년 2 년 24 년

Upload: randolph-barber

Post on 19-Jan-2016

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 13 (Prototype Methods and Nearest-Neighbors ) 2006 년 2 월 24 일

Chapter 13Chapter 13

(Prototype Methods and Nearest-Neighbors )(Prototype Methods and Nearest-Neighbors )

2006 년 2 월 24 일

Page 2: Chapter 13 (Prototype Methods and Nearest-Neighbors ) 2006 년 2 월 24 일

2

ContentsContents

Prototype Methods and Nearest-Neighbors1. Introduction 2. Prototype Methods 2.1 K-means Clustering 2.2 Learning Vector Quantization 2.3 Gaussian Mixtures

3. k-Nearest-Neighbor Classifiers 3.1 Example: A Comparative Study 3.2 Example: k-Nearest-neighbors and Image Scene Classification 3.3. Invariant Metrics and Tangent Distance

4. Adaptive Nearest-Neighbor Methods 4.1 Example 4.2 Global Dimension Reduction for Nearest-Neighbors

5. Computational Considerations

Page 3: Chapter 13 (Prototype Methods and Nearest-Neighbors ) 2006 년 2 월 24 일

3

2. Prototype Methods2. Prototype Methods

◆Prototypes: a set of representative points, not examples from the training

sample.

◆Prototype Methods: Classify a query point x to the class of the closest prototype.

-“Closest” is usually defined by Euclidean distance in the feature space, after each feature has been standardized to have overall mean 0 and variance 1 in the training sample.

The main issue: To figure out how many prototypes to use and where to put them.

Page 4: Chapter 13 (Prototype Methods and Nearest-Neighbors ) 2006 년 2 월 24 일

4

2.1 K-means Clustering

K-means clustering : a method for finding clusters and cluster centers (R) in a set of unlabeled data.

-K-means algorithm

1. Initialize set of centers 2. For each center we identify the subset of training points that is closer to it than any other center 3. The means of each feature for the data points in each cluster are computed, and this mean vector becomes the new center for that cluster.

Iterate this until convergence. (Typically the initial centers are R randomly chosen observations from the training data)

2. Prototype Methods2. Prototype Methods

Page 5: Chapter 13 (Prototype Methods and Nearest-Neighbors ) 2006 년 2 월 24 일

5

2. Prototype Methods2. Prototype Methods

- K-means classifier (labeled data)

1. Apply K-means clustering to the training data in each class separately, using R prototypes per class 2. Assign a class label to each of the K by R prototypes 3. Classify a new feature x to the class of the closest prototype

Drawbacks: For each class, the other classes do not have a say in the positioning of the prototypes for that class. a number of the prototypes are near the class boundaries, leading to potential misclassification errors for points near these boundaries.

use all of the data to position all prototypes LVQ

Figure 13.1 (upper panel)

Page 6: Chapter 13 (Prototype Methods and Nearest-Neighbors ) 2006 년 2 월 24 일

6

2. Prototype Methods2. Prototype Methods

2.2 Learning Vector Quantization (LVQ1)

Drawbacks: the fact that they are defined by algorithms, rather than optimization of some fixed criteria. It’s difficult to understand their properties.

Figure 13.1 (lower panel)

Page 7: Chapter 13 (Prototype Methods and Nearest-Neighbors ) 2006 년 2 월 24 일

7

2. Prototype Methods2. Prototype Methods

-In the lower panel, the LVQ algorithm moves the prototypes away from the decision boundary.

Page 8: Chapter 13 (Prototype Methods and Nearest-Neighbors ) 2006 년 2 월 24 일

8

2.3 Gaussian Mixtures

-Each cluster is described in terms of a Gaussian density, which has a centroid, and a covariance matrix.

Estimation of means and covariance (apply the EM algorithm in each class): 1.In the E-step, observations close to the center of a cluster will most likely get weight 1 for that cluster, and weight 0 for every other cluster. 2. In the M-step, each observation contributes to the weighted means for every cluster. Classification based on posterior probabilities:

2. Prototype Methods2. Prototype Methods

(Soft clustering method)

1ˆ ˆ ˆ( ) { ( ), , ( )},

ˆ ˆ( ) arg max ( ).

K

k k

p x p x p x

G x p x

Classification rule

Page 9: Chapter 13 (Prototype Methods and Nearest-Neighbors ) 2006 년 2 월 24 일

9

2. Prototype Methods2. Prototype Methods

-Although the decision boundaries are roughly similar, those for the mixture model are smoother.

Figure 13.2

Page 10: Chapter 13 (Prototype Methods and Nearest-Neighbors ) 2006 년 2 월 24 일

10

K-Nearest-Neighbor Classifiers - No model to be fit. Given a query point x0 , we find the k training points x(r) ,r = 1,…,k closest in distance to x0 , and then classify using majority vote among the k neighbors. (For simplicity we will assume that the features are real-valued, and we use Euclidean distance in feature space: )

-Ties are broken at random. -In 1-nearest-neighbor classification, each training point is a prototype. -k large high bias, low variance ( smooth boundaries) k small low bias, high variance ( wiggly boundaries)

3. k-Nearest-Neighbor Classifiers3. k-Nearest-Neighbor Classifiers

( ) ( ) 0 .i id x x

Page 11: Chapter 13 (Prototype Methods and Nearest-Neighbors ) 2006 년 2 월 24 일

11

3. k-Nearest-Neighbor Classifiers3. k-Nearest-Neighbor Classifiers

-The decision boundary is fairly smooth compared to the lower panel, where a 1-nearest-neighbor classifier was used.

-The upper panel shows the misclassification errors as a function of neighborhood size.

Figure 13.3

Figure 13.4

Page 12: Chapter 13 (Prototype Methods and Nearest-Neighbors ) 2006 년 2 월 24 일

12

Test error bound for 1-NN ( cover and hart 1967) -Asymptotically the error rate of the 1-nearest-neighbor classifier is never more than t

wice the Bayes rate.

At x let k* be the dominant class,and pk (x) the true conditional probability for class k.

Then,

The asymptotic 1-NN error rate is that of random rule:

3. k-Nearest-Neighbor Classifiers3. k-Nearest-Neighbor Classifiers

* *

2

1

( )(1 ( )) 2(1 ( )) (1 ( )) .1

K

k k k kk

Kp x p x p x p x

K

Page 13: Chapter 13 (Prototype Methods and Nearest-Neighbors ) 2006 년 2 월 24 일

13

3. k-Nearest-Neighbor Classifiers3. k-Nearest-Neighbor Classifiers

3.1 Example : A Comparative Study

-Comparison of K-NN, K-means, LVQ There are 10 independent features Xj , each uniformly distributed on [0,1]. The two-class 0-1 target variable is defined as follows:

-We see that K-means and LVQ give nearly identical results. For the best choices of their tuning parameters, K-means and LVQ outperform nearest-neighbors for the first problem & second problem. Notice that the best value of each tuning parameter is clearly situation dependent. The results underline the importance of using an objective, data-based method like cross-validation to estimate the best value of a tuning parameter.

Page 14: Chapter 13 (Prototype Methods and Nearest-Neighbors ) 2006 년 2 월 24 일

14

3. k-Nearest-Neighbor Classifiers3. k-Nearest-Neighbor Classifiers

Figure 13.5

Page 15: Chapter 13 (Prototype Methods and Nearest-Neighbors ) 2006 년 2 월 24 일

15

3. k-Nearest-Neighbor Classifiers3. k-Nearest-Neighbor Classifiers

3.2 Example : K-nearest-Neighbors and Image Scene Classification

Figure 13.6

Page 16: Chapter 13 (Prototype Methods and Nearest-Neighbors ) 2006 년 2 월 24 일

16

3. k-Nearest-Neighbor Classifiers3. k-Nearest-Neighbor Classifiers

Page 17: Chapter 13 (Prototype Methods and Nearest-Neighbors ) 2006 년 2 월 24 일

17

3. k-Nearest-Neighbor Classifiers3. k-Nearest-Neighbor Classifiers

3.3 Example : Invariant Metrics and Tangent Distance

-In some problems, the training features are invariant under certain natural transformations. The nearest-neighbor classifier can exploit such invariances by incorporating them into metric used to measure the distances between objects.-invariant metric( large transformation problems) use of tangent distances-“hints” A simpler way to achieve this invariance would be to add into the training set” a number of rotated versions of each training image, and then just use a standard nearest-neighbor classifier.

Page 18: Chapter 13 (Prototype Methods and Nearest-Neighbors ) 2006 년 2 월 24 일

18

3. k-Nearest-Neighbor Classifiers3. k-Nearest-Neighbor Classifiers

Figure 13.10 Figure 13.11

Page 19: Chapter 13 (Prototype Methods and Nearest-Neighbors ) 2006 년 2 월 24 일

19

- Median RConsider N data points uniformly distributed in the unit cube . Let R be the radius of a 1-nearest-neighborhood centered at the origin. Then

where is the volume of the sphere of radius r in p dimensions.-We see that median radius quickly approaches 0.5 , the distance to the edge of the cube.

4. Adaptive Nearest-Neighbor Methods4. Adaptive Nearest-Neighbor Methods

high-dimensional problem In a high-dimensional feature space, the nearest neighbors of a point can

be very far away, causing bias and degrading the performance of the rule.

Figure 13.12

1 1,

2 2

p

1/1/1/ 1

median( ) 12

pNp

pR v

p

pv r

Page 20: Chapter 13 (Prototype Methods and Nearest-Neighbors ) 2006 년 2 월 24 일

20

4. Adaptive Nearest-Neighbor Methods4. Adaptive Nearest-Neighbor Methods

Page 21: Chapter 13 (Prototype Methods and Nearest-Neighbors ) 2006 년 2 월 24 일

21

4. Adaptive Nearest-Neighbor Methods4. Adaptive Nearest-Neighbor Methods

◆ DANN (discriminant adaptive nearest neighbor)

Here W is the pooled within-class covariance matrix and B is the between class covariance matrix , with W and B

computed using only the 50 nearest neighbors around x0. After computation of the metric, it is used in nearest-neighbor rule at x0 .

0 0 0( , ) ( ) ( ),TD x x x x x x Where,

1/ 2 1/ 2 1/ 2 1/ 2

1/ 2 * 1/ 2

[ ]

[ ] .

W W BW I W

W B I W

DANN metric

Page 22: Chapter 13 (Prototype Methods and Nearest-Neighbors ) 2006 년 2 월 24 일

22

4. Adaptive Nearest-Neighbor Methods4. Adaptive Nearest-Neighbor Methods

4.1 Example

Figure 13.14 Figure 13.15

Page 23: Chapter 13 (Prototype Methods and Nearest-Neighbors ) 2006 년 2 월 24 일

23

4.2 Global dimension Reduction for nearest-Neighbors

Variation of DANN-local dimension reduction (Hastie and Tibshirani) At each training point xi , the between-centroids sum of squares matrix Bi

is computed, and then these matrices are averaged over all training points: Let e1, e2,…, ep be the eigenvectors of the matrix , ordered from largest to smallest eigenvalue . Then derivation is based on the fact that the best rank-L appr

oximation to , , solves the least squares problem

since each Bi contains information on (a) the local discriminant subspace. And (b) the strength of discrimination in that subspace, 13.11 can be seen as a way of finding the best approximating subspace of dimension L to a series of N subspaces by weighted least squares.

4. Adaptive Nearest-Neighbor Methods4. Adaptive Nearest-Neighbor Methods

(3.11)

Page 24: Chapter 13 (Prototype Methods and Nearest-Neighbors ) 2006 년 2 월 24 일

24

5. Computational Considerations 5. Computational Considerations

-Drawback of nearest-neighbor rules in general is the computational loads. computations for tangent distance (Hastie and Simard) The multi-edit algorithm (Devijver and Kittler) The condensing procedure (Hart)