optimized nearest neighbor methods cam weighted distance ... · optimized nearest neighbor methods...

Post on 15-Feb-2019

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Optimized Nearest Neighbor Methods Cam Weighted Distance & Statistical Confidence

Robert Ross Puckett

University of Hawaii at Manoa puckett@hawaii.edu

Abstract Nearest neighbor classification methods are a useful and a relatively straightforward to implement classification technique. However, despite such appeal, they still suffer from the curse of dimensionality. Additionally, the nature of the data sets may not be wholly applicable to the model assumed in the nearest neighbor methods. As such there have been many proposed optimizations. Two such optimizations studied in this project are statistical confidence and “cam-weighted” distance measure. For statistical confidence, a confidence measure is used to adapt the k value for k nearest neighbor to allow a more optimal set of neighbors for classification. Cam-weighted distance mimics attractive and repulsive forces of prototypes in determining its non-metric distance measure. This report documents the progress of studying these methods and their interaction. 1. Introduction The nearest neighbor methods are useful nonparametric techniques for pattern classification. Although simple in design, as the number of training samples grows, the computational and space requirements of the algorithm can become intractable. As such, numerous optimizations have been proposed over time including partial distance, search tree, and editing [1]. As such, the curse of dimensionality exacerbates the problem. This project studies the effects of two optimization techniques for k-nearest-neighbor. One optimization proposed by J. Wang et al involves the incorporation of a statistical confidence level as a means of determining the number k of neighbors to include in the computation [2]. Confidence is proportional to the majority value in the nearest k neighbors. If the majority is by a slim margin, the confidence will be very low. In such occurrences of low confidence it may be advisable to increase the number k. Thus, instead of globally increasing the k value and greatly impacting the computational complexity, this algorithm selectively

increases the k value only when the confidence is below some threshold. The other optimization, a proposal by C. Zhou et al, approaches the problem from the direction of the relationships between training samples, referred to as prototypes [3]. This technique involves deforming the distribution through a transformation such that the strengthening and weakening effects between prototypes are simulated. The k-nearest-neighbors surrounding each prototype are used to estimate the parameters of the distribution. A resulting inverse transformation can then be used to provide a “cam weighted distance” instead of the Euclidean distance. Additionally the authors maintain that “CamNN significantly outperforms one nearest neighbor classification (1NN) and k-nearest-neighbor classification (kNN) in most benchmarks, while its computational complexity is comparable of 1NN classification. [3]” 2. Problem Description The problem being studied is a comparison and hybrid combination of two algorithms—statistical confidence and cam-weighted distance. 2.1. Statistical Confidence The statistical confidence optimization to k-nearest-neighbor, as described by J. Wang et al, involves determining a confidence measure of a given set of neighbors. If the confidence is below a set threshold for that class, then the number of neighbors is expanded until the confidence threshold is met. The algorithm an analysis provided in that paper was limited to only two classes. It would be interesting to extend this concept to multiple classes. However, if we generalize the concept, we must be careful to ensure the generalization remains meaningful. The difficulty in generalization to multiple classes is that the confidence measure of a set D can be approximated as

the difference between the classes divided by the square root of the number of in the set being compared. The value for delta is obtained by subtracting the true minority from the true majority of the set. With multiple classes, there is the issue of there being more than one option for this difference. One could compare the top two majority classes, the top majority class and the bottom minority class, or perhaps something that includes all of the classes. 2.2.Cam-weighted Distance Proposed and named by C. Zhou et al, the cam-weighted distance provides and optimization of the distance measure based upon the interactions of prototypes. That is, the attractive and repulsive forces of the training data distort the decision boundary. The algorithm outlined in Zhou's paper simulates this distorting affect via a simple transformation. Then the training data is used to derive the parameters needed to produce an inverse transformation. Thus, the results are smoothed decision boundaries, called cam contours. It is interesting to note that although this is a distance measure, it fails to be a metric. Zhou elaborates in his paper on this point that the cam weighted distance is not wholly symmetrical. That is Cam(x1,x2) is not necessarily equal to Cam(x2,x1). Additionally there may be values where the cam measure is not defined. 3. System Description The algorithm was developed with Java using Eclipse. A Google hosting group “nearestneighbor” serves as a subversion repository and code hosting. This provides both offsite backup and version control. Code checking tools and testing framework have been utilized enhancing code quality. Such tools include PMD, checkstyle, Junit, and Emma. These tools help to ensure that the code quality remains high throughout the development. This system is divided into classifications, classifiers, and utility functions. Classifications contain the results of a classification. Different classifiers return different useful data which may include the k value, the confidence, and of course, the assigned class. The classifiers include the k-nearest neighbor (kNN) base system, the cam-weighted nearest neighbor (camNN) system, and the statistical confidence nearest neighbor (confNN) system. Utility functions provide common mathematical tools such as Euclidean distance, gamma, and error function.

The kNN system classifies a given data point based upon the majority of the nearest k neighbors. Should there be no majority, the classification is randomly chosen from the highest tied candidate classes. The JAVA built in random function is used as the random number generator. The camNN system is divided into two major parts—training and classification. The training part is in effect a preprocessing of the training data to generate the parameters used in the cam-weighted distance measure. One set of parameters for each training sample. For a given training set of 2500 samples, the time to finish training can be somewhat considerable. However, training could be viewed as a one-time cost. Should the algorithm be deployed in the field, the important cost is the classification time. The camNN system classifies using a one nearest neighbor rule. Thus, clearly its computational complexity for classification outperforms kNN for almost all k. For confNN, the system requires a threshold value to be specified in addition to the training data, test point, and starting k value. The major difference between confNN and kNN is that for any point to be classified, the number of neighbors used in the classification varies. A starting k value is given, but the algorithm may choose a higher k value. During classification, the algorithm computes the statistical confidence of the k nearest neighbors. If the calculated confidence is not above the threshold specified, then a larger number of neighbors are used. The process is repeated until the desired threshold is met. The error function is used in estimating the cumulative distribution function that approximates the percent error. Thus, the confidence measure is proportional to the estimated error from choosing the wrong class. If the original k value is sufficient to meet the desired confidence threshold, then the algorithm need not increase the k value. 4. Approaches / Methods The experiments in this project have used the ELENA database of artificial classification benchmarks (ftp://ftp.dice.ucl.ac.be/pub/neural-nets/ELENA/databases). For the purposes of this experiment, the Gaussian datasets ranging in dimension from two to eight have been utilized. Each dataset includes 5000 classified values. For purpose of this experiment, the first half was used as training data, and the latter half used in test data. These data sets are limited to two classes.

4.1 Verification of Algorithms The verification phase included assuring that the kNN, camNN, and confNN algorithms functioned as expected. Unfortunately, the paper outlining the confNN procedure did not include any data for which to compare. However, the camNN paper by Zhou provided both data for the kNN and the camNN algorithms. 4.2 Four Test Cases The motivation for this experiment is to see if camNN can be improved by using statistical confidence to determine the size of the cam distributions. To recap, the camNN training phase originally assumes that the k-nearest neighbors are part of its cam distribution. This cam distribution is used to estimate the parameters needed to reverse an assumed distortion in the probability distribution. However, having a fixed value for k means that we’re assuming that every cam distribution has a fixed size, which may not be true. Since the camNN algorithm is the cornerstone of this project and provided much useful data for comparison, the k values for kNN and camNN are utilized in the following tests. The test cases are:

1. The standard kNN algorithm. 2. The camNN algorithm. 3. The confNN algorithm. 4. A hybrid algorithm of camNN with statistical

confidence training 5. Results This section provides the results of the tests outlined in the previous section. 5.1 Verification In comparing the percent error from the system run to the expected values in the Zhou paper, the values for the kNN case differed by an average of .8%. For the camNN case, the values differed by an average of 3.7%.

5.2 Four Test Cases Table 1 below illustrates the results of running the four test cases outlined above. Below each table identification is the percent error from running that algorithm on the dataset specified. The average k value and average confidence value is presented for the confNN algorithm. 6. Discussion This section provides an analysis of the results from the testing section above. 4.1 Verification of Algorithms It is surprising that the kNN results are not identical to the results in the Zhou paper. However, the paper’s results have an average standard deviation of 0.7% for the kNN percent error values. It is possible that their algorithm has a different tie-breaking mechanism when there is no clear majority. Either way, the results from this system correlate closely to the results from the Zhou paper. Similarly, the results for the camNN algorithm correlate closely to those in Zhou paper. However, the increasing error suggests that there is room for improvement in the current implementation of this algorithm. 4.2 Four Test Cases As expected, the optimized nearest neighbor algorithms outperformed the standard kNN algorithm on the majority of cases. Similar to the results in Zhou’s paper, the camNN algorithm proved to be the strongest, yielding the best percentage error at high dimensions. Surprisingly, in lower dimensions, the pure statistical confidence algorithm outperformed the rest of the algorithms. It is worth noting that in Zhou’s paper, the kNN outperformed the camNN in lower dimensions. Additionally, although the number of neighbors used in this algorithm varies for each point, the average was still the same as the kNN case. This suggests that for most cases, the k value was sufficient to attain the desired threshold value.

Table 1 – Results of test cases showing percent error and relevant values. File #Feat Test 1 k Test 2 k Test 3 k_av conf_av Test 4 gauss_2D 2 0.284 21 0.3272 16 0.28 21 0.864461 0.322 gauss_3D 3 0.228 23 0.26 5 0.2276 23 0.893266 0.2524 gauss_4D 4 0.202 19 0.2112 6 0.2 19 0.900218 0.21 gauss_5D 5 0.1924 13 0.1792 6 0.1944 13 0.896091 0.1796 gauss_6D 6 0.1876 9 0.1712 6 0.196 9 0.856219 0.1704 gauss_7D 7 0.1836 5 0.1576 6 0.1884 5 0.808817 0.158 gauss_8D 8 0.1992 3 0.1492 6 0.2184 3 0.754405 0.1496

Unfortunately, the hybrid algorithm with statistical confidence training only outperformed the other algorithms in one case. However, one will note that this algorithm returned results that were very close, albeit higher percentage error, to the pure camNN algorithm. One weakness of the current approach is that the current implementation of the camNN algorithm ignores the direction of the cam distribution. After much work to implement the algorithm, the system produced a high error rate. Replacing the θcos with 1 in the cam distance measure equation yielded results surprisingly close to the Zhou paper. It is disappointing that this compromise had to be made, however it is enlightening that it produced such decent results. This substitution may prove useful in future work to reduce the computational complexity of the training phase. Since only the a and b parameters were needed, it may be possible to dispense with some of the calculations needed to implement this algorithm. Additionally, it is possible that the statistical confidence algorithm is not performing to its highest degree possible. The error function, which follows a somewhat sigmoid shape, is only approximated in the current implementation. The Taylor series expansion used in this computation converges slowly and the values added thereon may exceed the precision capable for the computer. Thus, the error function is simplified and it is assumed that values less than -2 are -1, and above 1 are 1. Since the error function converges rapidly to those values, this should not make a large impact. In regard to the hybrid implementation, this algorithm is a novel concept, thus there is no readily available performance data for which to compare it. The camNN algorithm was only released in 2006. The authors noted very few algorithms with similar concepts to theirs. Thus it may be a while before other hybrid implementations that incorporate camNN follow. 7. Summary and Conclusions This system tested several implementation of optimizations for nearest neighbor systems—k-nearest-neighbor (kNN), cam nearest-neighbor (camNN), confidence nearest-neighbor (confNN), and a hybrid of camNN and confNN. Overall, the pure camNN outperformed most of the other algorithms. However, statistical confidence outperformed camNN in low dimensions. Unfortunately, the implemented hybrid algorithm failed to provide the best percentage error in comparison to the

other algorithms tested. However, it was enlightening to see that the statistical confidence algorithm provided better results exactly where the camNN algorithm had its most trouble. That is, the confNN algorithm performed better than all other algorithms tested when the dimension was low. The Zhou paper demonstrates that camNN performs best at high dimensions, but is bested by kNN in the lower dimensions. Thus, a new hybrid algorithm might perform better if it used pure statistical confidence to classify data sets with low dimension, and pure cam-weighted nearest neighbor to classify datasets with higher dimensions. 8. Future Work Future work will include providing a new hybrid algorithm that uses confNN for lower dimensions and camNN for higher dimensions. Additionally, it would be useful to develop an algorithm to aid in determining the optimal threshold value. This may include a preprocessing of the dataset to determine the threshold value appropriate for processing the entire dataset. While it is possible that one threshold value may provide the optimal results for all datasets, it is unlikely. Thus, one would want to determine the optimal threshold value for the dataset. The results in this paper provide the average confidence value for the run of the confNN algorithm. This might be a useful head start for determining what the optimal threshold. Additionally, it would be useful to study the computational complexity of the hybrid implementation. Unfortunately, the confNN algorithm is likely to perform worse than the kNN algorithm owing to frequent calculation of the confidence value. Since the pure confNN algorithm does no preprocessing, all computational complexity is within the classification phase. Thus, it would be ideal if the hybrid implementation of confNN and camNN would keep the statistical confidence calculations in the training phase which does not impact the classification phase. 9. References [1] Duda, R. O., P. E. Hart, et al. (2001). Pattern classification. New York, Wiley. [2] Wang, J., P. Neskovic, et al. (2006). "Neighborhood size selection in the knearestneighbor rule using statistical confidence." Pattern Recognition 39(3): 417423. [3] Zhou, C. Y. and Y. Q. Chen (2006). "Improving nearest neighbor classification with cam weighted distance." Pattern Recognition 39(4): 635645.

top related