geometric proximity graphs for improving nearest neighbor methods in instance-based learning and...

50
International Journal of Computational Geometry & Applications Vol. 15, No. 2 (2005) 101-150 © World Scientific Publishing Company GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING* GODFRIED TOUSSAINT School of Computer Science, McGill University 3480 University St., McConnell Eng. Building, Room 318 Montrial, Quebec H3A 2A7, Canada [email protected] Received 15 March 2003 Revised 5 November 2004 Communicated by Marina L. Gavrilova, Guest Editor ABSTRACT In the typical nonparametric approach to classification in instance-based learning and data mining, random data (the training set of patterns) are collected and used to design a decision rule (classifier). One of the most well known such rules is the k-nearest- neighbor decision rule (also known as lazy learning) in which an unknown pattern is classified into the majority class among its k nearest neighbors in the training set. Several questions related to this rule have received considerable attention over the years. Such questions include the following. How can the storage of the training set be reduced without degrading the performance of the decision rule? How should the reduced training set be selected to represent the different classes? How large should k be? How should the value of k be chosen? Should all k neighbors be equally weighted when used to decide the class of an unknown pattern? If not, how should the weights be chosen? Should all the features (attributes) we weighted equally and if not how should the feature weights be chosen? What distance metric should be used? How can the rule be made robust to overlapping classes or noise present in the training data? How can the rule be made invariant to scaling of the measurements? How can the nearest neighbors of a new point be computed efficiently? What is the smallest neural network that can implement nearest neighbor decision rules? Geometric proximity graphs such as Voronoi diagrams and their many relatives provide elegant solutions to these problems, as well as other related data mining problems such as outlier detection. After a non-exhaustive review of some of the classical canonical approaches to these problems, the methods that use proximity graphs are discussed, some new observations are made, and open problems are listed. Keywords: Instance-based learning; proximity graphs; nearest-neighbor methods; data mining. "This research was supported by NSERC and FCAR. World Scientific www.worldscientific.com 101 Int. J. Comput. Geom. Appl. 2005.15:101-150. Downloaded from www.worldscientific.com by SIMON FRASER UNIVERSITY on 08/29/13. For personal use only.

Upload: godfried

Post on 13-Dec-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

International Journal of Computational Geometry & Applications Vol. 15, No. 2 (2005) 101-150 © World Scientific Publishing Company

GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED

LEARNING AND DATA MINING*

GODFRIED TOUSSAINT

School of Computer Science, McGill University 3480 University St., McConnell Eng. Building, Room 318

Montrial, Quebec H3A 2A7, Canada [email protected]

Received 15 March 2003 Revised 5 November 2004

Communicated by Marina L. Gavrilova, Guest Editor

ABSTRACT

In the typical nonparametric approach to classification in instance-based learning and data mining, random data (the training set of patterns) are collected and used to design a decision rule (classifier). One of the most well known such rules is the k-nearest-neighbor decision rule (also known as lazy learning) in which an unknown pattern is classified into the majority class among its k nearest neighbors in the training set. Several questions related to this rule have received considerable attention over the years. Such questions include the following. How can the storage of the training set be reduced without degrading the performance of the decision rule? How should the reduced training set be selected to represent the different classes? How large should k be? How should the value of k be chosen? Should all k neighbors be equally weighted when used to decide the class of an unknown pattern? If not, how should the weights be chosen? Should all the features (attributes) we weighted equally and if not how should the feature weights be chosen? What distance metric should be used? How can the rule be made robust to overlapping classes or noise present in the training data? How can the rule be made invariant to scaling of the measurements? How can the nearest neighbors of a new point be computed efficiently? What is the smallest neural network that can implement nearest neighbor decision rules? Geometric proximity graphs such as Voronoi diagrams and their many relatives provide elegant solutions to these problems, as well as other related data mining problems such as outlier detection. After a non-exhaustive review of some of the classical canonical approaches to these problems, the methods that use proximity graphs are discussed, some new observations are made, and open problems are listed.

Keywords: Instance-based learning; proximity graphs; nearest-neighbor methods; data mining.

"This research was supported by NSERC and FCAR.

World Scientific www.worldscientific.com

101

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 2: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

102 G. Toussaint

1. Nearest-Neighbor Decision Rules

In the typical non-parametric classification problem (see Aha4, Devroye, Gyorfy and Lugosi,79 Duda and Hart,82 Duda, Hart and Stork,83 McLachlan,167 O'Rourke and Toussaint183) we have available a set of d measurements or observations (also called a feature vector) taken from each member of a data set of n ob­jects (patterns) denoted by {X,Y} — {(Xi,Y\),(X2,Y2), ...,(Xn,Yn)}, where Xi and Yi denote, respectively, the feature vector on the ith object and the class la­bel of that object. One of the most attractive decision procedures, conceived by Fix and Hodges in 1951, is the nearest-neighbor rule (1-AW-rule).91 Let Z be a new pattern (feature vector) to be classified and let Xj be the feature vector in {X,Y} = {(X1,Yi),(X2,Y2), ...,(Xn,Yn)} closest to Z. The nearest neighbor de­cision rule classifies the unknown pattern Z into class Yj. Figure 1 depicts the decision boundary of the 1-AW-rule. The feature space is partitioned into convex polyhedra (polygons in the plane). This partitioning is called the Voronoi diagram. Each pattern (Xi, Yi) in {X, Y} is surrounded by its Voronoi polyhedron consisting of those points in the feature space closer to (Xi, Yi) than to (Xj, Yj) for all j ^ i. The 1-AW-rule classifies a new pattern Z that falls into the Voronoi polyhedron of pattern Xj into class Yj. Therefore the decision boundary of the 1-AW-rule is deter­mined by those portions of the Voronoi diagram that separate patterns belonging to different classes. In the example depicted in Figure 1 the decision boundary is shown in bold lines and the resulting decision region of one class is shaded.

A key feature of this decision rule (also called lazy learning,4 instance-based learning,5'36 and memory-based reasoning236) is that it performs remarkably well considering that no explicit knowledge of the underlying distributions of the data is used. Consider for example the two class problem and denote the a priori prob­abilities of the two classes by P(C\) and P(C2), the a posteriori probabilities by P(C\\X) and P(C2\X), and the mixture probability density function by

P(x) = ̂ (CiMxid) + P(C2)P(X\C2) (i)

where p(X\d) is the class-conditional probability density function given class d,i = 1,2. In 1967 Cover and Hart57 showed, under some continuity assump­tions on the underlying distributions, that the asymptotic error rate of the 1-AW rule, denoted by Pe[l-NN\, is given by

Pe[l - NN] = 2 Jp(X)[P(C1\X)P(C2\X)}dX (2)

They also showed that Pe [1-AW] is bounded from above by twice the Bayes error (the error of the best possible rule). More precisely, and for the more general case of M pattern classes the bounds proved by Cover and Hart57 are given by:

Pe < Pe[l - NN] < Pe(2 - MPe/(M - 1)) (3)

where Pe is the optimal Bayes probability of error given by:

Pe = 1 - f p(X)[max{P(Cl\X),P(C2\X), ...,P(CM\X)}}dX, (4)

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 3: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

Geometric Proximity Graphs for Improving Nearest Neighbor Methods 103

Fig. 1. The nearest neighbor decision boundary is a subset of the Voronoi diagram.

and

- M

Pe{\ -NN] = 1- J P(X)[J^P(Ci\X)2}dX. (5)

Stone238 and Devroye76 generalized these results by proving the bounds for all distributions. These bounds imply that the nearest neighbor of Z contains at least half of the total discrimination information contained in an infinite-size training set. Furthermore, a simple generalization of this rule called the k-NN-rule, in which a new pattern Z is classified into the class with the most members present among the k nearest neighbors of Z in {X, Y], can be used to obtain good estimates of the Bayes error (Fukunaga and Hostetler98) and its probability of error asymptotically approaches the Bayes error (Devroye et al.79).

Another simple and attractive special case of the k-NN-iu\e is the 2-NN-rule with a reject option. Let Z be a new pattern to be classified and let Xs and Xt be the two nearest neighbors of Z among {X,y} — {(X\,Yi), (X2, Y2),..., (Xn,Yn)}. Here if both Xs and Xt belong to the same class, say d, then Z is classified into class Cj. If Xs and Xt belong to different classes then Z is rejected (no decision is made). This rule was suggested by Hellman who also showed that its asymptotic error rate was at most equal to the Bayes error.115 Thus this rejection rule reduces the upper bound on the assymptotic error of the l-AW-rule by one half. More recently another rejection rule has been proposed that in practice reduces the error

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 4: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

104 G. Toussaint

by one half while keeping the number of rejections reasonably low (Decaestecker and Van de Merckt70). There is also a 2-AW-rule without a reject option where a tie is broken randomly. Djouadi80 shows that this rule yields a good finite-sample size estimate of Pe[l-NN\.

The measure Pe [1-NN\ turns up in a surprising variety of related problems some­times in disguise. For example, it is also the error rate of the 1-NN rule based on ranks (Dasgupa and Lin65), as well as the error rate of the proportional predic­tion randomized decision rule considered by Goodman and Kruskal105 (see also Toussaint253). Breiman et al.32 refer to it as the Gini diversity index. Devijver and Kittler73 and Vajda269 refer to it as the quadratic entropy. Mathai and Rathie165 call it the harmonic mean coefficient. It is also closely related to the Bayesian distance (Devijver72) and the quadratic mutual information (Toussaint249). Incidentally, the Bayesian distance is called the cross-category feature importance in the instance-based learning literature (Stanfill and Waltz,236 Creecy et al.58). Furthermore, it is identical to the asymptotic probability of correct classification of the l-AW-rule given by PC[1-NN] = 1 - Pe[l-NN]. The error probability Pe[l-NN\ also shares a property with Shannon's measure of equivocation. Both are special cases of the equivocation of order (3 (Toussaint251'256). The reader is referred to Breiman et al.32 for further interpretations of the measure Pe[l-NN\ in the statistics literature.

There is a vast literature on the subject of nearest neighbor classification which will not be reviewed here. The interested reader is referred to the comprehensive treatment by Devroye, Gyorfi and Lugosi79 and the collected papers in the 1991 volume edited by Dasarathy.60 For more on the information measures closely related to the measure Pe[l-NN\ the reader is referred to Mathai and Rathie165 (see also Toussaint252,255).

Since its conception, many pattern recognition practicioners have unfairly crit­icized the l-AW-rule, on the grounds of several mistaken assumptions. These mis­taken assumptions are: (1) that all the data {X, Y} must be stored in order to implement such a rule; (2) that to determine the nearest neighbor of a pattern to be classified, distances must be computed between the unknown vector Z and all members of {X, Y}\ and (3) that such nearest neighbor rules are not well suited for fast parallel implementation in the form of neural networks. As we shall see in the following, all three of these assumptions are incorrect and computational geometric progress in the 1980's and 1990's along with faster and cheaper hardware has made the k-NN-vules a practical reality for pattern recognition applications in the 21st Century.

In practice the size of the training set {X, Y] is not infinite. This raises two fundamental questions of both practical and theoretical interest. How fast does the error rate Pe [k-NN] approach the Bayes error Pe as n approaches infinity, and what is the finite-sample performance of the k-NN-ru\e (Psaltis, Snapp and Venkatesh,197

Kulkarni, Lugosi and Venkatesh146). These questions have in turn generated a va­riety of additional problems concerning several aspects of k-NN-rules in practice.

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 5: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

Geometric Proximity Graphs for Improving Nearest Neighbor Methods 105

Such problems include the following. How can the storage of the training set be reduced without degrading the performance of the decision rule? How should the reduced training set be selected to represent the different classes? How large should k be? How should a value of k be chosen? Should all k neighbors be equally weighted when used to decide the class of an unknown pattern? If not, how should the weights be chosen? Should all the features (attributes) we weighted equally, and if not, how should the feature weights be chosen? Which distance metric should be used? How can the rule be made robust to overlapping classes or noise present in the training data? How can the rule be made invariant to scaling of the measurements? How can the nearest neighbors of a new point be computed efficiently? What is the small­est neural network that can implement nearest neighbor decision rules? Geometric proximity graphs such as Voronoi diagrams and their many relatives provide ele­gant solutions to these problems, as well as other related problems such as outlier detection. After a brief and non-exhaustive review of some of the classical canonical approaches to these problems in order to provide some context, the methods that use proximity graphs are discussed, some new observations are made, and avenues for further research are proposed.

2. Reducing the Size of the Stored Training Data

2.1. Hart's condensed rule and its relatives

In 1968 Hart was the first to propose an algorithm for reducing the size of the stored data for the nearest neighbor decision rule.109 Hart defined a consistent subset of the data as one that classified the remaining data correctly with the nearest neighbor rule. He then proposed an algorithm for selecting a consistent subset by heuristically searching for data that were near the decision boundary. The algorithm is very simple. Let C denote the desired final consistent subset. Initially C is empty. First a random element from {X, Y} is transferred to C. Then C is used as a classifier with the 1-NN rule to classify all the remaining data in {X, Y}. During this scan of {X, Y} whenever an element is incorrectly classified by C it is transferred from {X, Y} to C. Thus {X, Y} is shrinking and C is growing. This scan of {X, Y} is repeated as long as at least one element is transferred from {X, Y} to C during a complete pass of the remaining data in {X, Y}. The goal of the algorithm is to keep only a subset of the data {X, Y} that represents well the decision boundary of the entire training data {X, Y}. This algorithm has inspired researchers to design other algorithms to explicitly select instances on the boundary of a set (see Porter and Liu193). Such feature vectors are also called support vectors because they "support" the decision boundary (Vapnik270). The motivation for this heuristic is the intuition that data far from the decision boundary are not needed and that if an element is misclassified it must lie close the the decision boundary. By construction the resulting reduced set C classifies all the training data {X, Y} correctly and hence it is referred to here as a training-set consistent subset. In the literature Hart's algorithm is called CNN and the resulting subset of {X, Y} is called a consistent

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 6: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

106 G. Toussaint

subset. Here the longer term training-set consistent is used in order to distingish it from another interesting type of subset: one that determines exactly the same decision boundary as the entire training set {X, Y}. The latter kind of subset will be called decision-boundary consistent. Clearly decision-boundary consistency implies training-set consistency but the converse is not necessarily true. Empirical results have shown that Hart's CNN rule considerably reduces the size of the training set and does not greatly degrade performance on a separate testing (validation) set. It is also easy to see that using a naive brute-force algorithm the complexity of computing the condensed subset of {X, Y} is 0(dn3), where d is the dimension (number of measurements) of X. However, the method does not in general yield a minimal-size consistent subset and unfortunately may change the decision boundary of the original training set. Recently several theoretical results on CNN have been obtained by Devroye et al..79

In 1987 Kibler and Aha134 proposed an algorithm called the growth-additive algorithm which consists of only one pass of Hart's CNN rule. Such an algorithm of course may not discard as much data. On the other hand a naive implementation of it runs in 0(dn2) worst-case time.

It is clear from the preceeding description that since CNN starts with a random element of {X, Y} transferred to C, the final subset C not only depends on the order in which {X, Y} is processed, but also may contain elements that are far from the decision boundary. Furthermore, points in C far from the decision boundary, selected for inclusion in C early in the execution of CNN, may be classified correctly by a new C in which they are absent. In an attempt to address these issues and further reduce the size of the consistent subset Gates100 proposed what he called the reduced nearest neighbor rule (RNN). The RNN rule consists of first performing CNN and then following with a post-processing step in which elements of C are visited and deleted from C if their deletion does not result in misclassifying any elements in {X, Y}. Experimental results confirmed that RNN yields a slightly smaller training-set consistent subset of {X, Y] than that obtained with CNN.100

Tomek244 also argued that CNN keeps too many points that are not near the decision boundary because of its arbitrary initialization step. To combat this prob­lem he proposed a modification of CNN in which a preliminary pass of {X, Y} is made to select an order-independent special subset of {X, Y} that lies close to the decision boundary. After this preprocessing step his method proceeds in the same manner as CNN but instead of processing {X, Y} it works on the special subset so preselected. The algorithm to preselect the special subset of {X, Y] consists of keeping all pairs of points (Xi,Yi), (Xj,Yj) such that Yi ^ Yj (the two points be­long to different classes) and the diametral sphere determined by X% and Xj does not contain any points of {X, Y} in its interior. (The diametral sphere determined by two points Xi and Xj is the smallest sphere containing Xi and Xj, i.e., the sphere with center at (Xi + Xj)/2 and diameter d(Xi,Xj).) Such pairs are often called Tomek links in the literature. Clearly, pairs of points far from the decision boundary will tend to have other points in the interior of their diametral sphere.

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 7: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

Geometric Proximity Graphs for Improving Nearest Neighbor Methods 107

It is claimed in Ref. [244] that the resulting subset of {X, Y} is training-set consis­tent. However, Toussaint263 demonstrated a counter-example. It should be noted that Tomek's preselected non-consistent subset using the diametral sphere test im­plicitly computes a subgraph of the Gabriel graph130 of {X, V}, a proximity graph admirably suited for condensing the training data that will be discussed in the following sections. More recently Karagali and Krim132 rediscovered Hart's idea of keeping points close to the decision boundary, and they explicitly compute closest pairs of training patterns that belong to different classes for inclusion in the reduced training set.

In 2000 an original and significantly different variant of Hart's condensed nearest neighbor rule was proposed by Baram18 and is inspired by local Voronoi diagrams although it does not compute them explicitly. Baram calls these minimum distance local separators. His algorithm consists of only a single pass of the data and in this respect resembles the algorithm of Kibler and Aha134 and thus runs in 0(dn2) time. There are two important points in which Baram's algorithm is different. First, each class is condensed separately and at the end the condensed sets for each class are combined to form the final classifier. Second, elements considered for addition to the condensed set are compared to all elements of different classes. Let {X,d] denote the training data for class d and consider computing the condensed set for d. The condensed set is initialized with a random element of {X, d}. The remaining elements of {X, d) are visited sequentially and each is added to the condensed set if it lies closer to an element of {X, Y} - {X, Ci\ than to an element of the present condensed set of class d. Otherwise it is discarded from further consideration. Note that this is equivalent to retaining the points of class C% if they are misclassified by a 1-NN classifier that consists of the union of (1) the growing condensed set for class d and (2) the set {X, Y} - {X, d}. Also note that this procedure yields a training set consistent subset as well.

Another variant of Hart's method that discards instances in the "interior" of a group of instances belonging to one class, considers one dimension (coordinate axis) at a time in the context of axis-parallel classification rules. Aguilar, Riquelme and Toro2'204 propose projecting and sorting {X, Y} on each axis, determining on each axis which instances are not next to instances of different classes, and discarding those instances that have this property in all the d coordinate axes. This method does not depend on the order of processing data and is simple to compute in 0(dn log n) worst-case time.

Many other variants of Hart's original condensed nearest neighbor rule have been proposed in an attempt to find small (even non-consistent) subsets that have good generalizing capability. In pattern recognition these techniques are usually called condensing techniques. In the instance-based learning literature they are called competence preservation methods within the class of instance selection algorithms (Brighton and Mellish36). However, a large fraction of papers have concentrated on obtaining good performance on the training data rather than on separate testing

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 8: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

108 G. Toussaint

data. The reader is referred to Nock and Sebban,179 Dasarathy et al.,64 Brighton and Mellish,36 Liu and Motoda158 and Aha et al.60 for further references and in-depth comparative studies of many such techniques.

2.2. Order-independent subsets

CNN, RNN and Tomek's modification of CNN all have the undesirable property that the resulting reduced consistent subsets are a function of the order in which the data are processed. Several attempts have been made to obtain training-set consistent subsets that are less sensitive to the order of presentation of the data. One class of methods with this goal suggested by Alpaydin8 applies the preceeding methods several times (processing the data in a different random order each time) to obtain a group of training-set consistent subsets. Then a voting technique among these groups is used to make the final decision. Kubat and Cooperson145 obtain just three small condensed training subsets and take a vote with the 1-NN rule used on the three subsets. Such voting techniques among a group of classifiers is also called boosting when the explicit goal is to combine several poor quality classifiers to improve the performance over one good classifier (Sebban et al.223). Another approach to minimize order-dependence used by Salzberg208 starts off with a small random batch of data rather than a single point as the preceeding methods do.

A successful solution to obtaining order-independent training-set consistent sub­sets by generalizing Hart's CAW procedure was proposed by Devi and Murty.71

Recall that in Hart's procedure the subset C starts with a single random element from {X, Y} and subsequently each time an element from {X, Y} is misclassifled it is transferred to C. In other words transfers are made one at a time and class-membership is not an issue. In contrast, the method of Devi and Murty,71 which they call the modified condensed nearest neighbor rule (MCNN) initializes the re­duced set (call it MC) by transferring, in batch mode, one representative of each class from {X, Y} to MC. Subsequently MC is used to classify all the remaining elements of {X, Y}. Then from each class of the resulting misclassified patterns a representative is transferred to MC (again in batch mode). This process is re­peated until all the remaining patterns in {X, Y} are classified correctly. Note that if at some stage there is a class, say d, that has no misclassified patterns using MC, then no representative of that class is transferred from {X, Y} to MC at that stage. Hence the most difficult classes (the last ones to be completely correctly clas­sified) receive more representatives in MC. Thus this approach provides a natural way to automatically decide how many representatives each class should be alloted and how they should be distributed. However, the proximity graph methods to be discussed in the following are superior in this respect.

2.3. Minimal size training-set consistent subsets

The first researchers to deal with computing a minimal-size training-set consis­tent subset were Ritter et al..205 They proposed a procedure they called a selective

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 9: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

Geometric Proximity Graphs for Improving Nearest Neighbor Methods 109

nearest neighbor rule SNN to obtain a minimal-size training-set consistent subset of {X, Y}, call it S, with one additional property that Hart's CNN does not have. Any training-set consistent subset C obtained by CNN has the property that every element of {X, Y} is nearer to an element in C of the same class than to any element in C of a different class. On the other hand, the training-set consistent subset S of Ritter et al.205 has the additional property that every element of {X, Y} is nearer to an element in S of the same class than to any element of a different class in the com­plete set, {X, Y}. This additional property of SNN tends to keep points closer to the decision boundary than does CNN. The additional property allows Ritter et al.205

to compute the selected subset S without testing all possible subsets of {X, Y}. Nevertheless, their algorithm still runs in time exponential in n (see Wilfong276) in the worst case. However, Wilson and Martinez278 and Wilson280 claim that the average running time of SNN is 0{dnz). Furthermore, experimental results indicate that the resulting cardinality of S is about the same as that of the reduced nearest neighbor consistent subsets of Gates.100 Hence the heavy computational burden of SNN does not make it competitive with RNN.

In 1994 Dasarathy61 proposed a complicated algorithm intended to compute a minimal-size training-set consistent subset but did not provide a proof of optimality. The algorithm uses a subset of {X, Y} that he calls the Nearest Unlike Neighbor (NUN) subset.62 Given an element Xt of {X, Y}, the element of {X, Y} closest to Xi but belonging to a different class is called the nearest unlike neighbor of Xj. The NUN subset consists of all points in {X, Y} that are nearest unlike neighbors of one or more elements of {X, Y}. The algorithm yields a training-set consistent subset of {X, Y} which he calls the MCS (Minimal Consistent Subset). Extensive experiments led him to conjecture that his algorithm generated an MCS that was indeed minimal-size. However, counter-examples to this claim have been found by Kuncheva and Bezdek,148 Cerveron and Fuertes46 and Zhang and Sun.287

Wilson and Martinez278 rediscovered the idea of using the nearest unlike neigh­bors to reduce the size of the training-set consistent subsets. They call the training-set condensing algorithms "instance pruning techniques" and refer to the nearest unlike neighbor as the nearest enemy. They also propose three algorithms for com­puting training-set consistent subsets. Several other similar algorithms can be found in the literature on instance-based and lazy learning (Mantaras and Armengol,68

Aha, Kibler and Albert5 and Aha3,4).

Wilfong276 showed in 1991 that the problem of finding the smallest size training-set consistent subset is iVP-complete when there are three or more classes. Further­more, he showed that even for only two classes the problem of finding the smallest size training-set consistent selective subset (Ritter et al.205) is also ATP-complete. For more theoretical results on the complexity of computing optimal subsets of prototypes see Blum and Langley29 and Nock and Sebban.180

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 10: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

110 G. Toussaint

2.4. Prototype generation methods

The techniques discussed in the preceeding have in common that they select a sub­set of the training set as the final classifier. There exists also a class of techniques that do not have this restriction when searching for a good set of prototypes. In the pattern recognition literature these methods are most often called prototype generation or replacement methods, although they are sometimes referred to as abstraction techniques (Lam et al.,151 Sanchez210). In the instance-based learn­ing literature, as well as the data mining and knowledge discovery literature, they are often called instance construction methods within the class of instance selec­tion algorithms (Reinartz,202 Madigan, Raghavan, DuMouchel, Nason, Posse and Ridgeway,161 Brighton and Mellish,35 Sen and Knight,225 Barreis21). They are also sometimes called instance averaging (Kibler and Aha,135 Keung and Lam133). One of the first such algorithms, proposed in 1974 by Chang,47 repeatedly merges the two nearest neighbors of the same class as long as this merger does not increase the error rate on the training set. This technique yields a training-set consistent set of prototypes that is reasonably small and is independent of the order in which the data is processed but is not a subset of the original training set. One drawback of Chang's method is that it may yield prototypes that do not characterize well the training set in terms of generalization. To combat this Mollineda et al.172 modi­fied Chang's algorithm to merge clusters (rather than pairs of data points) based on several geometric criteria. Thus the technique resembles hierarchical bottom-up clustering guided by a constraint on the resulting error rate on the training set. Bezdek et al.25 proposed another modification of Chang's method and demon­strated that it produced a smaller set of prototypes for the well known Iris data set.

Salzberg et al.209 proposed a novel and more extreme version of the preceeding methods which focuses on the desired decision boundary and ignores the training set. They design a minimal size set of prototypes to realize the desired decision boundary by synthesizing best-case prototypes. A related approach by Zhang288

looks for typical prototypes. Such methods are sometimes called compression meth­ods as distinguished from the filtering methods that use a subset of the original training set (Brighton and Mellish36).

Once we decide that we need not restrict the condensed set to consist of a subset of the training data we can bring the vast literatures of cluster analysis and vector quantization to bear on the problem. For example, Datta and Kibler66

use the &-means clustering algorithm to group the training samples of a class into clusters and then use the means of these clusters as the prototypes. Yen et al.285

combine vector quantization with incremental generation of prototypes where they are needed most, until the desired classification accuracy is achieved. These areas are beyond the scope of this study and will not be reviewed here.

One can of course combine the filtering approaches with the averaging methods to obtain hybrid methods (Lam et al.151). In a variant of this approach Kim and

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 11: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

Geometric Proximity Graphs for Improving Nearest Neighbor Methods 111

Oommen136 first use a filtering method to condense the training set, and then use a vector-quantization technique to adjust the position of the selected prototypes.

Sanchez210 proposes a two-stage method in which the data are first recursively partitioned into two sets determined by the perpendicular bisector of the points that determine the diameter of the set, until the desired number of subsets is obtained. In stage two, the centroid of each subset is then used as the abstracted prototype representing that subset.

2.5. Selecting good representatives

There exist several families of data reduction methods that have more general goals than the methods discussed in the preceeding. These methods try to select a subset of the training data that represents the original set as closely as possible in some sense. Of course such representations are very much a function of the intended ap­plication. One of the earlier methods due to Fukunaga and Mantock" uses nearest neighbors and an information measure to reduce the data with the goal of approxi­mating the underlying probability density function. For a more recent paper along these lines see Chaudhuri, Murthy and Chaudhuri.49 An extreme version of this approach is to use the median point of the data in each class. Here a suitable gener­alization of the median to high dimensions is required. This approach is not unlike the well established area in statistics concerned with robust estimators of location. One of the most well known generalizations of the median to higher dimensions is the convex hull peeling approach suggested by Tukey.266 Given a set of points belonging to a given class, points on the convex hull are successively discarded (in layers) until either a single point (the median) remains, or the convex hull contains no points in its interior. In the latter case the median is defined as the arithmetic mean of the remaining convex hull points. Thus this approach may use either origi­nal data points or generated prototypes. This area of statistics is broad and beyond the scope of this study. The reader is referred to the papers by Aloupis et al. for further references to this literature.7'6

In another context Mico et al.,170 and Moreno-Seco et al.173 select representative sets they call base prototypes with the goal of reducing the number of distances com­puted in the fc-nearest neighbor classification rule. Such prototypes are frequently called pivots as well. Bustos et al.,42 compare several pivot selection techniques.

A related problem in yet another quite different context is that of simplifying dot density maps in cartography (see the paper by de Berg, Bose, Cheong and Morin).67 Here the goal is to visually represent well the true underlying density of the data.

2.6. Detecting outliers

The complementary problem of selecting good representatives of a training set of data is the selection of bad representatives or outliers.112'20 Nonparametric methods for the identification of outliers frequently use distances from a candidate to its

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 12: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

112 G. Toussaint

nearest neighbors. Three popular definitions for such outliers are listed by Bay and Schwabacher.22 Knorr et al.139 consider a pattern X to be an outlier if at most k other patterns have distance less than d from X. Ramaswami et al.200 classify a pattern X as an outlier if it is ranked among the top n patterns according to the greatest distance from X to its A;-th nearest neighbor. Finally, Angiulli and Pizutti11 and Eskin et al.90 define X to be an outlier if it is ranked among the top n patterns according to the greatest sum of distances from X to its k nearest neighbors.

The preceeding methods may be considered as traditional in the sense that they define an outlier without regard to class membership. However, in the context of supervised learning (where data have class labels attached to them) it makes sense to define outliers by taking such information into account. He et al.113 suggest several methods for doing this. A related method is the Paella algorithm of Limas et al..153 Although the data is not class labelled, the Paella algorithm first performs a cluster analysis of the data, and subsequently the metric used to decide if a pattern is an outlier incorporates information about the resulting clusters.

2.7. Optimization methods

There have been many approaches that use approximate optimization techniques to find a subset close to the smallest size training-set consistent subset. Such meth­ods include tabu search (Cerveron and Ferri,45 Zhang and Sun287), gradient de­scent and deterministic annealing (Decaestecker69), genetic algorithms (Ho, Liu and Liu,117 Kuncheva,147 Kuncheva and Jain,149 Skalak,228 Chang and Lippmann48), evolutionary learning (Zhao and Higuchi,290 Zhao289), bootstrapping (Saradhi and Murty215) and other random search techniques (Lipowezky,156 Skalak229). Alter­nately, in finite-memory versions of the problem, first the size of the desired proto­type subset is fixed and subsequently the error rate is minimized under this con­straint (Bermejo and Cabestany,24 Smyth and Keane,232 Markovitch and Scott163). Liu and Nakagawa157 recently compared 11 optimization methods to each other. Unfortunately they did not compare those techniques to the proximity-graph meth­ods (Toussaint, Bhattacharya and Poulsen245) to be discussed in the following.

2.8. Decision-boundary generation methods

Consider two elements {Xi,Yi} and {Xj,Yj} in {X, Y} such that Y; ^ Yj. If the two points are used in the 1-NN rule they implement a linear decision boundary which is the hyperplane that orthogonally bisects the line segment joining {Xi, Yi] and {Xj,Yj}. Thus when a subset of {X,Y} is being selected in the preceeding methods, the hyperplanes are being chosen implicitly. However, we could just as well be selecting these hyperplanes explicitly.

When classifiers are designed by manipulating more hyperplanes than there are pattern classes they are called piece-wise linear classifiers (Sklansky and Michelotti230). There is a vast field devoted to this problem which is beyond the

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 13: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

Geometric Proximity Graphs for Improving Nearest Neighbor Methods 113

scope of this study and the interested reader is referred to the book by Nilsson.178

One can also generate non-parametric decision boundaries with other surfaces be­sides hyperplanes. Priebe et al.196 '195 for example, model the decision surfaces with balls.

2.9. Support-vector machine methods

The feature vectors in {X, Y} that play a key role in defining the decision boundary of a classifier are also called support vectors by some researchers because they "sup­port" the decision boundary (Vapnik,270 Syed, Liu and Sung240). Traditional opti­mization methods that compute a support-vector subset of {X, Y} may yield sub­sets that contain redundant or non-essential support vectors. Burges40 and Burges and Schoelkopf41 propose methods that yield smaller support-vector subsets by ap­proximating them with fewer vectors that are not necessarily members of {X, Y] similar to the prototype generation (replacement, abstraction) methods described in the preceeding discussion. Downs, Gates and Masters81 on the other hand propose a method for eliminating support vectors that are linearly dependent in feature space without affecting the solution. Vincent and Bengio271 have shown that a modified nearest neighbor rule that incorporates a smoothing mechanism boosts the performance above the level of traditional support vector machines. Kim and Oomen136 showed empirically that support vector machines can be improved by using the 1-AW rule after adjusting the position of the support vectors obtained from traditional algorithms.

3. Editing the Training Data to Improve Performance

Consider a two-class problem with Gaussian distributions and uncorrelated bivari-ate features with equal covariance matrices such that the mean of p{X\C\) is (—1,0) and the mean of p{X\C2) is (1,0). Assume the a priori probabilities are equal. Then we know that the optimal (Bayes) decision boundary is the y-axis. This situation is illustrated in Figure 2, where m\ and TB2 are the means of the two distributions.

Now assume we have a random finite sample from these distributions as our training set {X, Y} and use the 1-NNdecision rule instead. Of course, if we knew the data came from these Gaussian distributions we would be stupid to use the nearest neighbor rule. However, this thought experiment provides insight into the most important behavioral property of the l-AW rule. With high probability (depending on the size of the training set and the amount of overlap between p(X\C\) and p(X|C2)) there will be data points of p(X\Ci) on the positive s-half space, and points of p{X\C2) on the negative z-half space, as shown in Figure 3. Therefore the 1-NN decision boundary will contain "islands" of feature space assigned to the wrong class and it may look quite different from the optimal decision boundary as depicted in Figure 3.

This suggests that the decision rule could be improved if data near the optimal decision boundary were smoothed in some way. One possible way to smooth the

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 14: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

114 G. Toussaint

optimal decision boundary

Fig. 2. A random sample of data from Gaussian distributions and the optimal Bayes decision boundary.

data is to change the label of a point if the majority of its Voronoi neighbors belong to another class. This operation is used for smoothing images and is called salt-and-pepper smoothing when using a unanimous vote instead of a majority vote. The result of this operation on the data of Figure 3 is shown in Figure 4 where it is clear that the resulting 1-NN boundary is now a much closer approximation to the optimal Bayes decision boundary. Alternately, if it is desired to reduce the size of the training set such points could be discarded.

Such a move may not discard many points since only those near the decision boundary would be thrown out. However, it should result in a smoother decision boundary more like the optimal boundary. Methods such as these that have as their goal the improvement of recognition acuracy and generalization rather than data reduction are called editing rules in the pattern recognition literature. In the instance-based learning literature they are called competence enhacement methods within the class of instance selection algorithms (Brighton and Mellish36). In 1972 Wilson277 first conceived the idea of editing with this goal in mind, and proposed the following elegant and simple algorithm.

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 15: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

Geometric Proximity Graphs for Improving Nearest Neighbor Methods 115

the optimal decision boundary

• / *

• \ °

• • " /

• (

• / / •

• — ~ " \ P

• \

) (u

• \

' \ \

Fig. 3. The optimal Bayes decision boundary compared with the l-NN boundary.

PREPROCESSING A. For each i:

1. Find the A;-nearest neighbors to Xi among {X, Y} (not counting Xi). 2. Classify Xi to the class associated with the largest number of points among

the ^-nearest neighbors (breaking ties randomly).

B. Edit {X, Y} by deleting all the points misclassified in the foregoing.

DECISION RULE Classify a new unknown pattern Z using the l-NN rule with the edited subset

of{x,y}.

This simple editing scheme is so powerful that the error rate of the l-NN rule that uses the edited subset converges to the Bayes error. We remark here that a gap in the proof of Wilson277 was pointed out by Devijver and Kittler74 but alternate proofs were provided by Wagner273 and Penrod and Wagner.192

Wilson's deleted nearest neighbor rule deletes all the data misclassified by the k-NN majority rule. A modified editing scheme was proposed in 2000 by Hattori and Takahashi111 in which the data Xi are kept only if all their ^-nearest neighbors belong to the same class as that of Xi. Thus only the strongest correctly classified

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 16: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

116 G. Toussaint

the optimal decision boundary

Fig. 4. The smoothed 1-NN decision boundary.

data are kept. This algorithm discards more data than Wilson's algorithm and gives empirically better results for some problems when a pair of parameters are optimized.

One problem with the two preceeding editing algorithms is that, although the final decision is made using the 1-NN rule, the editing is done with the k-NN rule, and thus one is faced again with the problem of how to choose the value of k in practice. This problem was elegantly solved by Tomek243 and Devijver and Kittler73 who proposed repeated application of Wilson editing with the 1-NN rule until no points are discarded. Devijver and Kittler73 showed empirically that with this scheme, which they call multi-edit, the 1-NN rule appears to have an error rate that leads assymptotically to the Bayes error rate. To the author's knowledge, however, this has not been proved in theory.

In 1981 Koplowitz and Brown141 generalized Wilson's editing scheme by dis­carding some instances from the training data and changing the class label of other instances. In particular, for each instance (Xi,Yi) in {X, Y}, its fc-nearest neigh­bors are examined, and if a large fraction of these belongs to one class, say Yj, then the label of the instance is changed to this class, i.e., Y$ is changed to Yy, otherwise the instance (Xi,Yi) is discarded from {X, Y}. Sanchez et al.211 exper­imentally compared several variants of the algorithm of Koplowitz and Brown.141

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 17: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

Geometric Proximity Graphs for Improving Nearest Neighbor Methods 117

Rosin and Fierens206 empirically compared the two extremes of this approach using the fc-nearest neighbor majority rule: each instance in {X, Y} is compared to its A;-nearest neighbors and if its label is different from the majority of its A;-neighbors then: in the first approach it is relabeled to the majority label, and in the second method it is discarded from {X, Y}. Their results showed that discarding the data gave better results than relabelling the data.

4. Hybrid Methods

Since Wilson editing is one of the most attractive competence-enhacement tech­niques, it makes sense to use it as a preprocessing step before applying condensing algorithms. Methods that perform both condensing (data reduction, competence preservation) and editing (smoothing, competence enhancement) are sometimes called hybrid methods.

One notable hybrid method (called DROP4) was proposed by Wilson and Martinez.280 The first phase consists of the Wilson editing step discussed in the preceeding. The second phase belongs to the class of decremental techniques which unlike Hart's method and its variants (which are incremental), deletes in­stances from the set obtained from the first phase. For any member (Xi,Yi) of {X,Y} = {(Xi, Yi),(X2 ,F2),. . . , (Xn,Yn)} the authors define those instances of {X, Y} other than (Xi,Yi) that have (Xj,y,) as one of their k nearest neighbors, as the associates of (Xi,Yi). Let S denote the subset of {X, Y} at any step time during the decremental process. In phase two an instance (Xi,Yi) is removed from S if at least as many associates of (Xj, Yi) in {X, Y} are classified correctly without (Xi,Yi). Perhaps k-associates would have been a better term than just associates in order to emphasize their dependence on k. On the other hand, in the data man­agement community such k-associates are called reverse k-nearest neighbors.1*2'162

One problem with reverse k-nearest neighbors is that one must select a value of k. Motivated by extending the ideas of Wilson and Martinez280 to make an adaptive algorithm for which a value of k is not needed, Brighton and Mellish36

proposed a new hybrid method and compared it to several other hybrid methods on 30 different classification data sets. Their elegant and simple algorithm, which appears to be one of the best in practice, is called iterative case filtering (ICF), and may be described as follows. The first part of the algorithm consists of pre­processing with the original Wilson editing scheme. It is worth pointing out that by replacing this step with the multi-edit algorithm of Devijver and Kittler,73 one may obtain even better results, but this remains to be determined. The second part of their algorithm, their main contribution, is an adaptive condensing procedure. Let {X, Y} — {(Xi,Y\), (X2,Y2),.--, (Xn,Yn)} here denote the reduced training set obtained after performing Wilson editing (or multi-edit). The rule for discarding an element (Xk,Yk) of {X, Y} depends on the relative magnitute of two functions of (Xk,Yk) called the reachable set of (Xk,Yk) and the coverage set of (Xk,Yk). The reachable set of (Xk, Y^) consists of all the data points contained in a hypersphere

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 18: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

118 G. Toussaint

centered at Xk with radius equal to the distance from Xk to the nearest data point belonging to a class different from that of Xk. More precisely, let S{Xk,Yk) denote the hypersphere with center X^ and radius rk — min{d(Xk,Xj)\Yj ^ Yk) mini­mized over all j . Then all the data points of {X, Y} that are contained in S(Xk, 1*) constitute the reachable set of (Xk,Yk) denoted by R(Xk,Yk). The coverage set of (Xk,Yk), denoted by C(Xk, Yk), consists of all the data points in {X, Y} that have (Xk,Yk) in their own reachable set. More precisely, C(Xk,Yk) consists of all data points (Xi,Yi),i = 1,2, ...,n such that (Xk,Yk) is a member of R(Xi,Yi). The con­densing part of the ICF algorithm of Brighton and Mellish36 can now be made precise. First, for all i flag (Xi,Yi) if \R(Xi,Yi)\ > \C(Xi,Yi)\. Then discard all flagged points. This idea of using the size of the reachable set to decide whether to discard an instance has been used quite frequently under different names. For example, Wu, Ianakiev and Govindaraju282 proposed an algorithm which also uses this idea but called it the attractive capacity of an instance. However, they did not compare their algorithm to any other condensing algorithms.

There is also work on combining condensing and editing with other properties. These techniques are sometimes called integrated methods. Wilson and Martinez279

propose a new integrated algorithm and compare it to many other methods on 21 different data sets. Unfortunately these authors did not compare their algorithms to the proximity-graph approaches studied by Dasarathy et al..64 A nice experimental comparison of more than twenty methods, including editing with the Gabriel and relative neighborhood graphs, was done by Jankowski and Grochowski.128'107

A related problem that has its roots in statistics is that of improving perfor­mance by identifying and deleting erroneous or mislabeled data. Such data are sometimes also called outliers or mavericks. In classical statistics, techniques such as convex hull peeling pioneered by Tukey266 were used for obtaining better (more robust) estimates of parameters such as the mean (Huber122). In this estimator all data points lying on the convex hull of the data set are discarded. This step can of course be repeated several times and in the extreme reduces to a multivariate generalization of the median. In the context of non-parametric classification and instance-based learning Brodley and Friedl38 compare several techniques designed for solving this problem.

5. Are Some Neighbors More Important than Others?

The k-NN rule makes a decision based on the majority class membership among the k nearest neighbors of an unknown pattern Z. In other words every member among the k nearest neighbors has an equal say in the vote. However, it seems natural to give more weight to those members that are closer to Z. In 1966 Royall207

suggested exactly such a scheme where the z'-th neighbor receives weight Wi, wi > W2 > ••• > Wk and w\ + W2 + ••• + w/t = 1. In 1976 Dudani84 proposed such a weighting scheme where the weight given to Xi is inversely proportional to the distance between Z and Xi. He also showed empirically that for small training

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 19: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

Geometric Proximity Graphs for Improving Nearest Neighbor Methods 119

sets and certain distributions this rule gave higher recognition acuracy than the standard k-NN rule. Similar results were obtained by Priebe194 when he applied a randomly weighted k-NN rule to an olfactory classification problem. However, these results do not imply that the weighted rule is better than the standard rule asymptotically (Bailey and Jain,16 Devroye et al.79).

6. Weighting the Features

Considerable work has been done on trying to improve the nearest neighbor deci­sion rules by weighting the features (attributes, measurements) differently. It seems natural to try to put more weight on features that are better. Hence much effort has been directed at evaluating and comparing measures of the goodness of features. This problem is closely related to the vast field of feature selection where one is interested in selecting a subset of features with which to design a classifier. This is similar to setting the weights to "one" for the features selected and to "zero" for the ones discarded (Ling and Wang,154 Caruana and Freitag44). Considering the n feature vectors in d-dimensional feature space as an n by d matrix, feature selection can be considered the dual of instance selection. In fact there are techniques that attempt to do both simulateneously.50

One popular method for measuring goodness is with measures of information. For example, Lee and Shin152 propose an enhanced nearest neighbor learning al­gorithm, that has applications to relational data bases, in which they use infor­mation theory to calculate the weights of the attributes. More specifically they use the Hellinger divergence, a measure equivalent to the Bhattacharya coefficient, the Matusita distance and the affinity (Matusita,166 McLachlan167) to calculate the weights automatically. This measure of distance between the class-conditional prob­ability distributions, is closely related to the Bayes probability of error (Hellman and Raviv,116 Toussaint,254'257'258 Bhattacharya and Toussaint28).

A second popular measure is the mutual information originally proposed by Shannon in the context of communication theory (Cover and Thomas53). Wettschereck and Aha274 and Wettschereck et al.275 compare several methods for weighting features in nearest neighbor rules and claim that the mutual information gives good results. As with feature selection, one must be careful when calculating and evaluating weights of features independently of each other and then using them together, even if the features are independent (Toussaint248'247).

We close this section by mentioning a third very popular measure for weight­ing features referred to as the cross-category feature-importance measure by Wettschereck et al.,275 Wettschereck and Aha,274 Stanfill and Waltz,236 and Creecy et al..58 This measure is equivalent to the asymptotic probability of correct clas­sification of the nearest neighbor decision rule and also has several other names such as the Bayesian distance (Devijver72) as mentioned in the introduction. While such weighting schemes may sometimes improve classification performance in prac­tice, and for some special cases may appear to give better results in theory, there

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 20: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

120 G. Toussaint

are no guarantees (Ling and Wang154). Most unsettling is the fact that even when using the Bayes probability of correct classification (the ultimate criterion) as an evaluation measure, and even when the features are independent in each and every pattern class, it has been theoretically established that selecting the best indi­vidual features may actually result in obtaining the worst possible feature subset (Toussaint,248 Cover,55 Cover and Van Campenhout56).

7. Choice of Metric

A problem related to feature weighting is the selection of the metric used to measure distances between the pattern vectors (see for example Gavrilova101). A change in metric may implicitly change the weights given to different coordinates or even take correlations into account. There has been considerable effort spent on finding the "optimal" metric. Empirical improvements in accuracy are often obtained when the metric adapts locally to the distribution of the data (Short and Fukunaga,227

Friedman,93 Hastie and Tibshirani,110 Avesani et al.,13 Ricci and Avesani203).

Ekin et al.88 compared several distance methods that use the Euclidean and the Manhattan metrics (Minkowski metrics of order 2 and 1, respectively). They found that for large data sets the Euclidean metric is not significantly better than the easier to compute Manhattan metric. On the other hand for the small data sets that they used in their experiments the Euclidean metric resulted in a slightly lower acuracy. The authors offer no explanation for this behaviour.

One of the most important practical issues concerning nearest neighbor rules and cluster analysis (vector quantization) is the issue of scale in multi-dimensional data. Several elegant nearest neighbor rules have been devised that completely avoid this problem by being scale-invariant. In 1966 Anderson9 was the first to propose a fc-nearest neighbor rule based on ranks for the univariate case. Anderson's rule may be described as follows. Rank all the instances in {X, Y} including the unknown pattern Z which is to be classified. Count A; instances on each side of Z. Classify Z based on a majority of the 2k rank nearest neighbors breaking ties randomly. Dasgupta and Lin65 showed that the asymptotic probability of error of this rule is exactly the same as that of the conventional nearest neighbor rule when k = 1. It is not obvious how to generalize such rules to higher dimensions. An elegant and powerful generalization suggested independently by Olshen182 and Devroye75 uses empirical distances defined in terms of order statistics along the d coordinate axes (see also Devroye78). Bagui et al.,15 recently proposed another generalization based on covariance matrices.

The above methods are designed for feature vectors in a metric space. However, there are many applications where the patterns are strings of symbols. In this context distances between two strings are measured by the minimum amount of "work" required to transform one string into the other. The most well known such distance measure is the edit distance. For training-data reduction methods in this context the reader is referred to the paper by Martfnez-Hinajeros et at.,164 and the

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 21: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

Geometric Proximity Graphs for Improving Nearest Neighbor Methods 121

references therein.

8. Smoothing the Decision Boundary

It is well known that the nearest neighbor decision rules have a "rough" decision boundary for small size training sets. This has been pointed out in 1972 by Wilson277

for example, and rediscovered much more recently by Vincent and Bengio.271 In­deed, Wilson's original editing scheme that discards data points close to the decision boundary yields, surprisingly, a smaller data set with lower error rate. Vincent and Bengio271 proposed a different smoothing algorithm. Rather than discarding data, they compute a smoothed decision boundary by fantasizing the synthesis of fiducial new data added to the training set. They show empirically that their technique can improve the finite-sample error. Skurichina, Raudys and Duin231 take a step further along these lines and actually add noisy synthesized data to increase the size of the training set.

9. Finding Nearest Neighbors Efficiently

One of the widest misconceptions concerning the nearest neighbor decision rule has been that it requires too much computation because in order to classify an unknown item Z, the distances must be computed between Z and every element in the training set {X, Y}. Certainly this is one possible method to implement the nearest neighbor decision rule but unless the size of {X, Y} is very small it is probably the worst method available. Various approaches exist for computing a nearest neighbor of a query Z without computing distances to all the candidates. Friedman, Baskett and Shustek94 were the first to propose algorithms that computed far fewer than n distances on the average. Modifications by Friedman, Bentley and Finkel95 and Bentley, Weide and Yao23 soon followed. These elegant early methods depended on projections of high-dimensional data on lower dimensions (Papadimitriou and Bentley187). In a related much more recent approach Moreno-Seco et al.173 select representative sets they call base prototypes, which they use to reduce the number of distance computations in the ^-nearest neighbor classification rule.

A completely different approach partitions the space containing {X, Y} into re­gions such as Voronoi diagrams and applies point-location techniques to locate the query Z into the region that identifies its nearest neighbor. In these techniques, once the Voronoi diagram is computed in the design stage of the classifier, the nearest neighbors in the decision rule are found without computing any distances at all. Ramasubramanian and Paliwal199 propose methods that combine the pro­jection methods with the Voronoi diagram methods. One of the most promising family of techniques for this problem, particularly when the dimension is high, uses either bucketing techniques or k — d trees (Devroye et al.79). Some authors have proposed combining training data condensation algorithms such as Hart's with a tree-building technique into one process.39 Algorithms which achieve efficiency at the expense of knowing the distribution of query points in advance have been found

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 22: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

122 G. Toussaint

by Clarkson.52 One practical approach is of course to sacrifice finding the exact nearest neighbor. If we are satisfied with finding approximate nearest neighbors then more efficient algorithms are available (Arya, Mount and Narayan,12 Indik and Motwani,124 Kleinberg138). This vast area is beyond the scope of this paper which focuses on reducing the memory and error rate of nearest neighbor decision rules. For additional references to recent results concerning nearest neighbor search in arbitrary dimensions the reader is referred to the survey paper by Agarwal and Erickson.1

10. Proximity Graph Methods

10.1. Proximity graphs

The most fundamental and natural proximity graph defined on a set of points {X, Y} is the nearest neighbor graph (NNG). Here each point in {X, Y} is joined by an edge to its nearest neighbor (Paterson and Yao191). Another ubiquitous proxim­ity graph, that contains the nearest neighbor graph as a subgraph, is the minimum spannig tree (MST) Zahn.286 For a problem such as instance-based learning the most useful proximity graphs are adaptive in the sense that the number of edges they contain is a function of how the data are distributed. The minimum spanning tree is not adaptive; for n points it always contains n — 1 edges. In 1980 the rel­ative neighborhood graph (RNG) was proposed as a tool for extracting the shape of a planar pattern (see Toussaint281 '259,264). An example of the planar RNG is shown in Figure 5 but such definitions are readily extended to higher dimensions (Su and Chang239). Proximity graphs have many applications in pattern recogni­tion (see Toussaint260 '246,245). There is a vast literature on proximity graphs and it will not be reviewed here. The reader is directed to Jaromczyk and Toussaint130 for a start. The most well known proximity graphs besides those mentioned above are the Gabriel graph GG, the Delaunay triangulation DT, and the Urquhart graph UG.W All these are nested together in the following relationship:

NNG C MST C RNG CUGCGGCDT (6)

10.2. Decision-boundary-consistent subsets via proximity graphs

In 1978 Dasarathy and White were the first to characterize and compute explicitly the decision boundaries of nearest neighbor rules59 but only for the case of d — 2,3. Their algorithm runs in 0(n4) time. In the worst case, computing the nearest neighbor decision boundary is equivalent to computing the Voronoi diagram. The first O(nlogn) time algorithm for computing the Vononoi diagram in the plane is due to Shamos.226

In 1979 Toussaint and Poulsen246 were the first to use d-dimensional Voronoi diagrams to delete "redundant" members of {X, Y} in order to obtain a subset of {X, Y] that implements exactly the same decision boundary as would be obtained using all of {X, Y}. For this reason they called their method Voronoi condensing.

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 23: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

Geometric Proximity Graphs for Improving Nearest Neighbor Methods 123

Fig. 5. The relative-neighborhood-graph of a planar set of points.

The algorithm in Ref. [246] is very simple. Two points in {X, Y} are called Voronoi neighbors if their corresponding Voronoi polyhedra share a face. First mark each point Xi if all its Voronoi neighbors belong to the same class as X{. Then discard all marked points. The remaining points form the Voronoi condensed subset {X, Y}. Voronoi condensing does not change the error rate of the resulting decision rule because the nearest nighbor decision boundary with the reduced set is identical to that obtained by using the entire set. For this reason the Voronoi condensed subset is called decision-boundary consistent. Clearly decision-boundary consistency implies training-set consistency but the converse is not necessarily so. The most important consequence of this property is that all the theory developed for the 1-NN rule continues to hold true when the rule is preprocessed with Voronoi condensing. While this approach to editing sometimes does not discard a large fraction of the training data (say 90 percent), that information in itself is extremely important to the pattern classifier designer because the fraction of the data discarded is a measure of the resulting reliability of the decision rule. If few points are discarded it means that the feature space is relatively empty because few points are completely "surrounded" by points of the same class. This means that either there are too many features or more training data are urgently needed to be able to obtain reliable and robust estimates of the future performance of the rule.

In 2003, Bremner et al.33 gave an 0(nlogA;) time algorithm for computing the

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 24: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

124 G. Toussaint

nearest neighbor decision boundary of n points in the plane, where k is the number of points that contribute to the boundary. This problem is equivalent to computing the redundant set of points in Voronoi condensing.

Sixteen years after the proposal of Voronoi condensing by Toussaint and Poulsen,246 Murphy, Brooks and Kite176 rediscovered the algorithm in the con­text of neural network design and called it network reduction. In 1999, unaware of the above references, Esat89 again rediscovered Voronoi condensing and called it Voronoi polygon reduction

In 1998 Bhattacharya and Kaller26 extended the above methods to the fc-nearest neighbor rules. They call the decision-boundary consistency condensing exact thin­ning and otherwise inexact thinning. They proposed a proximity graph they called the k-Delaunay graph and showed how exact thinning may be performed with this graph. It should be noted that Delaunay graphs have been generalized on other ways as well. For example, Bandyopadhyay and Snoeyink17 defined a nearest-neighbor relation they call Almost-Delaunay Simplices.

There is an interesting sort of dual problem in machine learning in which we are given a tessellation of the feature space and one would like to know the minimum number of prototypes such that their Voronoi diagram contains the tessellation. In simple terms: can a nearest neighbor rule learn the given tessellation? Heath and Kasif114 show that this problem is iVP-hard.

10.3. Condensing prototypes via proximity graphs

In 1985 Toussaint, Bhattacharya and Poulsen245 generalized Voronoi condensing so that it would discard more points in a judicious and organized manner so as not to degrade performance unnecessarily. To better understand the rationale behind their proximity-graph-based methods it is useful to cast the Voronoi condensing algorithm in its dual form. The dual of the Voronoi diagram is the Delaunay trian-gulation. In this setting Voronoi condensing can be described as follows. Compute the Delaunay triangulation of {X,Y}. Mark a vertex X; of the triangulation if all its (graph) neighbors belong to the same class as that of Xi. Finally discard all the marked vertices. The remaining points of {X, Y} form the Voronoi condensed set. The methods proposed in Ref. [245] substitute the Delaunay triangulation by a sub­graph of the triangulation. Since a subgraph has fewer edges, its vertices have lower degree on the average. This means the probability that all the graph neighbors of Xi belong to the same class as that of Xi is higher, which implies more elements of {X, Y] will be discarded. By selecting an appropriate subgraph of the Delaunay triangulation one can control the number of elements of {X, Y} that are discarded. Furthermore by virtue of the fact that the graph is a subgraph of the Delaunay tri­angulation and that the latter yields a decision-boundary consistent subset, we are confident in degrading the performance gracefully. Experimental results obtained in Ref. [245] suggested that the Gabriel graph is the best in this respect.

Also in 1985 and independently of Toussaint, Bhattacharya and Poulsen,245

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 25: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

Geometric Proximity Graphs for Improving Nearest Neighbor Methods 125

Ichino and Sklansky123 suggested the same idea but with a different proximity graph that is not necessarily a subgraph of the Delaunay triangulation. They pro­posed a graph which they call the rectangular-influence graph or RIG defined as follows. Two points X\ and Xj in {X, Y} are joined by an edge if the smallest orthogonal hyper-rectangle that contains both X, and Xj contains no other point of {X, Y}. An orthogonal hyper-rectangle has its edges parallel to the coordinate axes. Not surprisingly, condensing the training set with the RIG does not guaran­tee a decision-boundary consistent subset. On the other hand recall that the RIG has the nice property that it is scale-invariant which can be very useful in some classification problems.

In 1998 Bhattacharya and Kaller26 proposed a proximity graph they call the k-Gabriel graph and show how inexact thinning can be performed with this graph. The k-Gabriel graph is much easier to compute than the k-Delaunay graph and yields good results.

10.4. Editing via proximity graphs

Sanchez, Pla and Ferri213 extended Wilson's277 editing idea to incorporate prox­imity graphs. Their algorithms mimic Wilson's algorithm. Instead of discarding points correctly classified by the A;-nearest neighbor rule they are discarded if they are correctly classified with the graph neighbors. They empirically investigated the relative neighborhood graph and the Gabriel graph and found that Gabriel graph editing was the best.

10.5. Combined editing and condensing via proximity graphs

Editing by itself smooths the decision boundary and improves performance with finite sample size. However, it tends not to discard much data. Therefore to re­duce the size of the training set condensing is necessary. There has been much re­search lately on exploring the synergy between editing and condensing techniques (Dasarathy and Sanchez,63 Dasarathy, Sanchez and Townsend,64 Sanchez, Pla and Ferri213). One conclusion of these studies is that editing should be done before condensing to obtain the best results. An extensive experimental comparison of 26 techniques shows that the best approach in terms of both recognition acuracy and data compression is to first edit with either the Gabriel graph or the relative neighborhood graph and subsequently condense with the minimal-consistent subset (MCS) algorithm (Dasarathy, Sanchez and Townsend64).

10.6. Feature weighting with proximity graphs

Sebban and Nock222 apply the nearest-neighbor graph and the minimum span­ning tree in conjunction with several information measures to feature evaluation in learning systems.

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 26: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

126 G. Toussaint

10.7. Piece-wise classifier design via proximity graphs

Proximity graphs have also found use in designing piece-wise linear and spherical classifiers. Sklansky and Michelotti230 and Park and Sklansky188'189 use the Gabriel graph edges that connect points of different classes (which they call Tomek links) to guide the selection of the final hyperplanes used to define the decision boundary. In particular they require the selection of hyperplanes that intersect all such edges. More recently Tenmoto, Kudo and Shimbo242 use the Gabriel graph edges between classes (Tomek links) only as a starting point for the initial position of the hyper­planes and subsequently apply error-correction techniques to change the position of the hyperplanes if the local performance improves.

As mentioned earlier, one can also generate non-parametric decision boundaries with other surfaces besides hyperplanes. Priebe et al.196,195 model the decision surfaces with balls and use proximity graphs called catch digraphs to determine the number, location and size of the balls.

10.8. Cluster analysis and validation via proximity graphs

One of the most natural ways to select prototypes to represent a class in pattern recognition is to perform a cluster analysis on the training data of the class in ques­tion (Duda et al.83). The number and shape of the resulting clusters can then guide the designer in selecting the prototypes. Obviously one can bring the entire avail­able clustering and vector quantization arsenals to bear down on this problem as discussed previously (Jardine and Sibson,129 Jain and Dubes,125 Kohonen,140 Baras and Subhrakanti19). However, the most powerful and robust methods for clustering turn out to be those based on proximity graphs. Florek et al.92 were the first to propose the minimum spanning tree proximity graph as a tool in classification. The minimum spanning tree contains the nearest-neighbor graph as a subgraph. Zahn286

and Wong281 demonstrated the power and versatility of the minimum spanning tree when applied to many pattern recognition problems. These techniques were later generalized by using other proximity graphs by Urquhart.267 Another generaliza­tion of the nearest-neighbor graph is the ^-nearest-neighbor graph. This graph is obtained by joining each point with an edge to its k nearest neighbors. Brito te al.37

study the connectivity of the ^-nearest-neighbor graph and apply it to clustering and outlier detection.

Once a clustering is obtained it is desirable to perform a cluster-validation test. Pal and Biswas186 propose some new indices of cluster validity based on three proximity graphs: the minimum spanning tree, the relative neighborhood graph and the Gabriel graph, and they show that for an interesting class of problems they outperform the existing indices.

A closely related problem is testing whether two training sets of data come from the same cluster or distribution. Such tests are referred to as two-sample tests in statistics. The earliest application of proximity graphs to this problem is due to Friedman and Rafsky96 who proposed a test based on the minimum

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 27: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

Geometric Proximity Graphs for Improving Nearest Neighbor Methods 127

spanning tree. In 1986 Schilling217 proposed a two-sample test based on various nearest neighbor graphs. In 1993 Dwyer and Squire85 proposed a test based on the Delaunay triangulation and showed that it has better theoretical properties than both of the above methods.

10.9. Selecting good representatives via proximity graphs

The only proximity graph that appears to have been applied to the more general problem of selecting "good" representatives of a set of data, is the minimum span­ning tree (MST). One of the earliest methods, proposed in 1979 by Toussaint and Poulsen246 first computes, for each class, the MST of the points belonging to that class. In a manner analogous to the convex hull layer peeling approach suggested by Tukey, the MST is "peeled" by pruning all the leaves of the tree. This step may be continued if deemed necessary. The final tree determines the selected rep­resentatives. Note that unlike the convex-hull peeling method this approach more accurately maintains the "shape" of the set of points.

Much more recently Tahani, Plummer and Hemamalini241 also compute the MST for each class. They use the average edge-length R of the MST to compute the density at each point Xk, denned as the number of points contained in the disk centered at Xk with radius R. Then they select a subset of high and low density points as the representatives. However, they do not describe how many or how these representatives are selected. Another method that uses the MST was proposed by Hoya.119 If p representatives are desired then the MST is repeatedly split by finding the tree with the largest number of vertices and splitting it at the shortest edge. This is done until a forest of p trees is obtained. Finally a so-called absolute center is computed for each tree, and these centers constitute the representatives.

10.10. Outlier detection via proximity graphs

Djamel Zighed, Stephane Lallich and Fabrice Muhlenbach292 have developed the first non-parametric statistical test for class-separability in Rd. Their statistic, called the Cut Edge Weight, is based on the relative neighborhood graph.261 In par­ticular, the statistic is defined in terms of the sum of the edge-lengths of the edges in the relative neighborhood graph that connect patterns that belong to different classes. This approach has been applied not only to outlier detection in Refs. [174, 150, 175] and Ref. [291], but to feature and prototype selection in Ref. [220] and Ref. [221].

10.11. Scale-invariance via proximity graphs

A technique suggested independently by Ichino and Sklansky123 and Devroye et al.79 is the so-called rectangle-of-influence-gv&ph decision rule. This rule classifies an unknown pattern Z by a majority rule among the rectangle-of-influence neighbors

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 28: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

128 G. Toussaint

$>

•4>

®

Fig. 6. The rectangle-of-influence neighbors of a point are scale-invariant.

of Z in {X, Y}. A point Xi in {X, Y} is such a neighbor if the smallest hyper-rectangle containing both Z and Xi contains no other points of {X, Y}. Devroye et al.79 call this rule the layered nearest neighbor rule and have shown that if there are no ties it is asymptotically Bayes optimal. Figure 6 shows the rectangle-of-influence neighbors of a point. Clearly, the rectangle-of-influence neighbors of a point are invariant to (even non-linear) scale transformations of the data.

10.12. Proximity-graph-neighbor decision rules

The classical approaches to k-NN decision rules are rigid in at least two ways: (1) they obtain the k nearest neighbors of the unknown pattern Z based purely on distance information, and (2) the parameter k is fixed. Thus they disregard how the nearest neighbors are distributed around Z and beg the question of what the value of k should be. These problems are solved naturally and efficiently with proximity-graph-neighbor decision rules.

In 1985 Ichino and Sklansky123 were the first to propose proximity-graph-neighbor decision rules, namely the rectangle-of-influence graph neighbor decision rule discussed in the preceeding. More recently new geometric definitions of neigh­borhoods have been suggested and new nearest neighbor decision rules based on other proximity graphs (Jaromczyk and Toussaint130) have been investigated. In 1996 Devroye et al.79 proposed the Gabriel neighbor rule which takes a majority vote among all the Gabriel neighbors of Z among {X, Y} breaking ties randomly.

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 29: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

Geometric Proximity Graphs for Improving Nearest Neighbor Methods 129

Sanchez, Pla and Ferri212 '2:4 proposed similar rules with other proximity graphs as well as the Gabriel and relative neighborhood graphs. Sebban, Nock and Lallich223

proposed using the relative neighborhood graph decision rule in the context of boosting to select prototypes. Thus in these approaches both the value of k and the distance of the neighbors vary locally and adapt naturally to the distribution of the data around Z. Note that these methods also automatically and implicitly assign different "weights" to the nearest geometric neighbors of Z. Therefore they are also "scale-invariant" in some strange sense. Indeed, they are graph-theoretical multivariate generalizations of Anderson's original nearest neighbor rule based on ranks discussed in the preceeding.9 Consider for example Devroye's Gabriel neigh­bor rule for the univariate case. In the univariate case the Gabriel graph is a chain and the two Gabriel neighbors of Z are exactly the left and right rank neighbors in Anderson's rule. Thus all the proximity-graph-neighbor decision rules discussed in the preceeding (including the minimum spanning tree decision rule) are equivalent to Anderson's rule in the univariate case and hence, from the work of Dasgupta and Lin,65 it follows that they have an asymptotic error rate that is less than twice the Bayes error.

A radically different decision rule based on the overall length of proximity graphs (rather than majority votes) was proposed by Ekin, Hammer, Kogan and Winter.88

These authors proposed computing a discriminant function for each class as follows. In the design stage of the classifier the minimum Steiner tree is computed for the training points in each class. When a query point is to be classified it is inserted in the Steiner tree of each class. Then for each class the ratio between the lengths of the Steiner trees before and after insertion is calculated. Finally the query point is classified to the class having the smallest ratio. Since computing the minimum Steiner tree is NP-hard they propose an approximation for use in practice.

11. Nearest-Neighbor-Rules and Neural Networks

Nearest neighbor decision rules and neural networks have a four-pronged research history. In the first of these lines of research both are used together as a composite classifier to exploit the advantages each has to offer (Mitiche and Lebidoff171). In the second line of research, nearest neighbor rules are used as filtering algorithms to reduce the size of training sets and increase their quality in order to make neural networks learn better and faster (see Rosin and Fierens206). In the third and fourth lines of research, neural networks have been applied in two ways to design near­est neighbor decision rules. On the one hand, neural networks have been used to attempt to solve the minimal consistent subset selection problem. Here neural net­works are just another optimization tool to design a good prototype set to be used by a nearest neighbor classifier (Huang et al.,120 '121 Yan,283 Jiang and Zhou131). On the other hand, neural networks are used as actual implementations of the nearest neighbor decision rule. Here (the topic of this section) the neural network is used to obtain a parallel version of the more traditional sequential implementation of the

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 30: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

130 G. Toussaint

nearest neighbor classifier. The Voronoi diagram of a set of sites (data points) is a partition of space into

convex regions (polyhedra) such that for each site Xi there corresponds a region with the property that all points in that region of space are closer to Xi than to any other site (see O'Rourke184 and Okabe et al.181 for a lucid introduction to Voronoi diagrams and some of their properties and applications). For this reason when minimum distance classifiers are implemented as neural networks they are also called Voronoi networks (see Krishna et al.144). Recall that another way to look at the l-NN classification is as a point-location query rather than as a minimum dis­tance computation. Given an unknown vector Z to be classified, determine in which Voronoi polyhedron it lies and classify it to the class associated with the resulting polyhedron. A convex polyhedron is determined by the intersection of half-spaces. Half-spaces (determined by hyperplanes) are a linear perceptron's best friend. One hyperplane can be specified exactly by a single McCulloch-Pitts neuron (simple perceptron, threshold logic unit, Nilsson178). This observation led Murphy177 to propose a neural network for parallel implementation of the l-NN classifier by im­plementing explicitly the Voronoi diagram of the training set {X, Y}. The network consists of three hidden layers: one for the McCulloch-Pitts neurons, one consist­ing of AND gates for each Voronoi polyhedron and a final layer of OR gates to select the "winning" polyhedron. Unfortunately the complexity of the Voronoi di­agram grows exponentially with d, the number of features (dimensionality of the feature space). More precisely, Raimund Seidel showed in 1991 that the number of faces of the d-dimensional Voronoi diagram of n points is 0(nL(d+1)/2J) in the worst case.224 Therefore any algorithm to compute the Voronoi diagram must have at least 0(nL(d+1)/2J) time and space complexities. Therefore for any values of n and d other than very small this approach to designing the network is impractical. Realizing this computational brick wall Murphy177 proposed a second more efficient approach. Rather than constructing explicitly the Voronoi diagram in order to solve point-location queries it suffices to test point location queries with respect to a set of half spaces. To test if a given query point Z is in the Voronoi polyhedron belong­ing to Xi it is sufficient to perform half-space tests for the n — 1 hyperplanes that bisect the line segments joining Xi to all the other points in {X, Y} since only these hyperplanes may determine the facets of the Voronoi polyhedra. These bisecting hyperplanes can be computed in a simple and straightforward manner in 0(dn2) time and 0(dn) space, a vast improvement over computing Voronoi diagrams. Both of these methods, however, lead to three-layer neural networks with n{n — l ) /2 or 0(n 2) McCulloch-Pitts neurons.

Unaware of the paper by Murphy,177 Smyth233 proposed the same approach as Murphy but took a step further. Realizing that some of the n(n — l ) /2 hyperplanes may be redundant he proposed a simple heuristic algorithm for discarding redun­dant hyperplanes. First, for every pair of points (Xi,Xj) in {X, Y} that belong to different classes compute the hyperplane that bisects the segment joining Xi and Xj. Then sort these hyperplanes in decreasing order of distance between their cor-

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 31: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

Geometric Proximity Graphs for Improving Nearest Neighbor Methods 131

responding X; and Xj. Finally, process this list of hyperplanes in order by selecting a hyperplane for inclusion in the network and discarding the hyperplanes that do not help the classification of the hyperplanes selected so far. This approach tries to select only the hyperplanes that are necessary for implementing the decision boundaries between the classes. Unfortunately in the worst case it is still possible that 0(n 2) McCulloch-Pitts neurons may result.

Unaware of the paper by Smyth,233 Bose and Garga30 suggested a much more complicated algorithm for removing redundant hyperplanes. Their idea is to parti­tion the space of each class into unions of convex polyhedra and then implement the facets of these polyhedra. Their algorithm is quite involved and along the way computes both the Voronoi diagram and the convex hull of {X, Y}. Thus the com­plexity of their algorithm is, like Murphy's, exponential in d in both time and space and although the number of neurons in the final network is, on the average, smaller than in Murphy's network, the end result is still a three-layer network with 0(n 2 ) McCulloch-Pitts neurons in the worst case.

Dwyer showed in 1991 that the d-dimensional Voronoi diagram of n points uniformly distributed in the interior of a d-dimensional ball has 0(n) expected complexity and may be computed in 0(n) expected time87 in the real-RAM model of computation with a constant time floor function computation and assuming a fixed constant value of d. However, the worst-case complexity of Dwyer's algorithm is0(nL(d + 1) /2J+ 1 logn).

In 1995 Murphy, Brooks and Kite176 proposed two modifications of Murphy's original algorithm. Firstly, motivated by Dwyer's result they proposed another al­gorithm for computing the Voronoi diagram of n points uniformly distributed over a bounded d-dimensional hypercube also in 0{n) expected time and also for fixed d. Then they used this algorithm to construct a three-layer neural network as in Ref. [177] in 0(n) expected time with 0(n) expected number of neurons. Of course if one has a pattern recognition problem with other distributions the complexity of their algorithm may still be exponential in d and the resulting neural network may still have 0(n 2) neurons in the worst case. Secondly, they rediscover the Voronoi condensing algorithm of Toussaint and Poulsen246'245 which deletes points Xi in {X, Y} that have all their Voronoi neighbors belonging to the same class as that of Xi and call this algorithm network reduction.

In 1999 and unaware of the previous work Esat89 proposed a neural network similar to that of Murphy. It also has three layers and uses 0(n 2) McCulloch-Pitts neurons in the worst case. Esat also rediscovered Voronoi condensing (Toussaint and Poulsen246) and called it Voronoi polygon reduction.

In 2001 and unaware of all the previous papers except that of Bose and Garga,30 Gentile and Sznaier103 proposed two improvements to the work of Bose and Garga.30 The algorithm tends to discard more redundant hyperplanes and yields a network with two layers instead of three. However, their algorithm also computes the Voronoi diagram of {X, Y} and is thus again exponential in d in both time and space. Furthermore in the worst case the number of neurons is still 0(n 2 ) .

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 32: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

132 G. Toussaint

All the above methods compute adjacencies of Voronoi cells by starting with the formidable task of computing the Voronoi diagram. However, it is possible to do this avoiding Voronoi diagrams altogether. Voronoi adjacency computation may be formulated as a linear programming problem, and can be computed much more efficiently than the Voronoi diagram (see Fukuda97). In fact for a single Voronoi cell and fixed d the redundant hyperplanes can be found in 0(nm) time and 0(n) space where m is the number of bounding (non-redundant) hyperplanes that determine the facets of the Voronoi cell (see Ottmann et al.185). Here again, as long as one one is using Voronoi adjacencies the final network is doomed to contain 0(n2) McCulloch-Pitts neurons in the worst case.

However, one can do much better still by avoiding Voronoi adjacency compu­tations altogether. Simply rewrite the distance computations in the 1-7V7V classifier as linear discriminant functions, one for each training pattern in {X, Y}. Each of these discriminant functions can be implemented with one McCulloch-Pitts neuron leading to a neural network with only a single hidden layer consisting of n neurons followed by a maximum selector (Cover,54 Nilsson178). The maximum selector can be implemented either in a straightforward manner if desired, or with a variety of winner-take-all units (Maas,159'160 Tseng and Wu265). This approach computes in 0(nd) worst-case time a neural network that has one hidden layer of no more than exactly n neurons and one maximum selector. Note that the worst-case com­plexity of this algorithm is linear in both n and d rather than the exponential or quadratic complexity required by the algorithms in the preceeding, and no learn­ing is required. Furthermore, if {X, Y} is preprocessed by applying Wilson-type editing (Devroye et al.79) then the resulting neural network will be asymptotically Bayes optimal because the l-AW-rule will exhibit the behavior of the k-NN-rule. Finally, if {X, Y} is further preprocessed by applying Voronoi condensing (Tous­saint et al.246,245) then the number of neurons will be less than n as well, while still maintaining the asymptotic Bayes optimality of the resulting network.

It is interesting that Zhao and Higuchi290'289 noticed that n neurons were suf­ficient to implement nearest neighbor perceptrons but they proposed a three-layer network and, rather than using Voronoi diagrams, applied evolutionary learning algorithms that start with a random subset of {X, Y} to obtain a reduced train­ing set. Hence their algorithm, by introducing learning where none is needed, also sacrifices the asymptotic Bayes optimality of the resulting network. The neural net­work designs of Park and Bang190 as well as Gazula and Kabuka102 approximate the 1-NN and k-NN rules, respectively, and thus they also sacrifice the asymptotic Bayes optimality of the resulting network.

It is also interesting to note that Jain and Mao127 were the only authors to propose a faithful neural network implementation of the k-NN-rule with only n neurons, but they failed to notice that with Wilson-type editing the l-iV7V-rule will do the job. As a result they have a complicated 4-layer network to compute the k nearest neighbors. The only other neural network faithful to the k-NN-rule is the one by Chen, Damper and Nixon51 which has one less layer than the Jain-Mao

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 33: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

Geometric Proximity Graphs for Improving Nearest Neighbor Methods 133

network but implements the Voronoi tessellation idea and thus uses 0(n 2) neurons.

12. Estimation of Misclassification

A crucial problem in pattern recognition is the estimation of the performance of a decision rule on future data (see Devroye et al.,79 Glick,104 Jain et al.,126

Schiavo and Hand,216 Toussaint250). Many geometric problems occur here also, where proximity graphs offer elegant and efficient solutions. For example, a good method of estimating the performance of a decision rule is to delete each member of {X, Y} = {(Xi,Yi), (X2,Y2),..., (Xn,Yn)} in turn and classify it with the re­maining set. This is the so-called deleted or leave-one-out estimate. For the nearest neighbor rule this problem reduces to computing for a given set of points in d-space the nearest neighbor of each (the all-nearest-neighbors-problem). Vaidya268 gives an 0(n log n) time algorithm to solve this problem. The proximity graphs discussed here have application to this problem as well. For example, the Gabriel-graph-neighbor decision rule of Devroye et al.79 classifies a new point Z by a majority vote among all its Gabriel neighbors in {X, Y] breaking ties randomly. To obtain a leave-one-out estimate (Toussaint250) of the probability of error of this rule it is sufficient to compute the Gabriel graph of {X, Y} once. Then each vertex can be vis­ited and classified by examining its neighbors (excluding itself). The complexity of this procedure depends on the expected number of edges in the graph (Devroye77). The same can be done with other decision rules such as the rectangle-of-influence rule.

13. Open Problems and New Directions

Browsing large databases is an increasingly frequent activity in today's information intensive world. Search algorithms typically return the k nearest neighbors of a doc­ument or image query according to a suitable similarity measure. A new approach to this problem stores the database as a proximity graph. Browsing then returns the graph neighbors of a query vertex in this graph. A step in this direction using the relative neighborhood graph261 was taken by Scuturici et al..219 '218 This is a promissing new area of research with many avenues still to be explored.

Recall that in 2003, Bremner et al.33 gave an O(nlogfc) time output-sensitive algorithm for computing the nearest neighbor decision boundary of n points in the plane, where k is the number of points that contribute to the decision boundary. This problem remains unexplored in higher dimensions. There are at least two interpretations for "output-sensitive" here: the number of points that contribute to the decision boundary, and the complexity of the decision boundary. Both problems remain open.

In 1985 Kirkpatrick and Radke137 (see also Radke198) proposed a generalization of the Gabriel and relative neighborhood graphs which they called /3-skeletons, where /? is a parameter that determines the shape of the neighborhood of two points that must be empty of other points before the two points are joined by an

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 34: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

134 G. Toussaint

edge in the graph. It is possible that for a suitable value of /3 these proximity graphs may yield better training set condensing results than the those obtained with the Gabriel graph. This should be checked out experimentally. It should be noted that there is a close relationship between /3-skeletons and the concept of mutual nearest neighbors used by Gowda and Krishna.106 For /3 = 1 an edge in the /3-skeleton has the property that the two points it joins are the mutual nearest neighbors of each other. For further references on computing /3-skeletons the reader is referred to the paper by Rao and Mukhophadhyay.201

In the early 1980's I proposed a graph I called the sphere-of-influence graph, as a visual primal sketch originally intended to capture the low-level perceptual structure of visual scenes consisting of dot-patterns, or more precisely, a set of n points in the plane (see Toussaint262). I also conjectured that (although not necessarily planar) this graph had a linear number of edges. Avis and Horton14 showed in 1985 that the number of edges in the sphere-of-influence graph of n points was bounded above by 29n. The best upper bound until recently remained fixed at 17.5n. Finally in 1999 Michael Soss235 brought this bound down to 15n. David Avis conjectured that the correct upper bound is 9n, and he has found examples that require 9n edges, so the problem is still open. More relevant to the topic of interest here is the fact that the sphere-of-influence graph yields a natural clustering of points completely automatically without the need of tuning any parameters. Furthermore, Guibas, Pach and Sharir showed that even in higher dimensions it has 0(n) edges for fixed dimension.108 Soss has also given results on the number of edges for metrics other than Euclidean.234 Finally, Dwyer86 has some results on the expected number of edges in the sphere-of-influence graph. For two recent papers with many references to recent results on sphere-of-influence graphs the reader is referred to Michael and Quint169 and Boyer, et al..31 To date the sphere-of-influence graph has not been explored for applications to nearest neighbor decision rules. It would be interesting to compare it with the sub-graphs of the Delaunay triangulation that have been successfully applied to the problems discussed in this paper.

Recently several new classes of proximity graphs have surfaced. These include the sphere-of-attraction graphs of McMorris and Wang,168 and the class-cover catch digraphs of De Vinney and Priebe.272 It would be interesting to compare all these graphs as well on the problems discussed in this paper.

The Gabriel-graph-neighbor decision rule (Devroye et al.,79 Sanchez et al.212) and the rectangle-of-influence-graph decision rule (Ichino and Sklansky123) open a variety of algorithmic problems in d dimensions. For any particular graph (relative neighbor, Gabriel, etc.) and a training set {X, Y}, in order to classify an unknown Z we would like to be able to answer quickly the query: which elements of {X, Y} are the graph neighbors of Z? For the rectangle-of-influence-graph some results have recently been obtained by Carmen Cortes, Belen Palop and Mario Lopez. For the case d — 2 some results are known on recognizing whether some special graphs can be realized as the rectangle-of-influence-graphs of some set of points (see Liotta, Lubiw, Meijer and Whitesides155). Also of interest is computing these proximity

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 35: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

Geometric Proximity Graphs for Improving Nearest Neighbor Methods 135

graphs, especially the Gabriel graph, efficiently in high dimensions. Brute force algorithms are simple but run in 0(dn3) worst-case time. In Ref. [245] a heuristic is proposed for computing the Gabriel graph in expected time closer to 0(dn2). Finding algorithms for computing these graphs in o(dn3) worst-case time is an open problem.

Another interesting problem in this area is finding the nearest neighbor of a query point Z in {X, Y] efficiently while supporting insertions and deletions of elements of {X, Y}. This problem is relevant to a variety of instance-based learning methods such as Hart's condensing algorithm and its relatives. For example, Hart's CAW can be computed in 0(dn3) time. How much can this bound be improved?

Recall that Gordon Wilfong276 showed in 1991 that the problem of finding the smallest size training-set consistent subset is NP-complete when there are more than two classes. The complexity for the case of two classes remains an open problem.

Recall that several of the better algorithms for selecting a good subset of the training data, such as the hybrid method of Wilson and Martinez,280 make copious use of associates or reverse k-nearest neighbors.142'237 '284'143,162 It is therefore of great interest to efficiently compute the reverse k-nearest neighbors of a query point. Almost nothing is known about this problem. The problem has many parameters: the points may be mono-chromatic or they may be bi-chromatic (or even multi-chromatic), the metrics of interest are the L±, Li and Loo metrics, and we have the dimension d. In Ref. [162] for d = 2 and the L2 metric it is shown that there exists an 0(n) space data structure that can be computed in O(nlogn) time such that a reverse nearest neighbor query can be answered in O(logn) time. What about k > 1? What about d > 2? What about the L\ and Loo metrics? What about the reverse furthest neighbor problem?

To date it has been difficult in practice to obtain an editing/condensing al­gorithm that works well in all situations, i.e., for all underlying distributions of the training data. An in-depth comparison of many algorithms led Brighton and Mellish34 to conclude that the existing algorithms tend to fall into two classes deter­mined by whether the data distributions are either homogeneous or not. In the past, algorithms that were good in one situation were not good in the other. Therefore a challenging problem is to find algorithms that work well for many distributions. It appears that finally one such an algorithm has been obtained. Bhattacharya, Mukherjee and Toussaint27 proposed a four-stage algorithm that uses (in the fol­lowing order), (1) Wilson-editing with the Gabriel graph, (2) thinning with the Gabriel graph, (3) filtering with the algorithm of Brighton and Mellish, and (4) the Gabriel-neighbor decision rule to classify incoming queries. Extensive experiments suggest that their approach is the best on the market.

Finally, two problems important in practice are concerned with how all these techniques scale up to large data bases and high dimensional spaces.43 Not much is known about how the proximity graph methods scale to high dimensions. In Ref. [27] a modification of a data structure proposed by Michael Houle118 is used that computes approximate Gabriel graphs, rather than exact graphs, in such a way

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 36: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

136 G. Toussaint

t ha t computat ional efficiency is obtained in high dimensions without compromising

performance.

A c k n o w l e d g e m e n t s

I thank Ned Glick of the Statistics Depar tment at the University of British

Columbia in Vancouver, for reading an earlier version of the manuscript and provid­

ing n pages of comments, as well as Djamel Zighed of the Depar tment of Computer

Science and Statistics at the University of Lyon 2, for several references on the

application of proximity graphs to outlier detection.

R e f e r e n c e s

1. Pankaj K. Agarwal and Jeff Erickson. Geometric range searching and its relatives. In J. E. Goodman B. Chazelle and R. Pollack, editors, Advances in Discrete and Computational Geometry. AMS Press, Providence, RI, 1998.

2. J. S. Aguilar, J. C. Riquelme, and M. Toro. Data set editing by ordered projection. In Proceedings of the Fourteenth European Conference on Artificial Intelligence, pages 251-255, Berlin, Germany, 2000.

3. D. W. Aha. Tolerating noisy, irrelevant and novel attributes in instance-based learn­ing algorithms. International Journal of Man-Machine Studies, 36:267-287, 1992.

4. D. W. Aha, editor. Lazy Learning. Kluwer, Norwell, MA, 1997. 5. D. W. Aha, D. Kibler, and M. Albert. Instance-based learning algorithms. In Ma­

chine Learning, 6, pages 37-66. Kluwer, Boston, Mass., 1991. 6. Greg Aloupis, Carmen Cortes, Francisco Gomez, Michael Soss, and Godfried T.

Toussaint. Lower bounds for computing statistical depth. Computational Statistics and Data Analysis, 40:223-229, 2002.

7. Greg Aloupis, Stefan Langerman, Michael Soss, and Godfried Toussaint. Algorithms for bivariate medians and a Fermat-Torricelli problem for lines. In Proceedings of the Thirteenth Canadian Conference on Computational Geometry, pages 21-24, Univer­sity of Waterloo, August 13-15 2001.

8. Ethem Alpaydin. Voting over multiple condensed nearest neighbors. Artificial Intel­ligence Review, 11:115-132, 1997.

9. T. W. Anderson. Some nonparametric multivariate procedures based on ranks. In P. R. Krishnaiah, editor, Proceedings of the First International Symposium on Mul­tivariate Analysis. Academic Press, New York, 1966.

10. D. V. Andrade and L. E. de Figueiredo. Good approximations for the relative neigh­borhood graph. In Proc. 13th Canadian Conference on Computational Geometry, University of Waterloo, August 13-15 2001.

11. F. Angiulli and C. Pizutti. Fast outlier detection in high dimensional spaces. In Proceedings of the 6th European Conference on the Principles of Data Mining and Knowledge Discovery, pages 15-26, 2002.

12. S. Arya, D. M. Mount, and O. Narayan. Accounting for boundary effects in nearest-neighbor searching. Discrete and Computional Geometry, 16:155-176, 1996.

13. P. Avesani, E. Blanzieri, and F. Ricci. Advanced metrics for class-driven similarity search. In Proceedings of the International Workshop on Similarity Search, Septem­ber 1999.

14. David Avis and Joe Horton. Remarks on the sphere of influence graphs. Annals of the New York Academy of Sciences, 440:323-327, 1982.

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 37: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

Geometric Proximity Graphs for Improving Nearest Neighbor Methods 137

15. Subhash C. Bagui, Sikha Bagui, Kuhu Pal, and Nikhil R. Pal. Breast cancer detection using rank nearest neighbor classification rules. Pattern Recognition, 36:25-34, 2003.

16. T. Bailey and A. Jain. A note on distance-weighted fc-nearest neighbor rules. IEEE Transactions on Systems, Man, and Cybernetics, 8:311-313, 1978.

17. D. Bandyopadhyay and J. Snoeyink. Almost-Delaunay simplices: nearest-neighbor relations for imprecise points. In Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 410-419, 2004.

18. Yoram Baram. A geometric approach to consistent classification. Pattern Recogni­tion, 33:177-184, 2000.

19. John S. Baras and Subhrakanti Dey. Combined compression and classification with learning vector quantization. IEEE Transactions on Information Theory, 45:1911— 1920, 1999.

20. V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley and Sons Ltd, London, 1994.

21. E. R. Barreis. Exemplar-Based Knowledge Acquisition. Academic Press, Boston, MA, 1989.

22. Stephen D. Bay and Mark Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proc. ACM Conference on Knowledge Discovery in Data and Data Mining, Washington, DC, 2003.

23. J. L. Bentley, B. W. Weide, and A. C. Yao. Optimal expected-time algorithms for closest-point problems. ACM Transactions on Mathematical Software, 6:563-580, 1980.

24. Sergio Bermejo and Joan Cabestany. Adaptive soft ^-nearest-neighbor classifiers. Pattern Recognition, 32:2077-2979, 1999.

25. James C. Bezdek, Thomas R. Reichherzer, Gek Sok Lim, and Yianni Attikiouzel. Multiple-prototype classifier design. IEEE Transactions on Systems, Man and Cy­bernetics - Part C: Applications and Reviews, 28:67-79, February 1998.

26. Binay Bhattacharya and Damon Kaller. Reference set thinning for the fc-nearest neighbor decision rule. In Proceedings of the 14th International Conference on Pat­tern Recognition, volume 1, 1998.

27. Binay Bhattacharya, Kaustav Mukherjee, and Godfried Toussaint. Geometric deci­sion rules for high dimensions. In Proceedings of the 55th Session of the International Statistical Institute, Sydney, Australia, April 5-12 2005.

28. Binay K. Bhattacharya. Application of computational geometry to pattern recogni­tion problems. Ph.d. thesis, School of Computer Science, McGill University, 1982.

29. Avrim L. Blum and Pat Langley. Selection of relevant features and examples in machine learning. Artificial Intelligence, pages 245-272, 1997.

30. N. K. Bose and A. K. Garga. Neural network design using Voronoi diagrams. IEEE Transactions on Neural Networks, 4:778-787, September 1993.

31. E. Boyer, L. Lister, and B. Shader. Sphere of influence graphs using the sup-norm. Mathematical and Computer Modelling, 32:1071-1082, 1999.

32. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Re­gression Trees. Wadsworth, Belmont, CA, 1984.

33. David Bremner, Erik Demaine, Jeff Erickson, John Iacono, Stefan Langerman, Pat Morin, and Godfried Toussaint. Output-sensitive algorithms for computing nearest-neighbour decision boundaries. In F. Dehne, J.-R. Sack, and M. Smid, editors, Al­gorithms and Data Structures, pages 451-461. Springer-Verlag, 2003.

34. Henry Brighton and Chris Mellish. On the consistency of information filters for lazy learning algorithms. In J. Zitkow and J. Rauch, editors, Principles of Data Mining and Knowledge Discovery. Springer-Verlag, Berlin, 1999.

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 38: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

138 G. Toussaint

35. Henry Brighton and Chris Mellish. Identifying competence-critical instances for instance-based learning algorithms. In Hiroshi Motoda and Huan Lui, editors, In­stance Selection and Construction for Data Mining, pages 1-18. Kluwer Academic Publishers, Boston, Mass., 2001.

36. Henry Brighton and Chris Mellish. Advances in instance selection for instance-based learning algorithms. Data Mining and Knowledge Discovery, 6:153-172, 2002.

37. M. R. Brito, E. L. Chaves, A. J. Quiroz, and J. E. Yukich. Connectivity of the mutual A;-nearest-neighbor graph in clustering and outlier detection. Statistics and Probability Letters, 35:33-42, 1997.

38. C. E. Brodley and M. A. Friedl. Identifying mislabelled training data. Journal of Artificial Intelligence Research, 11:131-167, 1999.

39. R. L. Brown. Accelerated template matching using template trees grown by conden­sation. IEEE Transactions on Systems, Man and Cybernetics, 25:523-528, 1995.

40. C. J. C. Burges. Simplified support vector decision rules. In Proceedings of the 13th International Conference on Machine Learning, pages 71-77. Bari, Italy, 1996.

41. C. J. C. Burges and B. Schoelkopf. Improving speed and acuracy of support vector learning mchines. Advances in Neural Information Processing Systems, 9:375-381, 1997.

42. Benjamin Bustos, Gonzalo Navarro, and Edgar Chavez. Pivot selection techniques for proximity searching in metric spaces. In Proceedings of the XXI International Conference of the Chilean Computer Science Society, pages 33-40, Punta Arenas, Chile, November 2001. IEEE Press.

43. Jose Ramon Cano, Francisco Herrera, and Manuel Lozano. Stratification for scaling up evolutionary prototype selection. Pattern Recognition Letters, 2004.

44. R. Caruana and D. Freitag. Greedy attribute selection. In Proceedings of the 1994 International Conference on Machine Learning, pages 28-36. Morgan Kaufmann, C A , 1994.

45. Vicente Cerveron and Francesc J. Ferri. Another move toward the minimum consis­tent subset: a tabu search approach to the condensed nearest neighbor rule. IEEE Transactions on Systems, Man and Cybernetics - Part B: Cybernetics, 31:408-413, 2001.

46. Vicente Cerveron and A. Fuertes. Parallel random search and Tabu search for the minimum consistent subset selection problem. In Lecture Notes in Computer Science, pages 248-259. Springer, Berlin, 1998.

47. C. L. Chang. Finding prototypes for nearest neighbor classifiers. IEEE Transactions on Computers, 23:1179-1184, 1974.

48. I. E. Chang and R. P. Lippmann. Using genetic algorithms to improve pattern clas­sification performance. Advances in Neural Information Processing, 3:797-803, 1991.

49. D. Chaudhuri, C. A. Murthy, and B. B. Chaudhuri. Finding a subset of representative points in a data set. IEEE Transactions on Systems, Man and Cybernetics, 24:1416-1424, September 1994.

50. Jian-Hung Chen, Hung-Ming Chen, and Shinn-Ying Ho. Design of nearest neighbor classifiers: multi-objective approach. International Journal of Approximate Reason­ing, 2005.

51. Yan Qiu Chen, Robert I. Damper, and Mark S. Nixon. On neural-network imple­mentations of fc-nearest neighbor pattern classifiers. IEEE Transactions on Circuits and Systems - I: Fundamental Theory and Applications, 44(7):622-629, 1997.

52. Ken L. Clarkson. Nearest neighbor queries in metric spaces. In Proc. 29th Annual ACM Symposium on the Theory of Computing, pages 609-617, 1997.

53. T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, New York,

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 39: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

Geometric Proximity Graphs for Improving Nearest Neighbor Methods 139

1991. 54. Thomas M. Cover. Geometrical and statistical properties of systems of linear in­

equalities with applications in pattern recognition. IEEE Transactions on Electronic Computers, EC-4:326-334, 1965.

55. Thomas M. Cover. The best two independent measurements are not the two best. IEEE Transactions on Systems, Man, and Cybernetics, 4:116-117, 1974.

56. Thomas M. Cover and Jan Van Campenhout. On the possible orderings in the mea­surement selection problem. IEEE Trans. Systems, Man, and Cybernetics, 7:657-661, 1977.

57. Thomas M. Cover and Peter E. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13:21-27, 1967.

58. Robert H. Creecy, Brij M. Masand, Stephen J. Smith, and David L. Waltz. Trading MIPS and memory for knowledge engineering. Communications of the ACM, 35:48-63, August 1992.

59. Balakrishnan Dasarathy and Lee J. White. A characterization of nearest-neighbor rule decision surfaces and a new approach to generate them. Pattern Recognition, 10:41-46, 1978.

60. Belur V. Dasarathy, editor. Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, Los Alamitos, CA, 1991.

61. Belur V. Dasarathy. Minimal consistent set (MCS) identification for optimal nearest neighbor decision system design. IEEE Trans, on Systems, Man and Cybernetics, 24:511-517, 1994.

62. Belur V. Dasarathy. Nearest unlike neighbor (NUN): an aid to decision confidence estimation. Optical Engineering, 34:2785-2792, 1995.

63. Belur V. Dasarathy and J. S. Sanchez. Tandem fusion of nearest neighbor editing and condensing algorithms - data dimensionality effects. In Proc. 15th International Conference on Pattern Recognition, pages 692-695, September 2000.

64. Belur V. Dasarathy, J. S. Sanchez, and S. Townsend. Nearest neighbor editing and condensing tools - synergy exploitation. Pattern Analysis and Applications, 3:19-30, 2000.

65. S. Dasgupta and H. E. Lin. Nearest neighbor rules for statistical classifications based on ranks. Sankhya, A-42:219-230, 1980.

66. P. Datta and D. Kibler. Symbolic nearest mean classifier. In Proceedings of the Hth National Conference of Artificial Intelligence, pages 82-87, 1997.

67. Mark de Berg, Prosenjit Bose, Otfried Cheong, and Pat Morin. On simplifying dot maps. In 18th European Workshop on Computational Geometry, 2002.

68. Ramon Lopez de Mantaras and Eva Armengol. Inductive and lazy methods. Data and Knowledge Engineering, 25:99-123, 1998.

69. Christine Decaestecker. Finding prototypes for nearest neighbor classification by means of gradient descent and deterministic annealing. Pattern Recognition, 30:281-288, 1997.

70. Christine Decaestecker and T. Van de Merckt. How to secure the decisions of a NN classifier. In Proc. IEEE International Conference on Neural Networks, pages 263-268, 1994.

71. V. Susheela Devi and M. Narasimha Murty. An incremental prototype set building technique. Pattern Recognition, 35:505-513, 2002.

72. Pierre Devijver. On a new class of bounds on Bayes risk in multihypothesis pattern recognition. IEEE Transactions on Computers, 23:70-80, 1974.

73. Pierre Devijver and Josef Kittler. Pattern Recognition: A Statistical Approach. Prectice-Hall, Englewood Cliffs, NJ, 1982.

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 40: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

140 G. Toussaint

74. Pierre A. Devijver and Josef Kittler. On the edited nearest neighbor rule. In Fifth International Conference on Pattern Recognition, pages 72-80, Miami, December 1980.

75. Luc Devroye. A universal A-nearest neighbor procedure in discrimination. In Pro­ceedings of the 1978 IEEE Computer Society Conference on Pattern Recognition and Image Processing, pages 142-147, 1978.

76. Luc Devroye. On the inequality of Cover and Hart. IEEE Transactions on Pattern Analysis and Machine Intelligence, 3:75-78, 1981.

77. Luc Devroye. The expected size of some graphs in computational geometry. Com­puters and Mathematics with Applications, 15:53-64, 1988.

78. Luc Devroye. A universal k-nearest neighbor procedure in discrimination. In Belur V. Dasarathy, editor, Nearest Neighbor Pattern Classification Techniques, pages 101-106. IEEE Computer Society Press, Los Alamitos, California, 1991.

79. Luc Devroye, Laszlo Gyorfi, and Gabor Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-Verlag New York, Inc., 1996.

80. Abdelhamid Djouadi. On the reduction of the nearest-neighbor variation for more accurate classification and error estimates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20:567-571, 1998.

81. Tom Downs, Kevin E. Gates, and Annette Masters. Exact simplification of support vector solutions. Journal of Machine Learning Research, 2:293-297, 2001.

82. Richard O. Duda and Peter E. Hart. Pattern Classification and Scene Analysis. Wiley-Interscience, New York, 1973.

83. Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification. John Wiley and Sons, Inc., New York, 2001.

84. Sahibsingh A. Dudani. The distance-weighted ^-nearest-neighbor rule. IEEE Trans­actions on Systems, Man, and Cybernetics, 6:325-327, 1976.

85. Rex Dwyer. A multivariate two-sample test using the Voronoi diagram. Tech. Report TR-93-21, Department of Computer Science, North Carolina State University, 1993.

86. Rex Dwyer. The expected size of the sphere-of-influence graph. Computational Ge­ometry: Theory and Applications, 5:155-164, 1995.

87. Rex A. Dwyer. Higher-dimensional Voronoi diagrams in linear expected time. Dis­crete and Compututational Geometry, 6:343-367, 1991.

88. Oya Ekin, Peter L. Hammer, Alexander Kogan, and Pawel Winter. Distance-based classification methods. INFOR, 37:337-352, 1999.

89. Ibrahim Esat. Neural network design based on decomposition of decision space. In Proceedings of the 6th International Conference on Neural Information Processing, volume 1, pages 366-370, 1999.

90. E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. A geometric framework for unsupervised anomaly detection: detecting intrusions in unlabelled data. In Data Mining for Security Applications, 2002.

91. E. Fix and J. Hodges. Discriminatory analysis. Nonparametric discrimination: Con­sistency properties. Tech. Report 4, USAF School of Aviation Medicine, Randolph Field, Texas, 1951.

92. K. Florek, J. Lucaszewicz, J. Perkal, H. Steinhaus, and S. Zubrzycki. Sur la liaison et la division des points d'un ensemble fini. Colloquium Mathematicae, 2:282-285, 1951.

93. J. H. Friedman. Flexible metric nearest neighbor classification. Stanford University, Stanford, California, November 1994. Technical Report.

94. J. H. Friedman, F. Baskett, and L. J. Shustek. An algorithm for finding nearest neighbors. IEEE Transactions on Computers, C-24:1000-1006, 1975.

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 41: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

Geometric Proximity Graphs for Improving Nearest Neighbor Methods 141

95. J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3:209-226, 1977.

96. J. H. Friedman and L. C. Rafsky. Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. Annals of Statistics, 7:697-717, 1979.

97. Komei Fukuda. Frequently asked questions in polyhedral computation. In Technical Report, Swiss Federal Institute of Technology, October 2000.

98. Keinosuke Fukunaga and L. D. Hostetler. K-nearest-neighbor Bayes-risk estimation. IEEE Transactions on Information Theory, 21:285-293, 1975.

99. Keinosuke Fukunaga and J. M. Mantock. Nonparametric data reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:115-118, 1984.

100. W. Gates. The reduced nearest neighbor rule. IEEE Transactions on Information Theory, 18:431-433, 1972.

101. Marina L. Gavrilova. On a nearest-neighbor problem under Minkowski and power metrics for large data sets. The Journal of Supercomputing, 22(l):87-98, May 2002.

102. S. Gazula and M. R. Kabuka. Design of supervised classifiers using boolean neu­ral networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17:1239-1246, December 1985.

103. Camillo Gentile and Mario Sznaier. An improved Voronoi-diagram-based neural net for pattern classification. IEEE Transactions on Neural Networks, 12:1227-1234, September 2001.

104. Ned Glick. Additive estimators of probabilities of correct classification. Pattern Recognition, 10:211-222, 1978.

105. L. A. Goodman and W. H. Kruskal. Measures of association for cross classifications. J. Amer. Statistical Association, pages 723-763, 1954.

106. K. Chidananda Gowda and G. Krishna. The condensed nearest neighbor rule using the concept of mutual nearest neighborhood. IEEE Transactions on Information Theory, 25:488-490, 1979.

107. Marek Grochowski and Norbert Jankowski. Comparison of instance selection algo­rithms II: results and comments. In Proceedings VII International Conference on Artificial Intelligence and Soft Computing, volume 3070, Lecture Notes in Computer Science, pages 580-585, Zakopane, Poland, 2004.

108. Leo Guibas, Janos Pach, and Micha Sharir. Sphere-of-influence graphs in higher dimensions. In Intuitive Geometry (Szeged, 1991), pages 131-137. North-Holland, Amsterdam, 1994.

109. Peter E. Hart. The condensed nearest neighbor rule. IEEE Transactions on Infor­mation Theory, 14:515-516, 1968.

110. Trevor Hastie and Robert Tibshirani. Discriminant adaptive nearest neighbor classi­fication. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18:607-616, June 1996.

111. Kazuo Hattori and Masahito Takahashi. A new edited &-nearest neighbor rule in the pattern classification problem. Pattern Recognition, 33:521-528, 2000.

112. D. Hawkins. Identification of Outliers. Chapman and Hall, 1980. 113. Zengyou He, Xiaofei Xu, Joshua Zhexue Huang, and Shengchun Deng. Mining class

outliers: concepts, algorithms and applications in CRM. Expert Systems with Appli­cations, 27:681-697, 2004.

114. David Heath and Simon Kasif. The complexity of finding minimal Voronoi covers with applications to machine learning. Computational Geometry: Theory and Appli­cations, 3:289-305, 1993.

115. Martin E. Hellman. The nearest neighbor classification rule with a reject option.

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 42: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

142 G. Toussaint

IEEE Transactions on Systems Science and Cybernetics, 6:179-185, 1970. 116. Martin E. Hellman and Josef Raviv. Probability of error, equivocation, and the

Chernoff bound. IEEE Transactions on Information Theory, 16:368-372, 1970. 117. Shinn-Ying Ho, Chia-Cheng Liu, and Soundy Liu. Design of an optimal nearest

neighbor classifier using an intelligent genetic algorithm. Pattern Recognition Letters, 23:1495-1503, 2002.

118. Michael Houle. SASH: A spatial approximation sample hierarchy for similarity search. Tech. Report RT-0517, IBM Tokyo Research Laboratory, 2003.

119. Tetsuya Hoya. Graph theoretic techniques for pruning data and their applications. IEEE Transactions on Signal Processing, 46:2574-2579, 1998.

120. Y. S. Huang, C. C. Chiang, J. W. Shieh, and E. Grimson. Prototype optimization for nearest neighbor classification. Pattern Recognition, 35:1237-1245, 2002.

121. Y. S. Huang, K. Liu, and C. Y. Suen. A new method of optimizing prototypes for nearest neighbor classifiers using a multi-layer network. Pattern Recognition Letters, 16:77-82, 1995.

122. P. H. Huber. Robust statistics: a review. Annals of Mathematical Statistics, 43:1041-1067, 1972.

123. Manabu Ichino and Jack Sklansky. The relative neighborhood graph for mixed fea­ture variables. Pattern Recognition, 18:161-167, 1985.

124. Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards remov­ing the curse of dimensionality. In Proc. 30th Annual ACM Symp. on Theory of Computing, 1998.

125. Anil K. Jain and Richard C. Dubes. Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs, New Jersey, 1988.

126. Anil K. Jain, Richard C. Dubes, and Chaur-Chin Chen. Bootstrap techniques for error estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9:628-633, 1987.

127. Anil K. Jain and J. Mao. A fc-nearest neighbor artificial neural network classifier. In Proc. International Joint Conference on Neural Networks, pages 515-520, Seattle, July 1991.

128. Norbert Jankowski and Marek Grochowski. Comparison of instance selection algo­rithms I: algorithms survey. In Proceedings VII International Conference on Artificial Intelligence and Soft Computing, volume 3070, Lecture Notes in Computer Science, pages 598-603, Zakopane, Poland, 2004.

129. Nicholas Jardine and Robin Sibson. Mathematical Taxonomy. John Wiley and Sons Ltd, London, 1971.

130. J. W. Jaromczyk and Godfried T. Toussaint. Relative neighborhood graphs and their relatives. Proceedings of the IEEE, 80(9):1502-1517, September 1992.

131. Yuan Jiang and Zhi-Hua Zhou. Editing training data for £NN classifiers with neural network ensemble. In First International Symposium on Neural Networks, volume 3173 Lecture Notes in Computer Science, pages 356-361, Dalian, China, 2004.

132. Bilge Karagali and Hamid Krim. Fast minimization of structural risk by nearest neighbor rule. IEEE Transactions on Neural Networks, 14(1):127-137, January 2002.

133. C K. Keung and W. Lam. Prototype generation based on instance filtering and aver­aging. In Proceedings of the Fourth Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 142-152, Kyoto, Japan, 2000.

134. D. Kibler and D. W. Aha. Learning representative exemplars of concepts: An ini­tial case study. In Proceedings of the Fourth International Workshop on Machine Learning, pages 24-30, Irvine, CA, 1987.

135. D. Kibler and D. W. Aha. Comparing instance-averaging with instance-filtering

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 43: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

Geometric Proximity Graphs for Improving Nearest Neighbor Methods 143

learning algorithms. In Proceedings of the Third European Working Session on Learn­ing, pages 63-80, 1988.

136. Sang-Woon Kim and B. J. Oommen. Enhancing prototype reduction schemes with LVQ3-type algorithms. Pattern Recognition, 36, 2003. in press.

137. David G. Kirkpatrick and John D. Radke. A framework for computational mor­phology. In Godfried T. Toussaint, editor, Computational Geometry, pages 217-248. North Holland, Amsterdam, 1985.

138. J. Kleinberg. Two algorithms for nearest-neighbor search in high dimension. In Proc. 29th Annual ACM Symposium on the Theory of Computing, pages 599-608, 1997.

139. E. M. Knorr, R. T. Ng, and V. Tucakov. Distance-based outliers: algorithms and applications. VLDB Journal: Very Large Databases, 8(3-4):237-253, 2000.

140. T. Kohonen. Self-Organizing Map. Springer-Verlag, Germany, 1995. 141. J. Koplowitz and T. A. Brown. On the relation of performance to editing in nearest

neighbor rules. Pattern Recognition, 13:251-255, 1981. 142. F. Korn and S. Muthukrishnan. Influence sets based on reverse nearest neighbor

queries. In W. Chen, J. Naughton, and P. A. Bernstein, editors, Proc. ACM Inter­national Conference on Management of Data, pages 201-212, New York, 2000. ACM Press.

143. Flip Korn, S. Muthukrishnan, and Divesh Srivastava. Reverse nearest neighbor ag­gregates over data streams. In Proc. 28th VLDB Conference, Hong Kong, 2002.

144. K. Krishna, M. A. L. Thathachar, and K. R. Ramakrishnan. Voronoi networks and their probability of misclassification. IEEE Trans, on Neural Networks, 11:1361— 1372, November 2000.

145. Miroslav Kubat and Jr. Martin Cooperson. Voting nearest-neighbor subclassifiers. In Proceedings of the 17th International Conference on Machine Learning, pages 503-510. Stanford, CA, 2000.

146. Sanjeev R. Kulkarni, Gabor Lugosi, and Santosh S. Venkatesh. Learning pattern classification - a survey. IEEE Transactions on Information Theory, 44:2178-2206, 1998.

147. Ludmila I. Kuncheva. Fitness functions in editing fc-NN reference set by genetic algorithms. Pattern Recognition, 30:1041-1049, 1997.

148. Ludmila I. Kuncheva and J. C. Bezdek. Nearest prototype classification: clustering, genetic algorithms, or random search. IEEE Transactions on Systems, Man and Cybernetics, 28:160-164, 1998.

149. Ludmila I. Kuncheva and Lakhmi C. Jain. Nearest neighbor classifier: Simultaneous editing and feature selection. Pattern Recognition Letters, 20:1149-1156, 1999.

150. S. Lallich, F. Muhlenbach, and D. A. Zighed. Improving clasification by removing or relabeling mislabeled instances. Applied Stochastic Models in Business and Industry, 2005.

151. Wai Lam, Chi-Kin Keung, and Charles X. Ling. Learning good prototypes for clas­sification using filtering and abstraction of instances. Pattern Recognition, 35:1491-1506, 2002.

152. Chang-Hwan Lee and Dong-Guk Shin. Using Hellinger distance in a nearest neighbor classifier for relational data bases. Knowledge-Based Systems, 12:363-370, 1999.

153. M. C. Limas, J. B. Ordieres Mere, F. J. M. de Pison Ascacibar, and E. P. V. Gonzalez. Outlier detection and data cleaning in multivariate non-normal samples. Data Mining and Knowledge Discovery, 9(2):171-187, September 2004.

154. Charles X. Ling and Handong Wang. Computing optimal attribure weight settings for nearest neighbor algorithms. Artificial Intelligence Review, 11:255-272, 1997.

155. Giuseppe Liotta, Anna Lubiw, Henk Meijer, and Sue Whitesides. The rectangle of

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 44: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

144 G. Toussaint

influence drawability problem. Computational Geometry: Theory and Applications, 10:1-22, 1998.

156. U. Lipowezky. Selection of the optimal prototype subset for 1-NN classification. Pattern Recognition Letters, 19:907-918, 1998.

157. Cheng-Lin Liu and Masaki Nakagawa. Evaluation of prototype learning algorithms for nearest-neighbor classifier in application to handwritten character recognition. Pattern Recognition, 34:601-615, 2001.

158. Huan Liu and Hiroshi Motoda. On issues of instance selection. Data Mining and Knowledge Discovery, 6:115-130, 2002.

159. Wolfgang Maas. Neural computation with winner-take-all as the only nonlinear op­eration. Advances in Neural Information Processing Systems, 12, 2000.

160. Wolfgang Maas. On the computational power of winner-take-all. Neural Computa­tion, 12:2519-2535, 2000.

161. D. Madigan, N. Raghavan, W. DuMouchel, M. Nason, C. Posse, and G. Ridge-way. Likelihood-based data squashing: a modeling approach to instance construction. Data Mining and Knowledge Discovery, 6:173-190, 2002.

162. Anil Maheshwari, Jan Vahrenhold, and Norbert Zeh. On reverse nearest neighbor queries. In Proceedings Fourteenth Canadian Conference on Computational Geome­try, pages 128-132, Lethbridge, Alberta, Canada, August 2002.

163. S. Markovitch and P. D. Scott. Information filtering: selection mechanisms in learning systems. Machine Learning, 10:113-151, 1993.

164. C. D. Martmez-Hinarejos, A. Juan, and F. Casacuberta. Median strings for ^-nearest neighbor classification. Pattern Recognition Letters, 24:173-181, 2003.

165. A. Mathai and P. Rathie. Basic Concepts in Information Theory and Statistics. Wiley Eastern Ltd., New Delhi, 1975.

166. K. Matusita. On the notion of affinity of several distributions and some of its appli­cations. Annals of the Institute of Statistical Mathematics, 19:181-192, 1967.

167. Geoffrey J. McLachlan. Discriminant Analysis and Statistical Pattern Recognition. John Wiley and Sons, Inc., New York, 1992.

168. F. R. McMorris and C. Wang. Sphere-of-attraction graphs. Congressus Numeran-tium, 142:149-160, 2000.

169. T. S. Michael and T. Quint. Sphere of influence graphs in general metric spaces. Mathematical and Computer Modelling, 29:45-53, 1999.

170. Luisa Mico, Jose Oncina, and E. Vidal. A new version of the nearest neighbor approx­imating and eliminating search algorithm (AESA) with linear preprocessing-time and memory requirements. Pattern Recognition Letters, 15:9-17, 1994.

171. A. Mitiche and M. Lebidoff. Pattern classification by a condensed neural network. Neural Networks, 14:575-580, 2001.

172. R. A. Mollineda, F. J. Ferri, and E. Vidal. An efficient prototype merging strategy for the condensed 1-NN rule through class-conditional hierarchical clustering. Pattern Recognition, 35:in press, 2002.

173. Francisco Moreno-Seco, Luisa Mico, and Jose Oncina. A modification of the LAESA algorithm for approximated fc-NN classification. Pattern Recognition Letters, 24:47-53, 2003.

174. F. Muhlenbach, S. Lallich, and D. A. Zighed. Identifying and handling mislabelled instances. Journal of Intelligent Information Systems, 22(1):89-109, 2004.

175. F. Muhlenbach, S. Lallich, and D. A. Zighed. Outlier handling in the neighbourhood-based learning of a continuous class. In Einoshin Suzuki and Setsuo Arikawa, editors, Proceedings of Discovery Science, pages 314-321. Springer-Verlag, Berlin-Heidelberg, 2004.

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 45: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

Geometric Proximity Graphs for Improving Nearest Neighbor Methods 145

176. O. Murphy, B. Brooks, and T. Kite. Computing nearest neighbor pattern classifica­tion perceptrons. Information Sciences, 83:133-142, 1995.

177. Owen J. Murphy. Nearest neighbor pattern classification perceptrons. Proceedings of the IEEE, 78:1595-1598, 1990.

178. Nils J. Nilsson. The Mathematical Foundations of Learning Machines. Morgan Kauf-mann Publishers, Inc., San Mateo, CA., 1990.

179. R. Nock and M. Sebban. Advances in adaptive prototype weighting and selection. International Journal on Artificial Intelligence Tools, 10:137-155, 2001.

180. Richard Nock and Marc Sebban. Sharper bounds for the hardness of prototype and feature selection. In Proceedings of the International Conference on Algorith­mic Learning Theory, pages 224-237. Springer-Verlag, 2000.

181. Atsuyuki Okabe, Barry Boots, and Kokichi Sugihara. Spatial Tessellations: Concepts and Applications of Voronoi Diagrams. John Wiley and Sons, Chichester, England, 1992.

182. R. Olshen. Comments on a paper by C. J. Stone. Annals of Statistics, 5:632-633, 1977.

183. J. O'Rourke and G. Toussaint. Pattern recognition. In J. E. Goodman and J. O'Rourke, editors, Handbook of Discrete and Computational Geometry, chapter 43, pages 797-814. CRC Press LLC, Boca Raton, 1997.

184. Joseph O'Rourke. Computational Geometry in C (Second Edition). Cambridge Uni­versity Press, 1998.

185. Th. Ottmann, S. Schuierer, and S. Soundaralakshmi. Enumerating extreme points in higher dimensions. In E. W. Mayer and C. Pueh, editors, 12th Annual Symposium on Theoretical Aspects of Computer Science, pages 562-570. Springer-Verlag, 1995.

186. N. R. Pal and J. Biswas. Cluster validity using graph theoretic concepts. Pattern Recognition, 30:847-857, 1997.

187. Christos H. Papadimitriou and Jon Louis Bentley. A worst-case snalysis of near­est neighbor searching by projection. In Automata, Languages and Programming, volume 85, pages 470-482. Springer-Verlag, 1980.

188. Y. Park and J. Sklansky. Automated design of multiple-class piecewise linear classi­fiers. Journal of Classification, 6:195-222, 1989.

189. Y. Park and J. Sklansky. Automated design of linear tree classifiers. Pattern Recog­nition, 23:1393-1412, 1990.

190. Y. H. Park and S. Y. Bang. A new neural network model based on nearest neighbor classifier. In Proc. IEEE International Joint Conference on Neural Networks, pages 2386-2389, Singapore, November 1991.

191. M. S. Paterson and F. F. Yao. On nearest-neighbor graphs. In Automata, Languages and Programming, volume 623, pages 416-426. Springer, 1992.

192. C. S. Penrod and T. J. Wagner. Another look at the edited nearest neighbor rule. IEEE Transactions on Systems, Man and Cybernetics, 7:92-94, 1977.

193. William A. Porter and Wei Liu. Efficient exemplars for classifier design. Journal of the Franklin Institute, 332B(2):155-172, 1995.

194. Carey Priebe. Olfactory classification via randomly weighted nearest neighbors. Johns Hopkins University, Baltimore, Maryland, 1998. Technical Report No. 585.

195. Carey E. Priebe, Jason G. DeVinney, and David J. Marchette. On the distribution of the domination number for random class cover catch digraphs. Statistics and Probability Letters, 55:239-246, 2001.

196. Carey E. Priebe, David J. Marchette, Jason G. DeVinney, and Diego Socolinsky. Classification using class cover catch digraphs. Tech. Report January 15, Johns Hop­kins University, Baltimore, 2002.

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 46: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

146 G. Toussaint

197. Demetri Psaltis, Robert R. Snapp, and Santosh S. Venkatesh. On the finite sample performance of the nearest neighbor classifier. IEEE Transactions on Information Theory, 40:820-837, 1994.

198. John D. Radke. On the shape of a set of points. In Godfried T. Toussaint, editor, Computational Morphology, pages 105-136. North Holland, Amsterdam, 1988.

199. V. Ramasubramanian and K. K. Paliwal. Voronoi projection-based fast nearest-neighbour search algorithms: box search and mapping table-based search techniques. Digital Signal Processing, 7:260-277, 1997.

200. S. Ramaswami, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In Proceedings of the ACM SIGMOD Conference, pages 427-438, 2000.

201. S. V. Rao and Asish Mukhopadhyay. Fast algorithms for computing /3-skeletons and their relatives. Pattern Recognition, 34:2163-2172, 2001.

202. Thomas Reinartz. A unified view on instance selection. Data Mining and Knowledge Discovery, 6:191-210, 2002.

203. Francesco Ricci and Paolo Avesani. Data compression and local metrics for nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intel­ligence, 21:380-384, April 1999.

204. Joee Riquelme, Jesus S. Aguilar-Ruiz, and Miguel Toro. Finding representative pat­terns with ordered projections. Pattern Recognition, 36:1009-1018, 2003.

205. G. L. Ritter, H. B. Woodruff, S. R. Lowry, and T. L. Isenhour. An algorithm for a selective nearest neighbor decision rule. IEEE Transactions on Information Theory, 21:665-669, November 1975.

206. Paul L. Rosin and Freddy Fierens. The effects of data filtering on neural network learning. Neurocomputing, 20:155-162, 1998.

207. R. Royall. A Class of Nonparametric Estimators of a Smooth Regression Function. Stanford University, Stanford, California, 1966. Ph.D. Thesis.

208. S. L. Salzberg. A nearest hyperrectangle learning method. Machine Learning, 6:251-276, 1991.

209. Steven Salzberg, Arthur L. Delcher, David Heath, and Simon Kasif. Best-case results for nearest-neighbor learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17:599-608, 1995.

210. J. S. Sanchez. High training set size reduction by space partitioning and prototype abstraction. Pattern Recognition, 37:1561-1564, 2004.

211. J. S. Sanchez, R. Barandela, A. I. Marques, R. Alejo, and J. Badenas. Analysis of new techniques to obtain quality training sets. Pattern Recognition Letters, 24:1015-1022, 2003.

212. J. S. Sanchez, F. Pla, and F. J. Ferri. On the use of neighborhood-based non-parametric classifiers. Pattern Recognition Letters, 18:1179-1186, 1997.

213. J. S. Sanchez, F. Pla, and F. J. Ferri. Prototype selection for the nearest neighbor rule through proximity graphs. Pattern Recog. Lett, 18:507-513, 1997.

214. J. S. Sanchez, F. Pla, and F. J. Ferri. Improving the fc-NCN classification rule through heuristic modifications. Pattern Recognition Letters, 19:1165-1170, 1998.

215. V. Vijaya Saradhi and M. Narasimha Murty. Bootstrapping for efficient handwritten digit recognition. Pattern Recognition, 34:1047-1056, 2001.

216. Rosa A. Schiavo and David J. Hand. Ten more years of error rate research. Interna­tional Statistical Review, 68:295-310, 2000.

217. M. F. Schilling. Multivariate two-sample tests based on nearest neighbors. Journal of the American Statistical Association, 81:799-806, 1986.

218. M. Scuturici, J. Clech, V. Scuturici, and D. A. Zighed. Topological representation

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 47: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

Geometric Proximity Graphs for Improving Nearest Neighbor Methods 147

model for image database query. Journal of Experimental and Theoretical Artificial Intelligence, pages 1-16, 2005.

219. M. Scuturici, J. Clech, and D. A. Zighed. Topological query in image databases. In Proceedings of the 8th Ibero-American Congress on Pattern Recognition, pages 144-151, Havana, Cuba, 2003.

220. M. Sebban, S. Rabaseda, and D. A. Zighed. Feature selection for a gait identification model. In G. Ritschard, A. Berchtold, F. Due, and D. A. Zighed, editors, Appren-tissage: des Principes Naturels aux Methodes Artificielles, pages 139-150. Hermes, Paris, 1998.

221. M. Sebban, D. A. Zighed, and S. Di Palma. Selection and statistical validation of features and prototypes. In J. M. Zytkow and J. Rauch, editors, 3rd European Conference on Principles and Practice of Knowledge Discovery in Databases, Lecture Notes in Artificial Intelligence 1704, pages 184-192. Springer, Prague, 1999.

222. Marc Sebban and Richard Nock. A hybrid filter/wrapper approach of feature selec­tion using information theory. Pattern Recognition, 35:835-846, 2002.

223. Marc Sebban, Richard Nock, and Stephane Lallich. Boosting neighborhood-based classifiers. In Proceedings of the 18th International Conference on Machine Learning. Williams College, MA, 2001.

224. Raimund Seidel. Exact upper bounds for the number of faces in d-dimensional Voronoi diagrams. In P. Gritzman and B. Sturmfels, editors, Applied Geometry and Discrete Mathematics: The Victor Klee Festschrift, volume 4 of DIM ACS Series in Discrete Mathematics and Theoretical Computer Science, pages 517-530. AMS Press, Providence, RI, 1991.

225. S. Sen and L. Knight. A genetic prototype learner. In C. S. Mellish, editor, Proceed­ings of the 14th International Joint Conference on Artificial Intelligence, volume 1, pages 725-731. Morgan Kaufmann, San Mateo, CA, 1995.

226. Michael I. Shamos. Geometric complexity. In Proc. 7th Annual ACM Symposium on the Theory of Computing, pages 224-253, 1975.

227. R. Short and K. Fukunaga. The optimal distance measure for nearest neighbor clas­sification. IEEE Transactions on Information Theory, 27:622-627, 1981.

228. David B. Skalak. Using a genetic algorithm to learn prototypes for case retrieval and classification. In Proceedings of the AAAI-9S Case-Based Reasoning Workshop, pages 64-69. Washington, D.C., 1993.

229. David B. Skalak. Prototype and feature selection by sampling and random mutation hill climbing algorithms. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 293-301. New Brunswick, NJ, 1994.

230. Jack Sklansky and Leo Michelotti. Locally trained piecewise linear classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2:101-111, 1980.

231. M. Skurichina, S. Raudys, and R. P. W. Duin. if-nearest neighbors directed noise injection in multilayer preceptron training. IEEE Transactions on Neural Networks, 11(2):504-511, March 2000.

232. B. Smyth and M. T. Keane. Remembering to forget. In Proceedings of the 14th International Conference of Artificial Intelligence, pages 377-382, 1995.

233. S. Gavin Smyth. Designing multilayer perceptrons from nearest-neighbor systems. IEEE Transactions on Neural Networks, 3:329-333, March 1992.

234. Michael Soss. The size of the open sphere of influence graph in Loo metric spaces. In Proceedings Tenth Canadian Conference on Computational Geometry, pages 108-109, Montreal, Quebec, Canada, 1998.

235. Michael Soss. On the size of the Euclidean sphere of influence graph. In Proceedings Eleventh Canadian Conference on Computational Geometry, pages 43-46, Vancou-

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 48: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

148 G. Toussaint

ver, British Columbia, Canada, 1999. 236. C. Stanfill and D. L. Waltz. Toward memory-based reasoning. Communications of

the ACM, 29:1213-1228, December 1986. 237. Ioana Stanoi, Divyakant Agrawal, and Amr El Abbadi. Reverse nearest neighbor

queries for dynamic databases. In A CM Workshop on Research Issues in Data Mining and Knowledge Discovery, 2002.

238. C. Stone. Consistent nonparametric regression. Annals of Statistics, 8:1348-1360, 1977.

239. T.-H. Su and R.-C. Chang. On constructing the relative neighborhood graph in Euclidean ^-dimensional spaces. Computing, 46:121-130, 1991.

240. N. Syed, H. Liu, and K. Sung. A study of support vectors on model independent example selection. In Proceedings of the International Conference on Knowledge Dis­covery and Data Mining, pages 272-276. New York, 1999.

241. H. Tahani, B. Plummer, and N. S. Hemamalini. A new data reduction algorithm for pattern classification. In Proceedings ICASSP, pages 3446-3449, 1996.

242. Hiroshi Tenmoto, Mineichi Kudo, and Masaru Shimbo. Piecewise linear classifiers with an appropriate number of hyperplanes. Pattern Recognition, 31:1627-1634, 1998.

243. I. Tomek. A generalization of the k-nn rule. IEEE Transactions on Systems, Man and Cybernetics, 6:121-126, 1976.

244. I. Tomek. Two modifications of CNN. IEEE Transactions on Systems, Man and Cybernetics, 6:769-772, 1976.

245. G. T. Toussaint, B. K. Bhattacharya, and R. S. Poulsen. The application of Voronoi diagrams to nonparametric decision rules. In Computer Science and Statistics: The Interface, pages 97-108, Atlanta, 1985.

246. G. T. Toussaint and R. S. Poulsen. Some new algorithms and software implementa­tion methods for pattern recognition research. In Proc. IEEE Int. Computer Software Applications Conf, pages 55-63, Chicago, 1979.

247. Godfried T. Toussaint. Comments on a modified figure of merit for feature selection in pattern recognition. IEEE Transactions on Information Theory, 17:618-620, 1971.

248. Godfried T. Toussaint. Note on optimal selection of independent binary-valued fea­tures for pattern recognition. IEEE Transactions on Information Theory, 17:618, 1971.

249. Godfried T. Toussaint. Feature evaluation with quadratic mutual information. In­formation Processing Letters, 1:153-156, 1972.

250. Godfried T. Toussaint. Bibliography on estimation of misclassification. IEEE Trans­actions on Information Theory, 20:472-479, 1974.

251. Godfried T. Toussaint. On information transmission, nonparametric classification, and measuring dependence between randon variables. In Proceedings of Symposium on Statistics and Related Topics, pages 30.01-30.08, Ottawa, October 1974.

252. Godfried T. Toussaint. On some measures of information and their application to pattern recognition. In Proceedings of the Conference on Measures of Information and Their Applications, Indian Institute of Technology, Bombay, August 16-18 1974.

253. Godfried T. Toussaint. On the divergence between two distributions and the prob­ability of misclassification of several decision rules. In Proceedings of the Second International Joint Conference on Pattern Recognition, pages 27-34, Copenhagen, 1974.

254. Godfried T. Toussaint. Some properties of Matusita's measure of affinity of several distributions. Annals of the Institute of Statistical Mathematics, 26:389-394, 1974.

255. Godfried T. Toussaint. Sharper lower bounds for discrimination information in terms

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 49: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

Geometric Proximity Graphs for Improving Nearest Neighbor Methods 149

of variation. IEEE Transactions on Information Theory, 21:99-100, 1975. 256. Godfried T. Toussaint. A generalization of Shannon's equivocation and the Fano

bound. IEEE Transactions on Systems, Man and Cybernetics, 7:300-302, 1977. 257. Godfried T. Toussaint. An upper bound on the probability of misclassification in

terms of the affinity. Proceedings of the IEEE, 65:275-276, 1977. 258. Godfried T. Toussaint. Probability of error, expected divergence and the affinity of

several distributions. IEEE Transactions on Systems, Man and Cybernetics, 8:482-485, 1978.

259. Godfried. T. Toussaint. Algorithms for computing relative neighbourhood graph. Electronics Letters, 16(22):860, 1980.

260. Godfried T. Toussaint. Pattern recognition and geometrical complexity. In Fifth In­ternational Conference on Pattern Recognition, pages 1324-1347, Miami, December 1980.

261. Godfried T. Toussaint. The relative neighbourhood graph of a finite planar set. Pattern Recognition, 12:261-268, 1980.

262. Godfried T. Toussaint. A graph-theoretical primal sketch. In Godfried T. Tous­saint, editor, Computational Morphology, pages 229-260. North-Holland, Amster­dam, Netherlands, 1988.

263. Godfried T. Toussaint. A counterexample to Tomek's consistency theorem for a condensed nearest neighbor rule. Pattern Recognition Letters, 15:797-801, 1994.

264. Godfried T. Toussaint and Robert Menard. Fast algorithms for computing the planar relative neighborhood graph. In Proc. Fifth Symposium on Operations Research, pages 425-428, University of Koln, August 1980.

265. Yuen-Hsien Tseng and Ja-Ling Wu. On a constant-time low-complexity winner-take-all neural network. IEEE Transactions on Computers, 44:601-604, 1995.

266. John W. Tukey. Mathematics and picturing data. In Proceedings of the International Congress of Mathematics, pages 523-531, Vancouver, 1974.

267. R. Urquhart. Graph theoretical clustering based on limited neighborhood sets. Pat­tern Recognition, 15:173-187, 1982.

268. P. M. Vaidya. An 0 ( n log n) algorithm for the all-nearest-neighbors problem. Discrete and Computational Geometry, 4:101-115, 1989.

269. I. Vajda. A contribution to the informational analysis of pattern. In Satosi Watanabe, editor, Methodologies of Pattern Recognition, pages 509-519. Academic Press, New York, 1969.

270. V. N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. 271. Pascal Vincent and Yoshua Bengio. K-local hyperplane and convex distance nearest

neighbor algorithms. Tech. Report 1197, Dept. IRO, University of Montreal, June 2001.

272. Jason De Vinney and Carey Priebe. Class cover catch digraphs. Discrete Applied Mathematics, in press.

273. Terry J. Wagner. Convergence of the edited nearest neighbor. IEEE Transactions on Information Theory, 19:696-697, September 1973.

274. Dietrich Wettschereck and David W. Aha. Weighting features. In Proceedings of the First International Conference on Case-Based Reasoning, Sesimbra, Portugal, 1995.

275. Dietrich Wettschereck, David W. Aha, and Takao Mohri. A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms. Ar­tificial Intelligence Review, 11:273-314, 1997.

276. Gordon Wilfong. Nearest neighbor problems. In Proc. 1th Annual ACM Symposium on Computational Geometry, pages 224-233, 1991.

277. D. L. Wilson. Asymptotic properties of nearest neighbor rules using edited data.

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.

Page 50: GEOMETRIC PROXIMITY GRAPHS FOR IMPROVING NEAREST NEIGHBOR METHODS IN INSTANCE-BASED LEARNING AND DATA MINING

150 G. Toussaint

IEEE Transactions on Systems, Man and Cybernetics, 2:408-420, 1972. 278. D. Randall Wilson and Tony R. Martinez. Instance pruning techniques. In D. Fisher,

editor, Machine Learning: Proceedings of the Fourteenth International Conference, pages 404-411. Morgan Kaufmann Publishers, San Francisco, CA, 1997.

279. D. Randall Wilson and Tony R. Martinez. An integrated instance-based learning algorithm. Computational Intelligence, 16:1-28, 2000.

280. D. Randall Wilson and Tony R. Martinez. Reduction techniques for instance-based learning algorithms. Machine Learning, 38:257-286, 2000.

281. M. Anthony Wong. A hybrid clustering method for identifying high-density clusters. Journal of the American Statistical Association, 77(380):841-847, 1982.

282. Yingquan Wu, Krassimir Ianakiev, and Venu Govindaraju. Improved fc-nearest neigh­bor classification. Pattern Recognition, 35:2311-2318, 2002.

283. H. Yan. Prototype optimization for nearest neighbor classifiers using a two-layer perceptron. Pattern Recognition, 26:317-324, 1993.

284. Congjun Yang and King-Ip Lin. An index structure for efficient reverse nearest neigh­bor queries. In Proceedings of the Seventh International Database Engineering and Applications Symposium, 2002.

285. Chen-Wen Yen, Chieh-Neng Young, and Mark L. Nagurka. A vector quantization method for nearest neighbor classifier design. Pattern Recognition Letters, 25:725-731, 2004.

286. Charles T. Zahn. Graph theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, 20:68-86, 1971.

287. Hongbin Zhang and Guangyu Sun. Optimal reference subset selection for nearest neighbor classification by tabu search. Pattern Recognition, 35:1481-1490, 2002.

288. J. Zhang. Selecting typical instances in instance-based learning. In Proceedings of the Ninth International Machine Learning Workshop, pages 470-479. Morgan Kaufmann Publishers, San Mateo, CA, 1992.

289. Qiangfu Zhao. Stable on-line evolutionary learning of nearest-neighbor multilayer perceptron. IEEE Transactions on Neural Networks, 8:1371-1378, November 1997.

290. Qiangfu Zhao and Tatsuo Higuchi. Evolutionary learning of nearest-neighbor multi­layer perceptron. IEEE Transactions on Neural Networks, 7:762-767, May 1996.

291. D. A. Zighed, S. Lallich, and F. Muhlenbach. Separability index in supervised learn­ing. In T. Elomaa, H. Mannila, and H. Toivonen, editors, Principles of Data Mining and Knowledge Discovery, pages 475-487. Springer-Verlag, Berlin, 2002.

292. D. A. Zighed, S. Lallich, and F. Muhlenbach. A statistical approach for separability of classes. Applied Stochastic Models in Business and Industry, 2005.

Int.

J. C

ompu

t. G

eom

. App

l. 20

05.1

5:10

1-15

0. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by S

IMO

N F

RA

SER

UN

IVE

RSI

TY

on

08/2

9/13

. For

per

sona

l use

onl

y.