nearest neighbors by neighborhood counting

12
Nearest Neighbors by Neighborhood Counting Hui Wang Abstract—Finding nearest neighbors is a general idea that underlies many artificial intelligence tasks, including machine learning, data mining, natural language understanding, and information retrieval. This idea is explicitly used in the k-nearest neighbors algorithm (kNN), a popular classification method. In this paper, this idea is adopted in the development of a general methodology, neighborhood counting, for devising similarity functions. We turn our focus from neighbors to neighborhoods, a region in the data space covering the data point in question. To measure the similarity between two data points, we consider all neighborhoods that cover both data points. We propose to use the number of such neighborhoods as a measure of similarity. Neighborhood can be defined for different types of data in different ways. Here, we consider one definition of neighborhood for multivariate data and derive a formula for such similarity, called neighborhood counting measure or NCM. NCM was tested experimentally in the framework of kNN. Experiments show that NCM is generally comparable to VDM and its variants, the state-of-the-art distance functions for multivariate data, and, at the same time, is consistently better for relatively large k values. Additionally, NCM consistently outperforms HEOM (a mixture of Euclidean and Hamming distances), the “standard” and most widely used distance function for multivariate data. NCM has a computational complexity in the same order as the standard Euclidean distance function and NCM is task independent and works for numerical and categorical data in a conceptually uniform way. The neighborhood counting methodology is proven sound for multivariate data experimentally. We hope it will work for other types of data. Index Terms—Pattern recognition, machine learning, nearest neighbors, distance, similarity, neighborhood counting measure. æ 1 INTRODUCTION F INDING nearest neighbors is a general idea that underlies many artificial intelligence tasks, including machine learning, data mining, natural language understanding, and information retrieval [8]. This idea is explicitly used in the k-nearest neighbors algorithm (kNN) [13], a popular pattern classification method. kNN uses a distance or similarity 1 function to find k nearest neighbors of a data point in question and classifies this data point by, usually, a majority voting over the known class labels of the nearest neighbors. Therefore, the key to kNN is a distance/similarity function. There are many distance/similarity functions for differ- ent types of data, for example, Euclidean distance for numerical data, Hamming distance for categorical data, edit distance for sequences, and maximal subgraph for graphs. Complex real-world applications may generate new types of data or new combinations of existing types of data. Additionally, domain knowledge may need to be considered in calculating similarity. This may call for new distance/ similarity functions. There is now growing interest in building domain knowledge models from a collection of natural language documents and using this model to infer semantic similarity between words and between documents and to perform concept-based information retrieval [28]. At the center of these efforts is the design of a suitable similarity function. To devise a new similarity function, we can take either an ad hoc approach or a principled approach, where we need a methodology to guide us. The nearest neighbor idea is adopted in our endeavor to develop a general 2 methodology, neighborhood counting, for devising similarity functions. We turn our focus from neighbors to neighborhoods, a region in the data space covering the data point in question. To measure the similarity between two data points, we consider all neighborhoods that cover both data points. We propose using the number of such neighborhoods as a generic measure of similarity which can then serve as a methodology. To validate the methodology, we consider multivariate data. We use one definition of neighborhood based on hypertuples 3 and derive a formula for such similarity, called neighborhood counting measure or simply NCM. NCM was tested in the framework of kNN. Experimental evaluations show that NCM is quite competitive compared to some of the widely used distance functions. The NCM similarity is conceptually simple and is straightforward to implement. It has a computational complexity in the same order as the standard Euclidean distance function and it is task independent and works for numerical and categorical data in a conceptually uniform way. This suggests that the neighborhood counting meth- odology is sound for multivariate data. The rest of this paper is organized as follows: Section 2 presents a review of kNN with a focus on the distance functions commonly used in kNN for multivariate data. The general idea of the methodology is presented in Section 3. The methodology is applied to multivariate data and the NCM 942 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 28, NO. 6, JUNE 2006 . The author is with the School of Computing and Mathematics, Faculty of Engineering, University of Ulster at Jordanstown, BT37 0QB, Northern Ireland, UK, and LITA, Universite´ de Metz, Ile du Saulcy, 57045 Metz Cedex, France. E-mail: [email protected]. Manuscript received 15 Apr. 2005; revised 23 Sept. 2005; accepted 12 Oct. 2005; published online 13 Apr. 2006. Recommended for acceptance by L. Kuncheva. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TPAMI-0201-0405. 1. Similarity is usually considered the converse of distance [28], so we use similarity and distance interchangeably in this paper. 2. We use the word “general” to mean that the methodology is intended to work for a broad range of data types. 3. Neighborhood can be defined in different ways. 0162-8828/06/$20.00 ß 2006 IEEE Published by the IEEE Computer Society

Upload: hui-wang

Post on 05-Nov-2016

217 views

Category:

Documents


2 download

TRANSCRIPT

Nearest Neighbors by Neighborhood CountingHui Wang

Abstract—Finding nearest neighbors is a general idea that underlies many artificial intelligence tasks, including machine learning, data

mining, natural language understanding, and information retrieval. This idea is explicitly used in the k-nearest neighbors algorithm (kNN),

a popular classification method. In this paper, this idea is adopted in the development of a general methodology, neighborhood counting,

for devising similarity functions. We turn our focus from neighbors to neighborhoods, a region in the data space covering the data point in

question. To measure the similarity between two data points, we consider all neighborhoods that cover both data points. We propose to

use the number of such neighborhoods as a measure of similarity. Neighborhood can be defined for different types of data in different

ways. Here, we consider one definition of neighborhood for multivariate data and derive a formula for such similarity, called neighborhood

counting measure or NCM. NCM was tested experimentally in the framework of kNN. Experiments show that NCM is generally

comparable to VDM and its variants, the state-of-the-art distance functions for multivariate data, and, at the same time, is consistently

better for relatively large k values. Additionally, NCM consistently outperforms HEOM (a mixture of Euclidean and Hamming distances),

the “standard” and most widely used distance function for multivariate data. NCM has a computational complexity in the same order as

the standard Euclidean distance function and NCM is task independent and works for numerical and categorical data in a conceptually

uniform way. The neighborhood counting methodology is proven sound for multivariate data experimentally. We hope it will work for other

types of data.

Index Terms—Pattern recognition, machine learning, nearest neighbors, distance, similarity, neighborhood counting measure.

1 INTRODUCTION

FINDING nearest neighbors is a general idea that underliesmany artificial intelligence tasks, including machine

learning, data mining, natural language understanding, andinformation retrieval [8]. This idea is explicitly used in thek-nearest neighbors algorithm (kNN) [13], a popular patternclassification method.

kNN uses a distance or similarity1 function to findknearestneighbors of a data point in question and classifies this datapoint by, usually, a majority voting over the known classlabels of the nearest neighbors. Therefore, the key to kNN is adistance/similarity function.

There are many distance/similarity functions for differ-ent types of data, for example, Euclidean distance fornumerical data, Hamming distance for categorical data, editdistance for sequences, and maximal subgraph for graphs.

Complex real-world applications may generate new typesof data or new combinations of existing types of data.Additionally, domain knowledge may need to be consideredin calculating similarity. This may call for new distance/similarity functions. There is now growing interest inbuilding domain knowledge models from a collection ofnatural language documents and using this model to infersemantic similarity between words and between documentsand to perform concept-based information retrieval [28]. At

the center of these efforts is the design of a suitable similarityfunction.

To devise a new similarity function, we can take eitheran ad hoc approach or a principled approach, where weneed a methodology to guide us.

The nearest neighbor idea is adopted in our endeavorto develop a general2 methodology, neighborhood counting,for devising similarity functions. We turn our focus fromneighbors to neighborhoods, a region in the data spacecovering the data point in question. To measure thesimilarity between two data points, we consider allneighborhoods that cover both data points. We proposeusing the number of such neighborhoods as a genericmeasure of similarity which can then serve as amethodology.

To validate the methodology, we consider multivariatedata. We use one definition of neighborhood based onhypertuples3 and derive a formula for such similarity, calledneighborhood counting measure or simply NCM. NCM wastested in the framework of kNN. Experimental evaluationsshow that NCM is quite competitive compared to some of thewidely used distance functions.

The NCM similarity is conceptually simple and isstraightforward to implement. It has a computationalcomplexity in the same order as the standard Euclideandistance function and it is task independent and works fornumerical and categorical data in a conceptually uniformway. This suggests that the neighborhood counting meth-odology is sound for multivariate data.

The rest of this paper is organized as follows: Section 2presents a review of kNN with a focus on the distancefunctions commonly used in kNN for multivariate data. Thegeneral idea of the methodology is presented in Section 3. Themethodology is applied to multivariate data and the NCM

942 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 28, NO. 6, JUNE 2006

. The author is with the School of Computing and Mathematics, Faculty ofEngineering, University of Ulster at Jordanstown, BT37 0QB, NorthernIreland, UK, and LITA, Universite de Metz, Ile du Saulcy, 57045 MetzCedex, France. E-mail: [email protected].

Manuscript received 15 Apr. 2005; revised 23 Sept. 2005; accepted 12 Oct.2005; published online 13 Apr. 2006.Recommended for acceptance by L. Kuncheva.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TPAMI-0201-0405.

1. Similarity is usually considered the converse of distance [28], so weuse similarity and distance interchangeably in this paper.

2. We use the word “general” to mean that the methodology is intendedto work for a broad range of data types.

3. Neighborhood can be defined in different ways.

0162-8828/06/$20.00 � 2006 IEEE Published by the IEEE Computer Society

similarity formula thus devised is presented in Section 4, inthe form of a procedure to show the line of reasoning in whichthe formula was derived. The similarity function is evaluatedand the experimental results are presented and analyzed inSection 5. The paper is concluded with a summary and adiscussion of possible future work.

2 A REVIEW OF k-NEAREST NEIGHBOR RULE

The kNN method [13] is a simple yet effective method forclassification in the areas of pattern recognition, machinelearning, data mining, and information retrieval. It has beensuccessfully used in a variety of real-world applications.kNN can be very competitive with the state-of-the-artclassification methods [12], [17].

kNN is a lazy learning method, which defers processing oftraining data until a query (a data point in question) needs tobe answered. This usually involves storing the training datain memory and finding relevant data to answer a particularquery. Relevance is often measured using a distance function,with nearby points having high relevance [2]. This type oflearning is also referred to as memory-based reasoning [24] orinstance-based learning [18].

A successful application of kNN depends on a suitabledistance function and a choice of k. The distance function putsdata points in order according to their distance to the queryandkdetermines how many data points are selected and usedas neighbors. Classification is usually done by voting amongthe neighbors. The vote of each neighbor can even beweighted, usually inversely, by its distance to the query.

There exist many distance functions in the literature.No distance function is known to perform consistentlywell, even under some conditions; no value of k is knownto be consistently good, even under some circumstances.In other words, the performance of distance functions isunpredictable. This makes the use of kNN highlyexperience-dependent.

Although kNN is no longer considered state-of-the-art, itserves as a perfect vehicle through which new distancefunctions are tested and evaluated.

2.1 Majority Voting kNN Rule

To use kNN, a distance function, d, is needed. For a query t,a set of k data points nearest (in terms of d values) to t areselected. t is then assigned to the class represented by amajority of its k nearest neighbors. This rule is nowadaysusually called the majority voting kNN rule.

Cover and Hart [7] have shown that, as the number N ofdata points and k both tend to infinity in such a manner thatk=N ! 0, the error rate of the kNN rule approaches theoptimal Bayes error rate.

2.2 Distance Weighted kNN Rule

In voting kNN, the k neighbors are implicitly assumed tohave equal weight in decision, regardless of their distancesto the query, t. It is conceptually appealing to give differentweights to the k neighbors based on their distances to thequery, with closer neighbors having greater weights. As aresult, we can in principle take k to be the number of allgiven data points, thus simplifying the use of kNN.

Let x1; x2; � � � ; xk be the knearest neighbors of t arranged in

increasing order of dðxi; tÞ. So, x1 is the first nearest neighbor

of t. Dudani [11] proposes to assign to the ith nearest

neighbor xi a weightwi ¼ dðxk;tÞ�dðxi;tÞdðxk;tÞ�dðx1;tÞ if dðxk; tÞ 6¼ dðx1; tÞ and

wi ¼ 1 if dðxk; tÞ ¼ dðx1; tÞ.Query t is assigned to the class for which the weights of the

representatives of the class among the k nearest neighborssum to the greatest value. This rule was shown by Dudani [11]to yield lower error rates than those obtained using the votingkNN rule. However, some other researchers reached lessoptimistic conclusions [3], [19], [8]. Denoeux [9] provides anexcellent and detailed review of distance-weighted kNN.

2.3 Distance Functions in kNN

Distance functions are needed in many modern algorithms.In particular, a distance function is at the center of any kNNmethod. A variety of distance functions are available in theliterature, including the Euclidean, Hamming, Minkowsky,Mahalanobis, Camberra, Chebychev, Quadratic, Correlation,Chi-square, hyperrectangle distance functions [22], [10],Value Difference Metric [24], and Minimal Risk Metric [5].

Distance is the converse of similarity, so distance issometimes also called dissimilarity. One way to transformbetween distance and similarity is to take the reciprocal, thestandard method for transforming between resistance andconductance in physics and electronics [28]. In a generalsense, similarity or dissimilarity measures the degree ofcoincidence or divergence between two objects.

In this section, we review some of the functions orcombinations of some that are frequently used in kNN.

2.3.1 Scale of Measurement: Attribute Types

Before we discuss distance functions, we need to discuss thetypes of attribute, from the scale of measurement point of view.

The scale of measurement of a variable (attribute) inmathematics and statistics describes how much informationthe values associated with the variable contains. Differentmathematical operations on variables are possible, depend-ing on the scale at which a variable is measured.

Four scales of measurement are usually recognized [25]:

. Nominal scale. The values in the domain of avariable are names or labels, which can and oftenare replaced by verbal names. The only operationsthat can be meaningfully applied on variable valuesare “equality” and “inequality.”

. Ordinal scale. The values have all the features ofnominal scales and are numerical—they representthe rank order (first, second, third, etc.) of the objectsmeasured. Comparisons of “greater” and “less” canbe made, in addition to “equality” and “inequality.”

. Interval scale. The values have all the features ofordinal scale and, additionally, are separated by thesame interval. In this case, differences betweenarbitrary pairs of values can be meaningfully com-pared. Operations such as “addition” and “subtrac-tion” are therefore meaningful. Additionally, negativevalues on the scale can be used.

. Ratio scale. The values have all the features of intervalscale and also have meaningful ratios betweenarbitrary pairs of numbers. Operations such as“multiplication” and “division” are therefore mean-ingful. The zero value on a ratio scale is nonarbitrary.Most physical quantities, such as mass, length, orenergy, are measured on ratio scales.

WANG: NEAREST NEIGHBORS BY NEIGHBORHOOD COUNTING 943

Nominal attributes are sometimes also called categoricalattributes and ratio attributes are sometimes loosely callednumerical variables in the literature, a term we adoptthroughout this paper. In pattern recognition and machinelearning applications, the most frequently encounteredattributes are nominal, ordinal, or ratio [16].

In this paper, we consider two types of attribute:categorical and numerical. The operations applicable tocategorical attributes include “equality” and “inequality,”whereas those applicable to numerical attributes include,additionally, “greater” or “less than,” “addition” and “sub-traction,” “multiplication,” and “division.”

2.3.2 Euclidean Distance

The Euclidean distance function is probably the mostcommonly used in any distance-based algorithm. It isdefined as

dðx;yÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXna¼1

ðxa � yaÞ2s

;

where x and y are two data vectors and n is the number ofattributes (variables). Euclidean distance applies to numer-ical attributes only.

2.3.3 Hamming Distance

Hamming distance is usually used, directly or indirectly,for categorical attributes in kNN (see, for example, [26]).The Hamming distance between two data vectors is thenumber of attributes in which they differ (do not match).

2.3.4 Heterogeneous Euclidean-Overlap Metric

The Heterogeneous Euclidean-Overlap Metric (HEOM) [29]uses the overlap, or Hamming distance, for categoricalattributes and the normalized Euclidean distance fornumerical attributes. The distance between two values xand y of an attribute a is daðx; yÞ ¼ 1 if x or y is unknown,daðx; yÞ ¼ overlapðx; yÞ if a is categorical, and daðx; yÞ ¼diffaðx; yÞ if a is numerical. Here, overlapðx; yÞ ¼ 0 if x ¼ yand 1 otherwise; diffaðx; yÞ ¼ jx�yj

rangea, where the value rangea

is used to normalize the values of attribute a and is definedas rangea ¼ maxa �mina.

Then, the distance between two (possibly heterogeneous,i.e., having a mixture of categorical and numericalattributes) data vectors x and y is given by HEOMðx;yÞ:

HEOMðx;yÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXna¼1

daðxa; yaÞ2s

: ð1Þ

2.3.5 Value Difference Metric

The Value Difference Metric (VDM) [24] was introduced toprovide an appropriate distance function for categoricalattributes. A simplified version of the VDM (without theweighting schemes) defines the distance between twovalues x and y of attribute a as:

vdmaðx; yÞ ¼XCc¼1

Na;x;c

Na;x�Na;y;c

Na;y

��������q

¼XCc¼1

jPa;x;c � Pa;y;cjq;

where Na;x is the number of data vectors in the training set Tthat have value x for attribute a, Na;x;c is the number of datavectors inT that have valuex for attributeaand output class c,

C is the number of output classes in the problem domain, q isa constant, usually 1 or 2, and Pa;x;c is the conditionalprobability that the output class is c given that attribute a hasthe value x.

Using the distance function vdmaðx; yÞ, two values areconsidered to be closer if they have more similar classifica-tions. The distance between x and y is

VDMðx;yÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXna¼1

vdmaðxa; yaÞs

: ð2Þ

The original VDM algorithm [24] makes use of attributeweights, which are not included in the above equations. Somevariants of VDM [6], [21], [10] have used alternativeweighting schemes. A well-known variant is the ModifiedValue Difference Metric (MVDM) [6], [21]. It does not useattribute weights as in VDM, but, instead, uses instanceweights, which are determined according to their perfor-mance history.

2.3.6 Heterogeneous Value Difference Metric

The Euclidean distance function is inappropriate forcategorical attributes and VDM is inappropriate forcontinuous attributes, so neither is sufficient on its ownfor use on a heterogeneous application, i.e., one with bothcategorical and numerical attributes. The HeterogeneousValue Difference Metric (HVDM) [29] combines the twodistance functions so that different distance functions areused for different types of attributes. It is defined as follows:

HVDMðx;yÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXna¼1

daðxa; yaÞ2s

: ð3Þ

The function daðx; yÞ returns a distance between the twovalues x and y for attribute a and is defined as daðx; yÞ ¼ 1 ifx or y is unknown, vdmaðx; yÞ if a is categorical, anddiffaðx; yÞ if a is numerical.

The function HVDM is similar to the function HOEM,except that it uses VDM instead of Hamming distance forcategorical attributes.

2.3.7 Interpolated Value Difference Metric

VDM is, by definition, applicable to categorical attributesonly. The Interpolated Value Difference Metric (IVDM) [29]extends VDM for numerical attributes. In the learning phase,it uses discretization to collect statistics and determinevalues of Pa;x;c (in VDM formula) for continuous valuesoccurring in the training data, but then retain the continuousvalues for later use. In a testing phase, the value of Pa;y;c for acontinuous value y is interpolated between two other valuesof P , namely, Pa;x1;c and Pa;x2;c, where x1 � y � x2. IVDM is,in fact, doing a nonparametric probability density estimationto determine the values of P for each class.

In IVDM, continuous values are discretized into s equal-width intervals. A value x of attribute a is discretized asdiscretizeaðxÞ ¼ x if a is discrete, s if x ¼ maxa, and bðx�minaÞ=wac þ 1 otherwise. Here, wa ¼ maxa�mina

s .For two data vectors x and y, the distance function for

the IVDM is defined as:

IV DMðx;yÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXna¼1

ivdmaðxa; yaÞ2s

; ð4Þ

944 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 28, NO. 6, JUNE 2006

where ivdma is defined as ivdmaðx; yÞ ¼ vdmaðx; yÞ if a is

categorical andPC

c¼1 jPa;cðxÞ � Pa;cðyÞj2 otherwise. The for-

mula for determining the interpolated probability value

Pa;cðxÞ of a continuous value x for attribute a and class c is

Pa;cðxÞ ¼ Pa;u;c þ ð x�mida;umida;uþ1�mida;uÞ, wheremida;u andmida;uþ1 are

midpoints of two consecutive discretized ranges such that

mida;u � x < mida;uþ1. Pa;u;c is the probability value of the

discretized range u, which is taken to be the probability value

of the midpoint of range u (and similarly for Pa;uþ1;c). The

value of u is found by first setting u ¼ discretizeaðxÞ and then

subtracting 1 from u if x < mida;u. The value of mida;u can be

calculated easily. IVDM is quite complex and, like VDM, it

applies to classification only.

2.3.8 Discretized Value Difference Metric

The Discretized Value Difference Metric (DVDM) is the sameas IVDM except that DVDM need not retain the originalcontinuous values; instead, it will use only the discretizedvalues in calculation. DVDM is defined as follows:

DVDMðx;yÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXna¼1

jvdmaðdiscretizeaðxaÞ; discretizeaðyaÞÞj2s

;ð5Þ

where vdma and discretizea are both defined earlier.

2.3.9 Minimal Risk Metric

Blanzieri and Ricci [5] introduce the Minimal Risk Metric(MRM), a probability-based distance function for nearestneighbor classification and case-based reasoning. It is afunction that directly minimizes the risk of misclassification.

Given a data point x in class ci and a neighbor y the finite

risk of misclassifying x is given by pðcijxÞð1� pðcijyÞÞ. The

total finite risk is the sum of the risks extended to all the

different classes and is given, as the MRM distance between

x and y, by:

MRMðx;yÞ ¼ rðx;yÞ ¼XCi¼1

pðcijxÞð1� pðcijyÞÞ; ð6Þ

where C is the number of classes.A key element in MRM is the estimation of pðcijxÞ. Any

probability estimation techniques can be used, for example,naive Bayes estimator and Gaussian kernel estimator. Theapplicability of MRM depends much on the underlyingprobability estimator, but, clearly, MRM can be used only forclassification.

2.3.10 Discussion

Euclidean distance and Hamming distance are elementarydistance functions that can be used to construct complexfunctions. The Euclidean distance function handles onlynumerical attributes, while Hamming and VDM functionshandle only categorical attributes. The remaining distancefunctions handle both categorical and numerical attributes ora mixture of both. HEOM is straightforward to implement,while HVDM, IVDM, and DVDM have more complexformulas to implement.

It was shown [29] that IVDM achieved higher averageaccuracy than HEOM, HVDM, and DVDM, albeit notconsistently. MRM was shown [5] to have an edge overVDM variants, but it has a much higher computationalcomplexity. Our experiments showed that a straightforwardimplementation of MRM using the naive Bayes estimator isover 10 times more expensive computationally than astraightforward implementation of HEOM or DVDM.

As for task dependency, the Euclidean, Hamming, andHEOM distance functions are task independent, whereasDVDM, HVDM, IVDM, and MRM are task dependent—they work only for classification tasks.

3 MEASURING DISTANCE THROUGH

NEIGHBORHOOD COUNTING

Generally speaking, distance or similarity measures thedegree of divergence or coincidence between two datapoints [15], [28]. A distance function is usually nonnegativeand symmetric.4 If it further satisfies the triangle inequality,it is then a metric. Not all distance functions are metric.

In this section, we present a methodology for devisingsimilarity functions. To introduce the methodology, weconsider an example. Fig. 1a shows a 2D data space alongwith three data points, where we want to find out which ofa and b is more similar (closer) to x. Intuitively, a is clearlymore similar to x than b. To quantify this intuition, we canuse the Euclidean distance (see Fig. 1b).

We consider a generic approach. We draw circles aroundx and then count how many such circles contain a or b, asshown in Fig. 2a. In this figure, six circles contain a and twocircles contain b. Alternatively, we draw rectangles aroundx and then count how many such rectangles contain a or b,as shown in Fig. 2b. In this figure, six rectangles contain aand two rectangles contain b. It is intuitively sensible thatthe higher this count of circles or rectangles, the moresimilar the data points are.

WANG: NEAREST NEIGHBORS BY NEIGHBORHOOD COUNTING 945

4. There are counterexamples to this property [14].

Fig. 1. (a) Which of a and b is closer to x? (b) Euclidean distance. Fig. 2. (a) Circle-based neighborhood. (b) Rectangle-basedneighborhood.

In more technical terms, the circles are neighborhoods.Neighborhood is one of the basic concepts in topology.Intuitively speaking, a neighborhood is a set of points suchthat you can “move” the points a bit without leaving the setand a neighborhood of a point is a neighborhood containingthe point. Formally, let U be a (topological) space and t be apoint in U . A neighborhood of t is a set that contains anopen set containing t [14].

Using the notion of neighborhood, we now state theabove intuition formally. Let U be a (topological) data spaceand t and x be two points in U . To measure the similaritybetween t and x, we count the number of all neighborhoodsthat contain both t and x. We call this count the cover of tand x, denoted by covðt; xÞ. The higher the covðt; xÞ, themore similar t and x are to each other.

The cover measure is clearly nonnegative and sym-metric. It is a generic measure of similarity as theneighborhood concept is very general and the measurehas to be specified if we want to use it for a particular typeof data. It can thus be considered as a methodology fordevising similarity functions.

Example 1. To further illustrate the concepts of neighbor-hood and cover and the neighborhood counting idea, weconsider Fig. 2a again, where x is a query data point anda and b are two other data points. The cocentric circlesare examples of the neighborhoods defined by somedistance function (e.g., Euclidean distance). Here, eachneighborhood is a region (disc) in the data space (here,the subspace of the 2D Euclidean space) delineated by acircle and a neighborhood of data point x is a disc thatcontains x. We note the following:

. Neighborhood circles do not need to be cocentric,in general, and we draw cocentric circles here justfor clarity of presentation because, otherwise, thefigure would have been very messy.

. Only a sample, not all, of the neighborhoods areshown here for illustration.

. There are other ways of defining neighborhoods.Fig. 2b shows a different interpretation ofneighborhood—axis-parallel rectangles, whichare defined by hypertuples (see next section).

Now, we come back to Fig. 2a. In order to find out whichof a and b is closer to x, we consider the 10 circles—theneighborhoods of x—and count how many of these circlesalso contain a or b. Clearly, there are six circles coveringa and two circles covering b, i.e., covðx; aÞ ¼ 6 andcovðx; bÞ ¼ 2. Therefore, the similarity between x and a=bis 6=2, respectively. The same conclusion can be drawnfrom the rectangle-based neighborhood.

4 NEIGHBORHOOD COUNTING FOR RELATIONAL

DATA

To validate the neighborhood counting methodology, weconsider multivariate data in this section and present adefinition of neighborhood. We then derive a formula forcover.

Given a data point, under any neighborhood definition,there may be a large number of neighborhoods; this numbermay even be infinite for numerical data. If we want toconsider all possible neighborhoods, then the computationalcomplexity is potentially very high. In the following, wepresent two approaches to counting all neighborhoods: A

straightforward one that has high-computational cost and aformula-based one that has much lower computational cost.

4.1 Neighborhood

In this section, we introduce a non-distance-based inter-pretation of neighborhood for multivariate data.

4.1.1 Notations

Multivariate data are described by attributes. Let R ¼fa1; a2; � � � ; ang be a set of attributes and domðaiÞ be thedomain of attribute ai 2 R. An attribute can be categorical ornumerical. We assume that, if attribute ai is categorical, thendomðaiÞ is finite; if it is numerical then there is a lower boundand an upper bound, denoted by minðaiÞ and maxðaiÞ,respectively, i.e., minðaiÞ � x � maxðaiÞ for any x 2 domðaiÞ.

Additionally, we define

Ji ¼

2domðaiÞ; if ai is categorical

f½b1; b2� : b1; b2 2 domðaiÞ; and b1 � b2g; if ai is numerical:

(

ð7Þ

If attribute ai is categorical, Ji is the power set of itsdomain, i.e., 2domðaiÞ; if ai is numerical, Ji is the set of allclosed intervals of its domain, i.e., the Borel set [1], which isneeded in the mathematical study of probability.

Furthermore, let V ¼def Qni¼1 domðaiÞ and L ¼def Qn

i¼1 Ji. V iscalled the data space defined by R and L an extended dataspace. A (given) data set is D � V—a sample of V .

If we write an element t 2 V by hv1; v2; � � � ; vni, thenvi 2 domðaiÞ. If we write h 2 L by hs1; s2; � � � ; sni, thensi 2 Ji.

An element of L is called a hypertuple and an element ofV a simple tuple [27]. The difference between the two is that afield in a simple tuple is a value (hence, value-based), while afield in a hypertuple is a set (hence, set-based). If we interpretvi 2 domðaiÞ as a singleton set fvig, then a simple tuple is aspecial hypertuple. Thus, we can embed V into L.

Consider two hypertuples, h1 and h2, where h1 ¼hs11; s12; � � � ; s1ni and h2 ¼ hs21; s22; � � � ; s2ni. We say h1 iscovered by h2 (or h2 covers h1), written h1 � h2, if, fori 2 f1; 2; � � � ; ng,

8x 2 s1i;minðs2iÞ � x � maxðs2iÞ if ai is numericals1i � s2i if ai is categorical:

For a simple tuple t, tðaiÞ represents the projection of t ontoattribute ai. For a hypertuple h, hðaiÞ is similarly defined.

4.1.2 Neighborhoods as Hypertuples

A neighborhood is here interpreted as a hypertuple and anyhypertuple in L is regarded as a neighborhood. For anysimple tuple t 2 V , a neighborhood of t is a hypertuple hthat covers t, i.e., t � h.

Clearly, any simple tuple has a neighborhood becausethe maximal hypertuple in the extended data space, i.e., thewhole data space, covers any simple tuple, hence, it is aneighborhood of any simple tuple.

4.2 The Neighborhood Counting Procedure

Having defined neighborhood, it is now time to discuss theneighborhood counting procedure. Here, we first of allassume that the domain of any numerical attribute is finite

946 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 28, NO. 6, JUNE 2006

and the values are natural numbers. Later, we will extend theresults, to the general case, where these restrictions arerelaxed.

4.2.1 Generating All Neighborhoods

A neighborhood is a hypertuple, which is an element ofQi Ji. Therefore, the number of all hypertuples is

Qi N0i ,

where N 0i ¼ jJij.According to the definition of Ji, if ai is categorical, then

N 0i ¼ 2mi , wheremi ¼ jdomðaiÞj. Ifai is numerical, the numberof distinctive intervals for an numerical attribute is N 0i ¼Pmi�1

j¼0 ðmi � jÞ ¼Pmi

j¼1 j ¼miðmiþ1Þ

2 . To summarize, the num-ber of all hypertuples is

Qi N0i , where

N 0i ¼2mi ; if ai is categoricalmiðmiþ1Þ

2 ; if ai is numerical:

�ð8Þ

4.2.2 Generating All Neighborhoods of a Simple Tuple

A neighborhood of a query (simple tuple in question), t 2 V ,is a hypertuple that covers t. Clearly, not all hypertuples inQ

i Ji cover t. For a hypertuple h to cover t, we must havetðaiÞ 2 hðaiÞ for all i. Therefore, to generate a neighborhoodof t, we can take an si 2 Ji such that tðaiÞ 2 si for all i,resulting in a hypertuple hs1; s2; � � � ; sni. If ai is categorical,the number of such si is

Ni ¼Xmi�1

i¼0

mi � 1i

� �¼ 2mi�1

since si is any subset of domðaiÞ that is the superset of tðaiÞ.If ai is numerical, this number is Ni ¼ ðmaxðaiÞ � tðaiÞ þ1Þ � ðtðaiÞ �minðaiÞ þ 1Þ since ðmaxðaiÞ � tðaiÞ þ 1Þ is thenumber of numerical values above tðaiÞ and ðtðaiÞ �minðaiÞ þ 1Þ is such a number below tðaiÞ. Any pair ofvalues from the two parts, respectively, forms an interval.

To summarize, the number of neighborhoods of t isQi Ni, where

Ni ¼2mi�1; if ai is categoricalðmaxðaiÞ � tðaiÞ þ 1Þ�ðtðaiÞ �minðaiÞ þ 1Þ; if ai is numerical:

8<: ð9Þ

4.2.3 Cover of Simple Tuples

Now, we know exactly the number of all neighborhoods ina data space and the number of all neighborhoods of a givensimple tuple t. We set to find out, for a simple tuple x in thedata sample D, the number of neighborhoods of t that alsocover x, that is, the cover of t and x, covðt; xÞ.

An obvious approach is to generate all neighborhoods of tand, for each simple tuple x in D, go through all theneighborhoods and count the number of neighborhoods hsuch that x � h. Due to the fact that there are an exponential

number of neighborhoods for a simple tuple, the computa-

tional complexity of this approach is too high. An alternative

approach with polynomial complexity will be presented in

Section 4.4.

4.3 A Toy Example

To illustrate the above procedure, we consider a toy data

set, as shown in Table 1, where a1 and a2 are prediction

(independent) attributes and c is a class (dependent)

attribute. We assume the domains of attributes a1 and a2

are both f1; 2; 3; 4; 5g, and the domain of c is fþ;�g.The attributes can be categorical or numerical and we

will look at both cases separately.

4.3.1 Numerical Case

We assume that both (prediction) attributes are numerical.

Then, the data space can be displayed as a 2D grid in Fig. 3.

Each unit square represents a simple tuple, which is the

coordinate of the square.

. All neighborhoods: Here, mi ¼ 5 for i ¼ 1; 2. Accord-ing to (8), there are miðmi þ 1Þ=2 ¼ 5 � 6=2 ¼ 15intervals for every attribute, which are ½1; 1�, ½2; 2�,½3; 3�, ½4; 4�, ½5; 5�, ½1; 2�, ½2; 3�, ½3; 4�, ½4; 5�, ½1; 3�, ½2; 4�,½3; 5�, ½1; 4�, ½2; 5�, and ½1; 5�. As a result, we have 15�15 ¼ 225 hypertuples in this example. An example ishf2; 3g; f4; 5gi, which is generated by two intervalsf2; 3g and f4; 5g from the two attributes, respectively.

. All neighborhoods of t: Consider t ¼ h1; 1i. Accordingto (9), there are ðmaxðaiÞ�tðaiÞþ1Þ�ðtðaiÞ �minðaiÞþ1Þ ¼ 5 intervals in each attribute that can be used toconstruct a neighborhood of t and they are f½1; 1�;½1; 2�; ½1; 3�; ½1; 4�; ½1; 5�g. As a result, t has 5� 5 ¼ 25neighborhoods. One example is h½1; 2�; ½1; 4�i, which isgenerated by intervals ½1; 2� and ½1; 4�.

. Cover: Here, we consider all five simple tuples and allneighborhoods of t in order to calculate their covers.This is simple but tedious work as, for each simpletuple x, we need to go through all 25 neighborhoodsand check ifx is covered. There is no need to documentthe whole process and we only list the results here:covðt; h3; 2iÞ ¼ 12, covðt; h2; 3iÞ ¼ 12, covðt; h4; 4iÞ ¼ 4,covðt; h4; 5iÞ ¼ 2, and covðt; h5; 4iÞ ¼ 2.

4.3.2 Categorical Case

Now, we assume both attributes are categorical. We go

through the procedure on the basis of this assumption.

. All neighborhoods: According to (8), there are 25 ¼32 subsets in each attribute that can be used togenerate hypertuples. As a result, there are 32�32 ¼ 1; 024 hypertuples.

WANG: NEAREST NEIGHBORS BY NEIGHBORHOOD COUNTING 947

TABLE 1A Toy Example

Fig. 3. Data sample displayed in the data space.

. All neighborhoods of t: According to (9), fort ¼ h1; 1i, there are 24 ¼ 16 subsets in each attributefor use in generating neighborhoods. Therefore,there are, altogether, 16� 16 ¼ 256 neighborhoods.

. Cover: Once again, for each simple tuple x, we need togo through all 256 neighborhoods and count thenumber of neighborhoods that cover x, which isclearly simple but very tedious. We list only the resultshere, omitting the details on how these numbers areobtained: covðt; h3; 2iÞ ¼ 81, covðt; h2; 3iÞ ¼ 81, covðt;h4; 4iÞ ¼ 81, covðt; h4; 5iÞ ¼ 81, and covðt; h5; 4iÞ ¼ 81.

4.4 A Polynomial Method to Calculate Cover

The key component in the above procedure is to calculatethe cover of a simple tuple in D. A straightforwardapproach, as discussed above, would involve going throughan exponential number of neighborhoods. We can see fromthe above section that, even for a data set as simple as thetoy example, the procedure for calculating the cover is veryinvolved. This approach is clearly not feasible in practice.Here, we present an efficient method for calculating thecover, which works in polynomial time.

Consider two simple tuples t ¼ ht1; t2; � � � ; tni and x ¼hx1; x2; � � � ; xni. A neighborhood h of t covers t by definition,i.e., t � h.Whatweneedto dois tocheck ifhcoversxaswell. Inother words, we want to find all hypertuples that cover both tand x.

Equation (9) specifies the number of all simple tuples thatcover t only. We take a similar approach here: We look atevery attribute and explore the number of elements in Ji thatcan be used to generate a hypertuple covering both t and x.Multiplying these numbers across all attributes gives rise tothe number we require.

Consider attribute ai. If ai is numerical, then the numberof intervals that can be used to generate a hypertuplecovering both xi and ti is Ni ¼ ðmaxðaiÞ �maxðfxi; tigÞþ1Þ � ðminðfxi; tigÞ �minðaiÞ þ 1Þ. If ai is categorical, thenumber of subsets for the same purpose is Ni ¼ 2mi�1 if xi ¼ti and 2mi�2 otherwise. Recall that mi ¼ jdomðaiÞj.

To summarize, the number of neighborhoods of tcovering x is covðt; xÞ ¼

Qi Ni, where

Ni ¼

ðmaxðaiÞ �maxðfxi; tigÞ þ 1Þ�ðminðfxi; tigÞ �minðaiÞ þ 1Þ; if ai is numerical2mi�1; if ai is categorical and xi ¼ ti2mi�2; if ai is categorical and xi 6¼ ti:

8>><>>: ð10Þ

4.5 Use of Cover as Similarity Function

We have so far derived a formula for covðt; xÞ. Following themethodology discussed in Section 3, this is equivalent to ourderiving a similarity function for multivariate data. We callthis function the neighborhood counting measure or NCM forshort.

948 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 28, NO. 6, JUNE 2006

TABLE 2General Information about Data Sets

Since we are so far working on a simplified case, whereall attributes are assumed to have finite domains, we callthis formula simplified NCM or NCMs. In the next section,we will present a general version of NCM without such anassumption. More formally, for any two tuples x and y, thesimplified NCM similarity between them is

NCMsðx; yÞ ¼ covðx; yÞ ¼Yni¼1

Ni; ð11Þ

where n is the number of attributes and Ni is given by (10).It is clear thatNCMsðx; xÞ NCMsðx; yÞ andNCMsðx; yÞ

¼ NCMsðy; xÞ. These are properties generally required by asimilarity function [20].

4.6 Extension of NCM

The simplified NCM was derived by reasoning under theassumption that all attributes have finite domains. If anattribute is categorical, then this assumption is reasonableand generally true in practice. If it is numerical, then the

assumption is too restrictive as, in practice, we often find datasets that have continuous (but bounded) attributes, hencetheir domains are infinite.

Now, we consider a general case where, if an attribute iscategorical, then its domain is finite and, if it is numerical,then its domain is bounded, either finite or infinite.

The formula for the simplified NCM may be problematicin the general case—the resulting similarity may be negativeand we cannot interpret it in terms of “the number ofneighborhoods.” Additionally, different attributes havedifferent domain ranges and, consequently, the resultingsimilarity is overwhelmingly dominated by those with largedomain ranges, which is usually not intended. Where there isno external knowledge as to which attributes are moreimportant, it is common that we assume all attributes areequally important.

In the spirit of the simplified NCM, we now define ageneral NCM as follows:

NCMðx; yÞ ¼Yi

Cðxi; yiÞ=CðxiÞ; ð12Þ

where

Cðxi; yiÞ ¼

ðmaxðaiÞ �maxðfxi; yigÞÞ�ðminðfxi; yigÞ �minðaiÞÞ; if ai is numerical2mi�1; if ai is categorical and xi ¼ yi2mi�2; if ai is categorical and xi 6¼ yi

8>><>>:

and

CðxiÞ ¼ðmaxðaiÞ � xiÞÞ � ðxi �minðaiÞÞ; if ai is numerical2mi�1; if ai is categorical:

It is clear 0 � NCMðx; yÞ � 1, NCMðx; xÞ ¼ 1, and NCM

ðx; yÞ 6¼ NCMðy; xÞ. If we need distance rather than similar-

ity, we can define a distance function as 1�NCMðx; yÞ.

WANG: NEAREST NEIGHBORS BY NEIGHBORHOOD COUNTING 949

Fig. 4. Aggregated “closeness to best in percentage” over all data setsfor different k values and different distance functions. (a) Withoutweighting and (b) with weighting.

TABLE 3Runtime, in Seconds, of Different Distance Functions,

where k ¼ 11 and There Is No Weighting

5 IMPLEMENTATION AND EVALUATION

The NCM function as shown in (12) is simple, so

implementing an NCM-based kNN is straightforward. To

calculate the NCM similarity between two tuples, we need

to calculate Ni for all attributes. Therefore, the computa-

tional complexity of the NCM function is OðnÞ, in the same

order as that of the Euclidean distance function, where n is

the number of attributes.

5.1 Missing Values

Missing values occur because they were not measured, not

answered, were unknown, or were lost. Typical ways in

which missing values are treated are: They are simply

ignored, the tuples containing missing values are omitted,

or they are replaced with the mode or mean. What is common

to these solutions is a principle that any missing value should

play a minimal role. We adopt the same principle and treat the

missing values in a way suitable for the NCM function.Recall that the NCM similarity is a product of some Nis,

where i is attribute index. For two data tuples, t and x, if there

is a missing value in t or x for attribute i, thenNi is set to 1. As

a result, this attribute does not contribute toward the product,

i.e., the NCM similarity between t and x.

5.2 Evaluation

The purpose of the evaluation is to see how effective ourNCM function works in kNN for the task of classification,when k assumes various values, where the attributes arenumerical, categorical, or a mixture of both.

The evaluation was done through experimentation on

some popular benchmark public data sets from UCI Machine

Learning Repository [4]. These data sets were carefully

selected so that there is a balanced mixture of numerical and

categorical attributes: Some consist of numerical attributes

only, some consist of categorical attributes only, and some

consist of a mixture of both types. General information about

these data sets is shown in Table 2, which is split into three

sections corresponding to the three types of attributes above.

We implemented a kNN system, with and without

weighting,5 along with the NCM function as well as the

HEOM, HVDM, IVDM, and DVDM functions. MRM is

very slow, so it was not included in our experiments. In

the experiment, k was set to 1; 6; 11; 16; 21; 26; 31, and “all,”

which means that k was set to the number of data tuples

in the training data. We ran 10-fold cross-validation

10 times with random partition of data for each function,

data set, and k value.

950 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 28, NO. 6, JUNE 2006

TABLE 4Averages and Standard Deviations of the Accuracies when k = 11 and without Weighting, along with the

Levels of Statistical Significance of Difference between These Functions and HEOM

5. The weighting scheme was presented in Section 2.2.

We present the experimental results along with statisticalanalysis results in Tables 4, 5, and 6 for k = 11 (withoutweighting), k = 21 (with weighting), and k = “all,” respec-tively. We choose these k values because 1) k = 11 and k = 21give the best overall results (see Table 7) and 2) k = all givesinteresting, consistent, results.

In our statistical analysis, we use HEOM as the referencebecause it is the best-known distance function. Statisticalsignificance of the difference between one of {DVDM, NCM,IVDM, HVDM} and HEOM was computed using the two-tailed unpaired Student t-test [23].6

From Table 4 (k = 11, without weighting), we observe thefollowing:

. On average, there is little difference between all

distance functions except HVDM, which is slightly

worse.

. At 99 percent confidence level (i.e., the superscript is

5), the counts that a distance function is significantly

better (positive) or worse (negative) than HEOM are:

From Table 5 (k = 21, with weighting), we observe thefollowing:

. On average, NCM is the best.

. At 99 percent confidence level, the counts that adistance function is significantly better or worse thanHEOM are:

WANG: NEAREST NEIGHBORS BY NEIGHBORHOOD COUNTING 951

TABLE 5Averages and Standard Deviations of the Accuracies when k = 21 and with Weighting, along with the

Levels of Statistical Significance of Difference between These Functions and HEOM

6. We used the Analysis ToolPak in MS-Excel for this test and recordedthe probability that two samples have not come from the same underlyingpopulations. These probability values are reported in Tables 4, 5, and 6using the following index (as superscripts): 5 for 99 percent, 4 for 98 percent,3 for 97.5 percent, 2 for 95 percent, 1 for 90 percent, and 0 for below90 percent; “+” indicates that the average accuracy of the function is higherthan that of HEOM and “-” otherwise. Consider DVDM = 87:4 1:0þ5 inTable 4 for an example. This tells us that the average accuracy of DVDM onthis data set is 87:4 with a standard deviation of 1:0 and DVDM issignificantly better than HEOM with a confidence of 99 percent.

From Table 6 (k = all, with weighting), we observe the

following:

. On average, NCM is clearly the best.

. At 99 percent confidence level, the counts that adistance function is significantly better or worse thanHEOM are:

Fig. 4 shows the aggregated classification accuracy of thevarious functions over all data sets under different k valueswith and without weighting. For each data set and each k, letx be the accuracy achieved by one function and best be thebest accuracy achieved by all functions. We compute ðx�bestÞ � 100=best and aggregate these values across all datasets. This is a way to compare different methods on a range ofdata sets, arguably better than averaging accuracies directlyacross all data sets. From Fig. 4, we observe the following:

. HVDM is clearly the worst in both cases after k > 6.

. NCM generally performed well under relatively largek. With weighting, it consistently outperformed all

other functions when k > 11. To explain this, atheoretical study is needed, which is beyond the scopeand page limit of this paper.

. NCM consistently outperformed HEOM when k > 1without weighting and when k > 11 with weighting.This is an interesting observation as they are the onlyfunctions that are task-independent, hence morecomparable.

The computational complexity of the five functions isreflected in the runtime of the functions. Table 3 shows theruntime of the functions when k ¼ 11 and there is noweighting. We can see that NCM is not the fastest amongthem, though not far away from the fastest.

952 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 28, NO. 6, JUNE 2006

TABLE 6Averages and Standard Deviations of the Accuracies when k = all and with Weighting, along with the

Levels of Statistical Significance of Difference between These Functions and HEOM

TABLE 7Average Classification Accuracy over Different

k Values, by All Methods on All Data Sets

6 CONCLUSION

This paper presents a methodology, neighborhood counting,for devising similarity functions. The methodology is basedon an intuition that two data points are closer if they arecovered by more neighborhoods.

Applying this methodology to multivariate data, we

obtained a similarity function (NCM) based on a definition

of neighborhood. This function handles both numerical and

categorical attributes in a conceptually uniform way and it is

defined without the use of class information, so it can be

used for classification and clustering. This function has a

simple, easy-to-implement formula and it has a computa-tional complexity in the same order as the Euclidean

distance function.Experiments, in the framework of kNN with and without

weighting, show that the NCM function is generally compar-able with some of the state-of-the-art distance functions(HEOM and VDM variants). It was consistently betterperforming than these functions under relatively largek values, especially with weighting. This was distinctivelytrue when all data tuples were used as neighbors. Althoughwe may not want to consider all data tuples as neighbors, inpractice, this is an interesting property not present in anyother distance/similarity functions, which we believe de-serves further studies.

These observations suggest that the methodology issound for multivariate data. This gives us a reason tobelieve that this methodology can also be applied to othertypes of data to devise new similarity functions, as long asappropriate definitions of neighborhood are available.

Future work will include, additionally, the application ofthe neighborhood counting methodology to other types ofdata in order to devise new or better similarity functions.

ACKNOWLEDGMENTS

The help by Mr. Liang Liang with the experiments isgratefully appreciated.

REFERENCES

[1] R.B. Ash and C. Doleans-Dade, Probability and Measure Theory.Academic Press, 2000.

[2] C.G. Atkeson, A.W. Moore, and S. Schaal, “Locally WeightedLearning,” Artificial Intelligence Rev., vol. 11, nos. 1-5, pp. 11-73,1997.

[3] T. Baily and A.K. Jain, “A Note on Distance-Weighted k-NearestNeighbor Rules,” IEEE Trans. Systems, Man, and Cybernetics, vol. 8,no. 4, pp. 311-313, 1978.

[4] C.L. Blake and C.J. Merz, UCI Repository of Machine LearningDatabases, 1998.

[5] E. Blanzieri and F. Ricci, “Probability Based Metrics for NearestNeighbor Classification and Case-Based Reasoning,” Lecture Notesin Computer Science, vol. 1650, pp. 14-29, 1999.

[6] S. Cost and S. Salzberg, “A Weighted Nearest Neighbor Algorithmfor Learning with Symbolic Features,” Machine Learning, vol. 10,pp. 57-78, 1993.

[7] T.M. Cover and P.E. Hart, “Nearest Neighbour Pattern Classifica-tion,” IEEE Trans. Information Theory, vol. 13, no. 1, pp. 21-27, 1967.

[8] Nearest Neighbor(NN) Norms: NN Pattern Classification Techniques,B.V. Dasarathy, ed. Los Alamitos, Calif.: IEEE CS Press, 1991.

[9] T. Denoeux, “A k-Nearest Neighbor Classification Rule Based onDempster-Shafer Theory,” IEEE Trans. Systems, Man, and Cyber-netics, vol. 25, pp. 804-813, 1995.

[10] P. Domingos, “Rule Induction and Instance-Based Learning: AUnified Approach,” Proc. 1995 Int’l Joint Conf. Artificial Intelligence,1995.

[11] S.A. Dudani, “The Distance-Weighted k-Nearest-Neighbor Rule,”IEEE Trans. Systems, Man, and Cybernetics, vol. 6, pp. 325-327, 1976.

[12] C. Elkan, “ Results of the KDD ’99 Classifier Learning Contest,”Sept. 1999, http://www.cs.ucsd.edu/users/elkan/clresults.html.

[13] E. Fix and J.L. Hodges, “Discriminatory Analysis, NonparametricDiscrimination: Consistency Properties,” Technical Report TR4,US Air Force School of Aviation Medicine, Randolph Field, Tex.,1951.

[14] Wikimedia Foundation, Wikipedia, The Free Encyclopedia, http://www.wikipedia.org, 2006.

[15] P. Gardenfors, Conceptual Spaces: The Geometry of Thought. The MITPress, 2000.

[16] D. Hand, H. Mannila, and P. Smyth, Principles of Data Mining. MITPress, 2001.

[17] H. Hayashi, J. Sese, and S. Morishita, “Optimization of NearestNeighborhood Parameters for KDD-2001 Cup ‘the GenomicsChallenge’,” technical report, Univ. of Tokyo, 2001, http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/WS/PDFfiles/Morishita.pdf.

[18] T.M. Mitchell, Machine Learning. McGraw-Hill Companies, Inc.,1997.

[19] R.L. Morin and D.E. Raeside, “A Reappraisal of Distance-Weighted k-Nearest Neighbor Classification for Pattern Recogni-tion with Missing Data,” IEEE Trans. Systems, Man, and Cybernetcs,vol. 11, no. 3, pp. 241-243, 1981.

[20] H. Osborne and D. Bridge, “Models of Similarity for Case-BasedReasoning,” Proc. Interdisciplinary Workshop Similarity and Categor-isation, pp. 173-179, 1997.

[21] J. Rachlin, S. Kasif, S. Salzberg, and D.W. Aha, “Towards a BetterUnderstanding of Memory-Based and Bayesian Classifiers,” Proc.11th Int’l Machine Learning Conf., pp. 242-250, 1994.

[22] S. Salzberg, “A Nearest Hyperrectangle Learning Method,”Machine Learning, vol. 6, pp. 251-276, 1991.

[23] G.W. Snedecor and W.G. Cochran, Statistical Methods. Ames, Iowa:Iowa State Univ. Press, 2002.

[24] C. Stanfill and D. Waltz, “Toward Memory-Based Reasoning,”Comm. ACM, vol. 29, pp. 1213-1229, 1986.

[25] S.S. Stevens, Mathematics, Measurement, and Psychophysics (Hand-book of Experimental Psychology). Wiley, 1951.

[26] G. Towell, J. Shavlik, and M. Noordewier, “Refinement ofApproximate Domain Theories by Knowledge-Based NeuralNetworks,” Proc. Eighth Nat’l Conf. Artificial Intelligence, pp. 861-866, 1990.

[27] H. Wang, I. Duntsch, G. Gediga, and A. Skowron, “Hyperrelationsin Version Space,” Int’l J. Approximate Reasoning, vol. 36, no. 3,pp. 223-241, 2004.

[28] D. Widdows, Geometry and Meaning. Univ. of Chicago Press, 2004.[29] D.R. Wilson and T.R. Martinez, “Improved Heterogeneous

Distance Functions,” J. Artificial Intelligence Research, vol. 6, pp. 1-34, 1997.

Hui Wang received the BSc degree in compu-ter science from Jilin University of China in1985 and then went on a three-year MScprogram in artificial intelligence. From 1988 to1992, he worked at Jilin University as a lecturer.In early 1993, he went to Northern Ireland tobegin a doctoral program and completed theDPhil degree in informatics in 1996. He workedat the University of Ulster as a lecturer from1996 to 2002 and was promoted to senior

lecturer in 2002. Dr. Wang has been working in the field of artificialintelligence for 18 years. His research interests include patternrecognition, machine learning, data/text mining, uncertainty reasoning,spatial reasoning, and information retrieval.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

WANG: NEAREST NEIGHBORS BY NEIGHBORHOOD COUNTING 953