network positioning for online nearest neighbors search xin qi

BOSTON UNIVERSITY

GRADUATE SCHOOL OF ARTS AND SCIENCES

Thesis

NETWORK POSITIONING FOR ONLINE NEAREST

NEIGHBORS SEARCH

by

XIN QI

B.E., Beijing University of Posts and Telecommunications, 1997M.E., Beijing University of Posts and Telecommunications, 2000

Submitted in partial fulfillment of the

requirements for the degree of

Master of Arts

2005

Approved by

First Reader

Richard West, Ph.D.

Assistant Professor of Computer Science.

Second Reader

Shang-Hua Teng, Ph.D.

Professor of Computer Science.

Third Reader

George Kollios, Ph.D.

Assistant Professor of Computer Science.

Acknowledgments

Most special thanks to my advisor Richard West for all the inspiration and freedom

during the past three years of research work. These three years of working together

make me more appreciative about the beauty of systems research. The culture he

developed in the BOSS group has brought all of my good colleagues together. Gary

Wong and Gabrial Parmer taught me about the happy part of the work.

I would also like to thank Dihan, who initiated the idea of this thesis, and Professor

Shang-Hua Teng, who gave me many insightful comments on the work.

I could never get this if it were not under the encouragement of my wife. Her

support gave me the strength during all the hard times.

iii

NETWORK POSITIONING FOR ONLINE NEAREST

NEIGHBORS SEARCH

XIN QI

ABSTRACT

Current networking embedding schemes have focused on preserving a stable quantity,

thought of as “network distance”, in a corresponding embedding space. This quantity

is usually taken to be the minimum round trip time (minRTT). However, for appli-

cations such as peer-to-peer systems and content-delivery networks, the performance

generally depends on instantaneous round trip time as influenced by current network

conditions (e.g., congestion in routers). What’s more, rather than determining actual

distances between all hosts, it is usually only necessary to determine the closest hosts

among an instantaneous sub-group of the hosts sending requests.

In this thesis, we propose a network embedding scheme using Lipschitz embedding

into the L∞ norm by choosing landmark nodes from outside those in a specific group of

hosts. This scheme makes possible an on-line refinement algorithm to guide network

measurement so as to quickly find the closest host with the smallest instantaneous

RTT, while radically decreasing the amount of measurements necessary. Compared

with other embedding schemes, our scheme also achieves better accuracy for short

distances, thereby improving the likelihood of finding the true nearest neighbors.

Complementary to this scheme is an approach to compensate for violations of the

triangle inequality, which is the foundation of the metric embedding. We show that

with the aid of an off-line network embedding mechanism based on the L∞ norm,

using an on-line refinement algorithm to adjust for the instantaneous networking

fluctuation, a host in the network can effectively find its nearest neighbor, or neighbor

set, among a large number of candidates.

iv

Contents

1 Introduction 1

2 Network Embedding 5

3 Network Positioning for Nearest Neighbors 9

3.1 Problem Definition and Approach . . . . . . . . . . . . . . . . . . . . 9

3.2 Geometric Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2.1 Lipschitz Embedding . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.2 Contractive Embedding . . . . . . . . . . . . . . . . . . . . . 11

3.2.3 An On-line Refinement Algorithm . . . . . . . . . . . . . . . . 12

3.3 Landmark Selection in L∞ . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3.1 The analysis for 2-d Euclidean space . . . . . . . . . . . . . . 16

3.3.2 The analysis for 3-d Euclidean space . . . . . . . . . . . . . . 23

3.4 Outside-max-distance algorithm . . . . . . . . . . . . . . . . . . . . . 25

4 Experimental Evaluation 27

4.1 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Compensation for the Violation of Triangle Inequality . . . . . . . . . 28

4.3 Contractive Embedding, Tradeoff between Structure Error and Em-

bedding Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

v

4.4 Comparison with GNP . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.4.1 Pairwise Distance Predicting . . . . . . . . . . . . . . . . . . . 34

4.4.2 Nearest Neighbor Searching . . . . . . . . . . . . . . . . . . . 35

4.5 Comparison Using INET Data . . . . . . . . . . . . . . . . . . . . . . 36

4.5.1 Pairwise Distance Predicting . . . . . . . . . . . . . . . . . . . 37

4.5.2 Nearest Neighbor Searching . . . . . . . . . . . . . . . . . . . 38

5 Related Work 42

6 Conclusions and Future work 44

vi

1

1 Introduction

The growth of the Internet has stimulated the development of applications that are

largely data- rather than CPU-intensive. Examples include streaming media delivery,

interactive distance learning, web-casting, group simulations, live auctions, and stock

brokerage applications. Additionally, peer-to-peer (P2P) systems such as Gnutella [7],

Freenet [5], Kazaa [11] and more recent variations (e.g., Chord [26], CAN [23], Pas-

try [24] and Tapestry [30]) have become popular due to their ability to efficiently

locate and retrieve data of particular interest.

While the Internet continues to grow, there is an enormous untapped potential

to use distributed computing resources for large scale applications. Some projects

such as SETI@home [17] have already realized this potential, by utilizing spare com-

pute cycles on off-the-shelf computers to process application-specific data. Similarly,

grid computing technologies (e.g., Globus [3]) are attempting to coordinate and share

computing resources on the scale of the Internet, to support emerging applications in

areas such as scientific computing. However, there has been only limited research on

the construction of scalable distributed systems to support the delivery and process-

ing of high bandwidth, and potentially real-time, data streams transported over the

Internet [10]. An Internet-wide distributed system for data stream processing would

be desirable for applications such as interactive distance learning, tele-medicine, and

live video broadcasts, bringing together potentially many thousands of people located

in different regions of the world. Further, such a system would pave the way for new

applications, such as a distributed and adaptive traffic management system, and,

generally, any application that requires the dissemination of sensor data.

Essentially, applications are emerging that have requirements that extend beyond

2

what a classical client/server paradigm can provide. In this model, a client issues

a remote procedure call (RPC) to a server, sending the request and receiving the

reply via IP. Some applications exist that would be better served by a pipelined,

publisher/subscriber model. The delivery of QoS constrained media streams to an

extremely large population of interested end-hosts is a task ideally suited for the

publisher/subscriber model.

One of the projects the BOSS (Boston University Operating System and Services)

group is focusing on [20, 6, 22] is to present such a system based on pipelined pro-

cessing of data streams. Generally, this system can be utilized by any application

domain that requires the scalable, QoS and resource aware delivery and processing of

data-streams.

Essentially, data streams can be transferred over an overlay topology of end-hosts

on the Internet and at each hop processing can be performed on the stream by Stream

Processing Agents (SPAs). These SPAs can take the form of a QoS and resource aware

router, a filter to extract from a stream relevant information, an entity to perform

transformations on the data, an agent to build multi-cast trees, a splitting agent that

could separate the stream over two links to distribute bandwidth usage, or an agent

to merge these streams.

For example, a number of sensors, perhaps cameras, could publish the raw video

they capture onto the distributed system. It would be routed through the overlay via a

sequence of end-host intermediary nodes. At each of these intermediaries, a number of

SPAs can be applied to the data-stream to, for instance, compress and filter the data

so that a mobile device can easily receive and display the sensor output. A certain

population will subscribe to these data-streams. These subscribers are the destination

3

end-hosts and will display the video. They will also act as intermediaries, possibly

further routing streams to which they are currently subscribed to other interested

end-hosts, or perhaps applying SPAs to other data-streams. The distributed system

is responsible for providing certain QoS levels to each of the subscribers and each

stream will be processed and routed according to these constraints.

One of the fundamental aspects of this work is to leverage off-the-shelf computers

on the Internet to build a scalable distributed system for the routing and processing of

data streams. Of particular interest is the use of overlay networks to deliver real-time

media streams from one or more publishing nodes to potentially many thousands of

subscribers, each with their own quality-of-service (QoS) requirements. While many

existing P2P systems form overlay networks on top of the underlying physical network,

they do not yet support the timely delivery of data streams between publishers and

subscribers. For example, Pastry, Chord, and CAN all build logical topologies for the

purpose of forwarding messages to locate an object in a decentralized system, rather

than transporting data streams. Moreover, these systems make no attempt to route

data in accordance with latency and bandwidth requirements.

Most current overlay technologies depend on the random assignment of logical ids

to physical hosts, to achieve a logarithmic bound on the number of hops to locate

information of interest. One important concern is how to correlate the logical position

of a host in the overlay with its physical or geographical position. To address this

concern, one may ask,“could we provide a GPS-like system, assigning a coordinate

to each node in the Internet so that when a node wants to join an overlay, it could

make use of it’s geographical information for further optimization?” The answer to

this question relates to the network positioning or network embedding problem. For

4

this reason, the rest of this thesis focuses on the network positioning problem and

embedding schemes that capture geographic information. Specifically, we present a

Lipschitz embedding scheme in the L∞ norm to capture physical distances between

hosts in an overlay. Our approach achieves better accuracy for determining short

distances, thereby improving the likelihood of finding the true nearest neighbors,

amongst a set of hosts that dynamically join and depart the overlay system

The rest of the thesis is organized as follows. Section 2 describes further mo-

tivation for network embedding. Our network positioning scheme in then analyzed

in Section 3, which also includes a more rigorous definition of the problem being ad-

dressed. The experimental evaluation of our approach is outlined in Section 4. This is

followed by a discussion of related work in Section 5. Finally, conclusions and future

work are mentioned in Section 6.

5

2 Network Embedding

The increasing growth of the Internet, coupled with the decreasing price-performance

ratio of off-the-shelf systems, has paved the way for applications that utilize end-

system multicast [10], content-delivery networks [6] and peer-to-peer (P2P) systems [24,

26, 23]. The scale of these applications has opened new areas of research concerning

the structure of overlay networks [15, 6], as well as efficient means of trading state

versus messages to locate and retrieve information. Consequently, the problem of

identifying “nearby” hosts via which data can be retrieved or propagated is central to

many emerging Internet-based applications. For systems encompassing thousands or

even millions of hosts on the scale of the Internet, it would be impractical to directly

measure (e.g., using ping-style ICMP 1 messages) the “distances” between all pairs

of hosts. This has led to research work that focuses on the estimation of distances

between hosts, without requiring probing messages to be sent to every other host in

the entire system.

Many emerging approaches to derive distances between hosts rely on geometric

embedding techniques, that map hosts to specific coordinates in a logical vector space.

Specifically, these “network positioning” schemes assign coordinates to nodes (e.g.,

hosts on the Internet) based on measurements to a fixed set of designated nodes, called

landmarks. Using such coordinates, the network distance (e.g., in terms of round-trip

propagation and/or transmission delay) between pairs of nodes is predicted without

explicit measurement.

Various embedding techniques, to reduce the error between real and derived dis-

tances of all nodes in a data set, include: (1) triangle-based heuristic solutions [4, 9]

1Internet Control Message Protocol

6

, (2) nonlinear optimization methods [18, 25], and (3) approaches using a combina-

tion of Lipschitz embedding and dimensionality reduction [27][13]. While all these

techniques attempt to derive actual distances between nodes, many applications only

require knowledge of their nearest neighbors, or nearest neighbor sets. For example,

in unstructured P2P networks such as Gnutella or Napster, it is desirable for a client

to download files from the closest peer or, at least, one that is not far away. Simi-

larly, in a content distribution network, the aim is to minimize both the hop count

and/or latency along the path over which data is exchanged. This has impact on the

design of overlay networks, that implement logical routing topologies such as k-ary

n-cubes [6] and deBruijn graphs [15]) over the underlying physical network. For such

overlay networks, especially those that attempt to route QoS-constrained data such

as voice and video to many thousands of destinations, it is important for data to be

propagated through intermediaries that are geographically close.

The contributions of this thesis are therefore concerned with the design of a net-

work positioning scheme, that accurately determines the nearest neighbor (or set of

neighbors) for a given host, amongst a dynamic subset of hosts taken from a global

set on the scale of the Internet. As with other work on network positioning, we lever-

age embedding techniques that map host positions into coordinates within a normed

vector space. The challenges faced by such a scheme include: (1) determining the

most appropriate normed vector space for deriving host coordinates, (2) deciding on

the number and location of landmarks used for coordinate derivation, and (3) dealing

with underlying network conditions that violate geometric requirements of the chosen

embedding method.

Considering all of the above problems, we propose a contractive embedding scheme,

7

in which embedded distances are always less than the corresponding real distances.

As will be shown, contractive embedding enables us to use an on-line refinement al-

gorithm, to compensate for “embedding errors” and thereby guarantee we find the

real nearest neighbor. In practice, this on-line refinement method can also be used

to send probing messages with other QoS related parameters such as bandwidth and

CPU utilization for further QoS-related optimization.

With respective to the above on-line refine algorithm, we find that L∞ is the best

normed vector space for our contractive embedding. In L∞ space, we provide theo-

retical analysis regarding the number and positions of the landmarks for an effective

embedding. We observe that with L∞ normed vector space, the prediction of the

nearest neighbor depends only on a single landmark that yields the smallest error.

This makes our approach more robust against failures and dynamic changes in land-

mark sets. Both of our analysis and experimental results also show that L∞ provides

smaller prediction errors for shorter real distances. Moreover, even though for pre-

serving the overall pairwise distance, our scheme is worse than with other methods

(e.g., non-linear optimization algorithms such as GNP [18]), the prediction of nearest

neighbors is as good as others, but more typically better when used with our on-line

refinement algorithm.

The final contribution of this work deals with underlying network conditions that

violate geometric requirements of embedding methods. Specifically, the “triangle

inequality” is a fundamental requirement of our embedding approach, and the na-

ture of physical networks such as the Internet occasionally lead to violations of this

constraint. We, therefore, propose a method to compensate for the violation of the

triangle inequality by adding a fixed offset to all of the edges of a corresponding met-

8

ric. In practice, this offset could be a global “remedy value”, which is added to the

probed “delay” between an end host and each landmark. However, if this “remedy

value” is too large, it will adversely affect the embedding errors. We therefore inves-

tigate the size of the remedy value to ensure the triangle inequality, while preventing

a further increase in embedding errors.

9

3 Network Positioning for Nearest Neighbors

3.1 Problem Definition and Approach

Before describing our network positioning scheme in detail, we begin by defining the

problem explicitly addressed in this thesis, as follows:

Given a set of hosts, S, in which each member, h∈S, has information about its

distance to all hosts in a subset, L | L⊂S, find the nearest neighbor, hj, to a specific

host hi such that hi, hj∈S.

One efficient practical method to solve this nearest neighbor query problem is

to use a coordinate-based network positioning (or embedding) scheme [18]. How-

ever, unlike prior approaches that preserve as much information as possible about

actual distances between all pairs of hosts in a given set, S, we only need to preserve

just enough information to determine nearest neighbors. Specifically, our approach

involves both an off-line landmark selection scheme, coupled with an on-line refine-

ment method to successively eliminate candidate hosts until the nearest neighbor is

found.

Having selected a subset of hosts, L, as landmarks, each host, h, in set S is given

a vector-based coordinate that represents the distance of h to each member of L. In

other words, each host, h, is given a coordinate based on the Lipschitz embedding

scheme. In the second stage, the on-line refinement algorithm first ranks each host

in S in increasing embedded distance from hi and then probes the actual distance to

successive hosts, starting with the host predicted to be nearest. Using a contractive

embedding scheme, the refinement algorithm is able to eliminate members of S, to

10

quickly converge the host, hj, nearest to hi.

Having briefly outlined the problem and our approach, we now describe the actual

details of our embedding scheme, along with the landmark selection method and

refinement algorithm.

3.2 Geometric Embedding

In this thesis, we follow the terminology used in [8]. A tuple (S, d) is said to be

a finite metric space if S is a finite set of cardinality N and d : S × S → R+ is

a distance metric. A great deal of work has been done on embedding finite metric

spaces into low-dimensional real-norm spaces which serves as the basis of a distance

metric. Usually, the norm is one of the Lp norms, ||x||p = (∑ |xi|p)1/p. Distance

metrics based on such a norm are often termed Minkowski metrics when p ≥ 1. The

most common Minkowski metrics are the Euclidean distance metric (L2), the City

Block distance metric (L1), and the Chessboard distance metric (L∞).

Formally, we call a finite metric space (S, d) an lp-metric if there is a mapping

F : S → Rk such that ||F (x) − F (y)||p = d(x, y); we will often denote this by d ∈ lp.

To denote the fact that the lp space has k dimensions, we will call the space lkp as

well.

Before going on, we need clarify several terms. The word “embedding” is a generic

term for any form of mapping from one space into another, it can be a mapping

from a distance metric into a norm space, or from a high dimensional norm space

to a low dimensional norm space. A distance-preserving embedding will be called

isometric. However, it is very rare to find cases where isometric embedding can be

11

found between two spaces of interest, and hence we often have to allow the mappings

to alter distances, thereby leading to some degree of distortion (or embedding error).

3.2.1 Lipschitz Embedding

Lipschitz embedding is a kind of geometric embedding, which is defined in terms of

a set R of subsets of S, R = {A1, A2, ..., Ak}. The subsets Ai are termed as reference

sets. Let d(o, A) be an extension of the distance function d to a subset A ∈ S such

that d(o, A) = minx∈A{d(o, x)}. An embedding with respect to R is defined as a

mapping F such that F (o) = (d(o, A1), d(o, A2), ..., d(o, Ak)). In other words, what

we are doing is defining a coordinate space where each axis corresponds to a subset

Ai ∈ S of the objects and the coordinate values of the object o are the distances from

o to the closest element in each of Ai. The intuition behind the Lipschitz embedding

is that, if x is an arbitrary object in the data set S, some information about the

distance between two arbitrary objects o1 and o2 is obtained with the aid of d(o1, x)

and d(o2, x), i.e., the value |d(o1, x) − d(o2, x)|. In particular, due to the triangle

inequality, we have |d(o1, x) − d(o2, x)| ≤ d(o1, o2).

In this thesis, as with other work in network embedding [27], we only consider

Lipschitz embeddings in which each reference set Ai is a singleton, which is defined

as the set of landmarks. Hence, the coordinates of a node in our scheme are the

network distance between that node and the landmarks.

3.2.2 Contractive Embedding

An embedding induced by a mapping F is said to be contractive with respect to S

if δ(F (o1), F (o2)) ≤ d(o1, o2) for all o1, o2 ∈ S.

12

L∞ is, by its nature, a contractive embedding because of the triangle inequality.

For other norms, p, p ∈ [1,∞), one candidate contractive embedding exists. It should

be noted that theoretically, you can make a contractive embedding for any distance

definitions as long as you know the distortion of the embedding. However the following

contractive embedding represents the general case [14], without a-priori knowledge of

embedding distortion:

δ(Fk(o1), Fk(o2)) =(∑

i |d(o1, Ai) − d(o2, Ai)|p)1/p

(k)1/p

Ai in the above definition is the reference set defined in Lipschitz embedding. The

proof of the contractive property depends on triangle inequality. For each Ai ∈ R,

we have |d(o1, Ai) − d(o2, Ai)| ≤ d(o1, o2), then when δ is an arbitrary Lp distance

metric,

δ(Fk(o1), Fk(o2)) =(∑

i |d(o1, Ai) − d(o2, Ai)|p)1/p

(k)1/p

≤ (k · d(o1, o2)p

k)1/p = d(o1, o2)

Moreover, the δ function described above strictly increases in terms of p. Thus con-

tractively embedding a data set into L∞ space should cause the least distortion. For

the nearest neighbor problem, given a fixed reference set, Ai, L∞ space should there-

fore lead to the best possible method for eliminating neighbors, that are clearly not

the nearest, as discussed next.

3.2.3 An On-line Refinement Algorithm

For a contractive embedding scheme, we can use the following on-line refinement pro-

cedure to adjust embeding errors and thus converge upon the “real” nearest neighbor.

This algorithm assumes the following prerequisites: (1) after the completion of some

13

off-line process, every end-host has a coordinate determined by its distance to each

of a known set of landmarks, and (2) during the on-line refinement stage, all hosts

in a set C piggyback their coordinates together with their requests for some content

(depending on the application), and now some server h (who is currently serving the

content) wants to find the “real” nearest neighbor among a candidate list to serve.

1. h sorts the end-hosts in increasing order of their distances from itself in the

embedding space. Suppose that point F (a) corresponding to host a is the

closest point to F (h) at the distance of δ(F (a), F (h)).

2. h physically sends a probing message to host a to get the real distance d(a, h)

between a and h. At this point, we know that any host x with δ(F (x), F (h)) >

d(a, h) cannot be the nearest neighbor of h since the contractive property then

guarantees that d(x, h) > d(a, h). Therefore, d(a, h) now serves as an upper

bound on the nearest-neighbor search in the embedding space, and we just

remove all such x from the candidate list.

3. h then finds the next closest point F (b) corresponding to node b, and physically

sends a probing packet to host b and gets d(b, h). Subject to the distance

constraint d(a, h). If d(b, h) < d(a, h), then b and d(b, h) replace object a and

d(a, h) as the current closest object and upper bound distance, respectively;

otherwise, a and d(a, h) are retained.

4. The previous step is repeated until we either converge on the nearest neighbor,

or we could simply stop after N probe messages have been sent (and assume that

we have the nearest neighbor at that point), thereby avoiding a large number

of probe messages being sent to neighboring hosts.

14

It is the nature of contractive embedding that enables us to quickly eliminate

neighbors from the search space of hosts that are potentially nearest. This enables

on-line probing to avoid unnecessary message transmissions to hosts that are relatively

far away.

3.3 Landmark Selection in L∞

`3

h1 h2c

a

`

b

θ`1

`2

Figure 1: Different Remote Landmarks

In this section, we’ll tackle the second challenge raised in section 1, deciding on

the number and location of landmarks used for coordinate derivation. [16] shows that

the intrinsic dimension of a real network data set is normally less than 7. So we study

the original data set intrinsically in high dimensional Euclidean space but embedded

into a Chessboard space(l∞), since with respect to our on-line refinement algorithm

for nearest neighbor searching, we find that contractively embedding the data set into

a Chessboard space(l∞) is the most efficient. Due to space limitation, we’ll only give

a detailed analysis for 2-d Euclidean space, and concisely extend to 3-d Euclidean

space, for higher dimension, we leave it as future work.

One nice property of L∞ is, by the definition of L∞, the accuracy of embedding

15

is determined by only one landmark which gives the least triangle predicting error.

As illustrated in fig. 1, (h1 h2) is the data set to be embedded, `1, `2, `3 are possible

landmarks (reference set). If triangle inequality holds, `1 and `3 gives us the isometric

embedding, while `2 gives us the largest distortion embedding. So, intuitively, we

should choose landmarks like `, such that the intersecting acute angle θ of `h1 and

h1h2 is small, and what’s more, the landmarks are better to be far away from all other

hosts.

Before our main result, we need the following corollary.

Corollary 1. As illustrated in figure 1, suppose h1h2 is embedded into l1∞ with ref-

erence set `. Then the distortion ε = |δ(d(`,h1),d(`,h2))−d(h1,h2)|d(h1,h2)

can be no worse than

1 − cos θ.

Proof. Denote the line segments Lh1, Lh2, h1h2 with a, b, and c respectively. so

δ(d(`, h1), d(`, h2)) = |a − b|, then,

ε =|δ(d(`, h1), d(`, h2)) − d(h1, h2)|

d(h1, h2)

= 1 − δ(d(`, h1), d(`, h2))

d(h1, h2)= 1 − |a − b|

|c|

From

|b|2 = |a|2 + |c|2 − 2|a||c| cos (180 − θ)

= |a|2 + |c|2 + 2|a||c| cos θ

Then

ε = 1 − |a −√|a|2 + |c|2 + 2|a||c| cos θ|

c

= 1 − |ac−

√1 + (

a

c)2 + 2(

a

c) cos θ|

16

Let n = ac, then,

ε = 1 − |n −√

1 + n2 + 2n cos θ|

ε(n) is a strictly increasing function. and limn→0 ε(n) = 0

limn→∞

ε(n) = 1 − limn→∞

|n −√

1 + n2 + 2n cos θ|

= 1 − limn→∞

| −1 − 2n cos θ

n +√

1 + n2 + 2n cos θ|

= 1 − cos θ

Note that the above corollary is applicable in all dimensions since three nodes

define a plane.

3.3.1 The analysis for 2-d Euclidean space

Before considering high dimension, we consider the 2-d plane. This contains many of

the ideas inherent in the more general case.

We draw a smallest circle of radius r containing all the hosts, and assume that

there are ”many” hosts scattered on or close to the circle. We use the standard notion

C(p, r) to denote a circle centered at p of radius r.

Definition 2. The arc length of line segment with angle θ in 2-d Euclidean

space is g1 + g2 illustrated in figure 2. The arc length of point with angle θ in

2-d Euclidean space is defined similarly.

Now comes our main theorem in 2-d Euclidean space.

17

θ

αβ

`

g1 g2

γ

h p

R

θ θθ

a b

Figure 2: Arch Length

(a) General Case

Theorem 3. Given C(p, r), randomly put two points a, b in C. To achieve 90% pre-

cision of ab, the expected number of landmarks is 7 if landmarks are evenly distributed

on or close to C or 43 if landmarks are randomly distributed on or close to C.

To prove our theorem 3, we need corollary 4 and theorem 5.

Corollary 4. Let θ denoted in figure 1 be equal to 25◦, the distortion ε = |δ(d(`,h1),d(`,h2))−d(h1,h2)|d(h1,h2)

can be no worse than 10%

Proof. Simply follow the proof for corollary 1.

Theorem 5. Given C(p, r), if randomly put two points a, b in C, the expectation of

the arc length of ab with angle θ = 25◦ is 0.92r.

Proof. : As illustrated in figure 2, the arc length of the line segment ab with angle θ

will change with l,h and p. Let g(l, h, p) denotes the expectation of the arc length of

any line segment with angle θ, the meaning of variables l,h, p are clearly illustrated in

18

figure 2. It’s obvious that L ∼ unif(0, r). If L is fixed, H is also uniformly distributed

from 0 to 2√

r2 − l2), so H ∼ unif(0, 2√

r2 − l2). In the same way, P is uniformly

distributed from 0 to 2√

r2 − l2 − h, that is, P ∼ unif(0, 2√

r2 − l2 − h). So,

E[g(H,L, P )]

=∫ r

0

∫ 2√

r2−l2

0

∫ 2√

r2−l2−h

0g(h, l, p)f(h, l, p) dp dh dl

+∫ r

0

∫ 2√

r2−l2

0

∫ 2√

r2−l2−h

0

g(h, l, 2√

r2 − l2 − h − p)f(h, l, p)dp dh dl (1)

where f(h, l, p) is the density function,

f(h, l, p)

= P (H = h, L = l, P = p)

= P (P = p|H = h, L = l)P (H = h|L = l)P (L = l)

= 12√

r2−l2−h× 1

2√

r2−l2× 1

r(2)

where 0 ≤ l ≤ r, 0 ≤ h ≤ 2√

r2 − l2, 0 ≤ p ≤ 2√

r2 − l2 − h, the value of f(h, l, p)

doesn’t depend on p, so

E[g(H,L, P )]

=∫ r

0

∫ 2√

r2−l2

0f(h, l, p)

∫ 2√

r2−l2−h

0g(h, l, p)dpdhdl

+∫ r

0

∫ 2√

r2−l2

0f(h, l, p)

∫ 2√

r2−l2−h

0g(h, l, 2

√r2 − l2 − h − p)dpdhdl (3)

let e = 2√

r2 − l2 − h − p, then

∫ 2√

r2−l2−h

0g(h, l, 2

√r2 − l2 − h − p)dp

=∫ 0

2√

r2−l2−hg(h, l, e)d(−e)

=∫ 2

√r2−l2−h

0g(h, l, e)de (4)

19

So, we can simplify it as,

E[g(H,L, P )]

= 2∫ r

0

∫ 2√

r2−l2

0f(h, l, p)

∫ 2√

r2−l2−h

0g(h, l, p)dpdhdl

= 2∫ r

0

∫ 2√

r2−l2

01

2√

r2−l2×

1r(∫ 2

√r2−l2−h

01

2√

r2−l2−hg(h, l, p)dp)dhdl (5)

The inner integration denotes for the expectation of the arc length when l and h are

fixed.

tan γ =√

r2−l2−pl

(6)

√(√

r2−l2−p)2+l2

sin(90+θ+β+γ)= r

sin(90+θ+γ)(7)

√(√

r2−l2−p)2+l2

sin(90−θ+α+β+γ)= r

sin(90−θ+γ)(8)

(7) ⇒

β + γ + θ = arccos(

√(√

r2−l2−p)2+l2

rcos(θ + γ)) (9)

(8) ⇒

α + β + γ − θ = arccos(

√(√

r2−l2−p)2+l2

rcos(γ − θ)) (10)

Let

M =

√(√

r2−l2−p)2+l2

r(11)

(9) − (10) ⇒

2θ − α = arccos(M cos(θ + γ)) − arccos(M cos(θ − γ)) (12)

20

⇒

α = 2θ − arccos(M cos(θ + γ)) + arccos(M cos(θ − γ))

So,

g(h, l, p) = α × r

= (2θ − arccos(M cos(θ + γ)) + arccos(M cos(θ − γ))) × r

We set θ = 25◦ and use Mathematica to calculate the above integration, to get

E[g(H,L, P )] = 0.92 × r (13)

Proof. : According to corollary 1, to get 90% precision of |ab|, θ should be less than

or equal to 25◦. According to theorem 3, the expectation of the arc length of ab with

angle θ = 25◦ is 0.92 × r, so 2πr0.92×r

= 7 landmarks evenly distributed on or close to

the circle or 43 landmarks randomly distributed on or close to the circle will offer us

the expectation.

(b) Best Case

In section 2.1, we mentioned that it’s unnecessary to preserve the distance of long

edges if the goal is to find the nearest neighbor, since the distance between a node

and its’ nearest neighbor is normally small. This inspires us that in practice we don’t

need that many landmarks deduced from theorem 3, since theorem 3 gives us the

expected number of landmarks in the case that any permitted length line segment

is allowed to put in C. The embedding algorithm we use along with the landmark

21

θ

β1

α1

α2

β2

P2

β3α3

P1

a4

a2

a3

b1b3

b4

b2

a1

`

Figure 3: Best Case

selection strategy can be seen as a hierarchical scheme based on the above thinking,

specifically, the shorter the edge is in the real time network, the lower the distortion

is after embedding. First, let’s see the best we can do.

Suppose the distance of any node with its nearest neighbor is small enough com-

pared to the diameter of C such that it can be simplified as a point. Correspondingly,

variable h in figure 2 is small.

Lemma 6. Given C(p, r) , put a point p randomly in C and draw two line segments

l1, l2 intersecting at p with acute angle α, then the arc length surrounded by l1, l2 is

2αr.

Proof. As illustrated in figure 3, let p1 be the point randomly put in C, a1a4,a2a3

be the two line segments intersecting at p1 with acute angle α = 2θ, a1a4 and a2a3

intersects with C at a1, a4 and a2,a3 respectively. l is the angle bisector of α and p2

is any point on l, draw b1b4 parallel to a1a4, b2b3 parallel to a2a3 intersecting at p2.

It’s clear that α3 = β3, so that α1 + α2 = β1 + β2 ⇒ a1a2 + a3a4 = b1b2 + b3b4. If

p2 moves rightwards until touches C, it’s obvious that now b1b2 = 4θr, b3b4 = 0. So,

22

b1b2 + b3b4 = 4θr = 2αr. That finishes the proof.

So, in the best case, we need 2πr4θr

= 4 landmarks when θ = 25◦.

(C) Practical Case (Fixed Length Case)

However, it’s rare to find applications bearing the idealistic property mentioned

in section 2.3(b), in other words, the edge distance between node and its nearest

neighbor is not small enough to be simplified as a point. In such cases, we can take a

sample from the real network space, and define a learning value ξ to be the maximum

of the distances between each node and its nearest neighbor. The following lemma

will give us some sense about the relation of the number of landmarks with distance

distortion.

Lemma 7. Given C(p, r), if we randomly put two points a, b in C with fixed length

ab = rδ. Let δ = 5, which means the learning value ξ is one fifth of r. To achieve

90% precision of ab, the expected number of landmarks is 5 if landmarks are evenly

distributed on or close to the circle or 28 if landmarks are randomly distributed on or

close to the circle.

Proof. of lemma 7:

The proof is very similar to that for theorem 3, however, here we fix h = r5, so,

E[g(L, P )] (14)

= 2∫ r

q

1− 1

4δ2

0 f(l, p)∫ 2

√r2−l2−h

0g(l, p)dpdl

= 2∫ r

q

1− 1

4δ2

01

rq

1− 1

4δ2

(

∫ 2√

r2−l2− rδ

01

2√

r2−l2− rδ

g(l, p)dp)dl

23

θ

R

g1 g2

DSC

DCA

θ

Figure 4: 3-D cap area

where δ = 5, according to 1, θ = 25◦ makes the distortion less than 10%, and we use

mathematica to calculate the above integration and get,

E[g(L, P )] = 1.51 × r (15)

This means 2πr1.51×r

= 5 landmarks evenly distributed on or close to the circle or 28

landmarks randomly distributed on or close to the circle will offer us the expectation.

3.3.2 The analysis for 3-d Euclidean space

We need extend the above analysis for the 2-d plane to high dimensional space. We

give a concise analysis for 3-d Euclidean space for the general case and best case in

this subsection, for higher dimension, we leave it as the future work.

We draw a smallest sphere of radius r containing all the hosts, and assume that

there are ”many” hosts scattered on or close to the sphere. We use the standard

notion B(p, r) to denote a sphere centered at p of radius r.

24

Definition 8. Degenerate Spherical Cone(DSC): The surface of resolution obtained

by cutting a conical ”wedge” with vertex at any point of a sphere out of the sphere. It

is therefore a degenerate cone plus a spherical cap.

Definition 9. Degenerate Cone Angle(DCA): Illustrated in figure (4)

Definition 10. The cap area of line segment with DCA θ in 3-d Euclidean

space is g1 +g2 (The sum of the two spherical caps area) illustrated in figure (4). The

cap area of point with DCA θ in 3-d Euclidean space is defined similarly.

Now comes to our main theorem for 3-d Euclidean space.

(a) General Case

Theorem 11. Given B(p, r) and randomly put two points a, b in B. To get 90%

precision of ab, the expected number of landmarks we need is 24 if landmarks are

evenly distributed on or close to B or 171 if landmarks are randomly distributed on

or close to B.

To prove our main theorem for 3-d Euclidean space, we need theorem 12.

Theorem 12. Given B(p, r), if we randomly put two points a, b in B, the expectation

of the cap area of ab with DCA θ = 25◦ is 0.54r2.

Proof. of theorem 12: The proof is the same as that for theorem 5, the only difference

is now

g(h, l, p) = 2πr2(1 − cos α2) (16)

which is the area of the spherical cap, where α is defined in equation (12).

25

Proof. of theorem 11: The proof is almost the same as that for theorem 3. The only

difference is in 3-d, the expectation of the cap area of the line segment with DCA

θ = 25◦ is 0.54r2, so the expected number of landmarks is 4πr2

0.54r2 = 24 if landmarks

are evenly distributed on or close to B or 171 if landmarks are randomly distributed

on or close to B.

(b) Best Case

Lemma 13. Given B(p, r), put a point p randomly in B and draw the degenerate

spherical cone d1, d2 at p with DCA θ in both directions, then the spherical cap area

covered by d1 and d2 is 4πr2(1 − cos θ).

Proof. The proof is the same as that for 6.

So, in the best situation, we need 4πr2

4πr2(1−cos θ)= 11 landmarks when θ = 25◦.

3.4 Outside-max-distance algorithm

From the above theoretical analysis, we use the following heuristic for selecting land-

marks for L∞.

1. Determine the candidate nodes, c, on the outside of a ring.

(a) Randomly choose a node in the data set, find a node who is furthest from

it, which is n1.

(b) From the data set, find another node which is furthest from n1, and call it

n2.

26

(c) Put n1 and n2 into a candidate node list. If the number of members in the

candidate list is equal or greater than c, then return c, else take n1 and n2

out of the data set, and goto (a).

2. Use the max-distance algorithm to find the nodes evenly distributed in the

outside ring of nodes. We just use the max-distance heuristic used in [28],

which works as follows, by iteratively selecting the set of landmarks, L: the

first landmark L1 is chosen from set S at random. For m(1 < m ≤ L), the

distance from a host hi to the set of already chosen landmarks L1, ..., Lm−1 is

defined as the minLjδ(F (hi), F (Lj)). The algorithm then selects as landmark

Lm the host that has the maximum distance to the set L1, ..., Lm−1.

Intuitively, the algorithm will try to evenly pickup the nodes from the boundary

of the data set.

27

4 Experimental Evaluation

In this section, we show the experimental results evaluating our scheme to find the

nearest neighbor in comparison with GNP[18]. We also show the tradeoff between

the structure error caused by the triangle inequality and the embedding error, with

respective to the contractive embedding into different norms.

4.1 Data Set

There are several data sets available recording round-trip time between Internet

nodes. [27] provides a good investigation on seven data sets, and shows that these

data sets have common intrinsic dimensions (around 5-7 dimensions). Six data sets

are provided in the form of a pairwise distance matrix of N×M , where N >> M , and

M < 30. Only AMP [1] data gives a N ×N matrix, where N > 100. The normal way

other embedding schemes verify the data is that they choose a subset of M columns

as landmarks, deriving coordinates for each node and using the remaining columns to

verify the result. In our experiments, we use AMP data [1], to verify the capability

of our embedding scheme to find the nearest neighbor among a set of neighbors. For

large data sets, we use the Inet 3.0 generator [29], which provides synthetic power-law

networks. We keep all default configurations, and generate 3050 nodes. These nodes

are placed in a square region and the delay of a link is the Euclidean distance between

the end points of that link. End-to-end delay is the shortest path delay.

28

0

10

20

30

40

50

60

L(1) L(2) L(infinity)Perc

enta

ge o

f N

odes F

aile

d t

o F

ind

Neare

st

Neig

hbor

before remedy after remedy

Figure 5: Error in finding nearest neighbor under different norm

4.2 Compensation for the Violation of Triangle Inequality

It has been found that Internet traffic does not always follow the shortest possible path

and that there is potential for violation of the triangle inequality due to the routing

policy [4]. In [27], the authors looked into the problem of whether the real networking

data obeys the triangle inequality, which is one requirement for the correctness of a

metric embedding. Particularly, for the AMP data, there is evidence that only 1.4% of

all combinations of d(i, k), d(k, j) and d(i, j) violate the triangle inequality, in which

d(i, k) + d(k, j) < d(i, j).

[13] tries to analyze the effectiveness of different norms with respect to their qual-

ity of representing “topological” information. Specifically, the authors compared the

effectiveness of embedding network nodes into different norms and came to the con-

clusion that the accuracy of representing topological information in a data space

depends heavily on the distance metric. However, they did not take the violation of

the triangle inequality into account.

Using the AMP data from [1], we performed the same experiment as in [13]. We

used Lipschitz embedding to assign coordinates to nodes, embedding them into L1,

29

L2 and L∞ normed space, and tried to calculate the nearest neighbor in the embedded

space for each node. As shown in Fig. 5, the x-axis represents different normed spaces,

and the y-axis shows the total number of nodes which fail to find the actual nearest

neighbor in the embedded space. Then as shown in the curve of “before remedy”, we

found that to find the nearest neighbor, even using L∞ norm, out of the total 101

nodes, 52 of them still cannot correctly find the nearest neighbor, which theoretically

is impossible. After further inspecting the data, we found that the main reason is the

violation of the triangle inequality.

Recall that the main problem we are solving is to find the nearest neighbor, which

means, we only care about the relative ordering of distances to all the neighbors, but

not the absolute distances. For a particular host h, if di is the distance to its ith

nearest neighbor, for all the distances before and after embedding, we only need to

keep the same sequence as:

d1 ≤ d2 . . . ≤ dn−1

To make the triangle inequality hold for the data, we add a fixed offset to every

distance, called the remedy value. In this case, we add the largest pairwise distance

of the whole metric as the remedy value. This way, we make the AMP data obey the

triangle inequality. Then as shown in the “after remedy” curve, the error using L∞

goes down to 0 after adjustment, which complies with the theoretical analysis.

This result tells us that, even though only a small portion (in this case 1.4%) of

cases a data set violates triangle inequality, big errors may occur in some situations

when finding nearest neighbors. We call this error caused by violation of the triangle

inequality the structure error. It also tells us that, if taking care of violation of the

triangle inequality correctly, L∞ norm gives us a possibility of isometrically embedding

30

the data. In practice, we could not have a node coordinate with dimension n, which

is the number of nodes in the data set. This is why we need landmarks for dimension

reduction.

0 10 20 30 40 50 600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Triangle Violation Length

Fra

ctio

n of

Nod

es

Figure 6: Distribution of triangle inequality violation

However, if we allow the remedy value to be too large, it will change the intrinsic

structure of the metric. Let’s define another parameter, the triangle inequality viola-

tion length d = c − (a + b) for any three edges a, b and c, which violate the triangle

inequality. If we add d/2 offset to these three edges, then they will obey the triangle

inequality. Fig. 6 shows the distribution of triangle inequality violation length of AMP

data. We can see that for 90% of all violations d is less than 10, which is to say, if we

add 5 as the offset to every distance, then we can revise 90% of triangle inequality

violations.

31

0

0.5

1

1.5

2

2.5

3

3.5

4

1 2 4 Infinity

NormP

roxim

ity

Max Probing 5 Max Probing 10 Max Probing 20

Figure 7: Proximity with max number of probing refinements

4.3 Contractive Embedding, Tradeoff between Structure Er-

ror and Embedding Error

As analyzed in section 3.2.2, for contractive embedding into Lkp

δ(Fk(o1), Fk(o2)) =(∑

i |d(o1, Ai) − d(o2, Ai)|p)1/p

(k)1/p

There exists a linear relationship among all the norms p, such that L∞ is the best one.

In this part, we examine the relationship between different contractive embedding

norms, more specifically, we use L1, L2, L4, and L∞ as examples. We also give the

tradeoff between triangle error, which is caused by violation of the triangle inequality,

and embedding error, which is cause by the embedding scheme.

First, we examine the capability in preserving proximity for different norms. In

fig. 7, the x-axis represents different norms, and the y-axis shows their capability

to preserve the proximity. In other words, this is the capability to preserve nearest

neighbor information. We give them different maximum number of probings shown

in different color bars. We can see that there is a clear linear relationship among all

the norms for contractive embedding in preserving proximity. All these norms allow

32

0

0.5

1

1.5

2

2.5

1 2 4 Infinity

Norm

Pro

xim

ity

remedy value 0 remedy value 5 remedy value 10 remedy value 15

Figure 8: Proximity with remedy value

0

10

20

30

40

50

60

70

80

90

100

1 2 4 Infinity

Norm

Pro

bin

g N

um

ber

remedy value 0 remedy value 5 remedy value 10 remedy value 15

Figure 9: Probing number with remedy value

33

on-line probing to compensate the embedding error. But since L∞ has the minimum

embedding errors, under a certain probing limitation, L∞ could always achieve best

proximity preserving.

Fig. 8 and fig. 9 show the tradeoff between “remedy value” and on-line probing

number. In the experiment for these two figures, we allow all norms the maximum

probing limitation, which is to say, they are allowed to probe as many times as

possible to correct the embedding errors. In Fig. 8, the x-axis represents different

norms, and y-axis shows the final proximity after probing. Theoretically, the best

value is 1. Different bar categories shows the different value of “remedy value”.

Fig. 8 shows that L∞ norm is more sensitive to the violation of triangle inequality

than other norms, which could almost get the real nearest neighbor through probing.

With respective to the L∞ itself, we can see that L∞ norm has better capability of

preserving proximity information with larger “remedy value”. However, from Fig. 9,

we can see that other norms have to undertake more probing steps before stopping.

Considering L∞ itself, a larger “remedy value” invokes more probing steps.

We define the errors caused by violation of the triangle inequality as the structural

error, and errors caused by embedding scheme itself as embedding errors. These two

figures are good examples showing the tradeoff between structure error versus the

embedding error. Basically, both of these two errors contribute to the final error

in finding nearest neighbor and on-line probing numbers. However, more “remedy

value” which compensates for the structure error, actually causes more embedding

errors; in turn, embedding errors will cause more probing times.

Considering this tradeoff, using a “remedy value” of 5, which is the d/2 of 90%

of the violations, and an embedding in the L∞ norm, an end host only needs (on

34

average) to probe 6.6 times, to find the 1.5th real nearest neighbor.

4.4 Comparison with GNP

4.4.1 Pairwise Distance Predicting

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Relative Predicting Error

Nod

es F

ract

ion

outside−max−distance 10Lrandom 10Lgnp 10L 8D

Figure 10: 10 Landmarks Predicting Error

First, fig. 10, shows a comparison of predicting error between GNP and our algo-

rithm. The error is defined as:

error =|predicted distance - measured distance|

predicted distance

It should be noted that this definition of error is different from that defined in [18],

in which the denominator is the minimum of predicted distance and real distance.

Since L∞ embedding is contractive, that definition in [18] will penalize the result.

We use two algorithms to choose landmarks: one is the outside-max-distance

algorithm mentioned before, the other is to randomly choose landmarks. For GNP,

we use the first 10 nodes as landmarks and dimension is 8. Since AMP data nodes

35

are arranged alphabetically, it is possible for such an arrangement to be one of the

configurations using randomly chosen landmarks. We can see that for the ability to

correctly predict distance between any two nodes, our angle-based L∞ embedding is

not as good as GNP. This is not a surprising result, which also complies with the

result from [18].

We can also see that in our scheme, randomly choosing landmarks, is almost as

good as the outside-max-distance algorithm in preserving distance.

4.4.2 Nearest Neighbor Searching

0 10 20 30 40 50 60 70 80 900

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Relative Proximity Error for Nearest Neighbor

Nod

es F

ract

ion

outside−max−distance 10L with probing && refineoutside−max−distance 10L with probingoutside−max−distance 10Lrandom 10Lgnp 10L 8D

Figure 11: Proximity

In contrast with the above results, if we consider the capability of predicting

nearest neighbors, our approach performs favorably. In what follows, let us define

proximity p as the the pth closest neighbor in real distance for the node which is in

embedded space the nearest neighbor. For example, if the calculated nearest neighbor

is actually the real nearest neighbor, the proximity p is 1.

Given the above, fig. 11 shows the capability to preserve proximity in embedded

36

space for the two schemes. From this figure, we can make several observations. First,

even though the random landmark selection algorithm achieves comparable results

to the outside-max-distance algorithm in preserving distance, it is much worse in the

capability of preserving nearest neighbors. This result complies with the claims in

[18], that triangle based schemes are sensitive to the positions of landmarks. For the

outside-max-distance algorithm, it can preserve proximity information as well as GNP.

Moreover, it is actually better at conserving the maximum proximity. Needlesstosay,

with our contractive embedding scheme, we can use: (1) on-line probing to adjust

embedding errors, and (2) a globally-defined “remedy value” to adjust the architecture

error. These are shown in green and red respectively in the figure.

4.5 Comparison Using INET Data

In this section, we compare the experimental results using an Inet [29] data set,

produced with the Inet 3.0 generator, with all parameters set to their defaults. This

data set consists of 3050 nodes, and represents a synthetic power-law network, in

which nodes are randomly placed in a square region, and the delay of a link is the

Euclidean distance between the end points of that link. We use the Floyd-Warshall

pairwise shortest path algorithm to generate pairwise network distances between every

node. Note that since this distance metric is derived from the shortest path algorithm

from a 2-D plane, it will obey the triangle inequality. We do not need to apply the

triangle inequality violation remedy algorithm to adjust the data.

37

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Relative Predicting Error

Nod

es F

ract

ion

outside−max−distance 50Lgnp median−cluster 50L 10Dgnp max distance 50L 10D

Figure 12: Predicting Error

4.5.1 Pairwise Distance Predicting

First, we compare the pairwise distance predicting error, which is defined in section

4.4.1. For the landmark selection algorithm, we use our two landmark selection

heuristics as outside-max-distance and convex-max-distance. Fifty landmarks are

used for these two algorithms, and for GNP, we setup the dimension to be 10.

We give two sets of landmarks to GNP. The first one is the set of landmarks

generated using the max-distance heuristic, while for the second, we choose the first

50 nodes as the landmarks. Surprisingly, using the first 50 nodes as landmarks gives

much better results than the maximum distance landmark selection method. After

further investigating the data, we found out that in Inet generated data, the nodes

are ranked decreasingly according to their degrees. So the first nodes have a higher

degree and are more likely to be the median nodes in cluster groups. In [18], they

have shown the N -cluster-median is the best landmark selection scheme for GNP.

This result complies with their discovery.

38

In Fig. 12, we show the predicting error of GNP and our scheme with respect to

four landmark selection algorithms. We can see that if GNP is used with the N -

cluster-median landmark selection algorithm, its predicting error is much less than

that of our algorithm. This result complies with the results from [19], in which

they show that with 40 landmarks, using 8-dimension Euclidean space, 90% of the

predictions have error less than 42%. With 10 more landmarks, in our experiment,

it’s a little bit better, 90% of the predicting has error less than 35%. Another reason

is that we use different error calculation function as expressed in section 4.4.1.

Neither of our two landmark selection heuristics are as good as GNP with N -

cluster-median in predicting distance for all, which is not a surprising result. However,

if we use GNP with maximum-distance heuristic to select landmarks, the predicting

capability is even worse than our scheme. These results shows that the landmark

selection algorithm depends on the the embedding schemes. In general, it is better

to choose landmarks according to the intrinsic property of the embedding scheme.

4.5.2 Nearest Neighbor Searching

Fig. 13 and Fig. 14 compare the capability of GNP and our algorithm in predicting

the nearest neighbor. In this case, we use the calculated distance derived from the

coordinates of each scheme to derive the nearest neighbor in the embedded space, and

we look into the real distance metric to check the errors. Since we use 50 nodes among

the 3050 nodes as landmarks, each node actually tries to find the nearest neighbor

among the other 3050− 50− 1 nodes. For GNP, we use the median-cluster landmark

selection algorithm which has the best predicting error.

From Fig. 13, if we consider the proximity in finding nearest neighbor, our scheme

39

0 500 1000 1500 2000 25000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Relative Proximity Error for Nearest Neighbor

Nod

es F

ract

ion

outside−max−distance 50Loutside−max−distance 50L probing 20outside−max−distance 50L probing 200gnp median−cluster 50L 10D

Figure 13: Proximity Error

is a little bit better than GNP, as shown in Fig. 13 with the legend of “gnp median-

cluster” and “out-side 50” respectively. Considering about 90% cases, the nearest

neighbor found in “calculated” space or “embedded” space, can only be guaranteed

to be less than actual 1200th closest neighbor. But if we could apply the online

algorithm to refine the embedding errors, then the results are much better. If we

allow maximum 20 real probes, with 90% probability, we can guarantee to find the

real ones with proximity less than 1000; and if we allow maximum probing to be

200; then with 90% probability, we can guarantee to find the nearest ones with the

proximity less than 270. Purely considering finding nearest neighbors among 3000

nodes, even if we allow at most 200 real probes, we could only guarantee less than

270 nearest neighbors.

If we look at the problem in another way, as shown in Fig. 14, if we define “dis-

tance error” as the ratio of the distance to the “calculated nearest neighbor” and the

distance to the “real nearest neighbor”, then distance error, which is equal to 1, will

show a significant result.

40

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Relative Distance Error for Nearest Neighbor

Nod

es F

ract

ion

outside−max−distance 50Loutside−max−distance 50L probing 20outside−max−distance 50L probing 200gnp median−cluster 50L 10D

Figure 14: Distance Error

We can see that a large portion of the predictions are less than 2 in all of the

four cases. In GNP, 50% of the cases, it finds the neighbor which is less than 2 times

further away than the real nearest neighbor. In our algorithm, the same result occurs

in 60% of the cases. Furthermore if we allow a maximum of 20 probings to adjust

the errors, the number of cases goes up to 80%, and if we allow a maximum of 200

probings, we can reach about 93%. This is due to the highly degree of clustering of

the Internet, which is also discussed in [28]. Using an embedding scheme to find the

nearest neighbor, the actual distance is within 2 or 3 times of that to the real nearest

neighbor. We believe this to be a reasonable approximation result.

One final issue concerns the question: “could we also let GNP use probing to ad-

just the embedding error?” Only a contractive embedding could guarantee the quick

convergence to the real nearest neighbor in the refinement stage. Table 1 shows the

tradeoff between probing numbers and error in predicting nearest neighbors. From

the table, we can see that even though we have the limitation of maximum probing

numbers, in a large portion of cases, the on-line probing algorithm stops before the

41

Max Allowed Real Probing Proximity Distance error

Probing

0 0 408 2.3

20 13.4 245 1.8

200 118.9 81 1.3

3050 429.8 1 1

Table 1: Probing Number v.s. Nearest Neighbor Finding Accuracy

maximum step due to the contraction (and, hence, elimination of unqualified can-

didates) property. If we allow 20 probings at most, on average it will stop after 13

probings, and if we allow 200 probings at most, on average it will stop after 119 steps.

In the extreme case, even if we do not give a limitation for the probings, it only needs

430 steps on average, which is far less than 3050. In this case, we are guaranteed to

find the real nearest neighbor.

42

5 Related Work

The paper by Ng and Zhang [25] is the first paper in the networking area to use a

non-linear optimization algorithm for network positioning [25, 13, 21, 27]. Similar

work has also been conducted in theoretical computer science [12] and also by Tang

and Crovella [27]. These bodies of work show that even though it is impossible to

represent the positions of Internet hosts in a purely 2D space, it is possible to embed

Internet hosts into a relatively-low (on the order of 5 to 7) dimensional Euclidean

space, using traditional dimension reduction algorithms like MDS (Multidimensional

Scaling) [2] and PCA (Principal Component Analysis). This makes it possible to

accurately embed the positions of hosts on the scale of the Internet using a relatively

small number of landmarks. Our work extends these ideas by focusing specifically on

the problem of determining nearest neighbors which, to our knowledge, has not been

the primary object of prior work in network positioning.

More recent work by Ng and Zhang [19], concerns the construction of a real

network-positioning system (NPS), to provide a positioning capability for hosts across

the Internet. In essence, this is similar to the Domain Name System (DNS). Princi-

pally, it involves a hierarchical network positioning architecture that maintains posi-

tion consistency while enabling decentralization and adaptation to network topology

changes. In effect, this is similar in concept to the off-line stage of our network

positioning approach.

Other work, such as the Big-Bang-Simulation [25], tries to simulate embedding

errors as force fields, and uses a multi-phase procedure to iteratively reduce such

errors. These iterative non-linear optimization-based algorithms are more sensitive to

input parameters and are more expensive to compute. For example, related work [13]

43

shows that under some circumstances, GNP may have non-unique coordinates which

would lead to estimation inaccuracy.

Lipschitz embedding with dimensionality reduction using PCA has been studied

by various other researchers [27, 13]. While it has been shown possible to reduce

the dimension of coordinate vectors from as much as 100 down to 20 using “virtual”

landmarks, it is still necessary for each end-host to probe as many as 100 “physical”

landmarks. In our solution, the dimension (or length) of coordinate vectors used

for positioning end-hosts is the same as the number of landmarks, implying that we

require no extra communication costs in the off-line derivation of these coordinates.

Finally, a global architecture for estimating Internet host distances, called the

Internet Distance Map Service (IDMaps), was first proposed by Francis et al. [4].

This architecture separates “tracers” (equivalent to our notion of landmarks) that

collect and distribute distance information from clients that use a corresponding dis-

tance map. A distance query interface allows an application to query IDMaps servers

to find out network distance between pairs of hosts. This is different from our two

stage service architecture in which landmark servers only participate in off-line co-

ordinate derivation, while end-hosts derive their nearest neighbors using an on-line

probing/refinement scheme.

44

6 Conclusions and Future work

In this thesis, we leverage geometric-based embedding techniques for the specific

objective of finding nearest neighbors. The nearest neighbor problem is of particular

importance to a large class of applications, in areas such as P2P systems, content-

distribution, overlay routing and end-system multicast. These large-scale applications

are now being deployed on scales that encompass many thousands of end-systems,

taken from a dynamic subset of all Internet hosts.

We propose a two-stage method for network positioning. In the first stage, which

is performed off-line, each host communicates with designated landmarks to derive

its coordinate. We use Lipschitz embedding in the L∞ normed vector space to assign

coordinates and derive distances between pairs of hosts. In the second stage, an on-

line refinement algorithm leverages the contractive property of L∞, to compensate

for embedding errors and quickly converge on the real nearest neighbor of a given

host. Once such a host is ascertained, it is possible to use a probe message (e.g., an

ICMP ping) used as part of our refinement algorithm, to capture distances, perhaps

in terms of latency, which may then be used in applications such as QoS-constrained

routing.

Our analysis shows that by careful selection of landmarks on the perimeter of all

hosts in a given set, it is possible to determine nearest neighbors with low error rates,

using an L∞ embedding scheme. Although geomtric embedding theory relies on the

triangle inequality, in practice this may be violated by the intrinsic properties of the

underlying network topology. We compensate for this by offsetting pairwise distances

between hosts using a “remedy value”. Care must be taken when using large remedy

values, since they may in turn increase embedding errors, which we were trying to

45

eliminate by asserting the triangle inequality.

Future work involves the design and implementation of a distributed version of

our embedding scheme, thereby making our method more scalable. We also intend

to investigate the landmark selection scheme for network topologies that have an

intrinsic dimensionality more than two or three. Specifically, the Internet is known

to have a higher dimensionality than two, which impacts the number and location of

landmarks necessary for an accurate embedding scheme that preserves information

about nearest neighbors.

46

References

[1] National laboratory for applied network research, Active measurement project

(AMP) http://watt.nlanr.net/.

[2] I. Borg and P. Groenen. Modern Multideminsional Scaling - Theory and Appli-

caions. Springer.

[3] I. Foster and C. Kesselman. Globus: A toolkit-based architecture. The Grid:

Blueprint for a New Computing Infrastructure, pages 259–278, 1999.

[4] P. Francis, S. Jamin, V. Paxson, and L. Zhang. An architecture for a global

internet host distance estimation service. In Proceedings of IEEE Infocom, 1999.

[5] Freenet, http://freenet.sourceforge.net/.

[6] G. Fry and R. West. Adaptive Routing of QoS-constrained Media Streams over

Scalable Overlay Topologies. In 10th IEEE Real-Time and Embedded Technology

and Applications Symposium (RTAS), May 2004.

[7] Gnutella, http://gnutella.wego.com/.

[8] G. R. Hjaltason and H. Samet. Properties of embedding methods for similarity

searching in metric spaces. IEEE Trans. Pattern Anal. Mach. Intell., 25(5):530–

549, 2003.

[9] S. Hotz. Routing information organization to support scalable interdomain rout-

ing with heterogeneous path requirements. In Ph.D. Thesis, Univ. of Southern

California, 1994.

47

[10] Y. hua Chu, S. G. Rao, and H. Zhang. A case for end system multicast (keynote

address). In Proceedings of the 2000 ACM SIGMETRICS international con-

ference on Measurement and modeling of computer systems, pages 1–12. ACM

Press, 2000.

[11] KaZaA, http://www.kazaa.com.

[12] J. Kleinberg, A. Slivkins, and T. Wexler. Trianglulation and embedding using

small sets of beacons. In 45nd IEEE symposium on Foundations of Computer

Science (FOCS’04), 2004.

[13] H. Lim, J. C. Hou, and C.-H. Choi. Constructing internet coordinate system

based on delay measurement. In Proceedings of the 2003 ACM SIGCOMM con-

ference on Internet measurement, pages 129–142. ACM Press, 2003.

[14] N. Linial, E. London, and Y. Rabinovich. The geometry of graphs and some

of its algorithmic applications. In Proc. 35th IEEE Ann. Synp. Foundations of

Computer Science, pp. 577-591, 1994.

[15] D. Loguinov, A. Kumar, V. Rai, and S. Ganesh. Graph-theoretic analysis of

structured peer-to-peer systems: routing distances and fault resilience. In Pro-

ceedings of the 2003 conference on Applications, technologies, architectures, and

protocols for computer communications, pages 395–406. ACM Press, 2003.

[16] J. Matousek. Note on bi-Lipschitz embeddings into normed spaces

www.emis.de/journals/CMUC/pdf/cmuc9201/

(see matousek.pdf).

[17] SETI@home, http://setiathome.ssl.berkeley.edu.

48

[18] T. S. E. Ng and H. Zhang. Predicting internet network distance with coordinates-

based approaches. In Proceedings of IEEE INFOCOM’02, 2002.

[19] T. S. E. Ng and H. Zhang. A network positioning system for the internet. In

USENIX 2004, 2004.

[20] G. Parmer, R. West, X. Qi, G. Fry, and Y. Zhang. An Internet-wide Distributed

System for Data-stream Processing. In 5th International Conference on Internet

Computing (IC’04), 2004, 2004.

[21] M. Pias, J. Crowcroft, S. Wilbur, S. Bhatti, and T. Harris. Lighthouses for

scalable distributed location. In Second International Workshop on Peer-to-Peer

Systems(IPTPS’03), 2003.

[22] X. Qi, G. Parmer, and R. West. An Efficient End-host Architecture for Clus-

ter Communication Services. In the IEEE International Conference on Cluster

Computing (Cluster’04), September 2004.

[23] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Schenker. A scalable

content-addressable network. In Proceedings of the 2001 conference on Appli-

cations, technologies, architectures, and protocols for computer communications,

pages 161–172. ACM Press, 2001.

[24] A. Rowstron and P. Druschel. Pastry: Scalable, distributed object location and

routing for large-scale peer-to-peer systems. In IFIP/ACM International Confer-

ence on Distributed Systems Platforms (Middleware), pages 329–350, Heidelberg,

Germany, November 2001.

[25] Y. Shavitt and T. Tankel. Big-bang simulation for embedding network distances

in euclidean space. In INFOCOM’03, 2003.

49

[26] I. Stoica, R. Morris, D. Liben-Nowell, D. R. Karger, M. F. Kaashoek, F. Dabek,

and H. Balakrishnan. Chord: a scalable peer-to-peer lookup protocol for internet

applications. IEEE/ACM Transactions on Networking, 11(1):17–32, 2003.

[27] L. Tang and M. Crovella. Virtual landmarks for the internet. In Proceedings of

the 2003 ACM SIGCOMM conference on Internet measurement, pages 143–152.

ACM Press, 2003.

[28] L. Tang and M. Crovella. Geometric exploration of the landmark selection prob-

lem. In Paasive & Active Measurment Workshop(PAM2004), 2004.

[29] J. Winich and S. Jamin. Inet-3.0: Internet topology generator. Technical Report

UM-CSE-TR-456-02, University of Michigan, 2002.

[30] B. Y. Zhao, J. D. Kubiatowicz, and A. D. Joseph. Tapestry: An infrastructure

for fault-tolerant wide-area location and. Technical report, 2001.

network positioning for online nearest neighbors search xin qi

Documents