tapestry:a resilient global- scale overlay for service deployment zhao, huang, stribling, rhea,...

Tapestry:A Resilient Global-Scale Overlay for Service

Deployment

Zhao, Huang, Stribling, Rhea, Joseph, Kubiatowicz

Presented by Rebecca Longmuir

What is Tapestry?

A peer-to-peer overlay routing infrastructure offering efficient, scalable, location-independent routing of message directly to nearby copies of an object or service using only localized resources.

An extensible infrastructure that provides decentralized object location and routing(DOLR)

What is DOLR?

Interface focusing on routing of message to endpoints such as nodes or object replicas.

Virtualizes resources, since endpoints are named by opaque identifiers encoding nothing about physical location

Allows for message delivery in instable underlying infrastructure. Developers only need to think about the dynamics of the network for optimization

Some background

Nodes participate in the overlay and are assigned nodeIDs uniformly at random from a large identifier space.

More than one node maybe be on one physical host.

Application-specific endpoints are assigned globally unique identifiers(GUIDs) select from the same identifier space.

More background

Node N has nodeID Nid and an object O has GUID OG.

Every message contains an application-specific identifier Aid

Four part DOLR networking API

PUBLISHOBJECT(OG,Aid): Publish or make available, object O on the local node. This is a best effort call and receives no confirmation

UNPUBLISHOBJECT(OG,Aid): Best effort attempt to remove location mappings for O

ROUTETOOBJECT(OG,Aid): Routes message to location of object with GUID OG

ROUTETONODE(N,Aid,Exact): Routes message to application Aid on node N. “Exact” specifies whether destination ID needs to be matched exactly to deliver payload

Routing and Object Location

Tapestry dynamically maps each identifier G to an Unique live node called the identifier's root GR

If node N exists with Nid = G then this is the root of G

Messages are delivered by using a routing table consisting of nodeIDs and IP addresses of nodes it communicates with referred to as neighbors.

Routing involves sending messages forward across neighbor links to nodes whose nodeIDs are progressively closer

Routing Mesh

Tapestry uses local tables at each node, called neighbor maps, to route overlay messages to the destination ID digit by digit

A node N has a neighbor map with multiple levels, where each level contains links to nodes matching a prefix up to a digit position in the ID and contains a number of entries equal to the ID’s base

Routing Mesh

The primary ith entry in the jth level is the ID and location of closest node that begins with the prefix(N,j-1)+i for example the 3259 is the ninth entry of the fourth level for 325AE

It is this idea of closest node that proves the locality properties of Tapestry

Routing Mesh

Since the nth hop shares a prefix of length >=n with the destination node, Tapestry looks in the (n+1)th level map for the entry matching the next digit in the destination ID

This guarantees that any existing node is reached in at most logBN logical hops where N is the namespace size and the IDs are of base B and assuming consistent neighbor maps.

If a digit cannot be matched than Tapestry looks for a close digit using ‘surrogate routing’ where each nonexistent ID is mapped to some live node with a similar ID

Providing resilience in the Routing Mesh Make use of redundant routing paths

Each Primary neighbor is augmented by backup links sharing the same prefix

There are c x B pointers on a level(c is the number of neighbor links that differ only at the nth digit on the nth routing level)

The total size of the neighbor maps is c x B x logBN

The expected total number of back pointers is c x B x logBN

Object Publication and Location

Each root node inherits a unique spanning tree for routing

This is utilized to locate objects by sending out soft-state directory information across nodes


A server S, storing an object O periodically publishes this object by routing a publish message toward OR

Each node along the publication path stores a pointer mapping

For replicas of an object that are on separate sever each server publishes its copy

Node location is stored for replicas in sorted order of network latency


A client locates an object by routing a message to OR

Each node checks to see if it has location map for the object

If a node has a location map it send the request to the servicing node S, otherwise, it sends the message onward to OR


Each hop toward the root reduces the number of nodes satisfying the next hop prefix constraint by a factor of the identifier base

The path to the root is a function of the destination ID only not the source ID

The closer a query gets to the object the more likely it is to cross the paths with the objects published path and the sooner it will reach the object

Dynamic Node Algorithms

There are many mechanisms in place to maintain consistency and to make sure objects are not lost

Many of the control messages require acknowledgments and are retransmitted as needed

Node Insertion (four components)

Need-to-know nodes are notified of N, because N fills a null entry in their routing tables

N might become the new object root for existing objects. References to those objects must be moved to N to maintain object availability

The algorithms must construct a near optimal routing table for N

Nodes near N are notified and may consider using N in their routing tables as an optimization

Node Insertion

Begins at N’s surrogate S (the “root” node that Nid maps to in the existing network)

S determines the longest prefix length it shares with Nid this length is p

S sends out an Acknowledge Multicast message to all existing nodes sharing the same prefix

The nodes that get the message add N to their routing tables and transfer references of locally rooted pointers as needed

Node Insertion

N’s initial neighbor set for it’s routing table are the nodes reached by the multicast

Beginning at level p N does an interactive nearest neighbor search

N uses the neighbor set to fill level p, trims this set to size k and requests that the k nodes send there backpointers

N decrements p and repeats the process until all levels are filled

Multiple nodes inserting at once

Every node A in the multicast keeps track of every node B that is still multicasting down to its neighbors

This lets any node C that is multicasting to A know of B’s existence

Also multicast keep track of the holes in the new node’s routing table and check their tables for any entries that can fill in these holes

Voluntary Node Deletion

When a node N leaves it tells all of the nodes making the set D that are it’s backpointers that it is leaving and of replacement nodes at each level from N’s routing table

The notified nodes send a republish object traffic to both N and its replacement

N routes reference of locally rooted objects to their new roots and lets D know when it is done

Involuntary Node Deletion

Tapestry improves object availability and routing in failure prone networks by building redundancy into the routing tables and object location references

Nodes send periodic beacons to detect outgoing links and node failures.

When these problems are notices repair of the routing mesh starts and redistribution and replication of the object location references begins

This is also helped by soft-state republishing of object references

Component Architecture

Transport layer provides the abstraction of communication channels from one overlay node to another

Neighbor Link provides secure but unreliable datagram facilities to layers above, including the fragmentation and reassembly of large messages

Neighbor Links also provides fault detection through keep alive message, plus latency and loss rate estimation

Neighbor Link optimizes message processing by parsing the message headers and only deserializing the message contents when required


The router implements functionality unique to Tapestry

This layer includes the routing table and local object points

The router examines the destination GUID of the message that is receives and determines the next hop using the routing table and local object pointers

Message are passed back to the neighbor link layer for delivery


Flow chart of the object location process

Also keep in mind the routing table and object pointer database are constantly changing as the network changes because of nodes entering or leaving or latency in the network

Tapestry Upcall Interface

To support functions that require greater control of details than the DOLR API can provide Tapestry supports an extensible upcall mechanism

Three primary calls provide the interaction between Tapestry and application handlers(G is a generic ID)– Deliver(G, Aid, Msg): Invoked on incoming messages

destined for the local node– Forward(G,Aid,Msg): Invoked on incoming upcall-enabled

messages– Route(G,Aid, Msg, NextHopNode): Invoked by the

application handler to forward a message on to the NextHopNode

Tapestry Upcall Interface

Tapestry sends the message to the application via Forward(). The handler is responsible for calling Route() with the final destination. Finally, Tapestry invokes Deliver() on messages destined for the local node to complete routing

Implementation of a Tapestry Node

Tapestry is implemented as an event-driven system for high throughput and scalability

This requires an asynchronous I/O layer as well as an efficient model for internal communication and control between components


Network stage – is a combination of part of the transport layer and part of the neighbor link layer, providing neighbor communication that is not provided by the operating system. It also works with the Patchwork monitoring to measure loss rates and latency

Core router - utilizes the routing and object reference tables to handle application driven messages. It is the critical path for all messages entering or exiting the system


Node membership – is responsible for handling the integration of new nodes into the Tapestry mesh and the voluntary exit of nodes

Mesh Repair-responsible for adapting the mesh as the environment changes, including network failures, updating the routing table for network latency

Patchwork- uses soft-state beacons to probe outgoing links for reliability and performance allowing Tapestry to respond to failures and changes in the topology

Evaluation

Preformed on Several platforms

Microbenchmarks were run on local cluster, measure the large scale performance of a deployed Tapestry on the PlanetLab global testbed and make use of a local network simulation layer to support controlled repeatable experiments

Also should be noted to enable a wider variety of experiments multiple Tapestry node instance where place on each physical machine. Each instance only shares code not data. In some cases this leads to a decrease in the time to exchange message but is more demanding on the processor so the system is slowed in other ways

Performance in a Stable Network

Used micro benchmarks on a network of two nodes to isolate Tapestry’s message processing overhead.

The sender establishes a binary network with the receiver and sends 10,0001 messages fore each message size

The receiver measures the latency for each size using the interarrival time between the first and last messages

Microbenchmarks on Stable Tapestry

First they eliminated network delay to measure raw message processing by placing both nodes on different ports on the same machine.

To see how performance scaled with processor speed the test was preformed on different machines

For very small messages there is a dominant constant processing time. For messages larger than 2kB the cost of copying the data dominates and processing time becomes linear

Microbenchmarks on Stable Tapestry

Measurements of the routing throughput show that throughput is low for small messages where processing dominates but quickly increases as the messages increase in size

For the average 4kB message the P-IV can process 7100 messages/s and where as the P-III can process 3200 messages/s

Routing Overhead to Nodes and Objects

The RDP is computed for node routing by measuring all pairs roundtrip routing latencies between the 400 Tapestry instances used and dividing each by the corresponding ping round-trip time

Figure 13 shows that the median values for node-node routing RDP starts at ~3 and slowly decreases ~1

The object RDP is measure as a ration of one-way Tapestry route to object latency, verses the one-way network latency

Figure 14 shows the RDP values sorted by their ping values and collect in 5ms bins with the 90th percentile and median values calculated per bin

Object Location Optimization

This figure was used to demonstrate that optimization can significantly lower the RDP observed by the bulk of all requesters for local-area network distance

Their technique trades extra storage space in the network for faster routing

Single Node Insertion

Measured the overhead required for a single node to join the Tapestry network, in terms of time required for the network and to stabilize and control message bandwidth during insertion

Figure 16 shows insertion time as a function of the network size. It shows that latencies scale sublinearly with the size of the network

For each datapoint a Tapestry network of size N is constructed and repeatedly a single node is inserted and deleted

Single Node Insertion

Figure 17 shows that the total bandwidth for a single node insertion scales logarithmically with the network size

Parallel Node Insertion

Started with a stable network of 200 nodes. Then repeated each parallel insert 20 times

They plotted the min, median and the 90th percentile.

There is significant variation in the 90th percentile, which is attributed to the effects of node virtualization

Continuous Convergence and Self-Repair Instead of measuring latency, these tests focused on

large-scale behavior under failures The routing to nodes test measures the success rate

of sending request to random keys in the namespace The routing to objects test sends messages to

previously published objects, located at servers which were guaranteed to stay alive in the network

Performance metrics include bandwidth and success rate of requests successfully reaching their destinations

Continuous Convergence and Self-Repair

For both figures 20% of the existing network is killed and after 15 minutes new nodes equal to 50% of the existing network are inserted

Only a small fraction of the request are affected when large portions of the network fail.

All massive failures and inserts lead to a small dip in success rate but quickly return to 100%

Continuous Convergence and Self-Repair

Each test included two churns of different levels of dynamicity

Constant change has little effect on Tapestry performance as success rates rarely fall even slightly below 100%

These dips happen independently of the parameters given to the churn

Comparison with other peer-to-peer systems Unlike Gnutella Tapestry guarantees that queries will

find existing objects Similar to Chord and CAN in that they all scale well

and guarantee that queries find existing objects under nonfailure conditions

Differ from Chord and CAN which take network distance into account when constructing their routing overlay. Tapestry instead constructs locally optimal routing tables from initialization and maintain them in order to reduce routing stretch

Tapestry allows applications to place objects according to their needs

Conclusion

Tapestry provides efficient and scalable routing of messages directly to nodes and objects in a large, sparse address space

Simulations show Tapestry performs near optimally under faults, while a small protion(~5%) of the queries fail on the faulty wide-area deployment

Tapestry: A Resilient Global-scale Overlay for Service Deployment Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, Anthony D. Joseph, and John Kubiatowicz: IEEE Journal on Selected Areas in Communications, January 2004, Vol. 22, No. 1

tapestry:a resilient global- scale overlay for service deployment zhao, huang, stribling, rhea,...

Documents