1 pier. 2 presentation overview pier core functionality and design principles core functionality and...

11

PIERPIER

22

Presentation overview PIERPresentation overview PIER

Core functionality and design Core functionality and design principlesprinciples

Distributed join example.Distributed join example. CAN high level functions.CAN high level functions. Application, PIER, DHT/CAN in detail.Application, PIER, DHT/CAN in detail. Distributed join / operations in detail. Distributed join / operations in detail.

Simulation results.Simulation results. ConclusionConclusion

33

What is PIER?What is PIER? Peer-to-Peer Information Exchange and RetrievalPeer-to-Peer Information Exchange and Retrieval

A distributed query engine based on widely A distributed query engine based on widely distributed environmentsdistributed environments

"general data retrieval system" using any source"general data retrieval system" using any source

Pier is internally using a relational database Pier is internally using a relational database formatformat

Read only systemRead only system

44

NetworkMonitoring

Other UserApps

Applications

Pier overview>>based on CAN

IPNetwork

Network

Tier 1 Tier 2 Tier 3

CoreRelationalExecution

EngineCatalogManager

QueryOptimizer

PIER

DHTWrapper

StorageManager

OverlayRouting

DHT

55

Relaxed ConsistencyRelaxed Consistency Brewer states that a distributed data system can Brewer states that a distributed data system can

only have two out of three of the following only have two out of three of the following properties :properties :

1.1. (C)onsistency(C)onsistency2.2. (A)vailability(A)vailability3.3. Tolerance of network (P)artitionsTolerance of network (P)artitions

Pier : Pier : Priority : A ,P and sacrifice CPriority : A ,P and sacrifice Cie. Best effort resultsie. Best effort results

Detailed in distributed join partDetailed in distributed join part

66

ScalabilityScalability

Scalability – amount of work scales Scalability – amount of work scales with the amount of nodeswith the amount of nodes• Network can grow easilyNetwork can grow easily• RobustRobust

PIER doesnt require a-priori PIER doesnt require a-priori allocation of resourcesallocation of resources

77

Data sourcesData sources

Data remains in it Data remains in it original sourceoriginal source

Could be anything : Could be anything : -file system-file system-live feed from a -live feed from a procesproces

Wrappers or gateways Wrappers or gateways have to be providedhave to be provided Source

Wrapper

Tuple

88

Standard schemasStandard schemas

Design goal for Design goal for application layerapplication layer

Pro : Bypasses Pro : Bypasses standardization standardization processprocess

Con : Limited by Con : Limited by current applicationscurrent applications IPIP PayloadPayload TimeStampTimeStamp

131.54.78.128131.54.78.128 1010100110101001 03-02-2004 03-02-2004 18:24:5018:24:50

WrapperWrapper

IPIP PayloadPayload

131.54.78.128131.54.78.128 1010100110101001

Relation

Popular software (example tcpdump)

99

PIER is independent of the DHT. PIER is independent of the DHT. Currently it is CAN. Currently it is CAN.

Currently using multicast. Other Currently using multicast. Other strategies are possible.strategies are possible.

(Further explained in DHT)(Further explained in DHT)

PIER storage overviewPIER storage overviewFig. 5.10: PIER: DHT-based pipelining symmetric hash join:consistency defined by dilated-reachable snapshot.

Source: -somedatastorage(Windows fileetc.).

Wrapper

Tuple schema and namespace. DHT key = namespace || ressourceID where resourceID = tuple key. Unique key

Attribute1 Attribute 2

DHT layer

PIER layer

Application layer: query application X. Application Y.

Create, update and deletesource data

Wrapper readssource data

Wrapper informs DTHlayer, that it has data

belonging to namespace XWrapper is preconfiguredwith knowledge about

relation X: it’s namepaceand multicast group.

DHT layer informs PIERlayer, that relation X (it’snamepace and multicast

group) exists. DHT createsmulticast group X.

PIER keeps tracks of allrelations (for each relation:

namespace, multicastgroup).

Query

DHT query

Request for data

1111





1212

DHT based distributed join: DHT based distributed join: exampleexample

• Distributed execution of relational database Distributed execution of relational database query operationsquery operations, e.g. join, is the , e.g. join, is the core core functionalityfunctionality of the PIER system. of the PIER system.

• Distributed join is a Distributed join is a relational database joinrelational database join performed to some degree in performed to some degree in parallelparallel by a by a number of processorsnumber of processors (machines) containing (machines) containing different parts of the relationsdifferent parts of the relations on which the on which the join is performed.join is performed.

• PerspectivePerspective: : generic intelligent "keyword" generic intelligent "keyword" based searchbased search based on distributed database based on distributed database query operations (e.g. like Google).query operations (e.g. like Google).

• The following example illustrates The following example illustrates whatwhat PIER can do PIER can do and thus the main and thus the main purpose of PIERpurpose of PIER, by means of , by means of distributed join based on DHTdistributed join based on DHT. Details of how . Details of how and which layer is doing what are provided later.and which layer is doing what are provided later.

DHT based distributed join example: DHT based distributed join example: relational database join proper (1/2)relational database join proper (1/2)

Relational database join of R and S on CPR: SELECT R.cpr, R.name, S.address FROM R, S WHERE R.cpr = S.cpr Join result: CPR Name Ad-

dress 311056-0485

Peter Hansen

Århus

200770-2312

Kirsten Andersen

København

DHT based distributed join example: DHT based distributed join example: relational database join proper (2/2)relational database join proper (2/2)

Relation R: - primary key = DHT resource id: A1 || A2 || CPR || A4 - attributes selected in join: CPR (join attribute), Name.

Attribute A1

Attribute A2

CPR Attribute A4

Attribute A5

Name Attribute A7

aaa bbb 311056-0485

ccc ddd Peter Hansen

eee

fff ggg 200770-2312

hhh iii Kirsten Andersen

jjj

Relation S: - primary key = DHT resource id: A2 || A3 || A4 - attributes selected in join: Address, CPR (join attribute).

Address Attribute A2

Attribute A3

Attribute A4

CPR

Køben-havn

bbb aaa ccc 200770-2312

Århus ggg fff hhh 311056-0485

DHT based distributed join example: (1/9)DHT based distributed join example: (1/9)Fig. 5.01: PIER: DHT-based pipelining symmetric hash join:R-table in DHT.

Entire DHT network

Namespace NR =multicast group R-group

Relation R: A1 A2 CPR A4 A5 Name A7 aaa bbb 7 ccc ddd xxx eee fff ggg 9 hhh iii yyy jjj

For each relation (table) (R):The tuples of the relation are distributedand stored among some subset of nodes,which:- is assigned a specific namespace (NR).- is assigned a specific multicast-group (R-group).DHT layer knows all relations (tables),namespaces and multicast groups.

DHT based distributed join example: (2/9)DHT based distributed join example: (2/9)

Fig. 5.02: PIER: DHT-based pipelining symmetric hash join:S-table in DHT.

Entire DHT network

Namespace NS =multicast group S-group

Relation S: Ad- dress

A2 A3 A4 CPR

xxx bbb aaa ccc 9 zzz ggg fff hhh 7

DHT based distributed join example: (3/9)DHT based distributed join example: (3/9)Fig. 5.03: PIER: DHT-based pipelining symmetric hash join:relational database join operation to be performed.

Entire DHT network





A2 A3 A4 CPR


Join result: [4b] CPR Name Ad-

dress 7 xxx xxx 9 yyy zzz

Relational database join of R and S on CPR: SELECT R.cpr, R.name, S.address FROM R, S WHERE R.cpr = S.cpr


Fig. 5.04: PIER: DHT-based pipelining symmetric hash join:step 1: activate operation on nodes in NR.

Entire DHT network



Node-7 originator of join: [1]. multicast(R-group, "oper-5", "Select xxx");

[1]


Fig. 5.05: PIER: DHT-based pipelining symmetric hash join:step 3: generate and push intermediate data to DHT on nodes in NR.

Entire DHT network



Namespace NQ


R derived intermediate tuple: [3Ra/b] CPR Name Source

table name

9 yyy R

Node in NR: [3R]: Build table: lscan R: for each local tuple:

[3Ra]: extract R.cpr, R.name Create intermediate tuple: (R.cpr,

R.name, "R"). Create instanceID as random

number: random_no = random(); [3Rb]: PUSH: put("NQ", R.cpr,

random_no, (R.cpr, R.name, "R"), "60 minutes").


Fig. 5.06: PIER: DHT-based pipelining symmetric hash join:steps 2 and 3: activate operation and generate and push intermediate data to DHT on nodes in NS.Note: step 3 is executed in parallel on nodes in NR and NS.

Entire DHT network



Namespace NQ

Node-7 originator of join: [2]. multicast(S-group, "oper-5", "Select xxx");

[2]


A2 A3 A4 CPR


S derived intermediate tuple: [3Sa/b] CPR Ad-

dress Source table name

9 zzz S

Node in NS: [3S]: Build table: lscan S: for each local tuple:

[3Sa]: extract S.cpr, S.address. Create intermediate tuple: (S.cpr, S.addres,

"S"). Create instanceID as random number:

random_no = random(); [3Sb]: PUSH: put("NQ", S.cpr,

random_no, (S.cpr, S.address, "S"), "60 minutes").

DHT based distributed join example: (7/9)DHT based distributed join example: (7/9)Fig. 5.07: PIER: DHT-based pipelining symmetric hash join:step 4: pull intermediate data and perform join on nodes in namespace NQ.

Entire DHT network



Namespace NQ





9 zzz S


table name

9 yyy R

Node in NQ: [4]: Probe table: newData("NQ", item) is called by DHT layer. For each call of newData:

[4a]: PULL: get("NQ", item.resourceID);

[4b]: if both R derived tuple and S derived tuple returned:

o extract R.cpr, R.name and S.address.

o Create join result tuple (R.cpr, R.name, S.address).

DHT based distributed join example: (8/9)DHT based distributed join example: (8/9)Fig. 5.08: PIER: DHT-based pipelining symmetric hash join:step 5: join result tuples are pushed (sent) to originator of join.

Entire DHT network



Namespace NQ

Node-7 originator of join: [5]. receive("oper-5", DHT_item) (one call for each tuple in result);



Node in NQ: [5]: Probe table: PUSH: send("oper-5", hashkey(Node-7), (R.cpr, R.name, S.address)).


Fig. 5.09: PIER: DHT-based pipelining symmetric hash join:overview of entire distributed join operation.

Entire DHT network



Namespace NQ

Node-7 originator of join: [1]. multicast(R-group, "oper-5", "Select xxx"); [2]. multicast(S-group, "oper-5", "Select xxx"); [5]. receive("oper-5", DHT_item) (one call for each tuple in result);

[1][2]



A2 A3 A4 CPR






9 zzz S


table name

9 yyy R

Node in NQ: [4,5]: Probe table: newData("NQ", item) is called by DHT layer. For each call of newData:

[4a]: PULL: get("NQ", item.resourceID);

[4b]: if both R derived tuple and S derived tuple returned:

o extract R.cpr, R.name and S.address.

o [5] PUSH: send("oper-5", hashkey(Node-7), (R.cpr, R.name, S.address)).

Node in NR: [3R]: Build table: lscan R: for each local tuple:

[3Ra]: extract R.cpr, R.name Create intermediate tuple: (R.cpr,

R.name, "R"). Create instanceID as random

number: random_no = random(); [3Rb]: PUSH: put("NQ", R.cpr,

random_no, (R.cpr, R.name, "R"), "60 minutes").

Node in NS: [3S]: Build table: lscan S: for each local tuple:

[3Sa]: extract S.cpr, S.address. Create intermediate tuple: (S.cpr, S.addres,

"S"). Create instanceID as random number:

random_no = random(); [3Sb]: PUSH: put("NQ", S.cpr,

random_no, (S.cpr, S.address, "S"), "60 minutes").

PIER use of DHT layer in distributed join: addressing of table by multicast group. Push / pull technique (intermediate results). Intermediate data used as queue and high degree of distribution and parallelism makes

network delay much less important. Hashing on primary key (rehashing) ensures related tuples on same node (content addressable

network / store based on hashed key). DHT used as exchange medium between nodes (database terms).

2424





2525

CANCAN

CAN is a DHTCAN is a DHT

Basic operations on CAN are Basic operations on CAN are insertion, lookup and deletion of insertion, lookup and deletion of (key,value) pairs(key,value) pairs

Each CAN node stores a chunk Each CAN node stores a chunk (called zone) of the entire hash table(called zone) of the entire hash table

2626

Every node is Every node is responsible of a small responsible of a small number of “adjacent” number of “adjacent” nodes(called zones) in nodes(called zones) in the tablethe table

Requests (insert, Requests (insert, lookup, delete) for a lookup, delete) for a particular key are particular key are routed by intermediate routed by intermediate CAN nodes towards the CAN nodes towards the CAN node that contains CAN node that contains the keythe key

2727

Design centers around a Design centers around a virtual d-dimensional virtual d-dimensional Cartesian coordinate Cartesian coordinate space on a d-torusspace on a d-torus

Coordinate space is Coordinate space is completely virtual and completely virtual and has no relation to any has no relation to any physical coordinate physical coordinate systemsystem

Keyspace is probably Keyspace is probably SHA1SHA1

Map Key k1 onto a point p1 using a Map Key k1 onto a point p1 using a Uniform Hash functionUniform Hash function

(k1,v1) is stored at the node Nx(k1,v1) is stored at the node Nx that owns the zone with p1that owns the zone with p1

2828

Retrieving a value for a given keyRetrieving a value for a given key

Apply the deterministic hash function Apply the deterministic hash function to map key onto point P and then to map key onto point P and then retrieve the corresponding value retrieve the corresponding value from the point P. from the point P.

One hash function per dimensionOne hash function per dimension

get(key) -> value:- lookup(key) -> ip address.- ip.retrieve(key) -> value

2929

Storing (key,value)Storing (key,value)

Key is deterministically mapped onto Key is deterministically mapped onto a point P in the coordinate space a point P in the coordinate space using a uniform hash functionusing a uniform hash function

The corresponding (key,value) pair is The corresponding (key,value) pair is then stored at the node that owns then stored at the node that owns the zone within which the point P liesthe zone within which the point P lies

put(key,value):- lookup(key) -> ip addressip.store(key,value)

3030

RoutingRouting

CAN node maintains routing table that CAN node maintains routing table that holds the IP address and virtual coordinate holds the IP address and virtual coordinate zone of each of its immediate neighbors in zone of each of its immediate neighbors in the coordinate spacethe coordinate space

Using its neighbor coordinate set, a node Using its neighbor coordinate set, a node routes a message towards its destination routes a message towards its destination by simple greedy forwarding to the by simple greedy forwarding to the neighbor with coordinates closest to the neighbor with coordinates closest to the destination coordinatesdestination coordinates

3131

• Node Maintains routing Node Maintains routing

table with neighborstable with neighbors• Follow the straight line Follow the straight line

path through path through

the Cartesian spacethe Cartesian space

(16,16)(16,0)

(0,16)(0,0)

Data

Key = (15,14)

3232

Node JoiningNode Joining

New node must find a node already in the New node must find a node already in the CANCAN

Next, using the CAN routing mechanisms, Next, using the CAN routing mechanisms, it must find a node whose zone will be splitit must find a node whose zone will be split

Neighbors of the split zone must be Neighbors of the split zone must be notified so that routing can include the notified so that routing can include the new nodenew node

3333

CAN: constructionCAN: construction

I

new node 1) Discover some node “I” already in CAN

3434


2) Pick random point in space

I

(x,y)

new node

3535


(x,y)

3) I routes to (x,y), discovers node J

I

J

new node

3636


newJ

4) split J’s zone in half… new node owns one half

3737

Node departureNode departure Node explicitly hands over its zone and the

associated (key,value) database to one of its neighbors

Incase of network failure this is handled by a take-over algorithm

Problem : take over mechanism does not provide regeneration of data

solution:every node has a backup of its neighbours

3838





3939

Querying the Internet with PIERQuerying the Internet with PIER(PIER = Peer-to-peer Information Exchange and (PIER = Peer-to-peer Information Exchange and

Retrieval)Retrieval)

4040

What is a DHT?What is a DHT?

• Take an abstract ID space, and partition among Take an abstract ID space, and partition among a changing set of computers (nodes)a changing set of computers (nodes)

• Given a message with an ID, route the Given a message with an ID, route the message to the computer currently responsible message to the computer currently responsible for that IDfor that ID

• Can store messages at the nodesCan store messages at the nodes This is like a “distributed hash table”This is like a “distributed hash table” Provides a put()/get() APIProvides a put()/get() API

4141

(16,16)(16,0)

(0,16)(0,0)

Data

Key = (15,14)

Given a message with an ID, route the Given a message with an ID, route the message to the computer currently message to the computer currently responsible for that IDresponsible for that ID

4242

Lots of effort is put into making DHTs Lots of effort is put into making DHTs better:better:• Scalable (thousands Scalable (thousands millions of millions of

nodes)nodes)• Resilient to failureResilient to failure• Secure (anonymity, encryption, etc.)Secure (anonymity, encryption, etc.)• Efficient (fast access with minimal state)Efficient (fast access with minimal state)• Load balancedLoad balanced• etc.etc.

4343

IPNetwork

Network

DHTWrapper

StorageManager

OverlayRouting

DHT

CoreRelationalExecution

EngineCatalogManager

QueryOptimizer

PIER

NetworkMonitoring

Other UserApps

Applications

Physical Network

Overlay Network

Query Plan

>>based on Can

DeclarativeQueries

Select R.Cpr,R.name,s.Address From R,SWhere R.cpr=S.cpr

4444

ApplicationsApplications

Any Distributed Relational Database Any Distributed Relational Database ApplicationsApplications

Network Monitoring Network Monitoring Feasible ApplicationsFeasible Applications Intrusion detectionIntrusion detection Fingerprint queries Fingerprint queries Load Monitor of CPULoad Monitor of CPU Split Zone Split Zone

4545

DHTsDHTs Implemented with CAN (Content Implemented with CAN (Content

Addressable Network).Addressable Network). Node identified by rectangle in d-Node identified by rectangle in d-

dimensional spacedimensional space Key hashed to a point, stored in Key hashed to a point, stored in

corresponding node.corresponding node. Routing Table of neighbours is Routing Table of neighbours is

maintained. O(d)maintained. O(d)

4646

DHT DesignDHT Design Routing LayerRouting Layer

Mapping for keysMapping for keys

(-- dynamic as nodes leave and join)(-- dynamic as nodes leave and join) Storage ManagerStorage Manager

DHT based dataDHT based data ProviderProvider

Storage access interface for higher Storage access interface for higher levels levels

ProviderStorageManager

OverlayRouting

4747

DHT – RoutingDHT – Routing

Routing layerRouting layer

mapsmaps a a keykey into the into the IP addressIP address of the node currently of the node currently responsible for that key. Provides exact lookups, responsible for that key. Provides exact lookups, callbacks higher levels when the set of keys has callbacks higher levels when the set of keys has changedchanged

Routing layer API Routing layer API Asynchronous FncAsynchronous Fnc

lookup(key) lookup(key) ipaddr synch fnc Local Node ipaddr synch fnc Local Nodejoin(landmarkNode)join(landmarkNode)leave()leave()

locationMapChange()locationMapChange()


OverlayRouting

4848

DHT – StorageDHT – Storage

Storage ManagerStorage Manager

storesstores and and retrievesretrieves records, which consist of records, which consist of key/value pairs. Keys are used to locate key/value pairs. Keys are used to locate items and can be any data type or structure items and can be any data type or structure supportedsupported

Storage Manager APIStorage Manager APIstore(key, items)store(key, items)retrieve(key)retrieve(key) items --Structure items --Structureremove(key)remove(key)


OverlayRouting

4949


OverlayRouting

DHT – Provider (1)DHT – Provider (1)

ProviderProvider

tiesties routing and storage manager layers and routing and storage manager layers and providesprovides an interface an interface

Each object in the DHT has a Each object in the DHT has a namespacenamespace, , resourceIDresourceID and and instanceIDinstanceID

DHT key = DHT key = (hash1((hash1(namespacenamespace,,resourceIDresourceID) +..+ ) +..+ hashN(hashN(namespacenamespace,,resourceIDresourceID))))

namespacenamespace - application or group of object, table or relation - application or group of object, table or relation resourceIDresourceID – primary key or any attribute(Object) – primary key or any attribute(Object) instanceIDinstanceID –– integer, to separate items with the same integer, to separate items with the same namespacenamespace and and

resourceIDresourceID

Lifetime - Lifetime - item storage duration item storage duration ----adherence principle of adherence principle of relaxed Consistencyrelaxed Consistency

CAN’s mapping of CAN’s mapping of resourceIDresourceID /Object is equivalent to an index /Object is equivalent to an index

Depends on dimension On CAN 0.2160

5050

DHT – Provider (2)DHT – Provider (2)Provider APIProvider API

getget(namespace, resourceID) (namespace, resourceID) item item

putput(namespace, resourceID, item, lifetime)(namespace, resourceID, item, lifetime)

renewrenew(namespace, resourceID, instanceID, lifetime) (namespace, resourceID, instanceID, lifetime) boolbool

multicastmulticast(namespace, resourceID, item)(namespace, resourceID, item)

lscanlscan(namespace) (namespace) items (Structure/Iterator) items (Structure/Iterator)

newDatanewData(namespace, item)(namespace, item)

Node R1Node R1

(1..n)(1..n)Table R (namespace)Table R (namespace)

(1..n) tuples (1..n) tuples

(n+1..m) tuples(n+1..m) tuples

Node R2Node R2

(n+1..m)(n+1..m)rID1rID1

itemitem

rID3rID3

itemitem

rID2rID2

itemitem


OverlayRouting

5151

Query ProcessorQuery Processor How it works?How it works?

• performs selection, projection, joins, grouping, performs selection, projection, joins, grouping, aggregation ->aggregation ->OperatorsOperators

• Push & Pull Ways of OperatingPush & Pull Ways of Operating

• simultaneous execution of multiple operators pipelined simultaneous execution of multiple operators pipelined togethertogether

• results are produced and queued as quick as possibleresults are produced and queued as quick as possible How it modifies data?How it modifies data?

• insert, update and delete different items via DHT insert, update and delete different items via DHT interfaceinterface

How it selects data to process?How it selects data to process?

• dilated-reachable snapshotdilated-reachable snapshot – data, published by – data, published by reachable nodes at the query arrival timereachable nodes at the query arrival time

5252

pull

R s

Join result: [4b1] CPR Name Ad-


TemporaryData in DHT

NameSpace

push


Relation S:

Ad- dress

A2 A3 A4 CPR


pull

push

Relation T: A1 A2 CPR Dptno A5 Name A7 aaa bbb 7 1 ddd xxx eee fff ggg 9 2 iii yyy jjj

Relation Q:

Deptname A2 Total Salary

Woked Deptno

xxx bbb 30,000 30 1 zzz ggg 40,000 40 2

TemporaryData in DHT Salary

Join result: [4b2] CPR Dept

Name Salary

7 xxx 30,000 9 yyy 40,000

T Q

Grouping & Aggregation

Join result: [4c] CPR Name Dept

name Salary Ad-

dress 7 xxx xxx 30,000 xxx 9 yyy yyy 40,000 zzz

(R S) (T Q)

(4b1.CPR=4b2.CPR)

5353

DHT based distributed join detailed: PIER layer DHT based distributed join detailed: PIER layer procedures in node originating join (1/2).procedures in node originating join (1/2).

Operations on node Node-7, originating the join (1/2): PIER Core Relational Execution Engine:

- [0]: Assign to the join operation a globally unique operation id and namespace for result: o operation id = "oper-5". o namespace is chosen to be equal to operation id = "oper-5".

- [1]: Activate join in all nodes belonging to NR. o Multicast group R-group was created, when R was created

(assumption). o DHT_resourceID = "oper-5". o DHT_item = "SELECT R.cpr, R.name, S.address FROM R, S WHERE R.cpr = S.cpr".

o Multicast syntax: multicast(namespace, resourceID, item). o multicast(R-group, DHT_resourceID, DHT_item);

5454

DHT based distributed join detailed: PIER layerDHT based distributed join detailed: PIER layerprocedures in node originating join (2/2).procedures in node originating join (2/2).

Operations on node Node-7, originating the join (2/2): - [2]: Activate join in all nodes belonging to NS.

o Multicast group S-group was created, when S was created (assumption).

o DHT_resourceID = "oper-5". o DHT_item = "SELECT R.cpr, R.name, S.address FROM R, S WHERE R.cpr = S.cpr" (as above).

o multicast(S-group, DHT_resourceID, DHT_item); - [5]: Receive result in namespace "oper-5":

o DHT Provider calls function receive in PIER Core Relational Execution Engine.

o Assumed syntax of receive: receive(namespace, item). o For each tuple in join result, receive function is called:

receive("oper-5", DHT_item); o For each call of receive:

store received DHT_item in result table.

5555

DHT based distributed join detailed: PIER layerDHT based distributed join detailed: PIER layerprocedures in node containing data of relation R.procedures in node containing data of relation R.

Operations on node containing data in relation R (multicast group NR): Building of hash table: lscan R:

for each tuple in R: o extract R.cpr, R.name. o DHT_item = (R.cpr, R.name, "R"). o DHT_instanceID = random(); o Put syntax: put(namespace, resourceID, instanceID, item,

lifetime): o [3R]: PUSH: put("NQ", R.cpr, DHT_instanceID,

DHT_item, "60 minutes").

5656

DHT based distributed join detailed:DHT based distributed join detailed:PIER layer procedures in node containing data of PIER layer procedures in node containing data of relation S.relation S.

5757

DHT based distributed join detailed:DHT based distributed join detailed:PIER layer procedures in node containing PIER layer procedures in node containing intermediate data (namespace NQ).intermediate data (namespace NQ).

Operations on node containing intermediate data (namespace NQ): newData("NQ", item) in PIER layer is called by DHT layer. For each call of newData:

o Get syntax: get(namespace, resourceID) list of items: o [4] PULL: get("NQ", item.resourceID); o if get returns both an R derived tuple and an S derived tuple

do the following: extract R.cpr, R.name and S.address. create join result tuple (R.cpr, R.name, S.address). DHT_item = (R.cpr, R.name, S.address). send DHT_item to originator of Join:

assumed syntax of send: send(namespace, receiver-address, item). Note: receiver-address is a hash key.

[5] PUSH: send("oper-5", hashkey(Node-7), DHT_item).

5858



Distributed join example.Distributed join example. CAN high level functions.CAN high level functions. Application, PIER, DHT/CAN in detail.Application, PIER, DHT/CAN in detail. Distributed join / operations in detail.Distributed join / operations in detail. Simulation results.Simulation results. ConclusionConclusion

5959

DHT based distributed joins DHT based distributed joins detailed: important properities.detailed: important properities.

Distributed join performed by PIER layer.Distributed join performed by PIER layer.

6060


The PIER distributed joins are adaptions of The PIER distributed joins are adaptions of well-known distributed database well-known distributed database distributed join algorithms.distributed join algorithms.

Distributed databases and database Distributed databases and database operations have been the object of operations have been the object of research for many years.research for many years.

PIER distributed join leverages DHT in PIER distributed join leverages DHT in many aspects, because DHT provides the many aspects, because DHT provides the desired scalability on network size.desired scalability on network size.

DHT based distributed join detailed:DHT based distributed join detailed:CAN multicast design.CAN multicast design.

Fig. 6-det: PIER: CAN multicast design.

Entire CAN network =orginal CAN network(CAN network identifier = 0)multicast group R-group =

separate CAN network(CAN network identifier = 1).Separate namespace = NR. multicast group S-group =

separate CAN network(CAN network identifier = 2).Separate namespace = NS.

CAN multicast isdone by enhancedfloodingmechanism, whichis acceptablyefficient.

DHT based distributed join consistency: DHT based distributed join consistency: in comparison: complete consistency in traditional in comparison: complete consistency in traditional distributed database systems.distributed database systems.

Fig. 5.11: PIER: DHT-based pipelining symmetric hash join:complete consistency in traditional distributed databasesystems.



[1][2]

Entire DHT network

Consistency in traditional distributed database systems: data consistency ensured by traditional database mechanisms

(including committed update transactions and "keep alive protocol"). This includes clock synchronization. o data are made unavailable by database system, if they are

inconsistent. correct set of data is the union of local data on participating nodes,

where each set of local data is from the same point of time: the time of the query, which is synchronized within very short timeinterval on all participating nodes.

nodes are either available or query is discarded by database system (network partition robustness sacrificed to consistency).

DHT based distributed join consistency: DHT based distributed join consistency: consistency defined by dilated-reachable snapshot.consistency defined by dilated-reachable snapshot.

Fig. 5.10: PIER: DHT-based pipelining symmetric hash join:consistency defined by dilated-reachable snapshot.



[1][2]

Entire DHT network

Node X

Node Y originated a tuple inS, which is stored on node X.

Node Y

Consistency defined by dilated-reachable snapshot: data consistency ensured by traditional database mechanisms (including

committed update transactions and "keep alive protocol"). (Note: update is not part of PIER). This includes clock synchronization. o this may not be the case, but then query consistency is not

interesting. correct set of data is the (slightly time-dilated) union of local snapshots of

data published by reachable nodes, where each local snapshot is from the time of the arrival of the multicast query message at that node.

data inconsistency (consistency sacrificed to availability and network partition robustness): o missing data: some nodes storing data of a relation may be

unavailable. o invalid data: some stored data originated on nodes, which are no

longer available.

6464


PIER use of DHT layer in distributed join:PIER use of DHT layer in distributed join: Relation -> multicast group = namespace. Join operation Relation -> multicast group = namespace. Join operation

multicasted to all nodes containing data for a given relation multicasted to all nodes containing data for a given relation ((addressing of table by multicast groupaddressing of table by multicast group)). This multicast . This multicast functionalilty is essential to achieve distribution and parallelism.functionalilty is essential to achieve distribution and parallelism.

Push / pull techniquePush / pull technique: : • Intermediate result stored (Intermediate result stored (pushedpushed) in ) in hashed tablehashed table. . • Nodes performing next step in workflow Nodes performing next step in workflow pullspulls intermediate result. intermediate result.

DHT hashed tables with intermediate results used as DHT hashed tables with intermediate results used as queuequeue in in workflow. workflow.

DHT leveraged queue and high degree of distribution and DHT leveraged queue and high degree of distribution and parallelism makes parallelism makes network delay much less importantnetwork delay much less important..

Hashing on primary key (Hashing on primary key (rehashingrehashing) ensures related tuples on ) ensures related tuples on same node (same node (content addressable network / store based on content addressable network / store based on hashed keyhashed key).).

DHT used as DHT used as exchange mediumexchange medium between nodes (database between nodes (database terms).terms).

6565

DHT based distributed join detailed: DHT based distributed join detailed: important properities.important properities.

Essential properties of DHT protocol used:Essential properties of DHT protocol used:• reliability in inserting data (put)reliability in inserting data (put)• reliability in retrieving data (get)reliability in retrieving data (get)• reliability in storing data (preserving data at nodefailure)reliability in storing data (preserving data at nodefailure)• scalability in terms of number of nodes and amount of work scalability in terms of number of nodes and amount of work

handledhandled trad. distributed databases systems has this, but DHT (at least trad. distributed databases systems has this, but DHT (at least

CAN) scalability is largerCAN) scalability is larger• robustness to node and network failures.robustness to node and network failures.

trad. database systems do not have this (cf. Brewer's "CAP" trad. database systems do not have this (cf. Brewer's "CAP" conjecture: you can only have two of conjecture: you can only have two of CConsistency, onsistency, AAvailability and vailability and tolerance of network tolerance of network PPartitions. Trad. distributed systems choose C artitions. Trad. distributed systems choose C and sacrifiCes A in the face of P).and sacrifiCes A in the face of P).

• support of quickly and frequently varying number of support of quickly and frequently varying number of participating nodesparticipating nodes

trad. distributed databases normally does not have this.trad. distributed databases normally does not have this.

6666

DHT based distributed join detailed: DHT based distributed join detailed: important properities.important properities.

Comparison of traditional distributed Comparison of traditional distributed databases and PIER distributed query:databases and PIER distributed query:• PIER query (as name shows) is read only, no PIER query (as name shows) is read only, no

update.update.• Committed update transactions and operation Committed update transactions and operation

support facilitites like backup (e.g. checkpoint support facilitites like backup (e.g. checkpoint based)based)

not a design goal of PIER and other P2P systems.not a design goal of PIER and other P2P systems. essential property of trad. distributed database essential property of trad. distributed database

systems for many years.systems for many years.

6767

DHT based distributed join detailed: DHT based distributed join detailed: important properties.important properties.

PIER: supported distributed join algoritms PIER: supported distributed join algoritms and query rewrite mechanisms (1/2):and query rewrite mechanisms (1/2):• Note initially: Note initially:

Basic principles of the use of DHT by PIER distributed Basic principles of the use of DHT by PIER distributed query is already shown by above example containing query is already shown by above example containing pipelining symmetric hash joinpipelining symmetric hash join. .

Pipelining symmetric hash join is the most general Pipelining symmetric hash join is the most general purpose equi join operation of PIER.purpose equi join operation of PIER.

Other distributed query mechanisms primarily differ Other distributed query mechanisms primarily differ with respect to distributed database operation with respect to distributed database operation strategy and bandwidth saving techniques. strategy and bandwidth saving techniques.

Other distributed query mechanisms are just shortly Other distributed query mechanisms are just shortly mentioned here.mentioned here.

6868

DHT based distributed join detailed: DHT based distributed join detailed: important properties.important properties.

PIER: supported distributed join algoritms and query rewrite PIER: supported distributed join algoritms and query rewrite mechanisms (2/2):mechanisms (2/2):• PIER supports DHT based versions of two distributed binary PIER supports DHT based versions of two distributed binary

equi join algoritms: equi join algoritms: pipelinedpipelined symmetric hash joinsymmetric hash join: shown in above example: shown in above example.. fetch matches: fetch matches: one of the tables is already hashed on join one of the tables is already hashed on join

key and attributes.key and attributes.• PIER supports DHT based versions of two distributed PIER supports DHT based versions of two distributed

bandwidth-reducing query rewrite strategies: bandwidth-reducing query rewrite strategies: symmetric semi-join: symmetric semi-join: first local projection of "source" first local projection of "source"

table's join keys and attributes, then pipelined symmetric table's join keys and attributes, then pipelined symmetric hash join.hash join.

Bloom joinBloom join: : something like the following:something like the following: an "index" table an "index" table keyed on join key of each participating table is put into keyed on join key of each participating table is put into intermediate table and makes it possible in an efficient way intermediate table and makes it possible in an efficient way to identify matching set of tuples in the participating tables.to identify matching set of tuples in the participating tables.

Note: pipelining symmetric has join (above example) does Note: pipelining symmetric has join (above example) does in it self safe much network bandwidth.in it self safe much network bandwidth.

6969



Distributed join example.Distributed join example. CAN high level functions.CAN high level functions. Application, PIER, DHT/CAN in detail.Application, PIER, DHT/CAN in detail. Distributed join / operations in detail.Distributed join / operations in detail. Simulation results.Simulation results. ConclusionConclusion

7070

Evaluation of peer scalability and Evaluation of peer scalability and robustness by means of simulation and robustness by means of simulation and

limited scale experimental operation:limited scale experimental operation: Subject of evalution: above two distributed join algoritms, above Subject of evalution: above two distributed join algoritms, above

two query rewrite methods.two query rewrite methods. Simulation: 10000 nodes network.Simulation: 10000 nodes network. Experimental operation: cluster of 64 pcs in a LAN.Experimental operation: cluster of 64 pcs in a LAN. Simulation results: Simulation results:

• network scales well with both scaling number of nodes and scaling network scales well with both scaling number of nodes and scaling amount of work.amount of work.

• network robust to node failure. Robustness is provided by the DHT network robust to node failure. Robustness is provided by the DHT layer, not by the PIER layer.layer, not by the PIER layer.

Evaluation of presenters: the Pier simulation test cases seem Evaluation of presenters: the Pier simulation test cases seem relevant and the results seem acceptable. relevant and the results seem acceptable.

Experimental results: negligeable (experimental operation must Experimental results: negligeable (experimental operation must be extended).be extended).

Reference: Reference: "Querying the Internet with PIER", Ryan Huebsch a.o., "Querying the Internet with PIER", Ryan Huebsch a.o., see course homepage. Hereafter "article".see course homepage. Hereafter "article".

The most important simulation tests and resuls shown below.The most important simulation tests and resuls shown below.

7171

Simulation testbedSimulation testbed The same test distributed join is used The same test distributed join is used

in all tests. in all tests. The amount of work, in terms of The amount of work, in terms of

datatraffic, of the test distributed join datatraffic, of the test distributed join can be and is proportionally can be and is proportionally increased with increasing number of increased with increasing number of nodes in the network. nodes in the network.

7272

Simulation: average join time when Simulation: average join time when scaling network and work ("full scaling network and work ("full mesh" topology) (1/2).mesh" topology) (1/2).

Test case: average join time when both Test case: average join time when both the amount of work and the amount of the amount of work and the amount of nodes in the network scale.nodes in the network scale.

"Full mesh" network topology used."Full mesh" network topology used. Result: network scales well (performance Result: network scales well (performance

degradation of factor 4 when number of degradation of factor 4 when number of nodes (and propertionally the amount of nodes (and propertionally the amount of work) scales from 2 to 10.000. work) scales from 2 to 10.000.

See article fig. 3 below.See article fig. 3 below.

7373

Simulation: average join time when scaling network and Simulation: average join time when scaling network and work ("full mesh" topology) (2/2).work ("full mesh" topology) (2/2).

7474

Simulation: average join time for Simulation: average join time for different mechanisms (1/2).different mechanisms (1/2).

Test case: Average join time for:Test case: Average join time for:• supported distributed join algoritms:supported distributed join algoritms:

pipelining symmetric hash joinpipelining symmetric hash join fetch matches.fetch matches.

• supported query rewrite strategies:supported query rewrite strategies: symmetric semi-join.symmetric semi-join. Bloom join. Bloom join.

Result: see article table 4 below.Result: see article table 4 below.

7575

Simulation: average join time for different mechanisms Simulation: average join time for different mechanisms (2/2).(2/2).

7676

Simulation test cases and results: Simulation test cases and results: aggregate join traffic for different aggregate join traffic for different mechanisms (1/2).mechanisms (1/2).

Test case: aggregate join traffic Test case: aggregate join traffic generated supported join algorithms generated supported join algorithms and query rewrite strategies.and query rewrite strategies.

Result: see article fig. 4 below.Result: see article fig. 4 below.

7777

Simulation test cases and results: aggregate join traffic Simulation test cases and results: aggregate join traffic for different mechanisms (2/2).for different mechanisms (2/2).

7878

Simulation: average join time when Simulation: average join time when scaling network and work ("transit scaling network and work ("transit stub" topology) (1/4).stub" topology) (1/4). Initially note: the topologies examined are the Initially note: the topologies examined are the

underlying IP network topology, not the PIER / underlying IP network topology, not the PIER / DHT overlay network topology.DHT overlay network topology.

Test case: "Transit stub" network topology Test case: "Transit stub" network topology compared to "full mesh" network topology.compared to "full mesh" network topology.

"Transit stub" is the only realistic network "Transit stub" is the only realistic network topology. "Full mesh" is easier to simulate.topology. "Full mesh" is easier to simulate.

"Transit stub" topology: traditional hierarchical "Transit stub" topology: traditional hierarchical network topology:network topology:• 4 "transit domains" each having4 "transit domains" each having• 10 "transit nodes" each having10 "transit nodes" each having• 3 "stub nodes".3 "stub nodes".• PIER nodes distributed equally upon "stub nodes".PIER nodes distributed equally upon "stub nodes".

7979

Simulation: average join time when Simulation: average join time when scaling network and work ("transit scaling network and work ("transit stub" topology) (2/4).stub" topology) (2/4). Comparison of "transit stub" topology with "full mesh" Comparison of "transit stub" topology with "full mesh"

topology is done by comparing the following in the two topology is done by comparing the following in the two topologies: average join time when both the amount of topologies: average join time when both the amount of work and the amount of nodes scale.work and the amount of nodes scale.

Performance scaling ability of "full mesh" topology was Performance scaling ability of "full mesh" topology was shown on fig. 3 above.shown on fig. 3 above.

Performance scaling ability of "transit stub" topology shown Performance scaling ability of "transit stub" topology shown on fig. 7 below.on fig. 7 below.

Result: Result: • "transit stub" topology performance scaling ability close to that "transit stub" topology performance scaling ability close to that

of "full mesh" topology.of "full mesh" topology.• "transit stub" absolute delay bigger (due to realistic topology "transit stub" absolute delay bigger (due to realistic topology

and network delays).and network delays).• conclusion: "full mesh", which is easier to simulate, used in conclusion: "full mesh", which is easier to simulate, used in

simulation test, since it is sufficiently close to the realistic simulation test, since it is sufficiently close to the realistic "transit stub" topology."transit stub" topology.

8080

Simulation: "transit stub" topology's ability to keep Simulation: "transit stub" topology's ability to keep performance high when scaling (fig. 7): performance high when scaling (fig. 7): average join time when scaling network and work (3/4).average join time when scaling network and work (3/4).

8181

Simulation: in comparison "full mesh" topology's ability to Simulation: in comparison "full mesh" topology's ability to keep performance high when scaling (fig. 3): keep performance high when scaling (fig. 3): average join time when scaling network and work (4/4).average join time when scaling network and work (4/4).

8282



Distributed join example.Distributed join example. CAN high level functions.CAN high level functions. Application, PIER, DHT/CAN in detail.Application, PIER, DHT/CAN in detail. Distributed join / operations in detail.Distributed join / operations in detail. Simulation results.Simulation results. Conclusion.Conclusion.

8383

ConclusionConclusion•Pier is a promising technique

•Network monitoring is limited to network statedata, which do not need to be consistent with respect to smalltimeinterval for updates.

•Only applications feasible are ones which can accept highly inconsistent data

1 pier. 2 presentation overview pier core functionality and design principles core functionality and...

Documents

pier slide

source pier

distributed data system

system slide

pier storage overview

distributed join example

distributed join operations

robustrobust pier doesnt