querying the internet with pier (pier = peer-to-peer information exchange and retrieval)

Querying the Internet with PIERQuerying the Internet with PIER(PIER = Peer-to-peer Information Exchange and (PIER = Peer-to-peer Information Exchange and

Retrieval)Retrieval)

What is PIER?What is PIER?

Peer-to-Peer Information Exchange and RetrievalPeer-to-Peer Information Exchange and Retrieval Query engine that runs on top of P2P networkQuery engine that runs on top of P2P network

• step to the distributed query processing at a step to the distributed query processing at a larger scalelarger scale

• way for massive distribution: querying way for massive distribution: querying heterogeneous dataheterogeneous data

Architecture meets traditional database query Architecture meets traditional database query processing with recent peer-to-peer technologiesprocessing with recent peer-to-peer technologies

Key goal is Key goal is scalablescalable indexing system for indexing system for large-large-scale decentralized storage applications scale decentralized storage applications on on the Internetthe Internet

P2P, Large scale storage management P2P, Large scale storage management systems (OceanStore, Publius), wide-area systems (OceanStore, Publius), wide-area name resolution services name resolution services

What is Very Large?What is Very Large?Depends on Who You AreDepends on Who You Are

Single SiteClusters

Internet Scale1000’s – Millions

Distributed10’s – 100’s

How to run DB style queries at Internet How to run DB style queries at Internet Scale!Scale!

Database Community Network Community

Internet scale systems vs. hundred node systems

What are the Key Properties?What are the Key Properties?

Lots of data that is:Lots of data that is:1.1. Naturally distributed (where it’s Naturally distributed (where it’s

generated)generated)

2.2. Centralized collection undesirableCentralized collection undesirable

3.3. Homogeneous in schemaHomogeneous in schema

4.4. Data is more useful when viewed as Data is more useful when viewed as a wholea whole

Who Needs Internet Scale?Who Needs Internet Scale?Example 1: FilenamesExample 1: Filenames

Simple ubiquitous schemas:Simple ubiquitous schemas:• Filenames, Sizes, ID3 tagsFilenames, Sizes, ID3 tags

Born from early P2P systems such as Born from early P2P systems such as Napster, Gnutella etc.Napster, Gnutella etc.

Content is shared by “normal” non-expert Content is shared by “normal” non-expert users… home usersusers… home users

Systems were built by a few individuals ‘in Systems were built by a few individuals ‘in their garages’ their garages’ Low barrier to entry Low barrier to entry

Example 2: Network TracesExample 2: Network Traces

Schemas are mostly standardized:Schemas are mostly standardized:• IP, SMTP, HTTP, SNMP log formatsIP, SMTP, HTTP, SNMP log formats

Network administrators are looking for Network administrators are looking for patterns within their site AND with other sites:patterns within their site AND with other sites:• DoS attacks cross administrative boundariesDoS attacks cross administrative boundaries• Tracking virus/worm infectionsTracking virus/worm infections• Timeliness is very helpfulTimeliness is very helpful

Might surprise you how useful it is:Might surprise you how useful it is:• Network bandwidth on PlanetLab (world-wide Network bandwidth on PlanetLab (world-wide

distributed research test bed) is mostly filled with distributed research test bed) is mostly filled with people monitoring the network statuspeople monitoring the network status

Our ChallengeOur Challenge

Our focus is on the challenge of Our focus is on the challenge of scale:scale:• Applications are homogeneous and Applications are homogeneous and

distributeddistributed Already have significant interestAlready have significant interest

• Provide a flexible framework for a Provide a flexible framework for a wide variety of applicationswide variety of applications

Four Design Principles (I)Four Design Principles (I)

Relaxed ConsistencyRelaxed Consistency• ACID transactions severely limits the ACID transactions severely limits the

scalability and availability of distributed scalability and availability of distributed databasesdatabases

• We provide best-effort resultsWe provide best-effort results Organic ScalingOrganic Scaling

• Applications may start small, withoutApplications may start small, withouta priori knowledge of sizea priori knowledge of size

Four Design Principles (II)Four Design Principles (II)

Natural habitatNatural habitat• No CREATE TABLE/INSERTNo CREATE TABLE/INSERT• No “publish to web server”No “publish to web server”• Wrappers or gateways allow the information to Wrappers or gateways allow the information to

be accessed where it is createdbe accessed where it is created Standard Schemas via Grassroots softwareStandard Schemas via Grassroots software

• Data is produced by widespread software Data is produced by widespread software providing a de-facto schema to utilizeproviding a de-facto schema to utilize

IPNetwork

Network

DHTWrapper

StorageManager

OverlayRouting

DHT

CoreRelationalExecution

EngineCatalogManager

QueryOptimizer

PIER

NetworkMonitoring

Other UserApps

Applications

Physical Network

Overlay Network

Query Plan

DeclarativeQueries

>>based on Can

ApplicationsApplications

P2P DatabasesP2P Databases

Highly distributed and Highly distributed and available dataavailable data

Network Monitoring Network Monitoring

Intrusion detectionIntrusion detection

Fingerprint queries Fingerprint queries

DHTsDHTs Implemented with CAN (Content Implemented with CAN (Content

Addressable Network).Addressable Network). Node identified by hyper-rectangle in d-Node identified by hyper-rectangle in d-

dimensional spacedimensional space Key hashed to a point, stored in Key hashed to a point, stored in

corresponding node.corresponding node. Routing Table of neighbours is Routing Table of neighbours is

maintained. O(d)maintained. O(d)

(16,16)(16,0)

(0,16)(0,0)

Data

Key = (15,14)

Given a message with an ID, route the Given a message with an ID, route the message to the computer currently message to the computer currently responsible for that IDresponsible for that ID

DHT DesignDHT Design Routing LayerRouting Layer

Mapping for keysMapping for keys

(-- dynamic as nodes leave and join)(-- dynamic as nodes leave and join) Storage ManagerStorage Manager

DHT based dataDHT based data ProviderProvider

Storage access interface for higher Storage access interface for higher levels levels

DHT – RoutingDHT – Routing

Routing layerRouting layer

mapsmaps a a keykey into the into the IP addressIP address of the node currently of the node currently responsible for that key. Provides exact lookups, responsible for that key. Provides exact lookups, callbacks higher levels when the set of keys has callbacks higher levels when the set of keys has changedchanged

Routing layer APIRouting layer APIlookup(key) lookup(key) ipaddr (Asynchronous Fnc) ipaddr (Asynchronous Fnc)join(landmarkNode)join(landmarkNode)leave()leave()

locationMapChange()locationMapChange()

DHT – StorageDHT – Storage

Storage ManagerStorage Manager

storesstores and and retrievesretrieves records, which consist of records, which consist of key/value pairs. Keys are used to locate key/value pairs. Keys are used to locate items and can be any data type or structure items and can be any data type or structure supportedsupported

Storage Manager APIStorage Manager APIstore(key, item)store(key, item)retrieve(key)retrieve(key) item itemremove(key)remove(key)

ProviderStorageManager

OverlayRouting

DHT – Provider (1)DHT – Provider (1)

ProviderProvider

tiesties routing and storage manager layers routing and storage manager layers and and providesprovides an interface an interface

Each object in the DHT has a Each object in the DHT has a namespacenamespace, , resourceIDresourceID and and instanceIDinstanceID

DHT key = DHT key = hash(hash(namespacenamespace,,resourceIDresourceID))

namespacenamespace - application or group of object, table or relation - application or group of object, table or relation resourceIDresourceID – primary key or any attribute(Object) – primary key or any attribute(Object) instanceIDinstanceID –– integer, to separate items with the same integer, to separate items with the same namespacenamespace

and and resourceIDresourceID Lifetime - Lifetime - item storage durationitem storage duration

CAN’s mapping of CAN’s mapping of resourceIDresourceID/Object is equivalent to an index/Object is equivalent to an index

DHT – Provider (2)DHT – Provider (2)Provider APIProvider API

getget(namespace, resourceID) (namespace, resourceID) item item

putput(namespace, resourceID, item, lifetime)(namespace, resourceID, item, lifetime)

renewrenew(namespace, resourceID, instanceID, lifetime) (namespace, resourceID, instanceID, lifetime) boolbool

multicastmulticast(namespace, resourceID, item)(namespace, resourceID, item)

lscanlscan(namespace) (namespace) items items

newDatanewData(namespace, item)(namespace, item)

Node R1Node R1

(1..n)(1..n)Table R (namespace)Table R (namespace)

(1..n) tuples (1..n) tuples

(n+1..m) tuples(n+1..m) tuples

Node R2Node R2

(n+1..m)(n+1..m)rID1rID1

itemitem

rID3rID3

itemitem

rID2rID2

itemitem

ProviderStorageManager

OverlayRouting

Query ProcessorQuery Processor How it works?How it works?

• performs selection, projection, joins, grouping, performs selection, projection, joins, grouping, aggregation ->aggregation ->OperatorsOperators

• Operators push and pull dataOperators push and pull data

• simultaneous execution of multiple operators pipelined simultaneous execution of multiple operators pipelined togethertogether

• results are produced and queued as quick as possibleresults are produced and queued as quick as possible How it modifies data?How it modifies data?

• insert, update and delete different items via DHT insert, update and delete different items via DHT interfaceinterface

How it selects data to process?How it selects data to process?

• dilated-reachable snapshotdilated-reachable snapshot – data, published by – data, published by reachable nodes at the query arrival timereachable nodes at the query arrival time

Join Algorithms Join Algorithms

Limited BandwidthLimited Bandwidth Symmetric Hash Join:Symmetric Hash Join:

- Rehashes both tables- Rehashes both tables Semi Joins:Semi Joins:

- Transfer only matching tuples- Transfer only matching tuples At 40% selectivity, bottleneck switches from At 40% selectivity, bottleneck switches from

computation nodes to query sites computation nodes to query sites

Future ResearchFuture Research

Routing, Storage and LayeringRouting, Storage and Layering Catalogs and Query OptimizationCatalogs and Query Optimization Hierarchical AggregationsHierarchical Aggregations Range PredicatesRange Predicates Continuous Queries over StreamsContinuous Queries over Streams Sharing between QueriesSharing between Queries Semi-structured DataSemi-structured Data

Distributed Hash Tables (DHTs)Distributed Hash Tables (DHTs)

What is a DHT?What is a DHT?• Take an abstract ID space, and partition Take an abstract ID space, and partition

among a changing set of computers (nodes)among a changing set of computers (nodes)• Given a message with an ID, route the Given a message with an ID, route the

message to the computer currently message to the computer currently responsible for that IDresponsible for that ID

• Can store messages at the nodesCan store messages at the nodes• This is like a “distributed hash table”This is like a “distributed hash table”

Provides a put()/get() APIProvides a put()/get() API

• Cheap maintenance when nodes come and Cheap maintenance when nodes come and gogo

Distributed Hash Tables (DHTs)Distributed Hash Tables (DHTs)

Lots of effort is put into making DHTs Lots of effort is put into making DHTs better:better:• Scalable (thousands Scalable (thousands millions of nodes) millions of nodes)• Resilient to failureResilient to failure• Secure (anonymity, encryption, etc.)Secure (anonymity, encryption, etc.)• Efficient (fast access with minimal state)Efficient (fast access with minimal state)• Load balancedLoad balanced• etc.etc.

PIER’s Three Uses for DHTsPIER’s Three Uses for DHTs Single elegant mechanism with many Single elegant mechanism with many

uses: uses: • Search: IndexSearch: Index

Like a hash indexLike a hash index• Partitioning: Value (key)-based routingPartitioning: Value (key)-based routing

Like Gamma/VolcanoLike Gamma/Volcano• Routing: Network routing for QP messagesRouting: Network routing for QP messages

Query dissemination Query dissemination Bloom filtersBloom filters Hierarchical QP operators (aggregation, join, etc)Hierarchical QP operators (aggregation, join, etc)

Not clear there’s another substrate that Not clear there’s another substrate that supports all these usessupports all these uses

MetricsMetrics We are primarily interested in 3 metrics:We are primarily interested in 3 metrics:

• Answer quality (recall and precision) Answer quality (recall and precision) • Bandwidth utilization Bandwidth utilization • LatencyLatency

Different DHTs provide different properties:Different DHTs provide different properties:• Resilience to failures (recovery time) Resilience to failures (recovery time) answer quality answer quality• Path length Path length bandwidth & latency bandwidth & latency• Path convergence Path convergence bandwidth & latency bandwidth & latency

Different QP Join Strategies:Different QP Join Strategies:• Symmetric Hash Join, Fetch Matches, Symmetric Semi-Symmetric Hash Join, Fetch Matches, Symmetric Semi-

Join, Bloom Filters, etc.Join, Bloom Filters, etc.• Big Picture: Tradeoff bandwidth (extra rehashing) and Big Picture: Tradeoff bandwidth (extra rehashing) and

latencylatency

Symmetric Hash Join (SHJ)Symmetric Hash Join (SHJ)

R

PUT

r.b=constant

r.a

S

s.b=constant

s.a

r.c > s.c

NS=temp

NS=r NS=s

r.a = s.a

PUT

Fetch Matches (FM)Fetch Matches (FM)

R

r.b=constant

S

s.b=constant AND r.c > s.c

NS=r NS=s

s.a

r.a = s.a

GET

Symmetric Semi Join (SSJ)Symmetric Semi Join (SSJ)

Both R and S are Both R and S are projected to save projected to save bandwidthbandwidth

The complete R The complete R and S tuples are and S tuples are fetched in parallel fetched in parallel to improve latencyto improve latency

R

r.b=constant

r.a

r.c > s.c

NS=temp

NS=r

r.a = s.a

r.a, r.key

r.a = r.a

R

r.key

NS=r

S

s.b=constant

s.a

NS=s

s.a, s.key

r.a = s.a

PUT PUT

GET

s.a = s.a

S

s.key

NS=s

GET

OverviewOverview

CAN is a distributed system that CAN is a distributed system that maps keys onto valuesmaps keys onto values

Keys hashed into d dimensional Keys hashed into d dimensional spacespace

Interface: Interface: • insert(key, value)insert(key, value)• retrieve(key)retrieve(key)

OverviewOverview

y

x

State of the system at time t

Peer

Resource

Zone

In this 2 dimensional space a key is mapped to a point (x,y)

DESIGN DESIGN

D-dimensional Cartesian coordinateD-dimensional Cartesian coordinate

space (d-torus)space (d-torus) Every Node owns a distinct ZoneEvery Node owns a distinct Zone Map Key k1 onto a point p1 using a Map Key k1 onto a point p1 using a

Uniform Hash functionUniform Hash function (k1,v1) is stored at the node Nx(k1,v1) is stored at the node Nx

that owns the zone with p1that owns the zone with p1

• Node Maintains routing Node Maintains routing

table with neighborstable with neighbors

Ex: A Node holds{B,C,E,D}Ex: A Node holds{B,C,E,D}• Follow the straight line path through Follow the straight line path through

the Cartesian spacethe Cartesian space

RoutingRouting

y

Peer

Q(x,y)

(x,y) d-dimensional space with n zones

2 zones are neighbor if d-1 dim overlap

Routing path of length:

Algorithm:Choose the neighbor nearest to the destination

Q(x,y) Query/Resource

key

CAN: construction*CAN: construction*

Bootstrap

node

new node

CAN: constructionCAN: construction

I

Bootstrap

node

new node 1) Discover some node “I” already in CAN


2) Pick random point in space

I

(x,y)

new node


(x,y)

3) I routes to (x,y), discovers node J

I

J

new node


newJ

4) split J’s zone in half… new owns one half

MaintenanceMaintenance

Use Use zone takeoverzone takeover in case of failure in case of failure or leaving of a node or leaving of a node

Send your neighbor table to Send your neighbor table to neighbors to inform that you are neighbors to inform that you are alive at discrete time interval talive at discrete time interval t

If your neighbor does not send alive If your neighbor does not send alive in time t, takeover its zonein time t, takeover its zone

Zone reassignmentZone reassignment is needed is needed

Node DepartureNode Departure

Some one has to take over the ZoneSome one has to take over the Zone Explicit hand over of the zone to one of its Explicit hand over of the zone to one of its

NeighborsNeighbors Merge to valid Zone if ”possible”Merge to valid Zone if ”possible” If not Possible ”then to Zones are temporary If not Possible ”then to Zones are temporary

handled by the smallest neighborhandled by the smallest neighbor

Zone reassignmentZone reassignment

1

2

3

4

1

3

2 4

Zoning

Partition tree

Zone reassignmentZone reassignment

1

3

4

1

3 4

Zoning

Partition tree

Design ImprovementsDesign Improvements

• Multi-DimensionMulti-Dimension• Multi-Coordinate SpacesMulti-Coordinate Spaces• Overloading the ZonesOverloading the Zones• Multiple Hash FunctionsMultiple Hash Functions• Topologically Sensitive ConstructionTopologically Sensitive Construction• Uniform PartitioningUniform Partitioning• CachingCaching

Multi-DimensionMulti-Dimension

Increase in the dimension reduces Increase in the dimension reduces the path lengththe path length

Multi-Coordinate SpacesMulti-Coordinate Spaces

Multiple coordinate Multiple coordinate spaces spaces

Each node is assigned Each node is assigned different zone in each different zone in each of them. of them.

Increases the Increases the availability and reduces availability and reduces the path length the path length

Overloading the ZonesOverloading the Zones

More than one peer are assigned to More than one peer are assigned to one zone. one zone.

Increases availabilityIncreases availability Reduces path length Reduces path length Reduce per-hop latencyReduce per-hop latency

Uniform PartitioningUniform Partitioning

Instead of splitting directly splitting Instead of splitting directly splitting the node occupant node the node occupant node • Compare the volume of its zone with Compare the volume of its zone with

neighborsneighbors• The one to split is the one having The one to split is the one having

biggest volumebiggest volume

querying the internet with pier (pier = peer-to-peer information exchange and retrieval)

Documents

peer information exchange

recent peer

challenge of scale

distributed query processing

peer technologieskey

wholewho needs internet

network bandwidth

network tracesschemas