querying the internet with pier (pier = peer-to-peer information exchange and retrieval)
DESCRIPTION
Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval). What is PIER?. Peer-to-Peer Information Exchange and Retrieval Query engine that runs on top of P2P network step to the distributed query processing at a larger scale - PowerPoint PPT PresentationTRANSCRIPT
Querying the Internet with PIERQuerying the Internet with PIER(PIER = Peer-to-peer Information Exchange and (PIER = Peer-to-peer Information Exchange and
Retrieval)Retrieval)
What is PIER?What is PIER?
Peer-to-Peer Information Exchange and RetrievalPeer-to-Peer Information Exchange and Retrieval Query engine that runs on top of P2P networkQuery engine that runs on top of P2P network
• step to the distributed query processing at a step to the distributed query processing at a larger scalelarger scale
• way for massive distribution: querying way for massive distribution: querying heterogeneous dataheterogeneous data
Architecture meets traditional database query Architecture meets traditional database query processing with recent peer-to-peer technologiesprocessing with recent peer-to-peer technologies
Key goal is Key goal is scalablescalable indexing system for indexing system for large-large-scale decentralized storage applications scale decentralized storage applications on on the Internetthe Internet
P2P, Large scale storage management P2P, Large scale storage management systems (OceanStore, Publius), wide-area systems (OceanStore, Publius), wide-area name resolution services name resolution services
What is Very Large?What is Very Large?Depends on Who You AreDepends on Who You Are
Single SiteClusters
Internet Scale1000’s – Millions
Distributed10’s – 100’s
How to run DB style queries at Internet How to run DB style queries at Internet Scale!Scale!
Database Community Network Community
Internet scale systems vs. hundred node systems
What are the Key Properties?What are the Key Properties?
Lots of data that is:Lots of data that is:1.1. Naturally distributed (where it’s Naturally distributed (where it’s
generated)generated)
2.2. Centralized collection undesirableCentralized collection undesirable
3.3. Homogeneous in schemaHomogeneous in schema
4.4. Data is more useful when viewed as Data is more useful when viewed as a wholea whole
Who Needs Internet Scale?Who Needs Internet Scale?Example 1: FilenamesExample 1: Filenames
Simple ubiquitous schemas:Simple ubiquitous schemas:• Filenames, Sizes, ID3 tagsFilenames, Sizes, ID3 tags
Born from early P2P systems such as Born from early P2P systems such as Napster, Gnutella etc.Napster, Gnutella etc.
Content is shared by “normal” non-expert Content is shared by “normal” non-expert users… home usersusers… home users
Systems were built by a few individuals ‘in Systems were built by a few individuals ‘in their garages’ their garages’ Low barrier to entry Low barrier to entry
Example 2: Network TracesExample 2: Network Traces
Schemas are mostly standardized:Schemas are mostly standardized:• IP, SMTP, HTTP, SNMP log formatsIP, SMTP, HTTP, SNMP log formats
Network administrators are looking for Network administrators are looking for patterns within their site AND with other sites:patterns within their site AND with other sites:• DoS attacks cross administrative boundariesDoS attacks cross administrative boundaries• Tracking virus/worm infectionsTracking virus/worm infections• Timeliness is very helpfulTimeliness is very helpful
Might surprise you how useful it is:Might surprise you how useful it is:• Network bandwidth on PlanetLab (world-wide Network bandwidth on PlanetLab (world-wide
distributed research test bed) is mostly filled with distributed research test bed) is mostly filled with people monitoring the network statuspeople monitoring the network status
Our ChallengeOur Challenge
Our focus is on the challenge of Our focus is on the challenge of scale:scale:• Applications are homogeneous and Applications are homogeneous and
distributeddistributed Already have significant interestAlready have significant interest
• Provide a flexible framework for a Provide a flexible framework for a wide variety of applicationswide variety of applications
Four Design Principles (I)Four Design Principles (I)
Relaxed ConsistencyRelaxed Consistency• ACID transactions severely limits the ACID transactions severely limits the
scalability and availability of distributed scalability and availability of distributed databasesdatabases
• We provide best-effort resultsWe provide best-effort results Organic ScalingOrganic Scaling
• Applications may start small, withoutApplications may start small, withouta priori knowledge of sizea priori knowledge of size
Four Design Principles (II)Four Design Principles (II)
Natural habitatNatural habitat• No CREATE TABLE/INSERTNo CREATE TABLE/INSERT• No “publish to web server”No “publish to web server”• Wrappers or gateways allow the information to Wrappers or gateways allow the information to
be accessed where it is createdbe accessed where it is created Standard Schemas via Grassroots softwareStandard Schemas via Grassroots software
• Data is produced by widespread software Data is produced by widespread software providing a de-facto schema to utilizeproviding a de-facto schema to utilize
IPNetwork
Network
DHTWrapper
StorageManager
OverlayRouting
DHT
CoreRelationalExecution
EngineCatalogManager
QueryOptimizer
PIER
NetworkMonitoring
Other UserApps
Applications
Physical Network
Overlay Network
Query Plan
DeclarativeQueries
>>based on Can
ApplicationsApplications
P2P DatabasesP2P Databases
Highly distributed and Highly distributed and available dataavailable data
Network Monitoring Network Monitoring
Intrusion detectionIntrusion detection
Fingerprint queries Fingerprint queries
DHTsDHTs Implemented with CAN (Content Implemented with CAN (Content
Addressable Network).Addressable Network). Node identified by hyper-rectangle in d-Node identified by hyper-rectangle in d-
dimensional spacedimensional space Key hashed to a point, stored in Key hashed to a point, stored in
corresponding node.corresponding node. Routing Table of neighbours is Routing Table of neighbours is
maintained. O(d)maintained. O(d)
(16,16)(16,0)
(0,16)(0,0)
Data
Key = (15,14)
Given a message with an ID, route the Given a message with an ID, route the message to the computer currently message to the computer currently responsible for that IDresponsible for that ID
DHT DesignDHT Design Routing LayerRouting Layer
Mapping for keysMapping for keys
(-- dynamic as nodes leave and join)(-- dynamic as nodes leave and join) Storage ManagerStorage Manager
DHT based dataDHT based data ProviderProvider
Storage access interface for higher Storage access interface for higher levels levels
DHT – RoutingDHT – Routing
Routing layerRouting layer
mapsmaps a a keykey into the into the IP addressIP address of the node currently of the node currently responsible for that key. Provides exact lookups, responsible for that key. Provides exact lookups, callbacks higher levels when the set of keys has callbacks higher levels when the set of keys has changedchanged
Routing layer APIRouting layer APIlookup(key) lookup(key) ipaddr (Asynchronous Fnc) ipaddr (Asynchronous Fnc)join(landmarkNode)join(landmarkNode)leave()leave()
locationMapChange()locationMapChange()
DHT – StorageDHT – Storage
Storage ManagerStorage Manager
storesstores and and retrievesretrieves records, which consist of records, which consist of key/value pairs. Keys are used to locate key/value pairs. Keys are used to locate items and can be any data type or structure items and can be any data type or structure supportedsupported
Storage Manager APIStorage Manager APIstore(key, item)store(key, item)retrieve(key)retrieve(key) item itemremove(key)remove(key)
ProviderStorageManager
OverlayRouting
DHT – Provider (1)DHT – Provider (1)
ProviderProvider
tiesties routing and storage manager layers routing and storage manager layers and and providesprovides an interface an interface
Each object in the DHT has a Each object in the DHT has a namespacenamespace, , resourceIDresourceID and and instanceIDinstanceID
DHT key = DHT key = hash(hash(namespacenamespace,,resourceIDresourceID))
namespacenamespace - application or group of object, table or relation - application or group of object, table or relation resourceIDresourceID – primary key or any attribute(Object) – primary key or any attribute(Object) instanceIDinstanceID –– integer, to separate items with the same integer, to separate items with the same namespacenamespace
and and resourceIDresourceID Lifetime - Lifetime - item storage durationitem storage duration
CAN’s mapping of CAN’s mapping of resourceIDresourceID/Object is equivalent to an index/Object is equivalent to an index
DHT – Provider (2)DHT – Provider (2)Provider APIProvider API
getget(namespace, resourceID) (namespace, resourceID) item item
putput(namespace, resourceID, item, lifetime)(namespace, resourceID, item, lifetime)
renewrenew(namespace, resourceID, instanceID, lifetime) (namespace, resourceID, instanceID, lifetime) boolbool
multicastmulticast(namespace, resourceID, item)(namespace, resourceID, item)
lscanlscan(namespace) (namespace) items items
newDatanewData(namespace, item)(namespace, item)
Node R1Node R1
(1..n)(1..n)Table R (namespace)Table R (namespace)
(1..n) tuples (1..n) tuples
(n+1..m) tuples(n+1..m) tuples
Node R2Node R2
(n+1..m)(n+1..m)rID1rID1
itemitem
rID3rID3
itemitem
rID2rID2
itemitem
ProviderStorageManager
OverlayRouting
Query ProcessorQuery Processor How it works?How it works?
• performs selection, projection, joins, grouping, performs selection, projection, joins, grouping, aggregation ->aggregation ->OperatorsOperators
• Operators push and pull dataOperators push and pull data
• simultaneous execution of multiple operators pipelined simultaneous execution of multiple operators pipelined togethertogether
• results are produced and queued as quick as possibleresults are produced and queued as quick as possible How it modifies data?How it modifies data?
• insert, update and delete different items via DHT insert, update and delete different items via DHT interfaceinterface
How it selects data to process?How it selects data to process?
• dilated-reachable snapshotdilated-reachable snapshot – data, published by – data, published by reachable nodes at the query arrival timereachable nodes at the query arrival time
Join Algorithms Join Algorithms
Limited BandwidthLimited Bandwidth Symmetric Hash Join:Symmetric Hash Join:
- Rehashes both tables- Rehashes both tables Semi Joins:Semi Joins:
- Transfer only matching tuples- Transfer only matching tuples At 40% selectivity, bottleneck switches from At 40% selectivity, bottleneck switches from
computation nodes to query sites computation nodes to query sites
Future ResearchFuture Research
Routing, Storage and LayeringRouting, Storage and Layering Catalogs and Query OptimizationCatalogs and Query Optimization Hierarchical AggregationsHierarchical Aggregations Range PredicatesRange Predicates Continuous Queries over StreamsContinuous Queries over Streams Sharing between QueriesSharing between Queries Semi-structured DataSemi-structured Data
Distributed Hash Tables (DHTs)Distributed Hash Tables (DHTs)
What is a DHT?What is a DHT?• Take an abstract ID space, and partition Take an abstract ID space, and partition
among a changing set of computers (nodes)among a changing set of computers (nodes)• Given a message with an ID, route the Given a message with an ID, route the
message to the computer currently message to the computer currently responsible for that IDresponsible for that ID
• Can store messages at the nodesCan store messages at the nodes• This is like a “distributed hash table”This is like a “distributed hash table”
Provides a put()/get() APIProvides a put()/get() API
• Cheap maintenance when nodes come and Cheap maintenance when nodes come and gogo
Distributed Hash Tables (DHTs)Distributed Hash Tables (DHTs)
Lots of effort is put into making DHTs Lots of effort is put into making DHTs better:better:• Scalable (thousands Scalable (thousands millions of nodes) millions of nodes)• Resilient to failureResilient to failure• Secure (anonymity, encryption, etc.)Secure (anonymity, encryption, etc.)• Efficient (fast access with minimal state)Efficient (fast access with minimal state)• Load balancedLoad balanced• etc.etc.
PIER’s Three Uses for DHTsPIER’s Three Uses for DHTs Single elegant mechanism with many Single elegant mechanism with many
uses: uses: • Search: IndexSearch: Index
Like a hash indexLike a hash index• Partitioning: Value (key)-based routingPartitioning: Value (key)-based routing
Like Gamma/VolcanoLike Gamma/Volcano• Routing: Network routing for QP messagesRouting: Network routing for QP messages
Query dissemination Query dissemination Bloom filtersBloom filters Hierarchical QP operators (aggregation, join, etc)Hierarchical QP operators (aggregation, join, etc)
Not clear there’s another substrate that Not clear there’s another substrate that supports all these usessupports all these uses
MetricsMetrics We are primarily interested in 3 metrics:We are primarily interested in 3 metrics:
• Answer quality (recall and precision) Answer quality (recall and precision) • Bandwidth utilization Bandwidth utilization • LatencyLatency
Different DHTs provide different properties:Different DHTs provide different properties:• Resilience to failures (recovery time) Resilience to failures (recovery time) answer quality answer quality• Path length Path length bandwidth & latency bandwidth & latency• Path convergence Path convergence bandwidth & latency bandwidth & latency
Different QP Join Strategies:Different QP Join Strategies:• Symmetric Hash Join, Fetch Matches, Symmetric Semi-Symmetric Hash Join, Fetch Matches, Symmetric Semi-
Join, Bloom Filters, etc.Join, Bloom Filters, etc.• Big Picture: Tradeoff bandwidth (extra rehashing) and Big Picture: Tradeoff bandwidth (extra rehashing) and
latencylatency
Symmetric Hash Join (SHJ)Symmetric Hash Join (SHJ)
R
PUT
r.b=constant
r.a
S
s.b=constant
s.a
r.c > s.c
NS=temp
NS=r NS=s
r.a = s.a
PUT
Fetch Matches (FM)Fetch Matches (FM)
R
r.b=constant
S
s.b=constant AND r.c > s.c
NS=r NS=s
s.a
r.a = s.a
GET
Symmetric Semi Join (SSJ)Symmetric Semi Join (SSJ)
Both R and S are Both R and S are projected to save projected to save bandwidthbandwidth
The complete R The complete R and S tuples are and S tuples are fetched in parallel fetched in parallel to improve latencyto improve latency
R
r.b=constant
r.a
r.c > s.c
NS=temp
NS=r
r.a = s.a
r.a, r.key
r.a = r.a
R
r.key
NS=r
S
s.b=constant
s.a
NS=s
s.a, s.key
r.a = s.a
PUT PUT
GET
s.a = s.a
S
s.key
NS=s
GET
OverviewOverview
CAN is a distributed system that CAN is a distributed system that maps keys onto valuesmaps keys onto values
Keys hashed into d dimensional Keys hashed into d dimensional spacespace
Interface: Interface: • insert(key, value)insert(key, value)• retrieve(key)retrieve(key)
OverviewOverview
y
x
State of the system at time t
Peer
Resource
Zone
In this 2 dimensional space a key is mapped to a point (x,y)
DESIGN DESIGN
D-dimensional Cartesian coordinateD-dimensional Cartesian coordinate
space (d-torus)space (d-torus) Every Node owns a distinct ZoneEvery Node owns a distinct Zone Map Key k1 onto a point p1 using a Map Key k1 onto a point p1 using a
Uniform Hash functionUniform Hash function (k1,v1) is stored at the node Nx(k1,v1) is stored at the node Nx
that owns the zone with p1that owns the zone with p1
• Node Maintains routing Node Maintains routing
table with neighborstable with neighbors
Ex: A Node holds{B,C,E,D}Ex: A Node holds{B,C,E,D}• Follow the straight line path through Follow the straight line path through
the Cartesian spacethe Cartesian space
RoutingRouting
y
Peer
Q(x,y)
(x,y) d-dimensional space with n zones
2 zones are neighbor if d-1 dim overlap
Routing path of length:
Algorithm:Choose the neighbor nearest to the destination
Q(x,y) Query/Resource
key
CAN: construction*CAN: construction*
Bootstrap
node
new node
CAN: constructionCAN: construction
I
Bootstrap
node
new node 1) Discover some node “I” already in CAN
CAN: constructionCAN: construction
2) Pick random point in space
I
(x,y)
new node
CAN: constructionCAN: construction
(x,y)
3) I routes to (x,y), discovers node J
I
J
new node
CAN: constructionCAN: construction
newJ
4) split J’s zone in half… new owns one half
MaintenanceMaintenance
Use Use zone takeoverzone takeover in case of failure in case of failure or leaving of a node or leaving of a node
Send your neighbor table to Send your neighbor table to neighbors to inform that you are neighbors to inform that you are alive at discrete time interval talive at discrete time interval t
If your neighbor does not send alive If your neighbor does not send alive in time t, takeover its zonein time t, takeover its zone
Zone reassignmentZone reassignment is needed is needed
Node DepartureNode Departure
Some one has to take over the ZoneSome one has to take over the Zone Explicit hand over of the zone to one of its Explicit hand over of the zone to one of its
NeighborsNeighbors Merge to valid Zone if ”possible”Merge to valid Zone if ”possible” If not Possible ”then to Zones are temporary If not Possible ”then to Zones are temporary
handled by the smallest neighborhandled by the smallest neighbor
Zone reassignmentZone reassignment
1
2
3
4
1
3
2 4
Zoning
Partition tree
Zone reassignmentZone reassignment
1
3
4
1
3 4
Zoning
Partition tree
Design ImprovementsDesign Improvements
• Multi-DimensionMulti-Dimension• Multi-Coordinate SpacesMulti-Coordinate Spaces• Overloading the ZonesOverloading the Zones• Multiple Hash FunctionsMultiple Hash Functions• Topologically Sensitive ConstructionTopologically Sensitive Construction• Uniform PartitioningUniform Partitioning• CachingCaching
Multi-DimensionMulti-Dimension
Increase in the dimension reduces Increase in the dimension reduces the path lengththe path length
Multi-Coordinate SpacesMulti-Coordinate Spaces
Multiple coordinate Multiple coordinate spaces spaces
Each node is assigned Each node is assigned different zone in each different zone in each of them. of them.
Increases the Increases the availability and reduces availability and reduces the path length the path length
Overloading the ZonesOverloading the Zones
More than one peer are assigned to More than one peer are assigned to one zone. one zone.
Increases availabilityIncreases availability Reduces path length Reduces path length Reduce per-hop latencyReduce per-hop latency
Uniform PartitioningUniform Partitioning
Instead of splitting directly splitting Instead of splitting directly splitting the node occupant node the node occupant node • Compare the volume of its zone with Compare the volume of its zone with
neighborsneighbors• The one to split is the one having The one to split is the one having
biggest volumebiggest volume