query processing and networking infrastructures day 2 of 2 joe hellerstein uc berkeley september 27,...

Query Processing and Networking Infrastructures

Day 2 of 2

Joe HellersteinUC Berkeley

September 27, 2002

Outline

Day 1: Query Processing Crash Course Intro Queries as indirection How do relational databases run queries? How do search engines run queries? Scaling up: cluster parallelism and distribution

Day 2: Research Synergies w/Networking Queries as indirection, revisited Useful (?) analogies to networking research Some of our recent research at the seams Some of your research? Directions and collective discussion

Indirections

Standard: Spatial Indirection

Allows referent to move without changes to referers Doesn’t matter where

the object is, we find it.

Alternative: copying Works if updates are

managed carefully, or don’t exist

Temporal Indirection

Asynchronous communication is indirection in time Doesn’t matter

when the object arrives, you find it

Analogy to space Sender referer Recipient

referent

Generalizing

Indirection in Space x-to-one or x-to-many? Physical or Logical mapping?

Indirection in Time Persistence model: storage or re-

xmission Persistence role: sender or receiver

Indirection in Space, Redux

One-to-one, one-to-many, many-to-many? Standard relational issue E.g. virtual address is many-to-one E.g. email distribution list is one-to-many

Physical or logical Mapping table?

E.g. page tables, mailing list, DNS, multicast group lists

Logical E.g. queries, subscriptions, interests

Indirection in Time, Redux

Persistence model: storage or re-xmission Storage: e.g. DB, heap, stack, NW buffer,

mailqueue Re-xmission: e.g. polling, retries.

“Joe is so persistent”

Persistence of put or get Put: e.g. DB insert, email, retry Get: e.g. subscription, polling

Examples: Storage Systems

Virtual Memory System Space: 1-to-1, physical Time: synchronous (no indirection)

Database System Space: many-to-many, logical Time: synchronous (no indirection)

Broadcast Disks Space: 1-to-1 Time: re-xmitted put

Examples: Split-Phase APIs

Polling Space: no indirection Time: re-xmitted get

Callbacks Space: no indirection Time: stored get

Active Messages Space: no indirection Time: stored get

App stores a get with putter, which tags it on messages

Examples: Communication

Email Space: One-to-many, physical

Mapping is one-to-many, delivery is one-to-one (copies)

Time: stored put

Multicast Space: One-to-many, physical

Both mapping and delivery are one-to-many

Time: roughly synchronous?

Examples: Distributed APIs

RPC Space: 1-to-1, physical

Can be 1-to-many Time: synchronous (no indirection)

Messaging systems Space: 1-to-1, physical

Often 1-to-many Time: depends!

Transactional messaging is stored put Exactly-once transmission guaranteed

Other schemes are re-xmitted put At least once transmission. Idempotency of message

becomes important!

Examples: Logic-based APIs

Publish-Subscribe Space: one-to-many, logical Time: stored receiver

Tuplespaces Space: one-to-many, logical Time: stored sender

Indirection Summary

2 binary indirection variables for space, 2 for timeCan have indirection in one without the otherLeads to 24 indirection options 16 joint space/time indirections, 4 space-only, 4

time-only And few lessons about the tradeoffs! Note: issues here in performance and SW

engineering and … E.g. “Are tuplespaces better than pub/sub?” Not a unidimensional question!

Rendezvous

Indirection on both sender and receiver side In time and/or space on each side Most general: neither sender nor

receiver know where or when rendezvous will happen!

Each chases a reference for where Each must persist for when

Join as Rendezvous

Recall pipelining hash join Combine all blue and gray tuples

that match

A batch rendezvous In space: the data items were not

stored in a fixed location, copied into HT In time: both sides do put-persist in the join algorithm via

storage

A hint of things to come: In parallel DBs, the hash table is content-addressed (via the

exchange routing function) What if hash table is distributed?

If a tuple in the join is doing “get”, then is there a distinction between sender/recipient? Between query and data?

Some resonances

We said that query systems are an indirection mechanism. Logical, many-to-many, but synchronous

Query-response

And some dataflow techniques inside query engines seem to provide useful indirection mechanismsIf we add a network into the picture, life gets very interesting Indirection in space very useful Indirection in time is critical Rendezvous is a basic operation

More Resonance

More Interaction: CS262 Experiment w/ Eric Brewer

Merge OS & DBMS grad class, over a yearEric/Joe, point/counterpointSome tie-ins were obvious: memory mgmt, storage, scheduling,

concurrency

Surprising: QP and networks go well side by side E.g. eddies and TCP Congestion Control

Both use back-pressure and simple Control Theory to “learn” in an unpredictable dataflow environment

Figure 3:Example Router Graph

Scout

Paths the key to comm-centric OS “Making Paths Explicit in the Scout Operating

System”, David Mosberger and Larry L. Peterson. OSDI ‘96.

CLICK

A NW router is a query plan! With a twist: flow-based context

An opportunity for “autonomous” query optimization

Revisiting a NW Classic with DB Goggles

Clark & Tennenhouse, SIGCOMM ‘90

Architectural Considerations for a New Generation of ProtocolsLove it for two reasons Tries to capture the essence of what networks do

Great for people who need the 10,000-foot view! I’m a fan of doing this (witness last week)

Tries to move the community up the food chain Resonances everywhere!!

C&T Overview (for amateurs like me)

Core function of protocols: data xfer Data Manipulation

buffer, checksum, encryption, xfer to/from app space, presentation

Transfer Control flow/congestion ctl,

detecting transmission problems, acks, muxing, timestamps, framing

Exchange!

Data Modeling!

Query Opt!

Thesis: nets are good at xfer control, not so good at data manipulationSome C&T wacky ideas for better data manipulation Xfer semantic units, not packets (ALF) Auto-rewrite layers to flatten them (ILP) Minimize cross-layer ordering constraints Control delivery in parallel via packet

content

C & T’s Wacky Ideas

DB People Should Be Experts!

BUT… remember: Basic Internet assumption:

“a network of unknown topology and with an unknown, unknowable and constantly changing population of competing conversations” (Van Jacobson)

Spoils the whole optimize-then-execute architecture of query optimization What happens when

denvironment/dt < query length?? What about the competing conversations? How do we handle the unknown topology? What about partial failure?

Ideally, we’d like: the semantics and optimization of DB dataflow with the agility and efficiency of NW dataflow

The Cosmic Convergence

NETWORKING RESEARCH

XML Routing

Router Toolkits

Content Addressingand DHTs

DirectedDiffusion

Adaptivity, Federated Control, GeoScalability

DATABASE RESEARCH

Adaptive QueryProcessing

ContinuousQueries, Streams

P2P QueryEngines

SensorQuery Engines

Data Models, Query Opt, DataScalability

What does the QP perspective add?

In terms of high-level languages?In terms of a reusable set of operators?In terms of optimization opportunities?In terms of batch-I/O tricks?In terms of approximate answers?A “safe” route to Active Networks?

Not computationally complete Optimizable and reconfigurable -- data independence

applies

Fun to be had here! Addressing a few fronts at Berkeley…

Some of our work at the seams

Starting with centralized engine for remote data sets and streams Telegraph: eddies, SteMs, FLuX “Deep Web”, filesharing systems, sensor

streams

More recently, querying sensor networks TinyDB/TAG: in-network queries

And DHT-based overlay networks PIER

Telegraph Overview

Telegraph: An Adaptive Dataflow System

Themes: Adaptivity and Sharing Adaptivity encapsulated in operators

Eddies for order of operations State Modules (SteMs) for transient state FLuX for parallel load-balance and availability

Work- and state-sharing across flows Unlike traditional relational schemes, try to share

physical structures

Franklin, Hellerstein, Hong and students (to follow)

TeSS

Eddy

Join Select Project Group Aggregate Transitive Closure DupElim

File ReaderIngress

Adaptive Routing and Optimization

FLuX

Online Query Processing

Inte

rMod

ule

Com

m

and

sch

edu

lin

g (F

jord

s)

Sensor Proxy

Request Parsing, Metadata

SQL Explicit Dataflows

Mod

ule

s

XML Catalog

SteM

P2P Proxy

Telegraph Architecture

Juggle

Continuous Adaptivity: Eddies

A little more state per tuple Ready/done bits (extensible a la Volcano/Starburst) Minimal state in Eddy itself

Queue + parameters being learning Decisions: which tuple in queue to which operator

Query processing = dataflow routing!! Ron Avnur

Eddy

Two Key Observations

Break the set-oriented boundary Usual DB model: algebra expressions: (R S) T Common DB implementation: pipelining operators!

Subexpressions needn’t be materialized Typical implementation is more flexible than algebra

We can reorder in-flight operators

Don’t rewrite graph. Impose a router Graph edge = absence of routing constraint Observe operator consumption/production rates

Consumption: cost. Production: cost*selectivity Could break these down per values of tuples

So fun! Simple, incremental, general Brings all of query optimization online

And hence a bridge to ML, Control Theory, Queuing Theory

State Modules (SteMs)

Goal: Further adaptivity through competition

Multiple mirrored sources (AMs) Handle rate changes, failures,

parallelism Multiple alternate operators Join = Routing + State SteM operator manages tradeoffs

State Module, unifies caches, rendezvous buffers, join state

Competitive sources/operators share building/probing SteMs

Join algorithm hybridization!

Eddies + SteMs tackle the full (single-site) query optimization problem online

Vijayshankar Raman, Amol Deshpande

staticdataflows

eddy

eddy+

stems

FLuX: Routing Across ClusterFault-tolerant, Load-balancing eXchangeContinuous/long-running flows need high availabilityBig flows need parallelism

Adaptive Load-Balancing req’d

FLuX operator: Exchange plus… Adaptive flow partitioning (River) Transient state replication & migration

Replication & checkpointing for SteMs Note: set-based, not sequence-based!

Needs to be extensible to different ops: Content-sensitivity History-sensitivity

Dataflow semantics Optimize based on edge semantics Networking tie-in again:

At-least-once delivery? Exactly-once delivery? In/Out of order?

Mehul Shah

Continuously AdaptiveContinuous Queries (CACQ)

Continuous Queries clearly need all this stuff! Natural application of Telegraph infrastructure

4 Ideas in CACQ: Use eddies to allow reordering of ops.

But one eddy will serve for all queries Queries are data: join with Grouped Filter

A la stored get! This idea extended in PSOUP (Chandrasekaran & Franklin)

Explicit tuple lineage Mark each tuple with per-op ready/done bits Mark each tuple with per-query completed bits

Joins via SteMs, shared across all queries Note: mixed-lineage tuples in a SteM. I.e. shared state is not shared algebraic

expressions! Delete a tuple from flow only if it matches no query

Sam Madden, Mehul Shah, Vijayshankar Raman, Sirish Chandrasekaran

Sensor QP: TinyDB/TAG

Smart Dust MotesTinyOS

Palm DevicesLinux

Wireless Sensor Networks

A spectrum of devices Varying degrees of power and network constraints Fun is on the small side!

Our current platform: Mica and TinyOS 4Mhz Atmel CPU, 4KB RAM, 40kBit radio, 512K EEPROM, 128K

Flash Sensors: temp, light, accelerometer, magnetometer, mic, etc. Wireless, single-ported, multi-hop ad-hoc network

Spanning-tree communication through “root”

TinyDB

A query/trigger engine for motesDeclarative (SQL-like) language for optimizability Data independence arguments in spades here! Non-programmers can deal with it

Lots of challenges at the seams of queries and routing Query plans over dynamic multi-hop network With power and bandwidth consumption as key

metrics

Sam Madden (w/Hellerstein, Hong, Franklin)

Query

Number of Messages vs.

Aggregation Function

0

20000

40000

60000

80000

100000

EXTERNAL

MAX

AVERAGECOUNT MEDIAN

Aggregation Function

Focus: Hierarchical Aggregation

Aggregation natural in sensornets The “big picture” typically interesting Aggregation can smooth noise and loss

E.g. signal processing aggs like wavelets Provides data reduction

Power/Network Reduction:in-network aggregation Hierarchical version of parallel

aggregation Tricky design space

power vs. quality topology-selection value-based routing dynamic environment requires

adaptivity

TinyDB Sample Apps

Habitat Monitoring: what is the average humidity in the populated petrel burrows on Great Duck Island right now?

Smart Office: find me the conference rooms that have been reserved but unoccupied for 5 minutes.

Home Automation: lower blinds when light intensity is above a threshold.

Performance in SensorNets

Power consumption Communication >> Computation

METRIC: radio wake time Send > Receive

METRIC: messages generated “Run for 5 years” vs. “Burn power for critical events” vs.

“Run my experiment”

Bandwidth Constraints Internal >> External

Volume >> surface area

Result Quality Noisy sensors Discrete sampling of continuous phenomena Lossy communication channel

TinyDB

SQL-like language for specifying continuous queries and triggers Schema management, etc.

Proxy on desktop, small query engine per mote Plug and play (query snooping) To keep the engine “tiny”, use an eddy-style arch

One explicit copy of each iterator’s code image

Adaptive dataflow in network

Alpha available for download on SourceForge

Some of the Optimization Issues

Extensible Aggregation API: Init(), Iter(), SplitFlow(), Close() Properties

Amount of intermediate state Duplicate sensitivity Monotonicity Exemplary vs. Summary

Hypothesis TestingSnooping and SuppressionCompression, Presumption, Interpolation

Generally, QP and NW issues intertwine!

PIER: Querying the Internet

Querying the Internet

As opposed to querying over the InternetHave to deal with Internet realities Scale, dynamics, federated admin, partial failure, etc. Standard distributed DBs won’t work

Applications Start with real-time, distributed network monitoring

Traffic monitoring, intrusion/spam detection, software deployment detection (e.g. via TBIT), etc.

Use PIER’s SQL as a workload generator for networks? Virtual “tables” determine load produced by each site “Queries” become a way of specifying site-to-site

communication Move to infect the network more deeply?

E.g. Indirection schemes like i3, rendezvous mechanisms, etc. Overlays only?

And p2p QP, Obviously

Gnutella done right And it’s so easy! :-)

Crawler-free web searchBring WYGIWIGY queries to the people Ranking, recommenders, etc.

Got to be more fun here If p2p takes off in a big way, queries have to be a big piece

Why p2p DB, anyway? No good reason I can think of! :-) Focus on the grassroots nature of p2p

Schema integration and transactions and … ?? No! Work with what you got! Query the data that’s out there Nothing complicated for users will fly

Avoid the “DB” word: P2P QP, not P2P DB

Approach: Leverage DHTs

“Distributed Hash Tables” Family of distributed content-routing schemes

CAN, CHORD, Pastry, Tapestry, etc. Internet scale “hash table”

A la wide-area, adaptive Exchange routing table With some notion of storage

Leverage DHTs aggressively As distributed indexes on stored data As state modules for query processing

E.g. use DHTs as the hash tables in a hash join As rendezvous points for exchanging info

E.g. Bloom Filters

PIER: P2p Information Exchange and Retrieval

Relational-style query executor With front-ends for SQL and catalogs Standard and continuous queries With access to DHT APIs

Currently CAN and Chord, working on Tapestry Common DHT API would help

Currently simulating queries running on 10’s of thousands of nodes Look ma, it scales!

Widest-scale relational engine ever, looks feasible Most of the simulator code will live on in implementation

On Millennium and PlanetLab this fall/winter

Ryan Huebsch and Boon Thau Loo (w/Hellerstein, Shenker, Stoica)

PIER Challenges

How does this batch workload stress DHTs?How does republishing of soft-state interact with dataflow?

And semantics of query answers

Materialization/precomputation/caching Physical tuning meets SteMs meets materialized views

How to do query optimization in this context Distributed eddies!

Partial failure a reality At storage nodes, query execution nodes? Impact on results, mitigation

What about aggregation? Similarities/difference with TAG? With Astrolabe [Birman et al]?

The “usual” CQ and data stream query issues, distributed Analogous to work in Telegraph, and at Brown, Wisconsin, Stanford…

All together now?

I thought about changing the names: Telegraph*, Teletiny…? The group didn’t like the branding

Teletubby!

Seriously: integration? It’s a plausible need

Sensor data + map data + historical sensor logs + … Filesharing + Web

We have done both of these cheesily But fun questions of doing it right

E.g. pushing predicates and data into sensor net or not?

References & Resources

Database Texts

Undergrad textbooks Ramakrishnan & Gehrke, Database Management Systems Silberschatz, Korth, Sudarshan, Database System

Concepts Garcia-Molina, Ullman, Widom, Database Systems - The

Complete Book O’Neil & O’Neil, DATABASE Principles, Programming, and

Performance Abiteboul, Hull, Vianu, Foundations of Databases

Graduate texts Stonebraker & Hellerstein, Readings in Database Systems

(a.k.a “The Red Book”) Brewer & Hellerstein: Readings book (e-book?) in

progress. Fall 2003?

Research Links

DB group at Berkeley: db.cs.berkeley.eduGiST: gist.cs.berkeley.eduTelegraph: telegraph.cs.berkeley.eduTinyDB: telegraph.cs.berkeley.edu/tinydb berkeley.intel-research.net/tinydbRed Book: redbook.cs.berkeley.edu

query processing and networking infrastructures day 2 of 2 joe hellerstein uc berkeley september 27,...

Documents

indirection time

physical time

logical time

generalizing indirection

indirection summary

indirection options

spatial indirection

multicast space