adaptive query processing: progress and challenges alon halevy university of washington [gore] joint...
Post on 20-Dec-2015
214 views
TRANSCRIPT
Adaptive Query Processing:Progress and Challenges
Alon Halevy
University of Washington [Gore]Joint work with Zack Ives, Dan Weld
(later: Nimble Technology)
ReviewsSh ip p in gO rd ersIn ven toryBooks
m ybooks .com M edia ted S chem a
W e s t
...
F e dE x
W A N
a lt.bo o ks .re v ie w s
In te rne tIn te rne t In te rne t
UP S
E a s t O rde rs C us to me rR e v ie w s
NY Time s
...
M o rga n-K a ufma n
P re ntic e -Ha ll
Data Integration Systems
Uniform query capability across autonomous, heterogeneous data sources on LAN, WAN, or Internet: in enterprises, WWW, big science.
Recent Trends in Data Integration Research
• Issues such as: architectures, query reformulation, wrapper construction are reasonably well understood (but still good work going on).
• Query execution and optimization raise significant challenges.
• Problems for traditional query processing model:– Few statistics (autonomous sources)– Unanticipated delays and failures (network-bound sources).
• Conclusion (ours): cannot afford to separate optimization from execution. Need to be adaptive.
• See IEEE Data Engineering Bulletin, June, 2000.
Outline
• Tukwila (version 1):– Interleaving optimization and execution at the
core.
• The unsolved problem: when to switch?• The complicating new challenges:
– XML, want first tuples fast.
• Tukwila (version 2):– completely pipelined XML query processing.
• Some experiences from Nimble
Tukwila: Version 1
• Key idea: build adaptive features into the core of the system.
• Interleave planning an execution (replan when you know more about your data)– Rule-based mechanism for changing behavior.
• Adaptive query operators:– Revival of the double-pipelined join.– Collectors (a.k.a. “smart union”).
• See details in SIGMOD-99.
Optim izer
(Re-)Optim izer
Mem Alloc-Fragm enter
ExecutionEngine
Tem p Store
EventHandler
QueryOperators
Reform ulator
Catalog
source mappings
querylogical
planexecplan
answ er
data
execresults
Tukwila Data Integration System
Novel components:– Event handler– Optimization-execution loop
Handling Execution Events
• Adaptive execution via event-condition-action rules
• During execution, events generatedTimeout, n tuples read, operator opens/closes, memory overflows,
execution fragment completes, …
• Events trigger rules:– Test conditions
Memory free, tuples read, operator state, operator active, …
– Execution actionsRe-optimize, reduce memory, activate/deactivate operator, …
Interleaving Planning and Execution
Re-optimize if at unexpected state:
– Evaluate at key points, re-optimize un-executed portion of plan [Kabra/DeWitt SIGMOD98]
– Plan has pipelined units, fragments
– Send back statistics to optimizer.
– Maintain optimizer state for later reuse.
Fragm ent 1
Fragm ent 0
H ashJo in
East
H ashJo in
M ateria lize& Test
FedExOrders
WHEN end_of_fragment(0) IF card(result) > 100,000 THEN re-optimize
Adaptive Operators: Double Pipelined Join
Hybrid Hash Join Partially pipelined: no output
until inner read Asymmetric (inner vs. outer) —
optimization requires source behavior knowledge
Double Pipelined Hash JoinEnhancement to [Wilschut
PDIS91]:uses multithreading, handles overflow
Outputs data immediatelySymmetric — requires less source
knowledge to optimize
Adaptive Operators: CollectorUtilize mirrors and overlapping sources to produce results quickly
– Dynamically adjust to source speed & availability
– Scale to many sources without exceeding net bandwidth
– Based on policy expressed via rules
C
CustReviews
NYTim es
alt.books
WHEN timeout(CustReviews) DO activate(NYTimes), activate(alt.books)WHEN timeout(NYTimes) DO activate(alt.books)
Highlights from Version 1
• It worked well (graphs to prove it)!• Unified architecture that encompassed
previous techniques:– Choose nodes (Cole & Graefe)– Mid-stream re-optimization (Kabra & DeWitt)– Query scrambling (Urhan, Franklin, Amsaleg)
• Optimizer can have global view of different factors affecting adaptive behavior.
The Unsolved Problem• Find interleaving points? When to switch from optimization
to execution?• Some straightforward solutions worked reasonably, but
student who was supposed to solve the problem graduated prematurely.
• Some work on this problem:– Rick Cole (Informix)– Benninghoff & Maier (OGI).
• One solution being explored: execute first and break pipeline later as needed.
• Another solution: change operator ordering in mid-flight (Eddies, Avnur & Hellerstein).
More Urgent Problems
• Users want answers immediately:– Optimize time to first tuple– Give approximate results earlier.
• XML emerges as a preferred platform for data integration:– But all existing XML query processors are
based on first loading XML into a repository.
Tukwila Version 2
• Able to transform, integrate and query arbitrary XML documents.
• Support for output of query results as early as possible:– Streaming model of XML query execution.
• Efficient execution over remote sources that are subject to frequent updates.
• Philosophy: how can we adapt relational and object-relational execution engines to work with XML?
Tukwila V2 Highlights
• The X-scan operator that maps XML data into tuples of subtrees.
• Support for efficient memory representation of subtrees (use references to minimize replication).
• Special operators for combining and structuring bindings as XML output.
Example XML File<db> <book publisher="mkp"> <title>Readings in Database Systems</title> <editors> <name>Stonebraker</name> <name>Hellerstein</name> </editors> <isbn>123-456-X</isbn> </book><company ID="mkp"> <name>Morgan Kaufmann</title> <city>San Mateo</city> <state>CA</state> </company></db>
XML Data Graph
db
#1
#2
#7mkp
ReadingsIn DatabaseSystems
123-456-X
#2#3 #6
Principlesof TransactionProcessing
235-711-Y
#8
#9
Morgan Kaufmann
San Mateo
#11 #12 #13
#4
StonebrakerHellerstein
#5
CA
#4
#5
#10
Bernstein
Newcomer
Example QueryWHERE <db> <book publisher=$pID> <title>$t</> </> ELEMENT_AS $b </> IN "books.xml", <db> <publication title=$t> <source ID=$pID>$p</>
<price>$pr</> </> </> IN "amazon.xml", $pr < 49.95CONSTRUCT <book> <name>$t</> <publisher>$p</> </>
X-Scan
• The operator at the leaves of the plan.
• Given an XML stream and a set of regular expressions – produces a set of bindings.
• Supports both trees and graph data.
• Uses a set of state machines to traverse match the patterns.
• Maintains a list to unseen element Ids, and resolves them upon arrival.
X-scan Data Structures
Structural Index
. . .
ID index
<db> <lab ID=... <name>Seattle... <location> <city>Seattle... ...
XML Tree Mgr
State MachinesStack
l c
#1 #3
Bindings
BindingTuples
ID2
. . .
ID1ID3
UnresolvedIDREFs
. . .
Other Features of Tukwila V.2
• X-scan:– Can also be made to preserve XML order.– Careful handling of cycles in the XML graph.– Can apply certain selections to the bindings.
• Uses much of the code of Tukwila I.
• No modifications to traditional operators.
• XML output producing operators.
• Nest operator.
In the “Pipeline”
• Partial answers: no blocking. Produce approximate answers as data is streaming.
• Policies for recovering from memory overflow [More Zack].
• Efficient updating of XML documents (and an XML update language) [w/Tatarinov]
• Dan Suciu: a modular/composable toolset for manipulating XML.
• Automatic generation of data source descriptions (Doan & Domingos)
First 5 Results
4.971.48
6.381.61 3.50
34.6
0.64 2.78
47.0 190
0
10
20
30
40
50
Order 1234(R, 5MB)
LineItemQty < 32
(R, 31MB)
Orders xCust (R,
5x0.5MB)
LineItems xOrders (R,31x7MB)
Papersunder
Confs (I,0.2x9MB)
Papers withConfRefs(D, 39MB)
Query
Exe
c. T
ime
Tukwila
DOM Parse
Xalan
XT
Relational
938
Completion Time
22.13
104.75
20.04
363.53
208.57175.67
508.29
69.6
190
0
50
100
150
200
250
300
350
400
Order 1234(R, 5MB)
LineItemQty < 32
(R, 31MB)
Orders xCust (R,
5x0.5MB)
LineItems xOrders (R,31x7MB)
Papersunder
Confs (I,0.2x9MB)
Papers withConfRefs(D, 39MB)
Query
Exe
c. T
ime
Tukwila
Xalan
XT
Relational
938
Intermediate Conclusions
• First scalable XML query processor for networked data.
• Work done in relational query processing is very relevant to XML query processing.
• We want to avoid decomposing XML data into relational structures.
Some Observations from Nimble
• What is Nimble?– Founded in June, 1999 with Dan Weld.– Data integration engine built on an XML
platform.– Query language is XML-QL.– Mostly geared to enterprise integration, some
advanced web applications.– 70+ person company (and hiring!)– Ships in trucks (first customer is Paccar).
XML Query
User Applications
Lens™ File InfoBrowser™Software
Developers Kit
NIMBLE™ APIs
Front-End
XML
Lens Builder™Lens Builder™
Management Tools
Management Tools
Integration Builder
Integration Builder
Security T
ools
Data Administrator
Data Administrator
System Architecture
Concordance Developer
Concordance Developer
Integration
Layer
Nimble Integration Engine™
Compiler Executor
MetadataServerCache
Relational Data Warehouse/ Mart
Legacy Flat File Web Pages
Common XML View
The Current State of Enterprise Information
• Explosion of intranet and extranet information
• 80% of corporate information is unmanaged
• By 2004 30X more enterprise data than 1999
• The average company:– maintains 49 distinct
enterprise applications– spends 35% of total IT
budget on integration-related efforts
1995 1997 1999 2001 2003 2005
Enterprise Information
Source: Gartner, 1999
Design Issues
• Query language for XML: tracking the W3C committee.
• The algebra: – Needs to handle XML, relational, hierarchical and
support it all efficiently!– Need to distinguish physical from logical algebra.
• Concordance tables need to be an integral part of the system. Need to think of data cleaning.
• Need to deal with down times of data sources (or refusal times).
• Need to provide range of options between on-demand querying and pre-materialization.