adaptive query processing: progress and challenges alon halevy university of washington [gore] joint...

33
Adaptive Query Processing: Progress and Challenges Alon Halevy University of Washington [Gore] Joint work with Zack Ives, Dan Weld (later: Nimble Technology)

Post on 20-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Adaptive Query Processing:Progress and Challenges

Alon Halevy

University of Washington [Gore]Joint work with Zack Ives, Dan Weld

(later: Nimble Technology)

ReviewsSh ip p in gO rd ersIn ven toryBooks

m ybooks .com M edia ted S chem a

W e s t

...

F e dE x

W A N

a lt.bo o ks .re v ie w s

In te rne tIn te rne t In te rne t

UP S

E a s t O rde rs C us to me rR e v ie w s

NY Time s

...

M o rga n-K a ufma n

P re ntic e -Ha ll

Data Integration Systems

Uniform query capability across autonomous, heterogeneous data sources on LAN, WAN, or Internet: in enterprises, WWW, big science.

Recent Trends in Data Integration Research

• Issues such as: architectures, query reformulation, wrapper construction are reasonably well understood (but still good work going on).

• Query execution and optimization raise significant challenges.

• Problems for traditional query processing model:– Few statistics (autonomous sources)– Unanticipated delays and failures (network-bound sources).

• Conclusion (ours): cannot afford to separate optimization from execution. Need to be adaptive.

• See IEEE Data Engineering Bulletin, June, 2000.

Outline

• Tukwila (version 1):– Interleaving optimization and execution at the

core.

• The unsolved problem: when to switch?• The complicating new challenges:

– XML, want first tuples fast.

• Tukwila (version 2):– completely pipelined XML query processing.

• Some experiences from Nimble

Tukwila: Version 1

• Key idea: build adaptive features into the core of the system.

• Interleave planning an execution (replan when you know more about your data)– Rule-based mechanism for changing behavior.

• Adaptive query operators:– Revival of the double-pipelined join.– Collectors (a.k.a. “smart union”).

• See details in SIGMOD-99.

Optim izer

(Re-)Optim izer

Mem Alloc-Fragm enter

ExecutionEngine

Tem p Store

EventHandler

QueryOperators

Reform ulator

Catalog

source mappings

querylogical

planexecplan

answ er

data

execresults

Tukwila Data Integration System

Novel components:– Event handler– Optimization-execution loop

Handling Execution Events

• Adaptive execution via event-condition-action rules

• During execution, events generatedTimeout, n tuples read, operator opens/closes, memory overflows,

execution fragment completes, …

• Events trigger rules:– Test conditions

Memory free, tuples read, operator state, operator active, …

– Execution actionsRe-optimize, reduce memory, activate/deactivate operator, …

Interleaving Planning and Execution

Re-optimize if at unexpected state:

– Evaluate at key points, re-optimize un-executed portion of plan [Kabra/DeWitt SIGMOD98]

– Plan has pipelined units, fragments

– Send back statistics to optimizer.

– Maintain optimizer state for later reuse.

Fragm ent 1

Fragm ent 0

H ashJo in

East

H ashJo in

M ateria lize& Test

FedExOrders

WHEN end_of_fragment(0) IF card(result) > 100,000 THEN re-optimize

Adaptive Operators: Double Pipelined Join

Hybrid Hash Join Partially pipelined: no output

until inner read Asymmetric (inner vs. outer) —

optimization requires source behavior knowledge

Double Pipelined Hash JoinEnhancement to [Wilschut

PDIS91]:uses multithreading, handles overflow

Outputs data immediatelySymmetric — requires less source

knowledge to optimize

Adaptive Operators: CollectorUtilize mirrors and overlapping sources to produce results quickly

– Dynamically adjust to source speed & availability

– Scale to many sources without exceeding net bandwidth

– Based on policy expressed via rules

C

CustReviews

NYTim es

alt.books

WHEN timeout(CustReviews) DO activate(NYTimes), activate(alt.books)WHEN timeout(NYTimes) DO activate(alt.books)

Highlights from Version 1

• It worked well (graphs to prove it)!• Unified architecture that encompassed

previous techniques:– Choose nodes (Cole & Graefe)– Mid-stream re-optimization (Kabra & DeWitt)– Query scrambling (Urhan, Franklin, Amsaleg)

• Optimizer can have global view of different factors affecting adaptive behavior.

The Unsolved Problem• Find interleaving points? When to switch from optimization

to execution?• Some straightforward solutions worked reasonably, but

student who was supposed to solve the problem graduated prematurely.

• Some work on this problem:– Rick Cole (Informix)– Benninghoff & Maier (OGI).

• One solution being explored: execute first and break pipeline later as needed.

• Another solution: change operator ordering in mid-flight (Eddies, Avnur & Hellerstein).

More Urgent Problems

• Users want answers immediately:– Optimize time to first tuple– Give approximate results earlier.

• XML emerges as a preferred platform for data integration:– But all existing XML query processors are

based on first loading XML into a repository.

Tukwila Version 2

• Able to transform, integrate and query arbitrary XML documents.

• Support for output of query results as early as possible:– Streaming model of XML query execution.

• Efficient execution over remote sources that are subject to frequent updates.

• Philosophy: how can we adapt relational and object-relational execution engines to work with XML?

Tukwila V2 Highlights

• The X-scan operator that maps XML data into tuples of subtrees.

• Support for efficient memory representation of subtrees (use references to minimize replication).

• Special operators for combining and structuring bindings as XML output.

Tukwila V2 Architecture

Example XML File<db> <book publisher="mkp"> <title>Readings in Database Systems</title> <editors> <name>Stonebraker</name> <name>Hellerstein</name> </editors> <isbn>123-456-X</isbn> </book><company ID="mkp"> <name>Morgan Kaufmann</title> <city>San Mateo</city> <state>CA</state> </company></db>

XML Data Graph

db

#1

#2

#7mkp

ReadingsIn DatabaseSystems

123-456-X

#2#3 #6

Principlesof TransactionProcessing

235-711-Y

#8

#9

Morgan Kaufmann

San Mateo

#11 #12 #13

#4

StonebrakerHellerstein

#5

CA

#4

#5

#10

Bernstein

Newcomer

Example QueryWHERE <db> <book publisher=$pID> <title>$t</> </> ELEMENT_AS $b </> IN "books.xml", <db> <publication title=$t> <source ID=$pID>$p</>

<price>$pr</> </> </> IN "amazon.xml", $pr < 49.95CONSTRUCT <book> <name>$t</> <publisher>$p</> </>

Query

Execution

Plan

X-Scan

• The operator at the leaves of the plan.

• Given an XML stream and a set of regular expressions – produces a set of bindings.

• Supports both trees and graph data.

• Uses a set of state machines to traverse match the patterns.

• Maintains a list to unseen element Ids, and resolves them upon arrival.

X-scan Data Structures

Structural Index

. . .

ID index

<db> <lab ID=... <name>Seattle... <location> <city>Seattle... ...

XML Tree Mgr

State MachinesStack

l c

#1 #3

Bindings

BindingTuples

ID2

. . .

ID1ID3

UnresolvedIDREFs

. . .

State Machines for X-scan

Mb:

MpID:

Mt:

1 2 3

4 5

db book

@publisher

6 7title

Other Features of Tukwila V.2

• X-scan:– Can also be made to preserve XML order.– Careful handling of cycles in the XML graph.– Can apply certain selections to the bindings.

• Uses much of the code of Tukwila I.

• No modifications to traditional operators.

• XML output producing operators.

• Nest operator.

In the “Pipeline”

• Partial answers: no blocking. Produce approximate answers as data is streaming.

• Policies for recovering from memory overflow [More Zack].

• Efficient updating of XML documents (and an XML update language) [w/Tatarinov]

• Dan Suciu: a modular/composable toolset for manipulating XML.

• Automatic generation of data source descriptions (Doan & Domingos)

First 5 Results

4.971.48

6.381.61 3.50

34.6

0.64 2.78

47.0 190

0

10

20

30

40

50

Order 1234(R, 5MB)

LineItemQty < 32

(R, 31MB)

Orders xCust (R,

5x0.5MB)

LineItems xOrders (R,31x7MB)

Papersunder

Confs (I,0.2x9MB)

Papers withConfRefs(D, 39MB)

Query

Exe

c. T

ime

Tukwila

DOM Parse

Xalan

XT

Relational

938

Completion Time

22.13

104.75

20.04

363.53

208.57175.67

508.29

69.6

190

0

50

100

150

200

250

300

350

400

Order 1234(R, 5MB)

LineItemQty < 32

(R, 31MB)

Orders xCust (R,

5x0.5MB)

LineItems xOrders (R,31x7MB)

Papersunder

Confs (I,0.2x9MB)

Papers withConfRefs(D, 39MB)

Query

Exe

c. T

ime

Tukwila

Xalan

XT

Relational

938

Intermediate Conclusions

• First scalable XML query processor for networked data.

• Work done in relational query processing is very relevant to XML query processing.

• We want to avoid decomposing XML data into relational structures.

Some Observations from Nimble

• What is Nimble?– Founded in June, 1999 with Dan Weld.– Data integration engine built on an XML

platform.– Query language is XML-QL.– Mostly geared to enterprise integration, some

advanced web applications.– 70+ person company (and hiring!)– Ships in trucks (first customer is Paccar).

XML Query

User Applications

Lens™ File InfoBrowser™Software

Developers Kit

NIMBLE™ APIs

Front-End

XML

Lens Builder™Lens Builder™

Management Tools

Management Tools

Integration Builder

Integration Builder

Security T

ools

Data Administrator

Data Administrator

System Architecture

Concordance Developer

Concordance Developer

Integration

Layer

Nimble Integration Engine™

Compiler Executor

MetadataServerCache

Relational Data Warehouse/ Mart

Legacy Flat File Web Pages

Common XML View

The Current State of Enterprise Information

• Explosion of intranet and extranet information

• 80% of corporate information is unmanaged

• By 2004 30X more enterprise data than 1999

• The average company:– maintains 49 distinct

enterprise applications– spends 35% of total IT

budget on integration-related efforts

1995 1997 1999 2001 2003 2005

Enterprise Information

Source: Gartner, 1999

Design Issues

• Query language for XML: tracking the W3C committee.

• The algebra: – Needs to handle XML, relational, hierarchical and

support it all efficiently!– Need to distinguish physical from logical algebra.

• Concordance tables need to be an integral part of the system. Need to think of data cleaning.

• Need to deal with down times of data sources (or refusal times).

• Need to provide range of options between on-demand querying and pre-materialization.

Non-Technical Issues

• SQL not really a standard.

• Legacy systems are not necessarily old.

• IT managers skeptical of truths.

• People are very confused out there.

• Need a huge organization to support the effort.