galt workshop oct 16, 2003 1 languages for distributed information val tannen computer and...

26
GALT Workshop Oct 16, 2003 1 Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements: • Philippa Gardner and Sergio Maffeis, Imperial College • Arnaud Sahuguet, Bell Labs (formerly at UPenn) • Zack Ives and Benjamin Pierce, UPenn

Upload: patricia-welch

Post on 21-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GALT Workshop Oct 16, 2003 1 Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements:

GALT WorkshopOct 16, 2003

1

Languages for Distributed Information

Val Tannen

Computer and Information Science Department

University of Pennsylvania

Acknowledgements:

• Philippa Gardner and Sergio Maffeis, Imperial College• Arnaud Sahuguet, Bell Labs (formerly at UPenn)• Zack Ives and Benjamin Pierce, UPenn

Page 2: GALT Workshop Oct 16, 2003 1 Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements:

GALT WorkshopOct 16, 2003

2

• Past: because it did not fit on one system; data

placement was a big topic.

•Present: independent data sources agree to collaborate

• Queries that require integration across multiple sources.

• Heterogeneity. Is it solved by XML? Results are mixed.

• Redundancy is what makes things really interesting!

Distributed information

Page 3: GALT Workshop Oct 16, 2003 1 Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements:

GALT WorkshopOct 16, 2003

3

A query arrives… and demands satisfaction

query process

Arrives in any node. Refers to names. Names relate to data

in various nodes. For example:

node A: proj (join R (sel S))

Need to discover that R is at node A and S is at node B.

Need a plan to compute the result of

proj (join tableR@A (sel tableS@B)

and return the answer to the client that asked the query.

Page 4: GALT Workshop Oct 16, 2003 1 Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements:

GALT WorkshopOct 16, 2003

4

One language for queries, views, and query plans

data languagequery language

(SQL, XQuery)

view language

query plan language

processes

Page 5: GALT Workshop Oct 16, 2003 1 Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements:

GALT WorkshopOct 16, 2003

5

Motives and intentions

Existing view and plan languages are limited.

Capture distributed query processing in a high-level language.

(SQL vs. C++)

Easier to program self-tuning features.

Easier to generate automatically, eg., by optimizers.

Easier to model toward formal verification.

My ambition:) is to convince

(1) database people that process calculi are useful,

(2) process calculi people to look at some new problems.

Page 6: GALT Workshop Oct 16, 2003 1 Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements:

GALT WorkshopOct 16, 2003

6

Concurrency and process calculi

Why process calculi?

Because queries arrive asynchronously and must be processed

concurrently within a node (and across nodes).

Composition primitives:

e1 | e2 parallel

e1 ; e2 sequential

e1 or e2 nondeterministic choice

e1, e2 expressions denoting processes

Page 7: GALT Workshop Oct 16, 2003 1 Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements:

GALT WorkshopOct 16, 2003

7

Synchronization and communication

CCS notifL ; e1 | waitL ; e2 e1 | e2

pi-calculus

new c . ( send c v ; e1 | recv c x. e2(x) )

new c . e1 | e2(v)

Anticipation: the blocking operational semantics does not

capture well query execution. The old dataflow paradigm might

be better for that. But channels are essential.

blocking synchro

+ comm on channel c

Page 8: GALT Workshop Oct 16, 2003 1 Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements:

GALT WorkshopOct 16, 2003

8

Capturing query plans

Original query may be “decomposed” into subordinate

queries running at different nodes. Recall

node A: proj (join R (sel S))

B

sel

A

proj join

SR

Client

plan:

Get from here… … to here

Page 9: GALT Workshop Oct 16, 2003 1 Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements:

GALT WorkshopOct 16, 2003

9

Query plan “development”

node A: proj (join R (sel S))

> meta-step (discovery)>

node A: proj (join R@A (sel S@B))

> meta-step (discovery)>

node A: proj (join tableR (sel S@B))

> meta-step (optimization)>

node A: new c . proj (join tableR (split c (sel S@B)))

>operational semantics step>

node A: new c . proj (join tableR (recv c))

| send c (sel S@B)

Page 10: GALT Workshop Oct 16, 2003 1 Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements:

GALT WorkshopOct 16, 2003

10

A new primitive: split

new c . e1(split c (e2))

new c . e1(recv c) | send c e2

(We are using a simplified form of recv since the value passed

through the channel is consumed in just one place.)

split is typically introduced by optimization steps.

Alternative (query shipping vs. data shipping):

node A: new c . proj (join tableR (sel (split c S@B)))

Page 11: GALT Workshop Oct 16, 2003 1 Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements:

GALT WorkshopOct 16, 2003

11

Distribution and migration

Each node handles a “soup” of parallel processes:

( node A: e1 | e2 ) || ( node B: e3 | e4 | e5 )

Process migration:

( node A: e1 | migrate B e2) || ( node B: e3 )

( node A: e1 ) || ( node B: e3 | e2 )

Will be used to move subordinate queries.

Page 12: GALT Workshop Oct 16, 2003 1 Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements:

GALT WorkshopOct 16, 2003

12

Query plan, continued

node A: new c . proj (join tableR (recv c))

| send c (sel S@B)

>meta-step (subordination)>

node A: new c . proj (join tableR (recv c))

| migrate B (send c (sel S@B))

>operational semantics step>

new c@A . ( node A: proj (join tableR (recv c))

|| node B: send c@A (sel S@B) )

Global channels vs. located channels.

We can get by with located channels.

Page 13: GALT Workshop Oct 16, 2003 1 Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements:

GALT WorkshopOct 16, 2003

13

Distributed query evaluation

>meta-step (discovery)>

new c@A . ( node A: proj (join tableR (recv c))

|| node B: send c@A (sel tableS) )

Should not block send/recv on the entire table (all-push). Data

should be streamed.

All-pull (“iterators”) blocks recv/send on each tuple: not good.

Push-pull (queued) is most reasonable.

B

sel

A

proj join

SR

Client

Page 14: GALT Workshop Oct 16, 2003 1 Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements:

GALT WorkshopOct 16, 2003

14

“Lay down the pipes … Turn on the faucets”

The approach of ubQL [Sahuguet&T., 2001]. Separate

• deployment phase: queries are split

channels are created

subordinate query processes migrate

carrying channel names with them

• execution phase: data is streamed through channels

processed according to subord. queries

• adaptive behavior: execution can be interrupted and reset

channels can be flushed and reset

alternative plans can be started

Page 15: GALT Workshop Oct 16, 2003 1 Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements:

GALT WorkshopOct 16, 2003

15

Adaptive query processing

Another new primitive (like Cardelli’s “Web combinators”)

adapt stream AltPlan stream if adequate

AltPlan if drying up

redundancy

Example:

new c@A .

node A: proj (join tableR (adapt (recv c) AltPlan))

|| node B: send c@A (sel tableS)

where, eg., AltPlan new c’@A. split c’ (sel S@C)

Page 16: GALT Workshop Oct 16, 2003 1 Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements:

GALT WorkshopOct 16, 2003

16

Distributed databases systems: 25+ years

A lot of useful techniques, especially optimizations to reduce

bandwidth. Eg., use of semijoins.

Essential limitation: need powerful, all-knowing node, to generate the query plans.

proj join

sel

B

A

proj SR

Client

These are also subordinatequeries. Not quite decomposition…

still easily expressible

Page 17: GALT Workshop Oct 16, 2003 1 Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements:

GALT WorkshopOct 16, 2003

17

Where is the data? Discovery!

• Distributed Database Systems

centralized, complete, consistent knowledge.

• No clue, go out and search everywhere (Gnutella)!

• Keyword-node indexes (catalogs) in each node.

• Finally something clever: Distributed Hash Tables

• Layered organization using views

(simple version in Mediated Information Integration: Tsimmis,

Kleisli, Information Manifold, K2, Garlic, etc.)

Page 18: GALT Workshop Oct 16, 2003 1 Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements:

GALT WorkshopOct 16, 2003

18

Views can organize distributed data

C

S1 = expr( localTables )

S2 = expr( localTables )

B

R = expr( localTables, S2@C)

A

query = expr( R@B, S1@C)

V = expr(S1@C, S2@C)

composition-with-views

rewriting-with-views

Page 19: GALT Workshop Oct 16, 2003 1 Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements:

GALT WorkshopOct 16, 2003

19

New kinds of viewsGeneralize split:

e1(spawn e2 e3) e1(e2) | e3

split c e spawn (recv c) (send c e)

View definition that invokes a remote continuous query:

R = new cr. spawn (recv cr) (send cq cr)

Continuous query “installed” elsewhere:

recv* cq x. send x (sel tableT)

cr is a data stream channelcq is a standard

pi-calculus channel

Page 20: GALT Workshop Oct 16, 2003 1 Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements:

GALT WorkshopOct 16, 2003

20

Some tentative conclusions

• Very useful process primitives: parallel and sequential

composition, migration, channels.

• Need a new behavior for certain channels: streaming data (at

least at the language level; underneath we have bounded

buffers/queues ).

• Standard channel semantics is still useful for passing small

values or channel names, eg., for continuous queries.

• Some new primitives should be considered split/spawn,

adapt.

Page 21: GALT Workshop Oct 16, 2003 1 Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements:

GALT WorkshopOct 16, 2003

21

Where to?

Service calls, active (dynamic) data, done by Gardner and

Maffeis.

Clarify semantics of execution phase so we can apply

verification techniques for adaptive query plans.

In the presence of redundancy, building globally optimal plans

seems hopeless. How to define local optimality? Is there even a

concept of “good enough”?

Combine layers of views with distributed hashing at the level of

schema?

Page 22: GALT Workshop Oct 16, 2003 1 Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements:

GALT WorkshopOct 16, 2003

22

Page 23: GALT Workshop Oct 16, 2003 1 Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements:

GALT WorkshopOct 16, 2003

23

An interesting complication: services

Service calls in query:

to applications that provide additional processing

(that cannot be programmed in the query language).

Eg., analysis tools, scientific (BLAST) or financial

Service calls in data (active data):

part of the data is to be retrieved only on demand

Eg., Active XML

Service definition, result computation, and result consumption

all in potentially different nodes.

Page 24: GALT Workshop Oct 16, 2003 1 Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements:

GALT WorkshopOct 16, 2003

24

Mixing query features with process features

In principle we can take any query language (SQL, OQL,

XQuery) and add the process primitives we saw.

Really? The query language better be nicely “compositional”

and “referentially transparent”. SQL: lots of special cases…

Even better is to also have a robust static type system.

Things can get complicated:

foreach r in Dept@A collect foreach s in Emp@B where s.name=r.manager collect {s.age}

Page 25: GALT Workshop Oct 16, 2003 1 Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements:

GALT WorkshopOct 16, 2003

25

Where is the relevant data? (2)

Basic algorithm can be iterated and made useful to databases

(Ives). But, what to do with irregular pre-existing data with

complex schema?

Page 26: GALT Workshop Oct 16, 2003 1 Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements:

GALT WorkshopOct 16, 2003

26

Some tentative conclusions (2)

• Services are modeled in the same framework with little

effort. Service call arguments can also be carried as part of

migrating processes.

• Did not show it, but subscriptions, continuous queries,

caching, proxying (query redirecting) can be approached too.

Useful feature from process calculi: replicated processes.

• Well-designed query languages will mix nicely with process

calculi primitives.

• Operational semantics steps and meta-steps must be

interleaved (this makes me uncomfortable!)