aubrey l. tatarowicz #1, carlo curino #2, evan p. c. jones #3, sam madden #4 # massachusetts...

45
Lookup Tables:Fine-Grained Partitioning for Distributed Databases Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

Upload: alexina-goodwin

Post on 27-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

Lookup Tables:Fine-Grained Partitioning for Distributed

Databases Aubrey L. Tatarowicz #1, Carlo Curino #2,

Evan P. C. Jones #3, Sam Madden #4# Massachusetts Institute of Technology,

USA

Page 2: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

Scale a distributed OLTP DBMS---Partition horizontally partition

To be effective--Strategy Minimize the number of nodes involved

The most common strategy horizontally partition the database using

hash partition or range partition

BACKGROUND

Page 3: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

Many to many relationships are hard to partition

For social networking, simple partitioning schemes create a large fraction of

distributed queries/transactions.

While queries on the partitioning attribute go to a single partition, queries on other attributes must be broadcast to all partitions.

Problems

Page 4: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

Use a fine-grained partitioning strategy. Related individual tuples are co-located

together in the same partition

Partition index . It specifies which partitions contain tuples

matching a given attribute value, without partitioning the data by those attributes.

Solution---Lookup Tables

Page 5: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

To solve both the fined-grained partitioning and partition index problems, we introduce lookup tables

Lookup tables map from a key to a set of partition ids that store the corresponding tuples.

Lookup tables are small enough that they can be cached in memory on database query routers, even for large databases.

Lookup Tables

Page 6: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

Lookup tables must be stored compactly in RAM

To avoid adding additional disk accesses when processing queries.

Efficiently maintaining lookup tables in the presence of updates

Challenges for lookup tables

Page 7: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

Interact with the database using JDBC driver. Consist of two layers• Backend databases(plus an agent)• Query routers Contain Lookup table and partitioning

metadata.

OVERVIEW-The structure of our system.

Page 8: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

The routers are given the network address for each backend,the schema, and the partitioning metadata when they are started.

Lookup tables are stored in memory and consulted to determine which backends should run each query.

Query routers send queries to backend databases.

Result in excellent performance, providing 40% to 300% better throughput.

Overview-Basic flow path of the system

Page 9: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

LOOKUP TABLE QUERY PROCESSING

START-UP, UPDATES AND RECOVERY

STORAGE ALTERNATIVES

EXPERIMENTAL EVALUATION

CONCLUSION

Content

Page 10: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

When receives a query ,firstly,Lookup tables tell which backends store the data that is referenced.

If (queries referencing a column that uses a lookup table )

the router consults its local copy of the lookup table and determines where to send the query;

If(multiple backends are referenced)rewrite the query and a separate query is sent to

each backend;

Basic Lookup Table Operation

Page 11: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

We will use two tables

Usersid status Followerssource destination

source and destination are two foreign keys to users

Basic operation--Example

Page 12: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

Users want to get the status for all usersthey are following.

R=SELECT destination FROM followers WHERE source=x

SELECT * FROM users WHERE id IN (R)

Page 13: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

Traditional hash partitioning• Partition the users table by id • Partition the followers table by source

Problem:• The second query accesses several

partitions• Hard to scale this system by adding more

machines

Example-Partition

Page 14: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

id status

1 0

2 0

3 1

4 1

source destination

1 2

1 4

2 1

3 1

3 2

3 4

1 2

1 4

2 1

3 1

3 2

3 4

id status

1 0

2 0

3 0

4 0

Page 15: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

id status

1 0

3 1

Intelligent partitioning-share many friends

2 0

4 1

source destination

1 2

1 4

3 1

3 2

3 4

2 1

4 1

Page 16: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

Defining Lookup Tables

Query Planning

LOOKUP TABLE QUERY PROCESSING

Page 17: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

CREATE TABLE users (   id int, ..., PRIMARY KEY (id),   PARTITION BY lookup(id) ON (part1, part2)   DEFAULT NEW ON hash(id)); This says that users is partitioned with a

lookup table on id.

ALTER TABLE users SET PARTITION=part2 WHERE id=27;

Place one or more users into a given partition

Define Lookup Tables

Page 18: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

ALTER TABLE followersPARTITION BY lookup(source) SAME AS users; Specify that the followers table should be

partitioned in the same way as the users table It means each followers tuple f should be

placed on the same partition as the users tuple u where u.id = f.source

CREATE SECONDARY LOOKUP l_a ON users(name);

• Define partition indexes. This specifies that a lookup table l_a should be maintained.

Page 19: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

Each router maintains a copy of the partitioning metadata .This metadata describes how each table is partitioned or replicated.

The router parses each query to extract the tables and attributes that are being accessed

The goal is to push the execution of queries to the backend nodes, involving as few of them as possible.

Query Planning

Page 20: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

When starting, each router know the network

address of each backend. This is part of the static configuration data.

The router then attempts to contact other routers to copy their lookup table.

As a last resort, it contacts each backend agent to obtain the latest copy of each lookup table subset.

START-UP

Page 21: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

To ensure correctness, the copy of the lookup table at each router is considered a cache that may not be up to date.

To keep the routers up to date, backends piggyback changes with query responses.

This is only a performance optimization, and is not required for correctness.

UPDATES

Page 22: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

Lookup tables are usually unique.

If(tuples are found)The existence of a tuple on a backend indicates that the query was routed

correctly; Else Stale lookup table entryNo lookup table entry

Piggyback changes

Page 23: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

Lookup Tables must be stored in RAM to avoid imposing performance penalty.

Two implementations of lookup tablesHash tables

◦ Hash tables can support any data type and sparse key spaces, and hence are a good default choice.

Arrays Arrays work better for dense key-spaces. Arrays are not always an option because they require mostly-dense, countable key spaces.

STORAGE ALTERNATIVES

Page 24: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

Lookup Table Reuse• Reuse the same lookup table in the router

for tables with location dependencies• At the cost of a slightly more complex

handling of metadata.

Compressed Tables• Trade CPU time to reduce space.• Specifically, we used Huffman encoding.

Efficiently store large lookup tables in RAM

Page 25: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

Hybrid Partitioning• Combine the fine-grained partitioning of a

lookup table with the space-efficient representation of range or hash partitioning.

• The idea is to place “important” tuples in specific partitions, while treating the remaining tuples with a default policy

• To derive a hybrid partitioning, we use decision tree classifiers

Page 26: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

Partial Lookup Tables• Trade memory performance by maintaining only

the recently used part of a lookup table.• It is effective if the data is accessed with skew.• The basic approach is to allow each router to

maintain its own least-recently used lookup table over part of the data.

• If the id being accessed is not found in the table, the router falls back to a broadcast query, and adds the mapping to its current table.

Page 27: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

Backend nodes run Linux and MySQL. The backend servers are older single-CPU,

single-disk systems. Query router is written in Java, and

communicates with the backends using MySQL’s protocol via JDBC.

All machines were connected to the same gigabit Ethernet switch.

The network was not a bottleneck.

EXPERIMENTAL EVALUATION

Page 28: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

Partition them using both lookup tables and hash/range partitioning.

Include approximately 1.5 million entries in each of the revision and text tables. And occupies 36 GB of space in MySQL.

Extracted the most common operation: fetch the current version of an article.

EXPERIMENTAL EVALUATION----Wikipedia

Page 29: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

Wikipedia

Page 30: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

R=select pid from page where title=“world” Z=select rid,page, text_id from R,revision where revision.page=R.pid and revision.rid=R.latest select text.tid from text where text.tid=Z.text_id

Query statements

Page 31: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

Partition page on title, revision on rid, and text on tid.

The first query will be efficient and go to a single partition---1message.

The join must be executed in two steps across all partitions (fetch page by pid which queries all partitions, then fetch revision where rid = p.latest)—k+1messages.

Finally, text can be fetched directly from one partition.—1message

Alternative 1

Page 32: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

The read-only distributed transaction can be committed with another broadcast to all partitions (Because of the 2PC read-only optimization. A distributed transaction that accesses more than one partition and must use two-phase commit).---k messages

Total: 2K+3 messages

Page 33: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

Partition page on pid, revision on page , and text on tid.

The first query goes everywhere—K messages

The join is pushed down to a single partition.—1message

The final query goes to a single partition.—1message

This results in a total of 2k + 2 messages.

Alternative 2

Page 34: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

Hash or range partition page on title.—1message

Build a lookup table on page.pid. Co-locate revisions together with their

corresponding page by partitioning revision using the lookup table. -2messages

Create a lookup table on revision.text_id and partitioning on text.tid = revision.text_id-

1message A total of 4 messages.

Alternative 3

Page 35: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

The number of messages required for hash and range partitioning grows linearly with the number of backends, implying that this solution will not scale.Lookup tables enable a constant number of messages for growing number of backends and thus better scalability.

Page 36: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

Lookup tables are mostly dense integers (76 to 92% dense), we use an array implementation of lookup tables.

We reuse lookup tables when there are location dependencies.

• In this case, there is one a lookup table shared for both page.pid and revision.page, and a second table for revision.text_id and text.tid.

We can store the 360 million tuples in the complete Wikipedia snapshot in less than 200MB of memory, which easily fits in RAM.

Storage Optimization

Page 37: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

The primary benefit of lookup tables is to reduce the number of distributed queries and transactions.

Examining the cost of distributed queries.

• Scale the number of backends.

• Increase the percentage of distributed queries

Cost of Distributed Queries

Page 38: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

The throughput with 1,4,8 backends As the percentage of distributed queries

increases, the throughput decreases. The reason is that the communication

overhead for each query is a significant cost

Page 39: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

Partitioned it across 1, 2, 4 and 8 backends. With both hash partitioning and lookup

tables.

Example-Wikipedia

Page 40: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

Shared nothing distributed databases typically only support hash or range partitioning of the data.

Lookup tables can be used with all these systems, in conjunction with their existing support for partitioning.

Related Work

Page 41: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

We use lookup tables as a type of secondary index for tables that are accessed via more than one attribute.

Bubba proposed Extended Range Declustering, where a secondary index on the non-partitioned attributes is created and distributed across the database nodes .

Our approach simply stores this secondary data in memory across all query routers, avoiding an additional round trip.

Page 42: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

Previous work has argued that hard to partition applications containing many-to-many relationships can be partitioned effectively by allowing tuples to be placed in partitions based on their relationships.

Schism uses graph partitioning algorithms to derive the partitioning. It does not discuss how to use the fine-grained partitioning it produces

Page 43: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

Using lookup tables, application developers can implementany partitioning scheme they desire, and can also create partition indexes that make it possible to efficiently route queries to just the partitions they need to access.

The article presented a set of techniques to efficiently store and compress lookup tables, and to manage updates, inserts, and deletes to them.

CONCLUSION

Page 44: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

With these applications, we showed that lookup tables with an appropriate partitioning scheme can achieve from 40% to 300% better performance than either hash or range partitioning and shows greater potential for further scale-out.

Page 45: Aubrey L. Tatarowicz #1, Carlo Curino #2, Evan P. C. Jones #3, Sam Madden #4 # Massachusetts Institute of Technology, USA

THANK YOU FOR YOUR TIME!