noha mega store

33
MegaStore Google Inc. Presented by: Noha Elprince 22 June, 2011 Jason Baker, Chris Bond, James C Corbett, JJ Furman, Andrey Khorlin, James Larson, Jean-Michel Leon, Yawei Li, Alexander Lloyd, Vadim Yushprakh. CIDR 2011.

Upload: noha-elprince

Post on 17-Jun-2015

466 views

Category:

Technology


6 download

TRANSCRIPT

Page 1: Noha mega store

MegaStore Google Inc.

Presented by: Noha Elprince

22 June, 2011

Jason Baker, Chris Bond, James C Corbett, JJ Furman, Andrey Khorlin, James Larson, Jean-Michel Leon, Yawei Li, Alexander Lloyd, Vadim Yushprakh. CIDR 2011.

Page 2: Noha mega store

What is MegaStore?

§  A storage system developed to meet the

requirements of today’s online interactive services.

§  Megastore is the data engine supporting the Google

App Engine (GAE) https://appengine.google.com/

§  GAE cloud computing technology:

Ø  Hosts/virtualizes web apps across multiple servers on Google’s platform. Ø  Fast development and deployment. Ø  Simple administration. Ø  No need to worry about hardware patches or backups and scalability.

2

Page 3: Noha mega store

Outline �  Motivation & Problem

�  Methodology

�  Design of Megastore �  Data Model �  Data Storage �  Transactions and Concurrency Control

�  How Megastore achieves Availability and Scalability. �  PAXOS. �  Megastore’s approach.

�  Experience

�  Related Work

�  Conclusion

3

Page 4: Noha mega store

Megastore- Motivation

•  Storage requirements of today’s interactive online applications. �  Highly scalable

�  Rapid development

�  Low latency

�  Durability and consistency

�  Availability and fault tolerance.

•  These requirements are in conflict !

4

Page 5: Noha mega store

CAP Theorem – Eric Brewer 2000

“In a distributed database system, you can only have at most two of the following three characteristics:

Ø  Consistency

Ø  Availability

Ø  Partition tolerance

ACID = Atomicity, Consistency, Isolation, Durability.

5

Page 6: Noha mega store

Problem §  Conflicts between Available systems:

�  RDBMS Rich set of features, expressive language helps development, but difficult to scale. Eg: MySQL, PostgreSQL, MS SQL Server, Oracle RDB.

�  NoSQL datastores Highly Scalable but Limited API and loose consistency models. Eg: Google’s BigTable, Apache Hadoop’s Hbase, Facebook’s Cassandra.

§  Reliability of a single datacenter cant be guaranteed 100%.

[“Always expect the unexpected”—James Patterson]

6

Page 7: Noha mega store

Methodology �  Megastore blends the scalability of NoSQL with the

convenience of traditional RDBMS.

�  High reliability can be achieved by: Ø  Data lives in multiple data centers.

Ø  Write to a majority of datacenters synchronously.

Ø  Allow the infrastructure decide what datacenter to read from and write to.

7

Page 8: Noha mega store

Outline �  Motivation & Problem

�  Methodology

�  Design of Megastore �  Data Model �  Data Storage �  Transactions and Concurrency Control

�  How Megastore achieves Availability and Scalability. �  PAXOS. �  Megastore’s approach.

�  Experience

�  Related Work

�  Conclusion

8

þ

þ

Page 9: Noha mega store

Design of Megastore : DataModel

�  The data model is declared in a schema.

�  Each schema has a set of tables : root tables or child tables.

�  Entity Group – consists of a root entity along with all child entities.

9

CREATE SCHEMA PhotoApp;

CREATE TABLE User {

required int64 user_id;

required string name;

} PRIMARY KEY(user_id),

ENTITY GROUP ROOT;

CREATE TABLE Photo { required int64 user_id; required int32 photo_id; required int64 time; required string full_url; optional string thumbnail_url; repeated string tag; } PRIMARY KEY(user_id, photo_id), IN TABLE User, ENTITY GROUP KEY(user_id) REFERENCES User;

Page 10: Noha mega store

10

•  (Hierarchical) data is de-normalized to eliminate the join costs Joins are implemented in application level

•  Outer joins with parallel queries using secondary indexed •  Provides an efficient stand-in for SQL-style joins

Design of Megastore : DataModel

Page 11: Noha mega store

How is it stored in BigTable?

11

“A Bigtable is a compressed, high performance, and proprietary database system built on :

Google File System (GFS), Chubby Lock service and other Google programs ”

Design of Megastore : Data Storage

Page 12: Noha mega store

Example:

User {user_id:101, name: ‘John’ }

Photo{ user_id:101, photo_id:501, time 2009, full_url: ‘john-pic1’,

tag:’vacation’, tag:’holiday’, tag:’Paris’}

Photo{ user_id:101, photo_id:502, time:2010, full_url: ‘john-pic2’, tag:’office’, tag:’friends’, tag:’pub’}

12

Design of Megastore : Data Storage Row Key

User.name

Photo. time

Photo. Tag

Photo URL

101 John

101, 501

2009 Vacation, Hoilday, Paris

101, 502

2010 Office, friends, pub

102 Mary

102, 600

2009 Office, Picnic, Paris

102, 601

2011 Birthday, Friends

User{user_id:102, name: ‘Mary’ }

Photo{ user_id:102, photo_id:600, time:2009, full_url: ‘mary-pic1’, tag:’office’, tag:’picnic’, tag:’Paris’}

Photo{ user_id:102, photo_id:601, time:2011, full_url: ‘mary-pic2’, tag:’birthday’, tag:’friends’}

Page 13: Noha mega store

�  Indexing �  Local Index – find data within Entity Group.

CREATE LOCAL INDEX PhotosByTime ON Photo(user_id, time);

�  Global Index - spans entity groups. CREATE GLOBAL INDEX PhotosByTag ON Photo(tag) STORING

(thumbnail_url);

�  The ‘Storing’ Clause Ø  Faster retrieval of certain properties.

13

Design of Megastore : Data Storage

Page 14: Noha mega store

14

How is it stored in BigTable?

Row Key

101,2009, 101,501

101,2010, 101,502

102,2009, 102,600

102,2011, 102,601

PhotosByTime Row Key Thumbnail.Url

Birthday,102, 601 …

Friends, 101, 502 …

Friends, 102,601 …

Holiday, 101, 501 …

Office, 101, 502 …

Office, 102, 600 …

Paris, 101, 501 …

Paris, 102, 600 …

Pub, 101, 502 …

PhotosByTag

Design of Megastore : Data Storage

Page 15: Noha mega store

Outline �  Motivation & Problem

�  Methodology

�  Design of Megastore �  Data Model �  Data Storage �  Transactions and Concurrency Control

�  How Megastore achieves Availability and Scalability. �  PAXOS. �  Megastore’s approach.

�  Experience

�  Related Work

�  Conclusion

15

þ

þ

þ ✓ ✓

Page 16: Noha mega store

Transactions and Concurrency Control •  Each Entity Group acts as mini-db, provides

ACID semantics.

•  Transaction management using Write Ahead Logging (WAL).

•  BigTable feature – ability to store multiple data for same row/column with different timestamps.

•  Cross entity group transactions supported via two-phase commit (2PC).

•  Entites in an Entity group employs Multiversion Concurrency Control (MVCC).

Page 17: Noha mega store

�  MVCC: multiversion concurrency control

Using timestamps - reads and writes do not block each other.

�  Read consistency

�  Current: wait for uncommitted writes then read last committed value

�  Snapshot: doesn't’t wait. Reads last committed values.

�  Inconsistent reads: ignore the state of log and read the last values directly (data may be stale)

�  Write consistency

�  Determine the next available log position

�  Assigns mutations of write-ahead log (WAL) a timestamp higher than any previous one

�  Employs Paxos to settle the resource contention : Select a winner to write on a certain entity group. The others will abort/retry their operations.

It uses optimistic concurrency OCC with mutations (write operations):

(Assumes there is no transaction ‘s data conficts => proceed without locks )

Transactions and Concurrency Control

Page 18: Noha mega store

18

Transactions and Concurrency Control

q  Queues §  Provide transactional messaging between entity groups. §  Each message either is : Ø  Synchronous: has a single

sending and receiving entity group. Ø  Asynchronous: has different

sending and receiving entity group.

Ø  Useful to perform operations that affect many entity groups.

Fig. Operations across entity groups

Page 19: Noha mega store

19

Transactions and Concurrency Control q  Two-Phase Commit (2PC) §  Coordinator: the component that receives the commit/abort request §  Participants: the resource managers that did work on behalf of

the transaction (by reading/updating resources). * Goal: Ensure that the coordinator and all participants either commit/abort the transaction => Atomicity is satisfied.

Source: Ref[2]

Disadv. High latency Adv. Simplify code for unique secondary key enforcement.

Page 20: Noha mega store

Other Features

�  Integrated Backup System Ø  used to restore back an entity group’s state to

any point in time

�  Data Encryption Ø  use distinct key/entity group

20

Page 21: Noha mega store

Outline �  Motivation & Problem

�  Methodology

�  Design of Megastore �  Data Model �  Data Storage �  Transactions and Concurrency Control

�  How Megastore achieves Availability and Scalability. �  PAXOS. �  Megastore’s approach.

�  Experience

�  Related Work

�  Conclusion

21

þ

þ

þ ✓ ✓ ✓

Page 22: Noha mega store

v  Megastore Replication System

Megastore – Availability / Scalability

•  Replication is done per entity group by: synchronously replicating the group’s transaction log into a number of replicas. •  Reads and writes can be initiated from any replicas. •  Writes require one round of inter-

datacenter communication. •  ACID semantics are preserved regardless of what replica a client starts from.

Fig. Scalable Replication

Page 23: Noha mega store

�  PAXOS Algorithm

Megastore – Replication

Adv. Tolerates delayed or reordered messages and replicas that fail by Stopping (can tolerate upto N/2 failures). Disadv. high-latency bec. it demands multiple rounds of communication. so Megastore uses an improved version.

•  a way to reach consensus among a group of replicas on a single value. •  Databases typically use PAXOS to replicate a transaction log, where a

separate instance of PAXOS is used for each position in the log.

Source: Ref[3]

Page 24: Noha mega store

•  Master-Based Approach

Ø  A Master-Slave model is generally used where the Master

handles all the replication of writes.

Ø  But it causes a bottleneck.

Megastore – Replication

Page 25: Noha mega store

•  MegaStore Replication System (PAXOS-modified)

§  Fast Reads

- Allow local reads from any where.

- Tracks a set of entity groups for which its replica has observed all PAXOS writes and serve their local reads.

§  Fast Writes

- A specific replica is chosen as a leader.

- The leader decides the proposal no. and sends it to other writers.

- The first writer submits a value to the leader, wins the right to ask all replicas to accept that value.

•  Select the next write’s leader using the closest replica heuristic (aim: minimizes the writer-leader latency by observing: most apps submit writes from the same region repeatedly).

Megastore – Replication

Page 26: Noha mega store

Outline �  Motivation & Problem

�  Methodology

�  Design of Megastore �  Data Model �  Data Storage �  Transactions and Concurrency Control

�  How Megastore achieves Availability and Scalability. �  PAXOS. �  Megastore’s approach.

�  Experience

�  Related Work

�  Conclusion

26

þ

þ

þ ✓ ✓ ✓

þ

Page 27: Noha mega store

Experience ²  Real-world deployment

�  More than 100 production application use Megastore (e.g. Google App Engine)

�  Most of applications see extremely high availability

�  Most of users see average write latencies of 100~400 ms.

Page 28: Noha mega store

Related Work

�  NoSQL data storage systems �  Bigtable, Cassandra, Yahoo PNUTS, Amazon SimpleDB

�  Data replication process �  Hbase, CouchDB, Dynamo, …

�  Extend replication scheme of traditional RDBMS systems

�  Paxos algorithm �  SCALARIS, Keyspace, …

�  Few have used Paxos to achieve synchronous replication

Page 29: Noha mega store

Conclusion

29

Megastore

Ø  A scalable, highly available datastore for interactive internet services.

Ø  Paxos is used for synchronous replication. Ø  Bigtable as the scalable datastore while adding richer

primitives (ACID, Indexes). Ø  Has over 100 applications in productions

Page 30: Noha mega store

Megastore

Any Questions?

Page 31: Noha mega store

References �  [1] “Megastore: Providing Scalable Highly Available Storage for

Interactive Services.” Jason Baker et al.. CIDR 2011.

�  [2] “Principles of transaction Processing.” Philip A. Bernstein, Eric Newcomer, Morgan Kaufmann, 2009.

�  [3] http://paprika.umw.edu/~ernie/cpsc321/10312006.html

�  [4] Google MegaStore’s Presentation at SIGMOD 2008. http://perspectives.mvdirona.com/2008/07/10/GoogleMegastore.aspx.

31

Page 32: Noha mega store

�  Each replica stores mutations and metadata for the log entries �  Read process

�  1. Query Local �  Up-to-date check

�  2. Find position �  Highest log position �  Select replica

�  3. Catchup �  Check the consensus

value from other replica

�  4. Validate �  Synchronizing with

up-to-data

�  5. Query data �  Read data with timestamp

Megastore – Replication Megastore Read Process

Page 33: Noha mega store

�  Each replica stores mutations and metadata for the log entries �  Write process

�  1. Accept leader �  Ask the leader to accept

the value as proposal number

�  2. Prepare �  Run the Paxos Prepare

phase at all replica

�  3. Accept �  Ask remaining replicas

to accept the value

�  4. Invalidate �  Fault handling for replicas

which did not accept the value

�  5. Apply �  Apply the value’s mutation at as many replicas as possible

Megastore – Replication �  Megastore Write Process