megastore: providing scalable, highly available storage...
TRANSCRIPT
![Page 1: Megastore: Providing Scalable, Highly Available Storage ...cidrdb.org/cidr2011/Talks/CIDR11_Larson.pdf · Megastore Started in 2006 for app development at Google Service layered on:](https://reader034.vdocuments.site/reader034/viewer/2022051918/600a1ea1f4c2cd430225e210/html5/thumbnails/1.jpg)
Megastore: Providing Scalable, Highly Available Storage for
Interactive Services
J. Baker, C. Bond, J.C. Corbett, JJ Furman, A. Khorlin, J. Larson, J-M Léon, Y. Li, A. Lloyd, V. Yushprakh
Google Inc.
CIDR 2011, Jan. 12 2011
![Page 2: Megastore: Providing Scalable, Highly Available Storage ...cidrdb.org/cidr2011/Talks/CIDR11_Larson.pdf · Megastore Started in 2006 for app development at Google Service layered on:](https://reader034.vdocuments.site/reader034/viewer/2022051918/600a1ea1f4c2cd430225e210/html5/thumbnails/2.jpg)
With Great Scale Comes Great Responsibility
A billion Internet usersSmall fraction is still huge
Must please usersBad press is expensive - never lose dataSupport is expensive - minimize confusionNo unplanned downtimeNo planned downtimeLow latency
Must also please developers, admins
![Page 3: Megastore: Providing Scalable, Highly Available Storage ...cidrdb.org/cidr2011/Talks/CIDR11_Larson.pdf · Megastore Started in 2006 for app development at Google Service layered on:](https://reader034.vdocuments.site/reader034/viewer/2022051918/600a1ea1f4c2cd430225e210/html5/thumbnails/3.jpg)
Making Everyone Happy
![Page 4: Megastore: Providing Scalable, Highly Available Storage ...cidrdb.org/cidr2011/Talks/CIDR11_Larson.pdf · Megastore Started in 2006 for app development at Google Service layered on:](https://reader034.vdocuments.site/reader034/viewer/2022051918/600a1ea1f4c2cd430225e210/html5/thumbnails/4.jpg)
Technology Options
![Page 5: Megastore: Providing Scalable, Highly Available Storage ...cidrdb.org/cidr2011/Talks/CIDR11_Larson.pdf · Megastore Started in 2006 for app development at Google Service layered on:](https://reader034.vdocuments.site/reader034/viewer/2022051918/600a1ea1f4c2cd430225e210/html5/thumbnails/5.jpg)
Technology Options
![Page 6: Megastore: Providing Scalable, Highly Available Storage ...cidrdb.org/cidr2011/Talks/CIDR11_Larson.pdf · Megastore Started in 2006 for app development at Google Service layered on:](https://reader034.vdocuments.site/reader034/viewer/2022051918/600a1ea1f4c2cd430225e210/html5/thumbnails/6.jpg)
Megastore
Started in 2006 for app development at GoogleService layered on:
Bigtable (NoSQL scalable data store per datacenter)Chubby (Config data, config locks)
Turnkey scaling (apps, users)Developer-friendly featuresWide-area synchronous replication
partition by "Entity Group"
![Page 7: Megastore: Providing Scalable, Highly Available Storage ...cidrdb.org/cidr2011/Talks/CIDR11_Larson.pdf · Megastore Started in 2006 for app development at Google Service layered on:](https://reader034.vdocuments.site/reader034/viewer/2022051918/600a1ea1f4c2cd430225e210/html5/thumbnails/7.jpg)
Entity Groups
Entity Groups are sub-databases
![Page 8: Megastore: Providing Scalable, Highly Available Storage ...cidrdb.org/cidr2011/Talks/CIDR11_Larson.pdf · Megastore Started in 2006 for app development at Google Service layered on:](https://reader034.vdocuments.site/reader034/viewer/2022051918/600a1ea1f4c2cd430225e210/html5/thumbnails/8.jpg)
Entity Groups
Cheap transactions within an entity group (common)
![Page 9: Megastore: Providing Scalable, Highly Available Storage ...cidrdb.org/cidr2011/Talks/CIDR11_Larson.pdf · Megastore Started in 2006 for app development at Google Service layered on:](https://reader034.vdocuments.site/reader034/viewer/2022051918/600a1ea1f4c2cd430225e210/html5/thumbnails/9.jpg)
Entity Groups
Expensive or loosely-consistent operations across Entity Groups (rare)
![Page 10: Megastore: Providing Scalable, Highly Available Storage ...cidrdb.org/cidr2011/Talks/CIDR11_Larson.pdf · Megastore Started in 2006 for app development at Google Service layered on:](https://reader034.vdocuments.site/reader034/viewer/2022051918/600a1ea1f4c2cd430225e210/html5/thumbnails/10.jpg)
Scale Axis vs. Wide Replication Axis
![Page 11: Megastore: Providing Scalable, Highly Available Storage ...cidrdb.org/cidr2011/Talks/CIDR11_Larson.pdf · Megastore Started in 2006 for app development at Google Service layered on:](https://reader034.vdocuments.site/reader034/viewer/2022051918/600a1ea1f4c2cd430225e210/html5/thumbnails/11.jpg)
Entity Group Mapping Examples
Applications must choose their partitioningCommon operations within an EG
Application Entity Groups Cross-EG Operations
Email User accounts none (out-of-system)
Blogs Users, Blogs Access control, notifications, global indexes
Mapping Local patches Patch-spanning ops (2PC)
Social Users, Groups Messages, bi-directional relationships, notifications
Resources Sites Shipments
![Page 12: Megastore: Providing Scalable, Highly Available Storage ...cidrdb.org/cidr2011/Talks/CIDR11_Larson.pdf · Megastore Started in 2006 for app development at Google Service layered on:](https://reader034.vdocuments.site/reader034/viewer/2022051918/600a1ea1f4c2cd430225e210/html5/thumbnails/12.jpg)
Achieving Technical Goals
ScaleBigtable within a datacenterEasy to add Entity Groups (storage, throughput)
ACID TransactionsWrite-ahead log per Entity Group2PC or Queues between Entity Groups
Wide-Area ReplicationPaxosTweaks for optimal latency
![Page 13: Megastore: Providing Scalable, Highly Available Storage ...cidrdb.org/cidr2011/Talks/CIDR11_Larson.pdf · Megastore Started in 2006 for app development at Google Service layered on:](https://reader034.vdocuments.site/reader034/viewer/2022051918/600a1ea1f4c2cd430225e210/html5/thumbnails/13.jpg)
Paxos: Quorum-based Consensus
"While some consensus algorithms, such as Paxos, have started to find their way into [large-scale distributed storage systems built over failure-prone commodity components], their uses are limited mostly to the maintenance of the global configuration information in the system, not for the actual data replication."
-- Lamport, Malkhi, and Zhou, May 2009
![Page 14: Megastore: Providing Scalable, Highly Available Storage ...cidrdb.org/cidr2011/Talks/CIDR11_Larson.pdf · Megastore Started in 2006 for app development at Google Service layered on:](https://reader034.vdocuments.site/reader034/viewer/2022051918/600a1ea1f4c2cd430225e210/html5/thumbnails/14.jpg)
Paxos: Megastore Tweaks
Replicates transaction log entries on each writeWrites: one WAN round-trip (avg.)Strong Reads: zero WAN round-trips (avg.)
per-replica bitmap invalidated on faultsReads/Writes from any replica (no master)
no pipelining: limited per-EG throughputbatching will improve throughput
Background scanners finish all operations
![Page 15: Megastore: Providing Scalable, Highly Available Storage ...cidrdb.org/cidr2011/Talks/CIDR11_Larson.pdf · Megastore Started in 2006 for app development at Google Service layered on:](https://reader034.vdocuments.site/reader034/viewer/2022051918/600a1ea1f4c2cd430225e210/html5/thumbnails/15.jpg)
Comparison with Other Approaches
NoSQL Megastore RDBMS
Minimal features Scalable features
Full-featured
Highly scalable Highly scalable Medium scale with effort
PK lookup and scan
Indexes, scans, physical clustering
Storage abstraction, complex query planning and execution
Limited/eventual consistency
Partitioned consistency
Global consistency
![Page 16: Megastore: Providing Scalable, Highly Available Storage ...cidrdb.org/cidr2011/Talks/CIDR11_Larson.pdf · Megastore Started in 2006 for app development at Google Service layered on:](https://reader034.vdocuments.site/reader034/viewer/2022051918/600a1ea1f4c2cd430225e210/html5/thumbnails/16.jpg)
Features
Declarative schemaSerializable Transactions (within Entity Group)Queues and 2PC (between Entity Groups)Indexes
declared fieldsfull-text
Online backup and restoreBuilt-in encryption and compression
![Page 17: Megastore: Providing Scalable, Highly Available Storage ...cidrdb.org/cidr2011/Talks/CIDR11_Larson.pdf · Megastore Started in 2006 for app development at Google Service layered on:](https://reader034.vdocuments.site/reader034/viewer/2022051918/600a1ea1f4c2cd430225e210/html5/thumbnails/17.jpg)
Omissions (current)
(currently) No query languageApps must implement query plansApps have fine-grained control of physical placement
(currently) Limited per-Entity Group update rate
![Page 18: Megastore: Providing Scalable, Highly Available Storage ...cidrdb.org/cidr2011/Talks/CIDR11_Larson.pdf · Megastore Started in 2006 for app development at Google Service layered on:](https://reader034.vdocuments.site/reader034/viewer/2022051918/600a1ea1f4c2cd430225e210/html5/thumbnails/18.jpg)
Is Everybody Happy?
Adminslinear scaling, transparent rebalancing (Bigtable)instant transparent failoversymmetric deployment
DevelopersACID transactions (read-modify-write)many features (indexes, backup, encryption, scaling)single-system image makes code simplelittle need to handle failures
End Usersfast up-to-date reads, acceptable write latencyconsistency
![Page 19: Megastore: Providing Scalable, Highly Available Storage ...cidrdb.org/cidr2011/Talks/CIDR11_Larson.pdf · Megastore Started in 2006 for app development at Google Service layered on:](https://reader034.vdocuments.site/reader034/viewer/2022051918/600a1ea1f4c2cd430225e210/html5/thumbnails/19.jpg)
Take-Aways
Sync WAN replication on each writeConstraints acceptable to most apps
EG partitioningHigh write latencyLimited per-EG throughput
Turnkey scaling achieved>100 apps>3 billion writes/day>20 billion reads/day~1PB data (before index, replication)Most apps get carrier-grade (five 9's) availability
In production use for over 4 years
![Page 20: Megastore: Providing Scalable, Highly Available Storage ...cidrdb.org/cidr2011/Talks/CIDR11_Larson.pdf · Megastore Started in 2006 for app development at Google Service layered on:](https://reader034.vdocuments.site/reader034/viewer/2022051918/600a1ea1f4c2cd430225e210/html5/thumbnails/20.jpg)
For more information
Read our full paperBecome a Megastore customer:
Use Google App Engine ("high replication")Ask a question...
![Page 21: Megastore: Providing Scalable, Highly Available Storage ...cidrdb.org/cidr2011/Talks/CIDR11_Larson.pdf · Megastore Started in 2006 for app development at Google Service layered on:](https://reader034.vdocuments.site/reader034/viewer/2022051918/600a1ea1f4c2cd430225e210/html5/thumbnails/21.jpg)
Extra Slides
![Page 22: Megastore: Providing Scalable, Highly Available Storage ...cidrdb.org/cidr2011/Talks/CIDR11_Larson.pdf · Megastore Started in 2006 for app development at Google Service layered on:](https://reader034.vdocuments.site/reader034/viewer/2022051918/600a1ea1f4c2cd430225e210/html5/thumbnails/22.jpg)
Megastore Architecture
![Page 23: Megastore: Providing Scalable, Highly Available Storage ...cidrdb.org/cidr2011/Talks/CIDR11_Larson.pdf · Megastore Started in 2006 for app development at Google Service layered on:](https://reader034.vdocuments.site/reader034/viewer/2022051918/600a1ea1f4c2cd430225e210/html5/thumbnails/23.jpg)
Why Not Lots of RDBMS's?
FunctionalNeed a place to store global and full-text indexes
Space and TimeCreate new local EG in ~10msOverhead of <1KB per EG
AdministrationLoad-rebalancingFault recoveryMonitoringOperational team
![Page 24: Megastore: Providing Scalable, Highly Available Storage ...cidrdb.org/cidr2011/Talks/CIDR11_Larson.pdf · Megastore Started in 2006 for app development at Google Service layered on:](https://reader034.vdocuments.site/reader034/viewer/2022051918/600a1ea1f4c2cd430225e210/html5/thumbnails/24.jpg)
Schema
CREATE SCHEMA PhotoApp;
CREATE TABLE User { required int64 user_id; required string name;} PRIMARY KEY(user_id), ENTITY GROUP ROOT;
CREATE TABLE Photo { required int64 user_id; required int32 photo_id; required int64 time; required string full_url; optional string thumbnail_url; repeated string tag;} PRIMARY KEY(user_id, photo_id), IN TABLE User, ENTITY GROUP KEY(user_id) REFERENCES User;
CREATE LOCAL INDEX PhotosByTime ON Photo(user_id, time);CREATE GLOBAL INDEX PhotosByTag ON Photo(tag) STORING (thumbnail_url);
![Page 25: Megastore: Providing Scalable, Highly Available Storage ...cidrdb.org/cidr2011/Talks/CIDR11_Larson.pdf · Megastore Started in 2006 for app development at Google Service layered on:](https://reader034.vdocuments.site/reader034/viewer/2022051918/600a1ea1f4c2cd430225e210/html5/thumbnails/25.jpg)
Locality
Bigtablecolumn-oriented storagefaster access to nearby rows
Row key User.name Photo.time Photo.tag Photo.url Photo.-I.PhotosByTime
101 John
101,500 12:30:01 Dinner, Paris http://...101,502 12:15:22 Betty, Paris http://...101,12:15:22,502 X
101,12:30:01,500 X
102 Mary
![Page 26: Megastore: Providing Scalable, Highly Available Storage ...cidrdb.org/cidr2011/Talks/CIDR11_Larson.pdf · Megastore Started in 2006 for app development at Google Service layered on:](https://reader034.vdocuments.site/reader034/viewer/2022051918/600a1ea1f4c2cd430225e210/html5/thumbnails/26.jpg)
Timeline of read algorithm
![Page 27: Megastore: Providing Scalable, Highly Available Storage ...cidrdb.org/cidr2011/Talks/CIDR11_Larson.pdf · Megastore Started in 2006 for app development at Google Service layered on:](https://reader034.vdocuments.site/reader034/viewer/2022051918/600a1ea1f4c2cd430225e210/html5/thumbnails/27.jpg)
Timeline of write algorithm
![Page 28: Megastore: Providing Scalable, Highly Available Storage ...cidrdb.org/cidr2011/Talks/CIDR11_Larson.pdf · Megastore Started in 2006 for app development at Google Service layered on:](https://reader034.vdocuments.site/reader034/viewer/2022051918/600a1ea1f4c2cd430225e210/html5/thumbnails/28.jpg)
Operations Across Entity Groups