spanner google’s scalable, multi-version, globally-distributed and synchronously-replicated...

26
Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX Association)

Upload: belinda-hutchinson

Post on 17-Dec-2015

227 views

Category:

Documents


0 download

TRANSCRIPT

SpannerGoogle’s scalable, multi-version, globally-distributed and synchronously-replicated database

Presented By Alon Adler – Based on OSDI ’12 (USENIX Association)

Why Spanner born?

• Google had BigTable and MegaStore.

• Why not BigTable ?

• Can’t handle with complex, evolving schemes.

• Only eventual consistency across datacenters.

• Transactional scope limited to single row.

• Why not MegaStore ?

• Low performance.

So, What is Spanner?

• At high level of abstraction, it is a database that shards data across many set of Paxos state machines in datacenters spread all over the world.

• Spanner is designed to scale up to millions of machines across hundreds of datacenters and trillions of database rows.

• Spanner maintained multiple replicas for each data.

• Replication is used for global availability.

• Applications can use Spanner for high availability even in face of wide-area natural disasters.

So, What is Spanner?

• Spanner supports general-purpose transactions (ACID).

• Atomicity, Consistency, Isolation, Durability.Sometimes “Eventually-consistent” of BigTable isn’t good enough.

• Spanner provides a SQL-based query language.

• Which provides to the applications the ability to handle complex schemes.

universemaster – status of all zones

placement driver – transfers data between zones

A Spanner deployment is called “universe”.

location proxies – Used by clients tolocate spanservers that hold the datathey need

Thousands of spanservers per zone

zonemaster allocates data tospanservers

Spanserver Software Stack

Tables sharded across rows into tablets (like bigtable) .

Tablet maps(key:string, timestamp:int64)->string.

Each spanserver is responsible for. 100-1000 tablets

Paxos state machine enables support synchronous replication.

Paxos State Machine

Paxos state machines - to implement a consistently replicated bag of mapping.

The key-value mapping state of each replica is stored in its corresponding tablet.

Writes must initiate the Paxos protocol at the leader.

The set of replicas is collectively a Paxos Group.

Each replica can be located on different datacenter.

Spanner’s Features

• As a globally-distributed database , Spanner provides several interesting features.

• Applications can specify constraints to control which datacenters contain which data :

• How far data is from its users (to control read latency).

• How replicas are from each other (to control write latency).

• How many replicas are maintained (to control durability, availability and read performance).

• In addition, Spanner has two features that are difficult to implement in a distributed database :

• Externally-consistent reads and writes.

• Globally-consistent reads across the database at a timestamp.

Spanner’s Features

• Why Externally-consistent reads and writes and Globally-consistent reads across the database at a timestamp are difficult to implement in a distributed database.

• Because we don't have a global “Wall Clock”.

So, what we can do?

• Global “Wall-Clock” time == External Consistency : Commit order respects global wall-time order.

• So, we will transform the problem to :

• Timestamp order respects global wall-time order.

• timestamp order == commit order.

Assigning timestamps to RW transactions

• Transaction that write use 2PL.

• Each transaction T is assigned a timestamp s.

• Data written by T is timestamped with s.

• Assign timestamp while locks are held.

T

Pick s = now()

Acquired locks Release locks

TIMESTAMP INVARIANTS

Timestamp order == commit order

Timestamp order respects global wall-time order

T2

T3

T4

T1

TrueTime API

• The key enabler of these properties (previous slide) is a new TrueTime APIand its implementation.

• The API exposes clock uncertainty, and the guarantees on Spanner’s timestampsdepend on the bounds that the implementation provides.

• The implementation keeps uncertainty small (generally less than 10ms) by usingmultiple modern clock references (GPS and atomic clocks).

TrueTime

• “Global wall-clock time” with bounded uncertainty.

time

earliest latest

TT.now()

2*ε

TIMESTAMPS AND TRUETIME

T

Pick s = TT.now().latest

Acquired locks Release locks

Wait until TT.now().earliest > ss

average ε

Commit wait

average ε

Operations

• Spanner supports:

• Read-write transaction.

• Read-only transaction.

• Snapshot reads.

• Read-only transaction must be pre-declared as not have any writes.

• Reads in read-only transactions execute at a system-chosen timestamp without locking, so that incoming writes are not blocked.

• Snapshot read is a read in the past that execute without locking.

• Client can either specify a timestamp or provide an upper bound.

Reads within read-write transactions

• Writes that occur in a transaction are buffered at the client until commit, as a result reads in a transaction do not see the effects of them.

• The client issues reads to the leader replica of the appropriate group.

• Acquires read locks and then reads the most recent data.

• While a client transaction remains open, it sends “keep-alive” messages.

• When a client has completed all reads and buffered all writes , write protocol begin.

RW transactions which involves one Paxos Group

T

Acquired locks Release locks

Start consensus Notify slaves

Commit wait donePick s

Achieve consensus

RW transactions which involves more than one Paxos Group – 2PC protocol

TC

Acquired locks Release locks

TP1

Acquired locks Release locks

TP2

Acquired locks Release locks

Notify participants of s

Commit wait doneCompute s for each

Start logging Done logging

Prepared

Compute overall s

Committed

Send s

EXAMPLE

TP

Remove X from my friend list

Remove myself from X’s friend list

sC=6

sP=8

s=8 s=15

Risky post P

s=8

Time <8

[X]

[me]

15

TC T2

[P]

My friends

My posts

X’s friends

8

[]

[]

Serving Reads at a Timestamp

• Each replica maintains .

• A replica can satisfy a read at a timestamp t if t <= .

• = min(, ).

• is timestamp of highest-applied Paxos write.

• is much harder:

• = ∞ if no pending 2PC transaction.

• = minimum (s-prepare i,g ) over i prepared transactions in group g.

• Thus, is maximum timestamp at which reads are safe.

Read-Only transactions

• Executes in two phases:

• Assign a timestamp .

• Reads as snapshot reads at .• The snapshot reads can execute at any replicas that are up-to-date.

• The simple assignment of =TT.now().latest , preservers external consistency.

• Such a timestamp may require the execution of data reads at to blockif has not advanced sufficiently.

• To reduce the chances of blocking, Spanner should assign the oldest timestamp that preserved external consistency.

Read-Only transactions

• Assigning a timestamp requires a negotiation phase between all of the Paxos groups that are involved in the read.

• As a result , Spanner requires a scope expression that summarizes the keys that will be read.

• If the scope’s values are served by a single Paxos group:

• The client issues the read-only transaction to the group leader.

• The leader assign = LastTS() (=the timestamp of the last committed write at Paxos).

• And execute the read at any up-to-date replica.

• If the scope’s values are served by multiple Paxos groups:

• = TT.now().latest (which may wait for safe time to advance).

50 Paxos groups, 2500 buckets, 4KBreads or writes, datacenters 1ms apart.

Benchmarks

Latency remains mostly constant asnumber of replicas increases becausePaxos executes in parallel at agroup’s replicas.

All leaders explicitly placed in zone Z1.

Red-line – Killing non-leaderno effects on read throughput.

Green-line – Killing leader-softgiving the leaders time to handoff leadership.

Blue-line – Killing leader-hardno warning for leaders.

Benchmarks

Questions?

Thanks!