sinfonia - university of cambridge · sinfonia a new paradigm for building scalable distributed...

25
Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos Karamanolis

Upload: vuonganh

Post on 28-May-2018

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos

Sinfoniaa new paradigm

for building scalable distributed systems

(SOSP 2007)

Marcos AguileraArif MerchantMehul Shah

Alistair VeitchChristos Karamanolis

Page 2: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos

Memory Nodes

App & Tx Coordinator Nodes

On the face of it, we have transaction coordinators alongside the application, and memory nodes to store the data. Is it just another ACID store that forces 2PC on you?

What I found interesting about this paper is that other approaches avoid 2pc at all costs; instead they relax consistency requirements (many use memcached) or avoid multi-word transactions (bigtable uses single row transactions)

Page 3: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos

Memory Nodes

App & Tx Coordinator Nodes

Another distributed transactional store?

On the face of it, we have transaction coordinators alongside the application, and memory nodes to store the data. Is it just another ACID store that forces 2PC on you?

What I found interesting about this paper is that other approaches avoid 2pc at all costs; instead they relax consistency requirements (many use memcached) or avoid multi-word transactions (bigtable uses single row transactions)

Page 4: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos

insert

delete

read

update

DB

read

A brief recap about db operation and 2pc. App begins a transaction, makes r requests (and touches m nodes in the process).DB often locks the data.

Page 5: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos

insert

delete

read

update

DB

r data requests to m nodes

read

A brief recap about db operation and 2pc. App begins a transaction, makes r requests (and touches m nodes in the process).DB often locks the data.

Page 6: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos

DB

prepare

prepare

prepare

2 pc starts after data is over. The db nodes log their data and prepare decision, thenthe coordinator logs its commit decision.

Page 7: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos

DB

commit

commit

commit

Page 8: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos

DB

commit

commit

commit

r + 2mround trips

Page 9: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos

DB

commit

commit

commit

r + 2mround trips

m + 1disk writes

Page 10: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos

sinfonia

if_cmp_rw_prepare

if_cmp_rw_prepare

if_cmp_rw_prepare

Sinfonia also exactly two phases, where the data transfer part is combined with prepare. Items are try-locked, then comparison is done, then the read and write is done. If the lock fails, the request fails.

It can do it because there is exactly one request allowed to read and write data. Clearly, this influences the way one writes application. The important takeaway is that such an API is useful in the real world. Multi-stage read-writes just have to be written as higher-order transactions (as with multi-word CAS)

To be fair, one can do this with stored procedures, but current databases donʼt allow you to exec stored procedure and prepare in one network hop.

Page 11: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos

sinfonia

m data requests to m nodes

if_cmp_rw_prepare

if_cmp_rw_prepare

if_cmp_rw_prepare

Sinfonia also exactly two phases, where the data transfer part is combined with prepare. Items are try-locked, then comparison is done, then the read and write is done. If the lock fails, the request fails.

It can do it because there is exactly one request allowed to read and write data. Clearly, this influences the way one writes application. The important takeaway is that such an API is useful in the real world. Multi-stage read-writes just have to be written as higher-order transactions (as with multi-word CAS)

To be fair, one can do this with stored procedures, but current databases donʼt allow you to exec stored procedure and prepare in one network hop.

Page 12: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos

sinfonia

commit

commit

commit

no coordinator disk write, plus commit write is lazy.

Page 13: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos

sinfonia

1.5 m round trips

commit

commit

commit

no coordinator disk write, plus commit write is lazy.

Page 14: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos

sinfonia

1.5 m round trips m disk writes

commit

commit

commit

no coordinator disk write, plus commit write is lazy.

Page 15: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos

DB Sinfoniastructuredstorage linear range

locking isolation levels duration deadlocks

brief, deterministic locking interval

blocking non-blocking

db nodes don’t know about each other

mem nodes know about others, for each tx

sinfonia: much lower level; app may have to worry about garbage collecting space

sinfonia: no blocking. If lock not acquired, does not prepare.

2pc: coordinator is a bottleneck for recovery because only it knows the participants.

Page 16: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos

coordinator crash

Management infrastructure periodically polls mem nodes about in-doubt transactions and a recovery coordinator kicks in when a coordinator crashes. It tells each node, for each in-doubt tx, to abort the tx unless it voted commit. If for a particular tx, all participating nodes say they voted to commit, then the rec. coordinator drives the tx to commit.

This is correct because this scheme can run concurrently with a coordinator which may have come back (maybe it got stuck in a GC or network hiccup). It is correct because no-one changes their vote.

To me, it is not clear from the paper how the management infrastructure knows which nodes the crashed coordinator was responsible for (unless there is a hint in the transaction id). Otherwise, it is reckless to start aborting all transactions currently in progress at all nodes.

Page 17: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos

coordinator crash

Management infrastructure periodically polls mem nodes about in-doubt transactions and a recovery coordinator kicks in when a coordinator crashes. It tells each node, for each in-doubt tx, to abort the tx unless it voted commit. If for a particular tx, all participating nodes say they voted to commit, then the rec. coordinator drives the tx to commit.

This is correct because this scheme can run concurrently with a coordinator which may have come back (maybe it got stuck in a GC or network hiccup). It is correct because no-one changes their vote.

To me, it is not clear from the paper how the management infrastructure knows which nodes the crashed coordinator was responsible for (unless there is a hint in the transaction id). Otherwise, it is reckless to start aborting all transactions currently in progress at all nodes.

Page 18: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos

coordinator crash

recovery coordinator

Management infrastructure periodically polls mem nodes about in-doubt transactions and a recovery coordinator kicks in when a coordinator crashes. It tells each node, for each in-doubt tx, to abort the tx unless it voted commit. If for a particular tx, all participating nodes say they voted to commit, then the rec. coordinator drives the tx to commit.

This is correct because this scheme can run concurrently with a coordinator which may have come back (maybe it got stuck in a GC or network hiccup). It is correct because no-one changes their vote.

To me, it is not clear from the paper how the management infrastructure knows which nodes the crashed coordinator was responsible for (unless there is a hint in the transaction id). Otherwise, it is reckless to start aborting all transactions currently in progress at all nodes.

Page 19: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos

for every in-doubt tx abort unless you remember voting commit return vote

coordinator crash

recovery coordinator

Management infrastructure periodically polls mem nodes about in-doubt transactions and a recovery coordinator kicks in when a coordinator crashes. It tells each node, for each in-doubt tx, to abort the tx unless it voted commit. If for a particular tx, all participating nodes say they voted to commit, then the rec. coordinator drives the tx to commit.

This is correct because this scheme can run concurrently with a coordinator which may have come back (maybe it got stuck in a GC or network hiccup). It is correct because no-one changes their vote.

To me, it is not clear from the paper how the management infrastructure knows which nodes the crashed coordinator was responsible for (unless there is a hint in the transaction id). Otherwise, it is reckless to start aborting all transactions currently in progress at all nodes.

Page 20: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos

coordinator crash

commit

commit

commit

if all of them voted to commit, a commit is sent to all, else abort.

Page 21: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos

memory node crash

No recovery coordinator is used when a mem node crashes. Nodes know the other nodes that were involved in the various transactions, so they ask each other while recovering. Iʼm not a big fan of this architecture; Iʼd have preferred a recovery coordinator in all cases; it is less complex and a recovery from a system crash (like a power failure) doesnʼt swamp the network.

Page 22: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos

memory node crash

No recovery coordinator is used when a mem node crashes. Nodes know the other nodes that were involved in the various transactions, so they ask each other while recovering. Iʼm not a big fan of this architecture; Iʼd have preferred a recovery coordinator in all cases; it is less complex and a recovery from a system crash (like a power failure) doesnʼt swamp the network.

Page 23: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos

memory node crash

No recovery coordinator is used when a mem node crashes. Nodes know the other nodes that were involved in the various transactions, so they ask each other while recovering. Iʼm not a big fan of this architecture; Iʼd have preferred a recovery coordinator in all cases; it is less complex and a recovery from a system crash (like a power failure) doesnʼt swamp the network.

Page 24: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos

enhancements

structured storage

smarter mem nodes

if cmp_read/write/prepare else read

temporary blocking instead of BAD_LOCK

Temporary blocking: Iʼd like a separate option that says block on locking for a limited amount of time instead of returning. It does introduce the possibility of limited-duration deadlocks, but may improve throughput.

Structured storage: different addressing options, not just offset, count. Ranges are prone to “off-by-one” errors that could result in livelocks and corrupted data. Key/Value storage keeps one keyʼs space logically separate from another.

App will also have to worry about portability. Fig. 7 in paper writes &newAttributes. This is tied to the current structure of attributes and to the machine that used it.

smarter: Compare could be any predicate (field2 > field 3). Actions could be increment, arithmetic, insertions etc.

Unnecessary round-trips on contention.

Page 25: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos

related reading

Google: Chubby, BigTable, TaskMasterYouTube architecture

Yahoo: , PNuts

Microsoft: Partitioning and Recovery Service

Apache/Yahoo Hadoop Project: Zookeeper