(spot302) under the covers of aws: core distributed systems primitives that power our platform | aws...

53
November 13, 2014 | Las Vegas, NV Al Vermeulen and Swami Sivasubramanian

Upload: amazon-web-services

Post on 02-Jul-2015

1.175 views

Category:

Technology


0 download

DESCRIPTION

AWS and Amazon.com operate some of the world's largest distributed systems infrastructure and applications. In our past 18 years of operating this infrastructure, we have come to realize that building such large distributed systems to meet the durability, reliability, scalability, and performance needs of AWS requires us to build our services using a few common distributed systems primitives. Examples of these primitives include a reliable method to build consensus in a distributed system, reliable and scalable key-value store, infrastructure for a transactional logging system, scalable database query layers using both NoSQL and SQL APIs, and a system for scalable and elastic compute infrastructure. In this session, we discuss some of the solutions that we employ in building these primitives and our lessons in operating these systems. We also cover the history of some of these primitives; DHTs, transactional logging, materialized views and various other deep distributed systems concepts; how their design evolved over time; and how we continue to scale them to AWS.

TRANSCRIPT

Page 1: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

November 13, 2014 | Las Vegas, NV

Al Vermeulen and Swami Sivasubramanian

Page 2: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014
Page 3: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Trend #1: The race between Computing power

and expectations

Computing systems keep getting more capableBut… expectations are going up even faster

Page 4: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014
Page 5: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Trend #2: Every application is a distributed app

The number of computers is going up fast

Many applications are distributed

Distributed systems is not a niche field anymore

Page 6: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014
Page 7: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Hardware and software trends

Specialized (expensive)

hardware

Built in redundancy

Simple Software

Commodity hardware

Smarter software

Page 8: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

With commodity hardware and scale - server failures are inevitable!

But by using smarter software,

we can build more robust systems

Trend #3: Commodity Hardware and Smarter Software

Page 9: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014
Page 10: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

OnlineOnline

Online

OnlineOnline

Page 11: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

OnlineOnline

Online

OnlineOnline

Page 12: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

OnlineOnline

Online

OnlineOnline

Page 13: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Cloud – Elasticity is the new normalThis results in fleets being dynamic

Page 14: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Our World

Page 15: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Challenges

Page 16: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Addressing Distributed Computing challenges

primitives

Page 17: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Core Distributed Systems Primitives

Group

MembershipDiscovery

Metadata

Store

Failure

DetectionWorkflows

Page 18: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Group Membership

Page 19: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Group Membership

Amazon RDS Multi-AZ

Amazon ElastiCache Group – List of caches in a Memcache

group

Page 20: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

An example…

Replica B

Replica C

Writes from

client AReplica A

Replica D

New member in the

group

Should I continue to serve reads?

Should I start a new quorum?

Replica E Replica F

Reads and

Writes from

client B

Classic Split Brain Issue in Replicated systems leading to lost writes!

Page 21: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Group Membership Fundamentals

Adding a new member to the group

Removing a member from the group

Discovering when the group membership changes

Discovering roles within the group

Page 22: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Discovery

Page 23: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Discovery

Page 24: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Discovery – Configuration File

Page 25: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Discovery - DNS

Page 26: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Discovery – DNS (cons)

Page 27: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Discovery – Gossip Protocol

Page 28: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Discovery – Gossip Protocol (Cons)Discovery – Gossip Protocol (Con)

Page 29: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Discovery – Metadata store/consensus

Amazon DynamoDB

Page 30: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Metadata Store

Page 31: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Metadata Store

Page 32: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Metadata Store - what are good characteristics?

Simplicity

Availability

Scalability

Amazon DynamoDB – top

choice for metadata storage in

Amazon

Page 33: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Metadata Store – Lessons Learned

Page 34: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Failure Detection

Page 35: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Failure Detection - Challenges

Page 36: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Failure Detection - Techniques

Page 37: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Failure detection: Lessons Learned

Page 38: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Workflows

Page 39: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Workflow – What is it?

To execute a series of actions asynchronously

Page 40: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

What is a workflow?

Page 41: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

What is not a workflow?

synchronous

asynchronous

Page 42: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Workflow – A simple script

Page 43: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Workflow – Recommended Approach

Activity

1Activity

2Activity

3

Activity

4

Page 44: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Workflow – Lessons Learned

Idempotent

metadata

Amazon Simple Workflow

Service

Page 45: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

What is the underlying problem?

Group

MembershipDiscovery

Metadata

Store

Failure

DetectionWorkflows

Page 46: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Consensus

Page 47: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Paxos and consensus

single point

of failure

Paxos at

the bottom

broken

Page 48: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Consensus – Lessons Learned

Page 49: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

consensus..

Page 50: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Paxos at Amazon

Page 51: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Group

Membership

Discovery

Metadata

Store

Failure

Detection

Workflows

Lock

ManagementAmazon

Kinesis

Amazon

DynamoDB

Streams

???

Page 52: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

Summary

Page 53: (SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That Power Our Platform | AWS re:Invent 2014

http://bit.ly/awsevals