(spot302) under the covers of aws: core distributed systems primitives that power our platform | aws...

Post on 02-Jul-2015

1.175 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

AWS and Amazon.com operate some of the world's largest distributed systems infrastructure and applications. In our past 18 years of operating this infrastructure, we have come to realize that building such large distributed systems to meet the durability, reliability, scalability, and performance needs of AWS requires us to build our services using a few common distributed systems primitives. Examples of these primitives include a reliable method to build consensus in a distributed system, reliable and scalable key-value store, infrastructure for a transactional logging system, scalable database query layers using both NoSQL and SQL APIs, and a system for scalable and elastic compute infrastructure. In this session, we discuss some of the solutions that we employ in building these primitives and our lessons in operating these systems. We also cover the history of some of these primitives; DHTs, transactional logging, materialized views and various other deep distributed systems concepts; how their design evolved over time; and how we continue to scale them to AWS.

TRANSCRIPT

November 13, 2014 | Las Vegas, NV

Al Vermeulen and Swami Sivasubramanian

Trend #1: The race between Computing power

and expectations

Computing systems keep getting more capableBut… expectations are going up even faster

Trend #2: Every application is a distributed app

The number of computers is going up fast

Many applications are distributed

Distributed systems is not a niche field anymore

Hardware and software trends

Specialized (expensive)

hardware

Built in redundancy

Simple Software

Commodity hardware

Smarter software

With commodity hardware and scale - server failures are inevitable!

But by using smarter software,

we can build more robust systems

Trend #3: Commodity Hardware and Smarter Software

OnlineOnline

Online

OnlineOnline

OnlineOnline

Online

OnlineOnline

OnlineOnline

Online

OnlineOnline

Cloud – Elasticity is the new normalThis results in fleets being dynamic

Our World

Challenges

Addressing Distributed Computing challenges

primitives

Core Distributed Systems Primitives

Group

MembershipDiscovery

Metadata

Store

Failure

DetectionWorkflows

Group Membership

Group Membership

Amazon RDS Multi-AZ

Amazon ElastiCache Group – List of caches in a Memcache

group

An example…

Replica B

Replica C

Writes from

client AReplica A

Replica D

New member in the

group

Should I continue to serve reads?

Should I start a new quorum?

Replica E Replica F

Reads and

Writes from

client B

Classic Split Brain Issue in Replicated systems leading to lost writes!

Group Membership Fundamentals

Adding a new member to the group

Removing a member from the group

Discovering when the group membership changes

Discovering roles within the group

Discovery

Discovery

Discovery – Configuration File

Discovery - DNS

Discovery – DNS (cons)

Discovery – Gossip Protocol

Discovery – Gossip Protocol (Cons)Discovery – Gossip Protocol (Con)

Discovery – Metadata store/consensus

Amazon DynamoDB

Metadata Store

Metadata Store

Metadata Store - what are good characteristics?

Simplicity

Availability

Scalability

Amazon DynamoDB – top

choice for metadata storage in

Amazon

Metadata Store – Lessons Learned

Failure Detection

Failure Detection - Challenges

Failure Detection - Techniques

Failure detection: Lessons Learned

Workflows

Workflow – What is it?

To execute a series of actions asynchronously

What is a workflow?

What is not a workflow?

synchronous

asynchronous

Workflow – A simple script

Workflow – Recommended Approach

Activity

1Activity

2Activity

3

Activity

4

Workflow – Lessons Learned

Idempotent

metadata

Amazon Simple Workflow

Service

What is the underlying problem?

Group

MembershipDiscovery

Metadata

Store

Failure

DetectionWorkflows

Consensus

Paxos and consensus

single point

of failure

Paxos at

the bottom

broken

Consensus – Lessons Learned

consensus..

Paxos at Amazon

Group

Membership

Discovery

Metadata

Store

Failure

Detection

Workflows

Lock

ManagementAmazon

Kinesis

Amazon

DynamoDB

Streams

???

Summary

http://bit.ly/awsevals

top related