resilent cloud applications

60
MICROSOFT CONFIDENTIAL – INTERN Resilent Cloud Applications Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team

Upload: kailey

Post on 24-Feb-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Mark Simms (@ mabsimms ) Principal Program Manager Windows Azure Customer Advisory Team. Resilent Cloud Applications. Session Objectives. Designing resilient large-scale services requires careful design and architecture choices - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Resilent  Cloud Applications

Resilent Cloud ApplicationsMark Simms (@mabsimms)Principal Program ManagerWindows Azure Customer Advisory Team

Page 2: Resilent  Cloud Applications

Session ObjectivesDesigning resilient large-scale services requires careful design and architecture choices

This session will explore key patterns & practices for highly available cloud services, illustrated with customer examples

Interactivity rocks -> please ask questions throughout!

Page 3: Resilent  Cloud Applications

Setting the Stage

Page 4: Resilent  Cloud Applications

Setting the stageScalability

AvailabilityInsight

Page 5: Resilent  Cloud Applications

Setting the stageMaximize service availability for consumersEnsure customers (and client devices) can access and use the service

Minimize impact of failure on consumersDegrade gracefully, isolate faults, fallback to alternate delivery paths

Maximize performance and capacityServices that are “live”, but cannot handle desired/required demand are not available

Page 6: Resilent  Cloud Applications

Musings on application design Traditional web service

design (N-tier) Make “everything

stateless”

Load Balancer

Web Servers

AppServers

Page 7: Resilent  Cloud Applications

Musings on application design Traditional web service

design (N-tier) Make “everything

stateless” Separate logic from

data (state) Leverage specialized

external state services Cache, load balancer,

relational database, document database, key/value store, etc

Load Balancer

Web Servers

AppServers

Database

DistributedCache

Doc Store

...

Page 8: Resilent  Cloud Applications

Musings on application design No service is an island Dependencies on

other internal and external services

Trading time-to-market and agility for control

Load Balancer

Web Servers

AppServers

Database

DistributedCache

Doc Store

...

External Services (SendGrid, Twitter, Facebook, etc)

Page 9: Resilent  Cloud Applications

What’s in a workload?#1: without the relational database the application

cannot fulfill any workloads

#2: the relational database is an external

service, subject to partial availability

Page 10: Resilent  Cloud Applications

Designing for Failure

Page 11: Resilent  Cloud Applications

Decompose by WorkloadApplications are compromised of one or more workloadsProducts like SharePoint and Windows Server are designed with this principle in mindEach with different profiles, requirements and boundariesManagement, Availability, Operational, Cost, Health, Security, Capacity, etc.Decomposition allows for workload specific optimizationTechnology selections, scalability and availability approaches, etc.

Page 12: Resilent  Cloud Applications

What are the “9”sAvailability % Downtime per year Downtime per month* Downtime per week

90% ("one nine") 36.5 days 72 hours 16.8 hours

99% ("two nines") 3.65 days 7.20 hours 1.68 hours

99.9% ("three nines") 8.76 hours 43.2 minutes 10.1 minutes

99.99% ("four nines") 52.56 minutes 4.32 minutes 1.01 minutes

99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds

99.9999% ("six nines") 31.5 seconds 2.59 seconds 0.605 seconds

12

• Study Windows Azure Platform SLAs:• Compute External Connectivity: 99.95% (2 or more instances)• Compute Instance Availability: 99.9% (2 or more instances)• Storage Availability: 99.9%• SQL Azure Availability: 99.9%

Page 13: Resilent  Cloud Applications

The Truth About 9s

SLA = *

Page 14: Resilent  Cloud Applications

Define Your SLAs

Page 15: Resilent  Cloud Applications

Design for FailureGiven enough scale, time and pressure all components or services will fail

Your application will experience 1..N failuresHow will your application behave?

Gracefully handle failure modes, continue to deliver value Not so gracefully …

Fault types: Transient. Temporary service interruptions, self-healing Enduring. Require intervention.

Page 16: Resilent  Cloud Applications

Failure ScopeRegion

Service

NodeIndividual Nodes May FailConnectivity Issues (transient failures), hardware failures,

Entire Services May FailService dependencies (internal and external), configuration and code issues

Regions may become unavailableConnectivity Issues, acts of nature

Page 17: Resilent  Cloud Applications

Handling Transient and Enduring Failures Use fault-handling

frameworks that recognize transient errors Make it part of the background ”noise”

Appropriate retry and backoff policies

Page 18: Resilent  Cloud Applications

Handling Transient and Enduring Failures

Page 19: Resilent  Cloud Applications
Page 20: Resilent  Cloud Applications

Handling Transient and Enduring Failures

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728290

50000100000150000200000250000300000350000400000450000

Web Request Response Latency

Avg Latency Response latency

• At some point, your request is blocking the line

• Fail gracefully, and get out of the queue!

• Anti-patterns:• Too much trust in

downstream services and client proxies

• Not bounding non-deterministic calls

• Blocking synchronous operations

Page 21: Resilent  Cloud Applications

Sample Retry PoliciesPlatform Context Sample Target

e2e latency max“Fast First”

Retry Count

Delay Backoff

SQL Database

Synchronous (e.g. render web page)

200 ms Yes 3 50 ms Linear

Asynchronous (e.g. process queue item)

60 seconds No 4 5 s Exponential

Azure Cache

Synchronous (e.g. render web page)

100 ms Yes 3 10 ms Linear

Asynchronous (e.g. process queue item)

500 ms Yes 3 100 ms

Exponential

Page 22: Resilent  Cloud Applications

Circuit Breaker at NetflixA request to a remote service times out

Thread pool and bounded task queue used to interact with a service dependency are at 100%

Client library used to interact with a service dependency throws an exception

On

Off

Error RateThresholdCriteria

Page 23: Resilent  Cloud Applications

Circuit Breaker at Netflix - Fallbacks

Page 24: Resilent  Cloud Applications

Deployment Redundancy

Page 25: Resilent  Cloud Applications

Failure PointsFocus on identifying design elements that are subject to external change. For example:

Database connection Website connection Configuration file Registry key

Categories of common Failure Points: ACLs, Database access, External web site/service access,

Transactions, Configuration, Capacity, Network

definition: design elements that can cause an outage.

Page 26: Resilent  Cloud Applications

Failure ModesExamples of failure modes:

Configuration file is not in correct location Too much traffic overusing resources Database reaches maximum capacity

The following would not be considered a failure mode: Product bugs Symptoms of problems Informational occurrences

definition: a predictable root cause of the outage that occurs at a Failure Point.

Page 27: Resilent  Cloud Applications

Failure Mode Example

27

public int GetBusinessData(string[] parameters){ try {

var config = Config.Open(_configPath);var conn = ConnectToDB(config.ConnectString);var data = conn.GetData(_sproc, parameters);return data;

} catch (Exception e) {

WriteEventLogEvent(100, E_ExceptionInDal);throw;

}}

Potential Failure Points: Database Server Database Table Configuration File

Potential Failure Modes: DB Server not responding DB offline DB access denied Sproc execute denied DB doesn’t exist DB timeout on connect Index corrupt Database corrupt Table doesn’t exist Table corrupt Config file missing or

invalid

Page 28: Resilent  Cloud Applications

Design for operations

Page 29: Resilent  Cloud Applications

Running a Live Site Service

Page 30: Resilent  Cloud Applications

Running without Insight / Telemetry

Page 31: Resilent  Cloud Applications

Capturing Insight Log all internal/external “transactions” (database, web services, etc) Application context (module/component) Host context (server/role/instance/process) Timing information (start/stop/duration) Activity identifier

Consolidate logs to central system / dashboard for health monitoring and troubleshooting

Page 32: Resilent  Cloud Applications

MICROSOFT CONF IDENT IAL – INTERNAL ONLY

Capturing Insight Capture timing and context information

through helper delegates (background noise)

Capture contextual errors (inner exceptions, etc) on

error

Logging library is asynchronous (fire-and-forget) to avoid blocking

Page 33: Resilent  Cloud Applications

Many Options

Windows Azure Diagnostics

Page 34: Resilent  Cloud Applications

Designing for InsightInstrument for production loggingIf you didn’t capture it, it didn’t happen

Implement inter-service monitoring and alertingCapture and quantify inter-service behavior and activity

Run-time configurable loggingEnable activation (capture or delivery) of additional channels at run-time

Page 35: Resilent  Cloud Applications

Define ALM

Dev Fabric

Code Unit Test

Run

Check In

Build

Automated Test

Run

Test

Deploy

Dev on Azure

CI

Stage

Deploy

TestMonitor

QA/Pre-release on Azure

Production Release on

Azure

Log Defect

Defect Feature Triage

Plan Fixes Updates

Plan

Design

Scope

Page 36: Resilent  Cloud Applications

Updating Configuration For a production service configuration == code

Need rigorous ALM process for rolling out (and rolling back) updates to both.

Page 37: Resilent  Cloud Applications

Updating Services“We want global, simultaneous production rollouts of our new code”Are you sure about that?

Production rollouts: Running N, N+1 concurrently Rolling load over to N+1, ability to fallback

Page 38: Resilent  Cloud Applications

What is a health model?

Logical piece of an applicationA component that makes sense to an operatorEach entity has a health stateEntities can be external or internalMultiple instances of an entity may exist

Managed EntityBreak down health state by functional teamMust be mutually exclusiveGroup by organizational responsibility e.g. security, performance, backupMay be specific or non-technology e.g. orders shipped.

AspectDefines level of operation currently availableNormal state is fully functionalWell designed applications may support partial operation e.g. read only

Operational Condition

Page 39: Resilent  Cloud Applications

Troubleshooting WorkflowDetectionIs there a problem?

ClassificationWhat’s not working, how bad is it?

DiagnosisWhy is there a problem?

RecoveryWhat needs to be done to fix it?

VerificationIs the problem really gone?

Page 40: Resilent  Cloud Applications

Resources Failsafe: Guidance for Resilient Cloud Architectures (http://msdn.microsoft.com/en-us/library/jj853352.aspx)

Best Practices for the Design of Large-Scale Services on Windows Azure Cloud Services

(http://msdn.microsoft.com/en-us/library/windowsazure/jj717232.aspx)

Designing and Deploying Internet Scale Services

https://www.usenix.org/events/lisa07/tech/full_papers/hamilton/hamilton.pdf

Page 41: Resilent  Cloud Applications

Design for Scale

Page 42: Resilent  Cloud Applications

Scale

Resources

Demands

Unit of ScaleWorkloads

Page 43: Resilent  Cloud Applications

Scale by Units

Page 44: Resilent  Cloud Applications

Workload 1

Workload 2

Bottom Ramp Peek

Page 45: Resilent  Cloud Applications

MICROSOFT CONF IDENT IAL – INTERNAL ONLY

Data Partitioning

Page 46: Resilent  Cloud Applications

Understanding the 3Vs

Page 47: Resilent  Cloud Applications

MICROSOFT CONF IDENT IAL – INTERNAL ONLY

Understanding Queryability

Page 48: Resilent  Cloud Applications

MICROSOFT CONF IDENT IAL – INTERNAL ONLY

Horizontal Partitioning

Page 49: Resilent  Cloud Applications

MICROSOFT CONF IDENT IAL – INTERNAL ONLY

Vertical Partitioning

Page 50: Resilent  Cloud Applications

MICROSOFT CONF IDENT IAL – INTERNAL ONLY

Hybrid Partitioning

Page 51: Resilent  Cloud Applications

Data – to cache or not to cache….

Page 52: Resilent  Cloud Applications

52

Microsoft ConfidentialPush vs. Pull

Load Balanced PushSync and good for sequential processingDependent on downstream servicesThrottling vs. Performance

Managed Pull/ThroughputAsynchronous and event driven processingEasy Parallelisation and PipeliningExtending logic is easy

Logic based• Priority• Date• Amount• Etc.

Time based• ASAP• Gradually• Periodically• On-Demand

Volume based• Single• In Batches

Page 53: Resilent  Cloud Applications

53

Microsoft ConfidentialData on the inside – Data on the outside

http://msdn.microsoft.com/en-us/library/ms954587.aspx

•Immutable (versions)•Requires open schema for interopReference Data

•Low concurrency updates (e.g. shopping basket)Activity Data

•Highly concurrent update (e.g. inventory)•Should live in worker role

Resource (shared) Data

Page 54: Resilent  Cloud Applications

54

Microsoft Confidential“Query Ready” Cache

Query patternsPush the data close to where it is queried– Example: BING Maps

Process, structure, produce, format etc. data and cache “query ready” dataLight/cheap data production is OK

Pure and Idempotent operations are usually good candidatesDuplication is OK

Same data in a different formatSame data in multiple places

This requires processing data before it is queried - NOT at the query timeAll data can be cachedSome data can be cached:Frequently usedProcess Heavy, Expensive dataBuild as you Go

Page 55: Resilent  Cloud Applications

55

Microsoft ConfidentialDistributed Caching

Simple to administerNo need to manage and host a distributed cache yourself.

Integrates easily into existing applicationsASP.NET session state and output cache providers enable no-code integration.

Same managed interfaces as Windows Server AppFabric Cache

On-Premises App Windows Azure App

Core Logic

AppF

abric

Ca

che

APIs Windows

Server AppFabric

CacheCore Logic

AppF

abric

Ca

che

APIs

Windows Azure AppFabric Caching

Page 56: Resilent  Cloud Applications

MICROSOFT CONF IDENT IAL – INTERNAL ONLY

Data Resiliency

Page 57: Resilent  Cloud Applications

MICROSOFT CONF IDENT IAL – INTERNAL ONLY

Backup and Restore

Page 58: Resilent  Cloud Applications

MICROSOFT CONF IDENT IAL – INTERNAL ONLY

Backing Up Table and Blob Storage

Source Replica

Log

Log Replica

01100100 01100001 01110100 01100001

Page 59: Resilent  Cloud Applications

MICROSOFT CONF IDENT IAL – INTERNAL ONLY

Managing Backed Up Data

Page 60: Resilent  Cloud Applications

MICROSOFT CONF IDENT IAL – INTERNAL ONLY

CDN

pic1.jpgpic1.jpg

Content Delivery Network

Blob Service

EdgeLocation

EdgeLocation

EdgeLocation

pic1.jpg