kaushik veeraraghavan - usenix · kaushik veeraraghavan, justin meza, david chou, wonho kim, sonia...

24
Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and Yee Jiun Song Facebook Inc.

Upload: others

Post on 24-Jul-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and

KaushikVeeraraghavan,JustinMeza,DavidChou,Wonho Kim,SoniaMargulis,ScottMichelson,RajeshNishtala,DanielObenshain,Dmitri

Perelman,andYeeJiun SongFacebookInc.

Page 2: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and

2

Webserver

Eachuserrequesttoucheshundredsofsystems

NewsfeedNewsfeed

DBDB

CacheCache

PYMLPYML

AdsAds

SearchSearch

EverstoreEverstore

ScribePtail

CoefficientCoefficient

LaserLaser

ScribeScribe

Page 3: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and

Theworkloadisconstantly evolving

3

Manyproducts Growinguserbase

• Facebook:3dailyreleases

• Instagram:cont.release

Rapidsoftwarechange

Page 4: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and

Goals

• Howmanymachinesdoeseachsoftwaresystemneed?

• Canweservepeakload?

• Areweoperatingefficiently?

4

Page 5: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and

Commonapproachestocapacitymanagement

• Loadmodeling:simulatehowsystembehavesathighload• Loadtesting:benchmarkusingsyntheticworkloads

5

Webserver

NewsfeedNewsfeed

DBDB

CacheCache

PYMLPYML

AdsAds

SearchSearch

EverstoreEverstore

ScribePtail

CoefficientCoefficient

LaserLaser

ScribeScribe

Page 6: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and

6

Liveusertrafficisthemostrepresentativeworkload

• Accuratedistributionofreads&writes

• Donotneedacustomtestsetup

Page 7: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and

7

• Directliveusertrafficattarget

Livetrafficloadtestsmeasurepeakservingcapacity

Page 8: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and

8

Livetrafficloadtestsmeasurepeakservingcapacitysafely

• Monitorhealthmetrics• Responselatency• Servererror

• Resetloadwhenthresholdsarehit

Page 9: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and

Roadmap

• Krakenmeasurespeakservingcapacityatallscales• Asinglewebserver• Asinglecluster• Anentiregeographicalregion

• Krakenidentifiesbottleneckslimitingutilization• Loadimbalance• Networksaturation

• ChallengesindeployingKraken

9

Page 10: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and

FrontendCluster

FrontendCluster

Clusterweight

Region

Region

Region

Region

Edgeweight

Webserverweight

DNS

EdgePOP

EdgePOP

EdgePOP

ServiceCluster

BackendCluster

Search

Newsfeed

Krakenusesweightstorouterequests

Page 11: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and

Krakenmeasuresawebserver’speakservingcapacity

11

• Peakwebservercapacity:175requestspersecond(RPS)• Productiontarget:90%utilizationi.e.,157RPS

175

Page 12: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and

Krakenmeasuresacluster’speakservingcapacity

• Maxclustercapacity=(webservercapacity)*(num.webserversincluster)12

90%

Page 13: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and

Krakenmeasuresaregion’speakservingcapacity

13• Wenowserve20%moreuserswiththesameinfrastructure

74%

90%

2015 2016

Page 14: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and

Inefficientloadbalancinglimitsutilization

14

Page 15: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and

Networksaturationlimitsutilization

15

Page 16: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and

Challenge:non-linearresponsetotrafficshifts

16

Webserver

NewsfeedNewsfeed

DBDB

CacheCache

PYMLPYML

AdsAds

SearchSearch

EverstoreEverstore

ScribePtail

CoefficientCoefficient

LaserLaser

ScribeScribe

?

Page 17: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and

Challenge:howcanwefosterexperimentation?

• Setconservativethresholds

• Communicatewidelyabouttests

• Encouragecollaboration• Monitoring• Failuremitigationstrategies

17

Page 18: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and

Conclusion

• We’verun50+regional,1000+clusterlivetrafficloadtestsin3years

• Krakenhashelpedusidentifyhundredsofbottlenecksandverifyfixes

• Wecannowserve20%moreuserswiththesameinfrastructure

18

Page 19: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and

Kraken:assumptionsandcaveats

• Statelessservers

• Routablerequests

• Loadimpactsdownstreamsystems

19

Page 20: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and

Kraken:usertrafficmanagement

WebLB

FrontendCluster

PoP

DNS

FrontendCluster

ServiceLB

ServiceCluster

PoP

Region

PoP

Region

Region

Region

PoPs Datacenterregions Datacenter

Edgeweight Clusterweight

Serverweight Serverweight

20

Page 21: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and

Kraken:livetrafficloadtests

WebLB

FrontendCluster

HealthMonitor

FeedbackControl

TrafficShifter

EdgePOP

DNS

MeasurehealthIncrease/resetloadUpdateweights

Kraken

21

Page 22: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and

HealthmetricsforsystemsaffectedbywebloadServicetype Metrics

Webservers CPUutilization,latency,errorrate,fractionofoperationalservers

Aggregator–leaf CPUutilization,errorrate,responsequality

Proxygen softwareL7loadbalancer CPUutilization,latency,connections,retransmitrate,Ethernetutilization,memorycapacityutilization

Memcache Latency,objectleasecount

TAO CPUutilization,writesuc- cess rate,readlatency

Batch processor Queuelength,exceptionrate

Logging Errorrate

Search CPUutilization

Servicediscovery CPUutilization

Messagedelivery CPUutilization

22

Page 23: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and

Continuousrunsmeasuringawebserver’scapacity

23

175

Page 24: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and

SomelessonsfromathousandKrakentests

• SimplicityiskeytoKraken’ssuccess.

• Identifyingtherightperformance,errorrateandlatencymetricstotrackisdifficult.

• Cheapsolutions,likeallocatingcapacityorfixingmisconfiguration,areoftenmoreimpactfulthanprofile-basedtuningorsystemredesign.

24