tail latency: beyond queuing theory · tail latency: beyond queuing theory kathryn s mckinley xi...

Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

3

Microsoft’s Quincy datacenter

Servers in US datacenters

4

Serv

ers i

n m

illion

s

06 2008 2010 2012 2014 2016 2018 2020

20

0

4

8

16

12

Unbranded 2+ sockets Unbranded 1 socket Branded 2+ sockets Branded 1 socket

*Shehabi et al., United States Data Center Energy Usage Report, Lawrence Berkeley, 2016.

2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger

Billio

n KW

Hou

rs /

Year

2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger

200 175 150 125 100 75 50 25 0 2000 2005 2010 2015 2020

Actual $1 to $6 billion

Electricity in US datacenters


~ $30,000,000 Savings from 1% less work Lots more by not building a datacenter

Datacenter economics quick facts*

6

~ $500,000 Cost of one datacenter ~3,000,000 US datacenters in 2016

~ $1.5 trillion US Capital investment to date ~ $3,000,000,000 KW dollars / year


Improve efficiency!

7

Tail Latency Matters

9

Twosecondslowdownreducedrevenue/userby4.3%.[EricSchurman,Bing]

400milliseconddelaydecreasedsearches/userby0.59%.[JackBrutlag,Google]

TOP

PRIORITY

10

Server architecture

aggregator

workers

client

11

Characteristics of interactive services

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Perc

enta

ge o

f re

quest

s

Latency (ms)

LC

�Bursty,diurnal�CDFchangesslowly�Slowestserverdictatestail�Ordersofmagnitudediffaverage&99+%Qle

12

Client side observations

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Perc

enta

ge o

f re

quest

s

Latency (ms)

noise?

13


0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Perc

enta

ge o

f re

quest

s

Latency (ms)

noise?

Solution to noise Replication •  All requests? •  CFD shows cost

& potential

10 % of requests 5% of requests

14


0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Perc

enta

ge o

f re

quest

s

Latency (ms)

not noise?

Roadmap

What’s in the tail? Continuous profiling to diagnose the tail Real problems

•  Noise: replication •  Work: parallelism •  Other opportunities

Still poor utilization due to bursty diurnal workload •  Colocation for utilization without impacting tail latency

Opportunities in hardware/software codesign 15

16

application

worker OS / VM

Java VM

application

worker OS / VM

Java VM

application

worker OS / VM

Java VM

Simplified life of a request

…

request response

Prior state of the art Dick Site’s talk: https://www.youtube.com/watch?v=QBu2Ae8-8LM

17

Dick Sites & team

18

Hand instrument system 1% on-line budget sample – but tails are rare… Off-line schematics Have insight Improve the system

Dick Sites & team

19

Hand instrument system 1% on-line budget sample – but tails are rare… Off-line schematics Have insight Improve the system

Dick Sites & team

20

Automated instrumentation 1% on-line budget continuous on-line profiling Off-line schematics Have insight Improve the system + On-line optimization

✗Hand instrument system 1% on-line budget sample – but tails are rare… Off-line schematics Have insight Improve the system

✗

counters tags

Automated cycle-level on-line profiling

Insight Hardware & software generate signals

21

[ISCA’15(TopPicksHM),ATC’16]

21

hardware signals software signals performance counters memory locations

✓ ✓

✓ ✓

SHIM Design ISCA’15 (Top Picks HM), ATC’16

22

Observe global state from other core

23

23

LLCmissespercycle

while(true):forcounterinLLCmisses,cycles:buf[i++]=readCounter(counter)

Observe local state with SMT hardware

24

24

HT1

HT2

0

4

HT1IPC

0

4

CoreIPC

0

4HT2SHIMIPC

HT1IPC=CoreIPC–HT2SHIMIPC

while(true):forcounterinHT2SHIM,Core,Cycles:buf[i++]=readCounter(counter);

Correlate hardware & software events

01234

HT1IPC

01234

CoreIPC

01234

HT2SHIMIPC

1

2

3

A()B()C()

HT1

HT2

while(true):forcounterinHT2SHIM,Core,cycles:buf[i++]=readCounter(counter);tid=threadonHT1buf[i++]=tid.method;

Fidelity

26

27

1e-05

0.0001

0.001

0.01

0.1

1

0.01 0.1

1 10 100 1000

IPC(logscale)

%ofsamples(logscale)

Raw samples

28

!me

R0C0

IPC1 IPC2 IPC3

R1C1 R2C2 R3C3

IPC=(Rt–Rt-1)/(Ct–Ct-1)

✗✓ ✓

CountersC:cyclesR:reQredinstrucQons

Problem: samples are not atomic

29

!me

IPC1 IPC2 IPC3

✗✓ ✓

Solution: use clock as ground truth CPC=(Cet–Cet-1)/(Cst–Cst-1)thisshouldbe1!

CPC1=1.0+/-1% CPC2=1.0+/-1% CPC3!=1.0+/-1%

Cs0R0C0Ce0 Cs1R1C1Ce1 Cs2R2C2Ce2 Cs3R3C3Ce3

30

1e-05

0.0001

0.001

0.01

0.1

1

0.01 0.1

1 10 100 1000

1e-05

0.0001

0.001

0.01

0.1

1

0.01 0.1

1 10 100 1000

----rawIPC

%ofsam

ples(logscale)

1e-05

0.0001

0.001

0.01

0.1

1

0.01 0.1

1 10 100 1000

----rawCPC----filteredIPC

----filteredCPCin[0.99,1.01]

Filtering Lusearch IPC samples

31

top10methods(74%totalexecuQonQme)

IPC

00.20.40.60.81

1.21.41.6

1 2 3 4 5 6 7 8 9 10

default1KHz maximum100KHz SHIM10MHz

IPC of individual methods in Lucene

32

0

0.5

1

1.5

2

2.5

3

3.5

4

30cycles 1213cycles

methodandloopIDs

Normalize

dto

with

outS

HIM

OverheadsfromwriteinvalidaQons

3MHz:1+orderofmagnitudeoverinterrupt‘maximum’

113MHz:3+ordersofmagnitudeoverinterrupt‘maximum’

Overheads from other core

Understanding Tail Latency

33

SHIM signals

Requests •  thread ids •  request id (software configured) •  time stamps, PC System threads •  thread ids •  time stamp, PC

34

All requests

35

0

20

40

60

80

100

120

0 20 40 60 80 100

late

ncy

(m

s)

Request groups (from the slowest 1% to the fastest 1%)

Client latency

Average queueing time

Longest 200 requests

36

0

20

40

60

80

100

120

0 50 100 150

200

late

ncy

(ms)

Top 200 requests

Network and networking queueing time

Idle time

CPU time

Dispatch queueing time

laten

cy

Network & other Idle CPU work Queuing at worker

not noise

noise, bursts? queuing?

0 0

0 0

0 0 0 0

0 0

0 0

Parallelism

0 300 600 900

1200 1500

0 10 20 30 40 50

Late

ncy

ms

Lucene RPS

Sequential 99th 4 way 99th

improves at low load

degrades at high load

Parallelismhistoricallyforthroughput

ParallelismfortaillatencyIdea

38

Parallelism

InsightApproach

Longrequestsrevealthemselves

Incrementallyaddparallelismtolongrequests–thetail–basedonrequestprogress&load

Parallelismhistoricallyforthroughput

ParallelismfortaillatencyIdea

Few-to-Many Dynamic Parallelism [ASPLOS’15]

0 0

0 0

0 0 0 0

0 0

0 0

Evaluation 2x8 64 bit 2.3 GHz Xeon, 64 GB

300

600

900

1200

1500

30 32 34 36 38 40 42 44 46 48

Tail

late

ncy

ms

Requests per Second

buy fewer servers

reduce tail latency

Few to Many Sequential

Queuing theory Optimizing average latency maximizes throughput But not the tail! Shortening the tail reduces queuing latency 40

Longest 200 requests

41

0

20

40

60

80

100

120

0 50 100 150

200

late

ncy

(ms)

Top 200 requests

Network and networking queueing time

Idle time

CPU time

Dispatch queueing time

laten

cy

Network & other Idle CPU work Queuing at worker

noise, bursts? queuing?

✔

Correlate bad requests with system state

request CPU0

CPU1

… CPUN

GC thread

GC thread

GC thread

Use time stamps to post-process traces

CPU 0

t0 t1 … ti tj tk tl thread thx

time

Recap & what’s next

SHIM continuous profiling to diagnose the tail •  Noise: replication •  Work: parallelism •  Scalability bottlenecks

Continuous monitoring suggests dynamic optimizations but… still poor utilization due to bursty diurnal workload •  Colocation

Looking forward

43

Queuing theory Over provision for maximum burst, otherwise queuing delay degrades average and tail latency

44

0

50

100

150

200

0 20 40 60 80 100 120 140 160 180

Late

ncy

(ms)

RPS

Lucene alone 50%ile Lucene alone 95%ile Lucene alone 99%ile

High Responsiveness—Low Utilization

45

IntelXeon-D1540Broadwell

LC

LuizAndréBarroso,UrsHölzle“TheDatacenterasaComputer:AnIntroducQontotheDesignofWarehouse-ScaleMachines”

“SuchWSCstendtohaverelaQvelylowaverageuQlizaQon,spendingmostofitsQmeinthe10-50%CPUuQlizaQonrange.”

1core,noSMT

ServiceLevelObjecQve100msSLO

0

50

100

150

200

0 20 40 60 80 100 120 140 160 180

Late

ncy

(ms)

RPS

Lucene alone 99%ile

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

0 20 40 60 80 100 120 140 160 180

Uti

lizat

ion

/ no

SM

T

RPS

34%w/SMT67%noSMT

LC

Soak up Slack with Batch?

LC batch

core core core core

sharedcache

Co-runningondifferentcoresSMTturnedoff

GoalNotaillatencyimpact[TOCS’16,EuroSys’14]requiresidlecoresinpartbecauseOSdeschedulingisslow

LC idle LCidlecorebatch

sharedcache sharedcache

core core core core t1 t2 t1 t2 t1 t2 t1 t2t1 t2 t1 t2 t1 t2 t1 t2

Co-runningondifferentcoresSMTturnedoff

Co-runningonsamecoreinSMTlanes

0

100

200

300

400

500

600

10 20 30 40 50 60 70 80 90 100

99%i

le la

tenc

y (m

s)

RPS

Lucene alone with IPC 1.0 with IPC 0.01

SMT Co-Runner

47

EvenIPC0.01violatesSLOatlowload!

while(1);

while(1){movnti();mfence();}

IPC1.0

IPC0.01

1core,2SMTlanes

SLO

GreatuQlizaQon!

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

10 20 30 40 50 60 70 80 90 100

Uti

lizat

ion

/ no

SM

T

RPS

48

Func!onalUnits

LoadStoreQueue

IssueLogicLanes

Qmelucene

Simultaneous Multithreading OFF

49

lane1

Func!onalUnits

LoadStoreQueue

IssueLogicLanes

Dynamicallyshared

StaQcallyparQQoned

Round-robinshared

Qmelucene

IPC0.01

AcQveSMTlanessharecriQcalresources

lane2

Simultaneous Multithreading ON

Principled Borrowing

50

Qme

busyidle

BatchborrowshardwarewhenLCisidle

BatchreleaseshardwarewhenLCisbusy

Canweimplementprincipledborrowingoncurrenthardware?

LC

batch

Hardware is Ready — Software is Not

51

Qme

Batchlanecalls“mwait”

Threadsleeps,releasinghardwaretoOS(~2Kcycles)

OSsupportsthreadsleeping,butnothardwaresleepingreleaseSMThardwaretootherlane

OSschedulesbatchlanewithanyreadyjob

LC

batch

nanonap()

52

ThreadinvokingnanonapreleasesSMThardwarewithoutreleasingSMTcontext

per_cpu_variable:nap_flag;voidnanonap(){enter_kernel();disable_preemption();my_nap_flag=this_cpu_flag(nap_flag);monitor(my_nap_flag);mwait();enable_preemption();leave_kernel();}

OScaninterrupt&wakeupthreadOScannotschedulehardwarecontext

Elfen Scheduler

53

InstrumentbatchworkloadstodetectLCthreads&nap

Bindlatency-criQcalthreadstoLClaneBindbatchthreadstobatchlane

Nochangetolatency-criQcalthreads

Elfen Scheduler

54

Qme

12

1 Batchthreadborrowsresources,conQnuouslychecksLClanestatus

2 LCstarts,batchcallsnanonap()toreleaseSMThardwareresources

OStouchesnap_flagtowakeupbatchthread

/*fastpathcheckinjectedintomethodbody*/check:if(!request_lane_idle)slow_path();slow_path(){nanonap();}

1

2

/*mapslaneIDstotherunningtask*/exposedSHIMsignal:cpu_task_maptask_switch(taskT){cpu_task_map[thiscpu]=T;}idle_task(){//wakeupanywaitingbatchthreadupdate_nap_flag_of_partner_lane();......}

LC

batch

3

3

3

nanonap()

0

20

40

60

80

100

120

140

10 20 30 40 50 60 70 80 90 100

99%i

le la

tenc

y (m

s)

RPS

w antlr

w bloat

w eclipse

w fop

w hsqldb

w jython

w luindex

w lusearch

w pmd

w xalan

Lucene alone

Results: Borrow Idle

55

1core,2SMTlanes

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

10 20 30 40 50 60 70 80 90 100

Uti

lizat

ion

/ no

SM

T

RPS

increaseduQlizaQon10x-1.5x

0

20

40

60

80

100

120

140

200 400 600 800 1000

99%i

le la

tenc

y (m

s)

RPS

7cores,2x7SMTlanes

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

200 400 600 800 1000

Uti

lizat

ion

/ no

SM

T

RPS

4x-0.19x

samelatency!

Exciting times

56

57

big

litle

000

000

0 0

0 0

0 0

0 0

000

000

0 0

0 0

0 0

0 0

custom

Hardware heterogeneity – opportunity & challenge

Processors Memory

DDRNVM flash

PIMpaired

Heterogeneous workload!

58

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Perc

enta

ge o

f re

quest

s

Latency (ms)

Heterogeneous hardware dominates homogeneous hardware for throughput, performance, and energy with a fixed power budget & variable request demand Slow-to-Fast sacrifice average a bit to reduce energy & tail latency

59

Requirements pull for heterogeneity! [DISC’14, ICAC’13, submission]

Thank you

62

Extras

63

Online self scheduling

0 0

0 0

0

0

|requests| Interval0=0 Interval1,2=50,100≤2 @0parallelism=33 @0parallelism=1 @50,parallelism=34-6 @50parallelism=1 @100,parallelism=3≥7 @exitparallelism=1 @100,parallelism=3

Software & hardware

Lucene open source enterprise search Wikipedia English 10 GB index of 33 million pages 10k queries from Lucene nightly tests

Bing web search with one Index Serving Node (ISN) 160 GB web index in SSD, 17 GB cache 30k Bing user queries

Hardware 2x8 64 bit 2.3 GHz Xeon, 64 GB Windows 15 request servers, 1 core issues requests Target parallelism = 24 threads

65

Policies Sequential N way single degree of parallelism for each

request Adaptive Select parallelism degree when request starts using system load [EUROSYS’13] Request Clairvoyant parallelizes long requests by

perfect prediction of tail FM Few to Many incrementally add parallelism

66

300

600

900

1200

1500

30 32 34 36 38 40 42 44 46 48

Tail l

atenc

y m

s Lucene RPS

Sequential

4 way

Fixed interval 20 ms



Fixed interval Add thread every X ms

0 0

0 0

0 0 0 0

0 0

0 0

Long intervals good at high load

Short intervals good at low load

Load variation

Alternatebetweenhigh&lowloadFMadaptstoburstswithlowvariance

68

0

400

800

1200

1600

Tail l

atenc

y m

s

Lucene RPS

Sequential 2 way 4 way FM

Low Low High High

Fewer servers: Total Cost of ownership

300

600

900

1200

1500

30 32 34 36 38 40 42 44 46 48

Tail

late

ncy

ms

Lucene RPS

Sequential FM

21%

30 32 34 36 38 40 42 44 46 48

Lucene RPS

Adaptive FM

9%

tail latency: beyond queuing theory · tail latency: beyond queuing theory kathryn s mckinley xi...

Documents