tail latency: beyond queuing theory · tail latency: beyond queuing theory kathryn s mckinley xi...

69
Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Upload: others

Post on 14-Oct-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Page 2: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini
Page 3: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

3

Microsoft’s Quincy datacenter

Page 4: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Servers in US datacenters

4

Serv

ers i

n m

illion

s

06 2008 2010 2012 2014 2016 2018 2020

20

0

4

8

16

12

Unbranded 2+ sockets Unbranded 1 socket Branded 2+ sockets Branded 1 socket

*Shehabi et al., United States Data Center Energy Usage Report, Lawrence Berkeley, 2016.

Page 5: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger

Billio

n KW

Hou

rs /

Year

2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger

200 175 150 125 100 75 50 25 0 2000 2005 2010 2015 2020

Actual $1 to $6 billion

Electricity in US datacenters

*Shehabi et al., United States Data Center Energy Usage Report, Lawrence Berkeley, 2016.

Page 6: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

~ $30,000,000 Savings from 1% less work Lots more by not building a datacenter

Datacenter economics quick facts*

6

~ $500,000 Cost of one datacenter ~3,000,000 US datacenters in 2016

~ $1.5 trillion US Capital investment to date ~ $3,000,000,000 KW dollars / year

*Shehabi et al., United States Data Center Energy Usage Report, Lawrence Berkeley, 2016.

Page 7: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Improve efficiency!

7

Page 8: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini
Page 9: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Tail Latency Matters

9

Twosecondslowdownreducedrevenue/userby4.3%.[EricSchurman,Bing]

400milliseconddelaydecreasedsearches/userby0.59%.[JackBrutlag,Google]

TOP

PRIORITY

Page 10: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

10

Server architecture

aggregator

workers

client

Page 11: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

11

Characteristics of interactive services

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Perc

enta

ge o

f re

quest

s

Latency (ms)

LC

�Bursty,diurnal�CDFchangesslowly�Slowestserverdictatestail�Ordersofmagnitudediffaverage&99+%Qle

Page 12: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

12

Client side observations

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Perc

enta

ge o

f re

quest

s

Latency (ms)

noise?

Page 13: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

13

Client side observations

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Perc

enta

ge o

f re

quest

s

Latency (ms)

noise?

Solution to noise Replication •  All requests? •  CFD shows cost

& potential

10 % of requests 5% of requests

Page 14: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

14

Client side observations

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Perc

enta

ge o

f re

quest

s

Latency (ms)

not noise?

Page 15: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Roadmap

What’s in the tail? Continuous profiling to diagnose the tail Real problems

•  Noise: replication •  Work: parallelism •  Other opportunities

Still poor utilization due to bursty diurnal workload •  Colocation for utilization without impacting tail latency

Opportunities in hardware/software codesign 15

Page 16: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

16

application

worker OS / VM

Java VM

application

worker OS / VM

Java VM

application

worker OS / VM

Java VM

Simplified life of a request

request response

Page 17: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Prior state of the art Dick Site’s talk: https://www.youtube.com/watch?v=QBu2Ae8-8LM

17

Page 18: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Dick Sites & team

18

Hand instrument system 1% on-line budget sample – but tails are rare… Off-line schematics Have insight Improve the system

Page 19: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Dick Sites & team

19

Hand instrument system 1% on-line budget sample – but tails are rare… Off-line schematics Have insight Improve the system

Page 20: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Dick Sites & team

20

Automated instrumentation 1% on-line budget continuous on-line profiling Off-line schematics Have insight Improve the system + On-line optimization

✗Hand instrument system 1% on-line budget sample – but tails are rare… Off-line schematics Have insight Improve the system

Page 21: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

counters tags

Automated cycle-level on-line profiling

Insight Hardware & software generate signals

21

[ISCA’15(TopPicksHM),ATC’16]

21

hardware signals software signals performance counters memory locations

✓ ✓

✓ ✓

Page 22: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

SHIM Design ISCA’15 (Top Picks HM), ATC’16

22

Page 23: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Observe global state from other core

23

23

LLCmissespercycle

while(true):forcounterinLLCmisses,cycles:buf[i++]=readCounter(counter)

Page 24: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Observe local state with SMT hardware

24

24

HT1

HT2

0

4

HT1IPC

0

4

CoreIPC

0

4HT2SHIMIPC

HT1IPC=CoreIPC–HT2SHIMIPC

while(true):forcounterinHT2SHIM,Core,Cycles:buf[i++]=readCounter(counter);

Page 25: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Correlate hardware & software events

01234

HT1IPC

01234

CoreIPC

01234

HT2SHIMIPC

1

2

3

A()B()C()

HT1

HT2

while(true):forcounterinHT2SHIM,Core,cycles:buf[i++]=readCounter(counter);tid=threadonHT1buf[i++]=tid.method;

Page 26: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Fidelity

26

Page 27: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

27

1e-05

0.0001

0.001

0.01

0.1

1

0.01 0.1

1 10 100 1000

IPC(logscale)

%ofsamples(logscale)

Raw samples

Page 28: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

28

!me

R0C0

IPC1 IPC2 IPC3

R1C1 R2C2 R3C3

IPC=(Rt–Rt-1)/(Ct–Ct-1)

✗✓ ✓

CountersC:cyclesR:reQredinstrucQons

Problem: samples are not atomic

Page 29: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

29

!me

IPC1 IPC2 IPC3

✗✓ ✓

Solution: use clock as ground truth CPC=(Cet–Cet-1)/(Cst–Cst-1)thisshouldbe1!

CPC1=1.0+/-1% CPC2=1.0+/-1% CPC3!=1.0+/-1%

Cs0R0C0Ce0 Cs1R1C1Ce1 Cs2R2C2Ce2 Cs3R3C3Ce3

Page 30: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

30

1e-05

0.0001

0.001

0.01

0.1

1

0.01 0.1

1 10 100 1000

1e-05

0.0001

0.001

0.01

0.1

1

0.01 0.1

1 10 100 1000

----rawIPC

%ofsam

ples(logscale)

1e-05

0.0001

0.001

0.01

0.1

1

0.01 0.1

1 10 100 1000

----rawCPC----filteredIPC

----filteredCPCin[0.99,1.01]

Filtering Lusearch IPC samples

Page 31: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

31

top10methods(74%totalexecuQonQme)

IPC

00.20.40.60.81

1.21.41.6

1 2 3 4 5 6 7 8 9 10

default1KHz maximum100KHz SHIM10MHz

IPC of individual methods in Lucene

Page 32: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

32

0

0.5

1

1.5

2

2.5

3

3.5

4

30cycles 1213cycles

methodandloopIDs

Normalize

dto

with

outS

HIM

OverheadsfromwriteinvalidaQons

3MHz:1+orderofmagnitudeoverinterrupt‘maximum’

113MHz:3+ordersofmagnitudeoverinterrupt‘maximum’

Overheads from other core

Page 33: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Understanding Tail Latency

33

Page 34: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

SHIM signals

Requests •  thread ids •  request id (software configured) •  time stamps, PC System threads •  thread ids •  time stamp, PC

34

Page 35: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

All requests

35

0

20

40

60

80

100

120

0 20 40 60 80 100

late

ncy

(m

s)

Request groups (from the slowest 1% to the fastest 1%)

Client latency

Average queueing time

Page 36: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Longest 200 requests

36

0

20

40

60

80

100

120

0 50 100 150

200

late

ncy

(ms)

Top 200 requests

Network and networking queueing time

Idle time

CPU time

Dispatch queueing time

laten

cy

Network & other Idle CPU work Queuing at worker

not noise

noise, bursts? queuing?

Page 37: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

0 0

0 0

0 0 0 0

0 0

0 0

Parallelism

0 300 600 900

1200 1500

0 10 20 30 40 50

Late

ncy

ms

Lucene RPS

Sequential 99th 4 way 99th

improves at low load

degrades at high load

Parallelismhistoricallyforthroughput

ParallelismfortaillatencyIdea

Page 38: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

38

Parallelism

InsightApproach

Longrequestsrevealthemselves

Incrementallyaddparallelismtolongrequests–thetail–basedonrequestprogress&load

Parallelismhistoricallyforthroughput

ParallelismfortaillatencyIdea

Few-to-Many Dynamic Parallelism [ASPLOS’15]

0 0

0 0

0 0 0 0

0 0

0 0

Page 39: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Evaluation 2x8 64 bit 2.3 GHz Xeon, 64 GB

300

600

900

1200

1500

30 32 34 36 38 40 42 44 46 48

Tail

late

ncy

ms

Requests per Second

buy fewer servers

reduce tail latency

Few to Many Sequential

Page 40: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Queuing theory Optimizing average latency maximizes throughput But not the tail! Shortening the tail reduces queuing latency 40

Page 41: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Longest 200 requests

41

0

20

40

60

80

100

120

0 50 100 150

200

late

ncy

(ms)

Top 200 requests

Network and networking queueing time

Idle time

CPU time

Dispatch queueing time

laten

cy

Network & other Idle CPU work Queuing at worker

noise, bursts? queuing?

Page 42: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Correlate bad requests with system state

request CPU0

CPU1

… CPUN

GC thread

GC thread

GC thread

Use time stamps to post-process traces

CPU 0

t0 t1 … ti tj tk tl thread thx

time

Page 43: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Recap & what’s next

SHIM continuous profiling to diagnose the tail •  Noise: replication •  Work: parallelism •  Scalability bottlenecks

Continuous monitoring suggests dynamic optimizations but… still poor utilization due to bursty diurnal workload •  Colocation

Looking forward

43

Page 44: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Queuing theory Over provision for maximum burst, otherwise queuing delay degrades average and tail latency

44

Page 45: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

0

50

100

150

200

0 20 40 60 80 100 120 140 160 180

Late

ncy

(ms)

RPS

Lucene alone 50%ile Lucene alone 95%ile Lucene alone 99%ile

High Responsiveness—Low Utilization

45

IntelXeon-D1540Broadwell

LC

LuizAndréBarroso,UrsHölzle“TheDatacenterasaComputer:AnIntroducQontotheDesignofWarehouse-ScaleMachines”

“SuchWSCstendtohaverelaQvelylowaverageuQlizaQon,spendingmostofitsQmeinthe10-50%CPUuQlizaQonrange.”

1core,noSMT

ServiceLevelObjecQve100msSLO

0

50

100

150

200

0 20 40 60 80 100 120 140 160 180

Late

ncy

(ms)

RPS

Lucene alone 99%ile

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

0 20 40 60 80 100 120 140 160 180

Uti

lizat

ion

/ no

SM

T

RPS

34%w/SMT67%noSMT

LC

Page 46: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Soak up Slack with Batch?

LC batch

core core core core

sharedcache

Co-runningondifferentcoresSMTturnedoff

GoalNotaillatencyimpact[TOCS’16,EuroSys’14]requiresidlecoresinpartbecauseOSdeschedulingisslow

LC idle LCidlecorebatch

sharedcache sharedcache

core core core core t1 t2 t1 t2 t1 t2 t1 t2t1 t2 t1 t2 t1 t2 t1 t2

Co-runningondifferentcoresSMTturnedoff

Co-runningonsamecoreinSMTlanes

Page 47: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

0

100

200

300

400

500

600

10 20 30 40 50 60 70 80 90 100

99%i

le la

tenc

y (m

s)

RPS

Lucene alone with IPC 1.0 with IPC 0.01

SMT Co-Runner

47

EvenIPC0.01violatesSLOatlowload!

while(1);

while(1){movnti();mfence();}

IPC1.0

IPC0.01

1core,2SMTlanes

SLO

GreatuQlizaQon!

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

10 20 30 40 50 60 70 80 90 100

Uti

lizat

ion

/ no

SM

T

RPS

Page 48: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

48

Func!onalUnits

LoadStoreQueue

IssueLogicLanes

Qmelucene

Simultaneous Multithreading OFF

Page 49: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

49

lane1

Func!onalUnits

LoadStoreQueue

IssueLogicLanes

Dynamicallyshared

StaQcallyparQQoned

Round-robinshared

Qmelucene

IPC0.01

AcQveSMTlanessharecriQcalresources

lane2

Simultaneous Multithreading ON

Page 50: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Principled Borrowing

50

Qme

busyidle

BatchborrowshardwarewhenLCisidle

BatchreleaseshardwarewhenLCisbusy

Canweimplementprincipledborrowingoncurrenthardware?

LC

batch

Page 51: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Hardware is Ready — Software is Not

51

Qme

Batchlanecalls“mwait”

Threadsleeps,releasinghardwaretoOS(~2Kcycles)

OSsupportsthreadsleeping,butnothardwaresleepingreleaseSMThardwaretootherlane

OSschedulesbatchlanewithanyreadyjob

LC

batch

Page 52: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

nanonap()

52

ThreadinvokingnanonapreleasesSMThardwarewithoutreleasingSMTcontext

per_cpu_variable:nap_flag;voidnanonap(){enter_kernel();disable_preemption();my_nap_flag=this_cpu_flag(nap_flag);monitor(my_nap_flag);mwait();enable_preemption();leave_kernel();}

OScaninterrupt&wakeupthreadOScannotschedulehardwarecontext

Page 53: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Elfen Scheduler

53

InstrumentbatchworkloadstodetectLCthreads&nap

Bindlatency-criQcalthreadstoLClaneBindbatchthreadstobatchlane

Nochangetolatency-criQcalthreads

Page 54: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Elfen Scheduler

54

Qme

12

1 Batchthreadborrowsresources,conQnuouslychecksLClanestatus

2 LCstarts,batchcallsnanonap()toreleaseSMThardwareresources

OStouchesnap_flagtowakeupbatchthread

/*fastpathcheckinjectedintomethodbody*/check:if(!request_lane_idle)slow_path();slow_path(){nanonap();}

1

2

/*mapslaneIDstotherunningtask*/exposedSHIMsignal:cpu_task_maptask_switch(taskT){cpu_task_map[thiscpu]=T;}idle_task(){//wakeupanywaitingbatchthreadupdate_nap_flag_of_partner_lane();......}

LC

batch

3

3

3

nanonap()

Page 55: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

0

20

40

60

80

100

120

140

10 20 30 40 50 60 70 80 90 100

99%i

le la

tenc

y (m

s)

RPS

w antlr

w bloat

w eclipse

w fop

w hsqldb

w jython

w luindex

w lusearch

w pmd

w xalan

Lucene alone

Results: Borrow Idle

55

1core,2SMTlanes

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

10 20 30 40 50 60 70 80 90 100

Uti

lizat

ion

/ no

SM

T

RPS

increaseduQlizaQon10x-1.5x

0

20

40

60

80

100

120

140

200 400 600 800 1000

99%i

le la

tenc

y (m

s)

RPS

7cores,2x7SMTlanes

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

200 400 600 800 1000

Uti

lizat

ion

/ no

SM

T

RPS

4x-0.19x

samelatency!

Page 56: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Exciting times

56

Page 57: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

57

big

litle

000

000

0 0

0 0

0 0

0 0

000

000

0 0

0 0

0 0

0 0

custom

Hardware heterogeneity – opportunity & challenge

Processors Memory

DDRNVM flash

PIMpaired

Page 58: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Heterogeneous workload!

58

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Perc

enta

ge o

f re

quest

s

Latency (ms)

Page 59: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Heterogeneous hardware dominates homogeneous hardware for throughput, performance, and energy with a fixed power budget & variable request demand Slow-to-Fast sacrifice average a bit to reduce energy & tail latency

59

Requirements pull for heterogeneity! [DISC’14, ICAC’13, submission]

Page 60: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

60

Page 61: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

61

Page 62: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Thank you

62

Page 63: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Extras

63

Page 64: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Online self scheduling

0 0

0 0

0

0

|requests| Interval0=0 Interval1,2=50,100≤2 @0parallelism=33 @0parallelism=1 @50,parallelism=34-6 @50parallelism=1 @100,parallelism=3≥7 @exitparallelism=1 @100,parallelism=3

Page 65: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Software & hardware

Lucene open source enterprise search Wikipedia English 10 GB index of 33 million pages 10k queries from Lucene nightly tests

Bing web search with one Index Serving Node (ISN) 160 GB web index in SSD, 17 GB cache 30k Bing user queries

Hardware 2x8 64 bit 2.3 GHz Xeon, 64 GB Windows 15 request servers, 1 core issues requests Target parallelism = 24 threads

65

Page 66: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Policies Sequential N way single degree of parallelism for each

request Adaptive Select parallelism degree when request starts using system load [EUROSYS’13] Request Clairvoyant parallelizes long requests by

perfect prediction of tail FM Few to Many incrementally add parallelism

66

Page 67: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

300

600

900

1200

1500

30 32 34 36 38 40 42 44 46 48

Tail l

atenc

y m

s Lucene RPS

Sequential

4 way

Fixed interval 20 ms

Fixed interval 100 ms

Fixed interval 500 ms

Fixed interval Add thread every X ms

0 0

0 0

0 0 0 0

0 0

0 0

Long intervals good at high load

Short intervals good at low load

Page 68: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Load variation

Alternatebetweenhigh&lowloadFMadaptstoburstswithlowvariance

68

0

400

800

1200

1600

Tail l

atenc

y m

s

Lucene RPS

Sequential 2 way 4 way FM

Low Low High High

Page 69: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Fewer servers: Total Cost of ownership

300

600

900

1200

1500

30 32 34 36 38 40 42 44 46 48

Tail

late

ncy

ms

Lucene RPS

Sequential FM

21%

30 32 34 36 38 40 42 44 46 48

Lucene RPS

Adaptive FM

9%