tail latency: beyond queuing theory · tail latency: beyond queuing theory kathryn s mckinley xi...
TRANSCRIPT
Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini
3
Microsoft’s Quincy datacenter
Servers in US datacenters
4
Serv
ers i
n m
illion
s
06 2008 2010 2012 2014 2016 2018 2020
20
0
4
8
16
12
Unbranded 2+ sockets Unbranded 1 socket Branded 2+ sockets Branded 1 socket
*Shehabi et al., United States Data Center Energy Usage Report, Lawrence Berkeley, 2016.
2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger
Billio
n KW
Hou
rs /
Year
2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger
200 175 150 125 100 75 50 25 0 2000 2005 2010 2015 2020
Actual $1 to $6 billion
Electricity in US datacenters
*Shehabi et al., United States Data Center Energy Usage Report, Lawrence Berkeley, 2016.
~ $30,000,000 Savings from 1% less work Lots more by not building a datacenter
Datacenter economics quick facts*
6
~ $500,000 Cost of one datacenter ~3,000,000 US datacenters in 2016
~ $1.5 trillion US Capital investment to date ~ $3,000,000,000 KW dollars / year
*Shehabi et al., United States Data Center Energy Usage Report, Lawrence Berkeley, 2016.
Improve efficiency!
7
Tail Latency Matters
9
Twosecondslowdownreducedrevenue/userby4.3%.[EricSchurman,Bing]
400milliseconddelaydecreasedsearches/userby0.59%.[JackBrutlag,Google]
TOP
PRIORITY
10
Server architecture
aggregator
workers
client
11
Characteristics of interactive services
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 20 40 60 80 100
0
20
40
60
80
100
Perc
enta
ge o
f re
quest
s
Latency (ms)
LC
�Bursty,diurnal�CDFchangesslowly�Slowestserverdictatestail�Ordersofmagnitudediffaverage&99+%Qle
12
Client side observations
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 20 40 60 80 100
0
20
40
60
80
100
Perc
enta
ge o
f re
quest
s
Latency (ms)
noise?
13
Client side observations
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 20 40 60 80 100
0
20
40
60
80
100
Perc
enta
ge o
f re
quest
s
Latency (ms)
noise?
Solution to noise Replication • All requests? • CFD shows cost
& potential
10 % of requests 5% of requests
14
Client side observations
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 20 40 60 80 100
0
20
40
60
80
100
Perc
enta
ge o
f re
quest
s
Latency (ms)
not noise?
Roadmap
What’s in the tail? Continuous profiling to diagnose the tail Real problems
• Noise: replication • Work: parallelism • Other opportunities
Still poor utilization due to bursty diurnal workload • Colocation for utilization without impacting tail latency
Opportunities in hardware/software codesign 15
16
application
worker OS / VM
Java VM
application
worker OS / VM
Java VM
application
worker OS / VM
Java VM
Simplified life of a request
…
request response
Prior state of the art Dick Site’s talk: https://www.youtube.com/watch?v=QBu2Ae8-8LM
17
Dick Sites & team
18
Hand instrument system 1% on-line budget sample – but tails are rare… Off-line schematics Have insight Improve the system
Dick Sites & team
19
Hand instrument system 1% on-line budget sample – but tails are rare… Off-line schematics Have insight Improve the system
Dick Sites & team
20
Automated instrumentation 1% on-line budget continuous on-line profiling Off-line schematics Have insight Improve the system + On-line optimization
✗Hand instrument system 1% on-line budget sample – but tails are rare… Off-line schematics Have insight Improve the system
✗
counters tags
Automated cycle-level on-line profiling
Insight Hardware & software generate signals
21
[ISCA’15(TopPicksHM),ATC’16]
21
hardware signals software signals performance counters memory locations
✓ ✓
✓ ✓
SHIM Design ISCA’15 (Top Picks HM), ATC’16
22
Observe global state from other core
23
23
LLCmissespercycle
while(true):forcounterinLLCmisses,cycles:buf[i++]=readCounter(counter)
Observe local state with SMT hardware
24
24
HT1
HT2
0
4
HT1IPC
0
4
CoreIPC
0
4HT2SHIMIPC
HT1IPC=CoreIPC–HT2SHIMIPC
while(true):forcounterinHT2SHIM,Core,Cycles:buf[i++]=readCounter(counter);
Correlate hardware & software events
01234
HT1IPC
01234
CoreIPC
01234
HT2SHIMIPC
1
2
3
A()B()C()
HT1
HT2
while(true):forcounterinHT2SHIM,Core,cycles:buf[i++]=readCounter(counter);tid=threadonHT1buf[i++]=tid.method;
Fidelity
26
27
1e-05
0.0001
0.001
0.01
0.1
1
0.01 0.1
1 10 100 1000
IPC(logscale)
%ofsamples(logscale)
Raw samples
28
!me
R0C0
IPC1 IPC2 IPC3
R1C1 R2C2 R3C3
IPC=(Rt–Rt-1)/(Ct–Ct-1)
✗✓ ✓
CountersC:cyclesR:reQredinstrucQons
Problem: samples are not atomic
29
!me
IPC1 IPC2 IPC3
✗✓ ✓
Solution: use clock as ground truth CPC=(Cet–Cet-1)/(Cst–Cst-1)thisshouldbe1!
CPC1=1.0+/-1% CPC2=1.0+/-1% CPC3!=1.0+/-1%
Cs0R0C0Ce0 Cs1R1C1Ce1 Cs2R2C2Ce2 Cs3R3C3Ce3
30
1e-05
0.0001
0.001
0.01
0.1
1
0.01 0.1
1 10 100 1000
1e-05
0.0001
0.001
0.01
0.1
1
0.01 0.1
1 10 100 1000
----rawIPC
%ofsam
ples(logscale)
1e-05
0.0001
0.001
0.01
0.1
1
0.01 0.1
1 10 100 1000
----rawCPC----filteredIPC
----filteredCPCin[0.99,1.01]
Filtering Lusearch IPC samples
31
top10methods(74%totalexecuQonQme)
IPC
00.20.40.60.81
1.21.41.6
1 2 3 4 5 6 7 8 9 10
default1KHz maximum100KHz SHIM10MHz
IPC of individual methods in Lucene
32
0
0.5
1
1.5
2
2.5
3
3.5
4
30cycles 1213cycles
methodandloopIDs
Normalize
dto
with
outS
HIM
OverheadsfromwriteinvalidaQons
3MHz:1+orderofmagnitudeoverinterrupt‘maximum’
113MHz:3+ordersofmagnitudeoverinterrupt‘maximum’
Overheads from other core
Understanding Tail Latency
33
SHIM signals
Requests • thread ids • request id (software configured) • time stamps, PC System threads • thread ids • time stamp, PC
34
All requests
35
0
20
40
60
80
100
120
0 20 40 60 80 100
late
ncy
(m
s)
Request groups (from the slowest 1% to the fastest 1%)
Client latency
Average queueing time
Longest 200 requests
36
0
20
40
60
80
100
120
0 50 100 150
200
late
ncy
(ms)
Top 200 requests
Network and networking queueing time
Idle time
CPU time
Dispatch queueing time
laten
cy
Network & other Idle CPU work Queuing at worker
not noise
noise, bursts? queuing?
0 0
0 0
0 0 0 0
0 0
0 0
Parallelism
0 300 600 900
1200 1500
0 10 20 30 40 50
Late
ncy
ms
Lucene RPS
Sequential 99th 4 way 99th
improves at low load
degrades at high load
Parallelismhistoricallyforthroughput
ParallelismfortaillatencyIdea
38
Parallelism
InsightApproach
Longrequestsrevealthemselves
Incrementallyaddparallelismtolongrequests–thetail–basedonrequestprogress&load
Parallelismhistoricallyforthroughput
ParallelismfortaillatencyIdea
Few-to-Many Dynamic Parallelism [ASPLOS’15]
0 0
0 0
0 0 0 0
0 0
0 0
Evaluation 2x8 64 bit 2.3 GHz Xeon, 64 GB
300
600
900
1200
1500
30 32 34 36 38 40 42 44 46 48
Tail
late
ncy
ms
Requests per Second
buy fewer servers
reduce tail latency
Few to Many Sequential
Queuing theory Optimizing average latency maximizes throughput But not the tail! Shortening the tail reduces queuing latency 40
Longest 200 requests
41
0
20
40
60
80
100
120
0 50 100 150
200
late
ncy
(ms)
Top 200 requests
Network and networking queueing time
Idle time
CPU time
Dispatch queueing time
laten
cy
Network & other Idle CPU work Queuing at worker
noise, bursts? queuing?
✔
Correlate bad requests with system state
request CPU0
CPU1
… CPUN
GC thread
GC thread
GC thread
Use time stamps to post-process traces
CPU 0
t0 t1 … ti tj tk tl thread thx
time
Recap & what’s next
SHIM continuous profiling to diagnose the tail • Noise: replication • Work: parallelism • Scalability bottlenecks
Continuous monitoring suggests dynamic optimizations but… still poor utilization due to bursty diurnal workload • Colocation
Looking forward
43
Queuing theory Over provision for maximum burst, otherwise queuing delay degrades average and tail latency
44
0
50
100
150
200
0 20 40 60 80 100 120 140 160 180
Late
ncy
(ms)
RPS
Lucene alone 50%ile Lucene alone 95%ile Lucene alone 99%ile
High Responsiveness—Low Utilization
45
IntelXeon-D1540Broadwell
LC
LuizAndréBarroso,UrsHölzle“TheDatacenterasaComputer:AnIntroducQontotheDesignofWarehouse-ScaleMachines”
“SuchWSCstendtohaverelaQvelylowaverageuQlizaQon,spendingmostofitsQmeinthe10-50%CPUuQlizaQonrange.”
1core,noSMT
ServiceLevelObjecQve100msSLO
0
50
100
150
200
0 20 40 60 80 100 120 140 160 180
Late
ncy
(ms)
RPS
Lucene alone 99%ile
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
0 20 40 60 80 100 120 140 160 180
Uti
lizat
ion
/ no
SM
T
RPS
34%w/SMT67%noSMT
LC
Soak up Slack with Batch?
LC batch
core core core core
sharedcache
Co-runningondifferentcoresSMTturnedoff
GoalNotaillatencyimpact[TOCS’16,EuroSys’14]requiresidlecoresinpartbecauseOSdeschedulingisslow
LC idle LCidlecorebatch
sharedcache sharedcache
core core core core t1 t2 t1 t2 t1 t2 t1 t2t1 t2 t1 t2 t1 t2 t1 t2
Co-runningondifferentcoresSMTturnedoff
Co-runningonsamecoreinSMTlanes
0
100
200
300
400
500
600
10 20 30 40 50 60 70 80 90 100
99%i
le la
tenc
y (m
s)
RPS
Lucene alone with IPC 1.0 with IPC 0.01
SMT Co-Runner
47
EvenIPC0.01violatesSLOatlowload!
while(1);
while(1){movnti();mfence();}
IPC1.0
IPC0.01
1core,2SMTlanes
SLO
GreatuQlizaQon!
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
10 20 30 40 50 60 70 80 90 100
Uti
lizat
ion
/ no
SM
T
RPS
48
Func!onalUnits
LoadStoreQueue
IssueLogicLanes
Qmelucene
Simultaneous Multithreading OFF
49
lane1
Func!onalUnits
LoadStoreQueue
IssueLogicLanes
Dynamicallyshared
StaQcallyparQQoned
Round-robinshared
Qmelucene
IPC0.01
AcQveSMTlanessharecriQcalresources
lane2
Simultaneous Multithreading ON
Principled Borrowing
50
Qme
busyidle
BatchborrowshardwarewhenLCisidle
BatchreleaseshardwarewhenLCisbusy
Canweimplementprincipledborrowingoncurrenthardware?
LC
batch
Hardware is Ready — Software is Not
51
Qme
Batchlanecalls“mwait”
Threadsleeps,releasinghardwaretoOS(~2Kcycles)
OSsupportsthreadsleeping,butnothardwaresleepingreleaseSMThardwaretootherlane
OSschedulesbatchlanewithanyreadyjob
LC
batch
nanonap()
52
ThreadinvokingnanonapreleasesSMThardwarewithoutreleasingSMTcontext
per_cpu_variable:nap_flag;voidnanonap(){enter_kernel();disable_preemption();my_nap_flag=this_cpu_flag(nap_flag);monitor(my_nap_flag);mwait();enable_preemption();leave_kernel();}
OScaninterrupt&wakeupthreadOScannotschedulehardwarecontext
Elfen Scheduler
53
InstrumentbatchworkloadstodetectLCthreads&nap
Bindlatency-criQcalthreadstoLClaneBindbatchthreadstobatchlane
Nochangetolatency-criQcalthreads
Elfen Scheduler
54
Qme
12
1 Batchthreadborrowsresources,conQnuouslychecksLClanestatus
2 LCstarts,batchcallsnanonap()toreleaseSMThardwareresources
OStouchesnap_flagtowakeupbatchthread
/*fastpathcheckinjectedintomethodbody*/check:if(!request_lane_idle)slow_path();slow_path(){nanonap();}
1
2
/*mapslaneIDstotherunningtask*/exposedSHIMsignal:cpu_task_maptask_switch(taskT){cpu_task_map[thiscpu]=T;}idle_task(){//wakeupanywaitingbatchthreadupdate_nap_flag_of_partner_lane();......}
LC
batch
3
3
3
nanonap()
0
20
40
60
80
100
120
140
10 20 30 40 50 60 70 80 90 100
99%i
le la
tenc
y (m
s)
RPS
w antlr
w bloat
w eclipse
w fop
w hsqldb
w jython
w luindex
w lusearch
w pmd
w xalan
Lucene alone
Results: Borrow Idle
55
1core,2SMTlanes
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
10 20 30 40 50 60 70 80 90 100
Uti
lizat
ion
/ no
SM
T
RPS
increaseduQlizaQon10x-1.5x
0
20
40
60
80
100
120
140
200 400 600 800 1000
99%i
le la
tenc
y (m
s)
RPS
7cores,2x7SMTlanes
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
200 400 600 800 1000
Uti
lizat
ion
/ no
SM
T
RPS
4x-0.19x
samelatency!
Exciting times
56
57
big
litle
000
000
0 0
0 0
0 0
0 0
000
000
0 0
0 0
0 0
0 0
custom
Hardware heterogeneity – opportunity & challenge
Processors Memory
DDRNVM flash
PIMpaired
Heterogeneous workload!
58
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 20 40 60 80 100
0
20
40
60
80
100
Perc
enta
ge o
f re
quest
s
Latency (ms)
Heterogeneous hardware dominates homogeneous hardware for throughput, performance, and energy with a fixed power budget & variable request demand Slow-to-Fast sacrifice average a bit to reduce energy & tail latency
59
Requirements pull for heterogeneity! [DISC’14, ICAC’13, submission]
60
61
Thank you
62
Extras
63
Online self scheduling
0 0
0 0
0
0
|requests| Interval0=0 Interval1,2=50,100≤2 @0parallelism=33 @0parallelism=1 @50,parallelism=34-6 @50parallelism=1 @100,parallelism=3≥7 @exitparallelism=1 @100,parallelism=3
Software & hardware
Lucene open source enterprise search Wikipedia English 10 GB index of 33 million pages 10k queries from Lucene nightly tests
Bing web search with one Index Serving Node (ISN) 160 GB web index in SSD, 17 GB cache 30k Bing user queries
Hardware 2x8 64 bit 2.3 GHz Xeon, 64 GB Windows 15 request servers, 1 core issues requests Target parallelism = 24 threads
65
Policies Sequential N way single degree of parallelism for each
request Adaptive Select parallelism degree when request starts using system load [EUROSYS’13] Request Clairvoyant parallelizes long requests by
perfect prediction of tail FM Few to Many incrementally add parallelism
66
300
600
900
1200
1500
30 32 34 36 38 40 42 44 46 48
Tail l
atenc
y m
s Lucene RPS
Sequential
4 way
Fixed interval 20 ms
Fixed interval 100 ms
Fixed interval 500 ms
Fixed interval Add thread every X ms
0 0
0 0
0 0 0 0
0 0
0 0
Long intervals good at high load
Short intervals good at low load
Load variation
Alternatebetweenhigh&lowloadFMadaptstoburstswithlowvariance
68
0
400
800
1200
1600
Tail l
atenc
y m
s
Lucene RPS
Sequential 2 way 4 way FM
Low Low High High
Fewer servers: Total Cost of ownership
300
600
900
1200
1500
30 32 34 36 38 40 42 44 46 48
Tail
late
ncy
ms
Lucene RPS
Sequential FM
21%
30 32 34 36 38 40 42 44 46 48
Lucene RPS
Adaptive FM
9%