lock behaviour characterization of commercial workloads
DESCRIPTION
Lock Behaviour Characterization of Commercial Workloads. . Jichuan Chang Xidong Wang. Outline. Motivation Methods Results Speculative Lock Elision Issues Conclusions. Motivation. Understanding the Synchronization Behavior of Commercial Workloads - PowerPoint PPT PresentationTRANSCRIPT
05-07-2002CS757 1
Lock Behaviour Characterization of Commercial Workloads
Jichuan Chang
Xidong Wang
05-07-2002CS757 2
Outline
• Motivation• Methods• Results• Speculative Lock Elision Issues• Conclusions
05-07-2002CS757 3
Understanding the Synchronization Behavior of Commercial Workloads
(OLTP, Apache, SpecJBB)
Motivation
Identifying Opportunities for Speculative Lock Elision
(performance, ease of programming)
05-07-2002CS757 4
Questions to Answer
• Lock related statistics– Can hardware identify critical sections?– Critical section size– Lock-free section size– Amount of lock contentions
• Hardware optimizations by speculation– Context switching implications– Resource requirements
• Other issues – Realistic timing model– Other synchronization (reader/writer, etc)
Lock-free section
Critical sectionContention (spin/wait)
time
05-07-2002CS757 5
Methods
• Benchmarks– OLTP, Apache, JBB, Barnes (for comparison)
• Full system simulation (tracing) using Simics– Simple timing model - Simics tracer– Ruby timing model - Simics + Ruby– Using #instr (not #cycle) as the measurement unit– Set cpu_switch_time to 1, disable STC
• Validating our approach– Using micro-benchmarks, to compare our stats
with the result reported by kernel tools (lockstat)– Tracing into disassembly code (kernel/user)
05-07-2002CS757 6
Lock Identification
• Basic idea [from SLE]– Lock acquisition must use one atomic instruction.– Silent store pair: as a pair, the stores in lock
acquisition and release operations are silent.
• SPARC v9 atomic instructions– ldstub, swap, casa (compare-and-swap)
casa [%l2] 128,%g4,%g3
… … …
casa [%l2] 128,%l0,%g4
ldstub [%o0 + %g0], %o4brnz,pn %o4, <0x10034b98>stbar
… …
stb %g0, [%o0 + 12]
OLTP JBB
0x0->0xff
… …
0xff->0x0
0x1->0x8410f8bc
… … …
0x8410f8bc->0x1
Values Values
05-07-2002CS757 7
Lock Identification Algorithm
• Starts with an atomic instruction– that writes back a different value to the lock– otherwise meaning unsuccessful lock acquisition
• Examine each following store made by the same CPU• Until we meet a normal store
– that completes the silent store pair– usually with the value of 0x0
• Other completion patterns– Self-release (by the same CPU)
• using atomic instruction, pair-silently (JBB)• using atomic instruction, not pair-silently
– Cross-release (by a different CPU)• using atomic instruction;
– Removed: can’t observe lock release (16K limited window).
Completion Patterns Breakdown
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
OLTP Apache Barnes JBB
cross-release
self -release
removed
resolved
05-07-2002CS757 8
Lock Frequency
Percentage of User/Kernel Locks
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
OLTP Apache JBB Barnes
user
super
Lock Frequency
0
5
10
15
20
25
30
OLTP Apache JBB Barnes
Nu
m L
ock
s p
er 1
0000
Inst
r
05-07-2002CS757 9
Execution Phase Breakdown
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
OLTP/s OLTP/R Apache/s Apache/R Branes/s Barnes/R
Lock-spinning
Critical Section
Lock-free Section
05-07-2002CS757 10
Critical Section Size
05-07-2002CS757 11
Lock-free Section Size
05-07-2002CS757 12
Timing Models
• Adding Ruby doesn’t change the size of critical section and lock-free section, but removes lock contentions.
• Why? – “Shrinking” caused by less frequent memory accesses within
critical sections– or simulation effect?
• Guess: more shrinking using Ruby and Opal
Memory Access Frequency
0.25
0.26
0.27
0.28
0.29
0.3
0.31
0.32
OLTP Apache Barnes/s Barnes/R
Overall
Inside Lock
Lock (incl. acq)
SimpleTiming
RubyTiming
05-07-2002CS757 13
Lock Contention
Percentage of Instructions Spent on Waiting / Spinning
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
OLTP/s Apache/s Barnes/s OLTP/R Apache/R Barnes/R
Waiting
Spinning (<=4K)
46% 236% 70%
• Waiting: from the first try to successful acquisition• Spinning: ignore those have been waiting for more
than 4K instructions.
05-07-2002CS757 14
Distinguishing “wait” and “spin”
• Why bother? – Very few long-waiting events make big difference in the percentages
of wasted instructions
• Easy if we can identify thread switching– But the identification is not easy
• Waiting if spinning for too many instructions– Using 4096 instructions as the limit– 90+% contentions are shorter than 4K instr– It makes sense for different timing models.
Instructions spent on waiting
0
5
10
15
20
25
2k 4k 8k 16k
Mill
ion
ssimple
ruby
Number of Lock Contentions
0
2
4
6
8
10
12
14
2k 4k 8k 16k
Th
ou
san
ds
simple
ruby
05-07-2002CS757 15
Lock Contention – Most Contended Locks
05-07-2002CS757 16
SLE on Commercial Workloads
• Context switching (later)• Buffering requirement – Not much
– Small critical sections dominate– Except for Apache user locks (1-8K)– Single shared buffer among threads on the same CPU
• Possible performance gain– Not big if only counting num of instructions (1 - 6%)
• Critical section size already small• Contention already infrequent
– Can be larger if lock spinning latency increases– Can be smaller
• less lock contentions happen (as in Ruby case)• Must throttle speculation (to avoid unnecessary rollbacks)
05-07-2002CS757 17
Context Switch
• Why bother? – Needed to precisely quantify the amount of instructions spent on lock
waiting (process and thread switching)
– Needed to correctly implement speculative lock elision (process switching only)
• Process Switching Identification– Marker: Demap TLB on context switch
– Apache (100 transactions, CPU #3)• Average: ~210K instructions (Max ~360K, Min ~160K)
– Process switching are infrequent, performance implication negligible
• Thread Switching Identification is hard– No simple patterns to observe, No feedback to validate assumptions
– Not a good idea to provide separate buffer for each thread on a single processor. Hard to detect conflicts, thread switch & need many buffers.
05-07-2002CS757 18
• Hard to recognize complex synchronization – Barriers, Read/writer locks, etc
• Mutual Exclusion implementation composed of the small critical sections– pthread_mutex_lock(&lock) acquires 3 lock– Reader/writer lock use locks to maintain data structure
(reader/writer queues, num of current reader, etc)
Other Synchronization Algorithms
writer_enter() writer_exit()
Serialized Execution (maintained by synch. algo.)
HW only sees two small critical sections
05-07-2002CS757 19
Conclusion
• Commercial workloads lock characterization– Small critical sections dominate– Infrequent lock contention
• User/kernel code have different behavior– Kernel locks can’t be ignored– (Kernel) contented PCs predictable
• Performance Improvements– SLE won’t help as much
05-07-2002CS757 20
Thank You!
Questions?
05-07-2002CS757 21
Backup Slides
• Thread switching details• Critical section size using Ruby timing model• Sparc Atomic Instructions• Misc Issues• Acknowledgement
05-07-2002CS757 22
Thread Switch Identification
• User thread scheduling– Disassemble user thread library, Observe execution of scheduling
methods (_disp, _switch). not always possible!!• Kernel thread scheduling
– Involve a set of interleaved method invocations (resume, disp, swtch, _resume_from_idle..). Hard to identify starting and ending point of thread switch
– Impossible to identify kernel thread switch by only observing register window swap since it also happen in user thread switch
– No feedback from OS to validate our assumption– Methodology & Preliminary Observations
• Disassemble kernel code to build VA kernel method map. Observe the method control flow in Simics trace.
• resume may indicate a kernel thread switch• user_rtt may indicate a user level thread switch.
• Conclusion: Thread Switch Identification is a hard, unresolved issue
05-07-2002CS757 23
Critical Section Size (Ruby)
05-07-2002CS757 24
Sparc Atomic Instructions
• ldstub– Write all 1 into a byte
• Swap – Swap the value of the reg and the mem location
• Compare-and-swap– Swap if (value in the 1st reg == value in mem)
• Membar/stbar– Usually follows such atomic instructions
05-07-2002CS757 25
Misc.
• Why Apache “strange”?– Lock more frequent, few user lock (1-2%)– Large percentage of critical section instruction
• Nested Locks• Intertwined Locks• Critical sections in Barnes are more clustered• Buffer size ≤ 2^9 * 30% * 1/3 = 64 Blocks
– The same as SLE
Apache Lock L1 Sizes
0500
10001500
20002500
30003500
40004500
5000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Simple
Ruby
05-07-2002CS757 26
Acknowledgement
• Project suggested by Prof. Mark Hill– Guiding and supporting
• Lots of discussion with and help from – Min Xu, our TA– Carl Mauer, Multifacet simulator expert– Ravi Rajwar, SLE paper author