lock behaviour characterization of commercial workloads

05-07-2002CS757 1

Lock Behaviour Characterization of Commercial Workloads

Jichuan Chang

Xidong Wang

<[email protected]><[email protected]>

05-07-2002CS757 2

Outline

• Motivation• Methods• Results• Speculative Lock Elision Issues• Conclusions

05-07-2002CS757 3

Understanding the Synchronization Behavior of Commercial Workloads

(OLTP, Apache, SpecJBB)

Motivation

Identifying Opportunities for Speculative Lock Elision

(performance, ease of programming)

05-07-2002CS757 4

Questions to Answer

• Lock related statistics– Can hardware identify critical sections?– Critical section size– Lock-free section size– Amount of lock contentions

• Hardware optimizations by speculation– Context switching implications– Resource requirements

• Other issues – Realistic timing model– Other synchronization (reader/writer, etc)

Lock-free section

Critical sectionContention (spin/wait)

time

05-07-2002CS757 5

Methods

• Benchmarks– OLTP, Apache, JBB, Barnes (for comparison)

• Full system simulation (tracing) using Simics– Simple timing model - Simics tracer– Ruby timing model - Simics + Ruby– Using #instr (not #cycle) as the measurement unit– Set cpu_switch_time to 1, disable STC

• Validating our approach– Using micro-benchmarks, to compare our stats

with the result reported by kernel tools (lockstat)– Tracing into disassembly code (kernel/user)

05-07-2002CS757 6

Lock Identification

• Basic idea [from SLE]– Lock acquisition must use one atomic instruction.– Silent store pair: as a pair, the stores in lock

acquisition and release operations are silent.

• SPARC v9 atomic instructions– ldstub, swap, casa (compare-and-swap)

casa [%l2] 128,%g4,%g3

… … …

casa [%l2] 128,%l0,%g4

ldstub [%o0 + %g0], %o4brnz,pn %o4, <0x10034b98>stbar

… …

stb %g0, [%o0 + 12]

OLTP JBB

0x0->0xff

… …

0xff->0x0

0x1->0x8410f8bc

… … …

0x8410f8bc->0x1

Values Values

05-07-2002CS757 7

Lock Identification Algorithm

• Starts with an atomic instruction– that writes back a different value to the lock– otherwise meaning unsuccessful lock acquisition

• Examine each following store made by the same CPU• Until we meet a normal store

– that completes the silent store pair– usually with the value of 0x0

• Other completion patterns– Self-release (by the same CPU)

• using atomic instruction, pair-silently (JBB)• using atomic instruction, not pair-silently

– Cross-release (by a different CPU)• using atomic instruction;

– Removed: can’t observe lock release (16K limited window).

Completion Patterns Breakdown

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

OLTP Apache Barnes JBB

cross-release

self -release

removed

resolved

05-07-2002CS757 8

Lock Frequency

Percentage of User/Kernel Locks

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

OLTP Apache JBB Barnes

user

super

Lock Frequency

0

5

10

15

20

25

30

OLTP Apache JBB Barnes

Nu

m L

ock

s p

er 1

0000

Inst

r

05-07-2002CS757 9

Execution Phase Breakdown

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

OLTP/s OLTP/R Apache/s Apache/R Branes/s Barnes/R

Lock-spinning

Critical Section

Lock-free Section

05-07-2002CS757 10

Critical Section Size

05-07-2002CS757 11

Lock-free Section Size

05-07-2002CS757 12

Timing Models

• Adding Ruby doesn’t change the size of critical section and lock-free section, but removes lock contentions.

• Why? – “Shrinking” caused by less frequent memory accesses within

critical sections– or simulation effect?

• Guess: more shrinking using Ruby and Opal

Memory Access Frequency

0.25

0.26

0.27

0.28

0.29

0.3

0.31

0.32

OLTP Apache Barnes/s Barnes/R

Overall

Inside Lock

Lock (incl. acq)

SimpleTiming

RubyTiming

05-07-2002CS757 13

Lock Contention

Percentage of Instructions Spent on Waiting / Spinning

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

OLTP/s Apache/s Barnes/s OLTP/R Apache/R Barnes/R

Waiting

Spinning (<=4K)

46% 236% 70%

• Waiting: from the first try to successful acquisition• Spinning: ignore those have been waiting for more

than 4K instructions.

05-07-2002CS757 14

Distinguishing “wait” and “spin”

• Why bother? – Very few long-waiting events make big difference in the percentages

of wasted instructions

• Easy if we can identify thread switching– But the identification is not easy

• Waiting if spinning for too many instructions– Using 4096 instructions as the limit– 90+% contentions are shorter than 4K instr– It makes sense for different timing models.

Instructions spent on waiting

0

5

10

15

20

25

2k 4k 8k 16k

Mill

ion

ssimple

ruby

Number of Lock Contentions

0

2

4

6

8

10

12

14

2k 4k 8k 16k

Th

ou

san

ds

simple

ruby

05-07-2002CS757 15

Lock Contention – Most Contended Locks

05-07-2002CS757 16

SLE on Commercial Workloads

• Context switching (later)• Buffering requirement – Not much

– Small critical sections dominate– Except for Apache user locks (1-8K)– Single shared buffer among threads on the same CPU

• Possible performance gain– Not big if only counting num of instructions (1 - 6%)

• Critical section size already small• Contention already infrequent

– Can be larger if lock spinning latency increases– Can be smaller

• less lock contentions happen (as in Ruby case)• Must throttle speculation (to avoid unnecessary rollbacks)

05-07-2002CS757 17

Context Switch

• Why bother? – Needed to precisely quantify the amount of instructions spent on lock

waiting (process and thread switching)

– Needed to correctly implement speculative lock elision (process switching only)

• Process Switching Identification– Marker: Demap TLB on context switch

– Apache (100 transactions, CPU #3)• Average: ~210K instructions (Max ~360K, Min ~160K)

– Process switching are infrequent, performance implication negligible

• Thread Switching Identification is hard– No simple patterns to observe, No feedback to validate assumptions

– Not a good idea to provide separate buffer for each thread on a single processor. Hard to detect conflicts, thread switch & need many buffers.

05-07-2002CS757 18

• Hard to recognize complex synchronization – Barriers, Read/writer locks, etc

• Mutual Exclusion implementation composed of the small critical sections– pthread_mutex_lock(&lock) acquires 3 lock– Reader/writer lock use locks to maintain data structure

(reader/writer queues, num of current reader, etc)

Other Synchronization Algorithms

writer_enter() writer_exit()

Serialized Execution (maintained by synch. algo.)

HW only sees two small critical sections

05-07-2002CS757 19

Conclusion

• Commercial workloads lock characterization– Small critical sections dominate– Infrequent lock contention

• User/kernel code have different behavior– Kernel locks can’t be ignored– (Kernel) contented PCs predictable

• Performance Improvements– SLE won’t help as much

05-07-2002CS757 20

Thank You!

Questions?

05-07-2002CS757 21

Backup Slides

• Thread switching details• Critical section size using Ruby timing model• Sparc Atomic Instructions• Misc Issues• Acknowledgement

05-07-2002CS757 22

Thread Switch Identification

• User thread scheduling– Disassemble user thread library, Observe execution of scheduling

methods (_disp, _switch). not always possible!!• Kernel thread scheduling

– Involve a set of interleaved method invocations (resume, disp, swtch, _resume_from_idle..). Hard to identify starting and ending point of thread switch

– Impossible to identify kernel thread switch by only observing register window swap since it also happen in user thread switch

– No feedback from OS to validate our assumption– Methodology & Preliminary Observations

• Disassemble kernel code to build VA kernel method map. Observe the method control flow in Simics trace.

• resume may indicate a kernel thread switch• user_rtt may indicate a user level thread switch.

• Conclusion: Thread Switch Identification is a hard, unresolved issue

05-07-2002CS757 23

Critical Section Size (Ruby)

05-07-2002CS757 24

Sparc Atomic Instructions

• ldstub– Write all 1 into a byte

• Swap – Swap the value of the reg and the mem location

• Compare-and-swap– Swap if (value in the 1st reg == value in mem)

• Membar/stbar– Usually follows such atomic instructions

05-07-2002CS757 25

Misc.

• Why Apache “strange”?– Lock more frequent, few user lock (1-2%)– Large percentage of critical section instruction

• Nested Locks• Intertwined Locks• Critical sections in Barnes are more clustered• Buffer size ≤ 2^9 * 30% * 1/3 = 64 Blocks

– The same as SLE

Apache Lock L1 Sizes

0500

10001500

20002500

30003500

40004500

5000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Simple

Ruby

05-07-2002CS757 26

Acknowledgement

• Project suggested by Prof. Mark Hill– Guiding and supporting

• Lots of discussion with and help from – Min Xu, our TA– Carl Mauer, Multifacet simulator expert– Ravi Rajwar, SLE paper author

lock behaviour characterization of commercial workloads

Documents