more on locks: case studies topics case study of two architectures xeon and opteron detailed lock...
TRANSCRIPT
![Page 1: More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649e1b5503460f94b09bb9/html5/thumbnails/1.jpg)
More on Locks: Case StudiesMore on Locks: Case Studies
TopicsTopics Case Study of two Architectures
Xeon and Opteron
Detailed Lock code and Cache Coherence
![Page 2: More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649e1b5503460f94b09bb9/html5/thumbnails/2.jpg)
– 2 –
Putting it all togetherPutting it all together
Background: architecture of the two testing machinesBackground: architecture of the two testing machines
A more detailed treatment of locks and cache-coherence A more detailed treatment of locks and cache-coherence with code examples and implications to parallel software with code examples and implications to parallel software design in the above contextdesign in the above context
![Page 3: More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649e1b5503460f94b09bb9/html5/thumbnails/3.jpg)
– 3 –
Two case studiesTwo case studies
48-core AMD Opteron48-core AMD Opteron
80-core Intel Xeon 80-core Intel Xeon
![Page 4: More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649e1b5503460f94b09bb9/html5/thumbnails/4.jpg)
48-core AMD Opteron48-core AMD Opteron
RAM
• Last level cache (LLC) NOT shared• Directory-based cache coherence
(mo
ther
bo
ard
)
L1
C
L1
C
LLC
6-cores per die
L1
C
…6x……8x…
L1
C
L1
C
LLC
6-cores per die
L1
C
…6x…
cross-socket!
![Page 5: More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649e1b5503460f94b09bb9/html5/thumbnails/5.jpg)
80-core Intel Xeon80-core Intel Xeon
RAM
• LLC shared• Snooping-based cache coherence
(mo
ther
bo
ard
)
L1
C
L1
C
Last Level Cache (LLC)
10-cores per die
L1
C
…10x……8x…
L1
C
L1
C
10-cores per die
L1
C
…10x…
cross-socket
![Page 6: More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649e1b5503460f94b09bb9/html5/thumbnails/6.jpg)
– 6 –
Interconnect between socketsInterconnect between sockets
Cross-sockets communication can be 2-hops
![Page 7: More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649e1b5503460f94b09bb9/html5/thumbnails/7.jpg)
– 7 –
Performance of memory operationsPerformance of memory operations
![Page 8: More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649e1b5503460f94b09bb9/html5/thumbnails/8.jpg)
– 8 –
Local caches and memory latenciesLocal caches and memory latencies
Memory access to a line cached locally (Memory access to a line cached locally (cyclescycles)) Best case: L1 < 10 cycles Worst case: RAM 136 – 355 cycles
![Page 9: More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649e1b5503460f94b09bb9/html5/thumbnails/9.jpg)
Latency of remote access: read (cycles)Latency of remote access: read (cycles)
Ignore
“State” is the MESI state of a cache line in a remote cache.
Cross-socket communication is expensive!Cross-socket communication is expensive! Xeon: loading from Shared state is 7.5 times more expensive over two
hops than within socket Opteron: cross-socket latency even larger than RAM
Opteron: uniform latency Opteron: uniform latency regardless regardless of the cache stateof the cache state Directory-based protocol (directory is distributed across all LLC)
Xeon: load from “Shared” state is much faster than from “M” and Xeon: load from “Shared” state is much faster than from “M” and “E” states“E” states “Shared” state read is served from LLC instead from remote cache
![Page 10: More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649e1b5503460f94b09bb9/html5/thumbnails/10.jpg)
Latency of remote access: write (cycles)Latency of remote access: write (cycles)
“State” is the MESI state of a cache line in a remote cache.
Cross-socket communication is expensive!Cross-socket communication is expensive!
Opteron: store to “Shared” cache line is much more expensiveOpteron: store to “Shared” cache line is much more expensive Directory-based protocol is incomplete
Does not keep track of the sharers Equivalent to broad-cast and have to wait for all invalidations to complete
Xeon: store latency similar regardless of the previous cache line Xeon: store latency similar regardless of the previous cache line statestate Snooping-based coherence
Ignore
![Page 11: More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649e1b5503460f94b09bb9/html5/thumbnails/11.jpg)
– 11 –
Detailed Treatment of Lock-based synchronizationDetailed Treatment of Lock-based synchronization
![Page 12: More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649e1b5503460f94b09bb9/html5/thumbnails/12.jpg)
– 12 –
Synchronization implementationSynchronization implementation
Hardware support is required to implement Hardware support is required to implement synchronization primitivessynchronization primitives In the form of atomic instructions Common examples include: test-and-set, compare-and-swap,
etc. Used to implement high-level synchronization primitives
e.g., lock/unlock, semaphores, barriers, cond. var., etc.
We will only discuss test-and-set here
![Page 13: More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649e1b5503460f94b09bb9/html5/thumbnails/13.jpg)
– 13 –
Test-And-SetTest-And-Set
The semantics of test-and-set are:The semantics of test-and-set are: Record the old value Set the value to TRUE
This is a write! Return the old value
Hardware executes it Hardware executes it atomicallyatomically!!
![Page 14: More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649e1b5503460f94b09bb9/html5/thumbnails/14.jpg)
– 14 – 14
Test-And-SetTest-And-Set
• Read-exclusive (invalidations)• Modify (change state)
• Memory barrier• completes all the mem. op.
before this TAS• cancel all the mem. op.
after this TAS
atomic!
![Page 15: More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649e1b5503460f94b09bb9/html5/thumbnails/15.jpg)
– 15 – Courtesy Ding Yuan
Using Test-And-SetUsing Test-And-Set
Here is our lock implementation with test-and-set:Here is our lock implementation with test-and-set:struct lock { int held = 0;}void acquire (lock) { while (test-and-set(&lock->held));}void release (lock) { lock->held = 0;}
![Page 16: More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649e1b5503460f94b09bb9/html5/thumbnails/16.jpg)
TAS and cache coherenceTAS and cache coherence
Shared Memory (held = 0)
CacheProcessor
State Data
Thread A:
CacheProcessor
State Data
Thread B:acq(lock)
Read-Exclusive
![Page 17: More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649e1b5503460f94b09bb9/html5/thumbnails/17.jpg)
TAS and cache coherenceTAS and cache coherence
Shared Memory (held = 0)
CacheProcessor
Dirty
State
held=1
Data
Thread A:
CacheProcessor
State Data
Thread B:acq(lock)
Read-ExclusiveFill
![Page 18: More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649e1b5503460f94b09bb9/html5/thumbnails/18.jpg)
TAS and cache coherenceTAS and cache coherence
Shared Memory (held = 0)
CacheProcessor
Dirty
State
held=1
Data
Thread A:
CacheProcessor
acq(lock)
State Data
Thread B:acq(lock)
Read-Exclusiveinvalidation
![Page 19: More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649e1b5503460f94b09bb9/html5/thumbnails/19.jpg)
TAS and cache coherenceTAS and cache coherence
Shared Memory (held = 1)
CacheProcessor
Inval
State
held=1
Data
Thread A:
CacheProcessor
acq(lock)
State Data
Thread B:acq(lock)
Read-Exclusiveinvalidationupdate
![Page 20: More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649e1b5503460f94b09bb9/html5/thumbnails/20.jpg)
TAS and cache coherenceTAS and cache coherence
Shared Memory (held = 1)
CacheProcessor
Inval
State
held=1
Data
Thread A:
CacheProcessor
acq(lock)
Dirty
State Data
Thread B:
held=1
acq(lock)
Read-ExclusiveFill
![Page 21: More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649e1b5503460f94b09bb9/html5/thumbnails/21.jpg)
What if there are contentions?What if there are contentions?
Shared Memory (held = 1)
CacheProcessor
State Data
Thread A:
CacheProcessor
while(TAS(l)) ;
State Data
Thread B:while(TAS(l)) ;
![Page 22: More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649e1b5503460f94b09bb9/html5/thumbnails/22.jpg)
– 22 –
How bad can it be?How bad can it be?
TAS
Recall: TAS essentially is a Store + Memory Barrier
IgnoreStore
![Page 23: More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649e1b5503460f94b09bb9/html5/thumbnails/23.jpg)
How to optimize?How to optimize?
When the lock is being held, a contending “acquire” When the lock is being held, a contending “acquire” keeps modifying the lock var. to 1keeps modifying the lock var. to 1 Not necessary! void test_and_test_and_set (lock)
{ do { while (lock->held == 1) ; // spin } } while (test_and_set(lock->held));}void release (lock) { lock->held = 0;}
![Page 24: More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649e1b5503460f94b09bb9/html5/thumbnails/24.jpg)
What if there are contentions?What if there are contentions?
Shared Memory (held = 0)
CacheProcessor
Dirty
State
held=1
Data
Thread A:
CacheProcessor
while(held==1) ;
State Data
Thread B:
holding lock
CacheProcessor
State Data
Thread B:
ReadRead request
![Page 25: More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649e1b5503460f94b09bb9/html5/thumbnails/25.jpg)
What if there are contentions?What if there are contentions?
Shared Memory (held = 1)
CacheProcessor
Share
State
held=1
Data
Thread A:
CacheProcessor
while(held==1) ;
Share
State Data
Thread B:
held=1
holding lock
CacheProcessor
State Data
Thread C:
ReadRead request update
![Page 26: More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649e1b5503460f94b09bb9/html5/thumbnails/26.jpg)
What if there are contentions?What if there are contentions?
Shared Memory (held = 1)
CacheProcessor
Share
State
held=1
Data
Thread A:
CacheProcessor
while(held==1) ;
Share
State Data
Thread B:
held=1
holding lock
CacheProcessor
Share
State Data
Thread C:
held=1
while(held==1) ;
Repeated read to “Shared” cache line: no cache coherence traffic!
![Page 27: More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649e1b5503460f94b09bb9/html5/thumbnails/27.jpg)
Let’s put everything togetherLet’s put everything together
TAS
Load Ignore
Write
Local access
![Page 28: More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649e1b5503460f94b09bb9/html5/thumbnails/28.jpg)
Implications to programmersImplications to programmersCache coherence is expensive (more than you thought)Cache coherence is expensive (more than you thought)
Avoid unnecessary sharing (e.g., false sharing) Avoid unnecessary coherence (e.g., TAS -> TATAS)
Clear understanding of the performance
Crossing sockets is a killerCrossing sockets is a killer Can be slower than running the same program on single core! pthread provides CPU affinity mask
pin cooperative threads on cores within the same die
Loads and stores can be as expensive as atomic Loads and stores can be as expensive as atomic operationsoperations
Programming gurus understand the hardwareProgramming gurus understand the hardware So do you now! Have fun hacking!
More details in “Everything you always wanted to know about synchronization but were afraid to ask”. David, et. al., SOSP’13