thoughts on shared caches jeff odom university of maryland
TRANSCRIPT
Thoughts on Shared Caches
Jeff OdomUniversity of Maryland
University of Maryland2
A Brief History of Time
First there was the single CPU– Memory tuning new field– Large improvements possible– Life is good
Then came multiple CPUs– Rethink memory interactions– Life is good (again)
Now there’s multi-core on multi-CPU– Rethink memory interactions (again)– Life will be good (we hope)
University of Maryland3
SMP vs. CMP
Symmetric Multiprocessing (SMP)– Single CPU core per chip– All caches private to each CPU– Communication via main memory
Chip Multiprocessing (CMP)– Multiple CPU cores on one integrated circuit– Private L1 cache– Shared second-level and higher caches
University of Maryland4
CMP Features
Thread-level parallelism– One thread per core– Same as SMP
Shared higher-level caches– Reduced latency– Improved memory bandwidth
Non-homogeneous data decomposition– Not all cores are created equal
University of Maryland5
CMP Challenges
New optimizations– False sharing/private data copies– Delaying reads until shared
Fewer locations to cache data– More chance of data eviction in high-
throughput computations Hybrid SMP/CMP systems
– Connect multiple multi-core nodes– Composite cache sharing scheme– Cray XT4
• 2 cores/chip• 2 chips/node
University of Maryland6
False Sharing
Occurs when two CPUs access different data structures on the same cache line
Cache Line
struct foo
CPU0 CPU1
struct bar
University of Maryland7
False Sharing (SMP)
CPU1CPU0
L2
Main memory
L2
Fet
ch fo
o
University of Maryland8
False Sharing (SMP)
CPU1CPU0
L2
Main memory
L2
University of Maryland9
False Sharing (SMP)
CPU1CPU0
L2
Main memory
L2
Fet
ch b
ar
University of Maryland10
False Sharing (SMP)
CPU1CPU0
L2
Main memory
L2
University of Maryland11
False Sharing (SMP)
CPU1CPU0
L2
Main memory
L2
Writ
e fo
o
inva
lidat
e
invalidate
University of Maryland12
False Sharing (SMP)
CPU1CPU0
L2
Main memory
L2
University of Maryland13
False Sharing (SMP)
CPU1CPU0
L2
Main memory
L2
Fet
ch b
ar
University of Maryland14
False Sharing (SMP)
CPU1CPU0
L2
Main memory
L2
University of Maryland15
False Sharing (CMP)
CPU1CPU0
L2
Main memory
Fet
ch fo
o
University of Maryland16
False Sharing (CMP)
CPU1CPU0
L2
Main memory
University of Maryland17
False Sharing (CMP)
CPU1CPU0
L2
Main memory
University of Maryland18
False Sharing (CMP)
CPU1CPU0
L2
Main memory
University of Maryland19
False Sharing (CMP)
CPU1CPU0
L2
Main memory
University of Maryland20
False Sharing (CMP)
CPU1CPU0
L2
Main memory
University of Maryland21
False Sharing (CMP)
CPU1CPU0
L2
Main memory
University of Maryland22
False Sharing (CMP)
CPU1CPU0
L2
Main memory
University of Maryland23
False Sharing (SMP vs. CMP)
With private L2 (SMP), modification of co-resident data structures results in trips to main memory
In CMP, false sharing impact is limited by the shared L2
Latency from L1 to L2 much less than L2 to main memory
University of Maryland24
Maintaining Private Copies
Two threads modifying the same cache line will want to move data to their L1
Simultaneous reading/modification causes thrashing between L1’s and L2
Keeping a copy of data in separate cache line keeps data local to the processor
Updates to shared data occur less often
University of Maryland25
Delaying Reads Until Shared
Often the results from one thread are pipelined to another
Typical signal-based sharing:– Thread 1 accesses data, is pulled into L1T1
– T1 modifies data– T1 signals T2 that data is ready
– T2 requests data, forcing eviction from L1T1 into L2Shared
– Data is now shared
L1 line not filled in, wasting space
University of Maryland26
Delaying Reads Until Shared
Optimized sharing:– T1 pulls data into L1T1 as before
– T1 modifies data– T1 waits until it has other data to fill the line
with, then uses that to push data into L2Shared
– T1 signals T2 that data is ready
– T1 and T2 now share data in L2Shared
Eviction is side-effect of loading line
University of Maryland27
Hybrid Models
Most CMP systems will have SMP as well– Large core density not feasible– Want to balance processing with cache sizes
Different access patterns– Co-resident cores act different than cores of
different nodes– Results may differ depending on which
processor pairs you get
University of Maryland28
Experimental Framework
Simics simulator– Full system simulation– Hot-swappable components– Configurable memory system
• Reconfigurable cache hierarchy• Roll-your-own coherency protocol
Simulated environment– SunFire 6800, Solaris 10– Single CPU board, 4 UltraSPARC IIi– Uniform main memory access– Similar to actual hardware on hand
University of Maryland29
Experimental Workload
NAS Parallel Benchmarks– Well known, standard applications– Various data access patterns (conjugate
gradient, multi-grid, etc.)
OpenMP-optimized– Already converted from original serial
versions– MPI-based versions also available
Small (W) workloads– Simulation framework slows down execution– Will examine larger (A-C) versions to verify
tool correctness
University of Maryland30
Workload Results
0
1
2
3
4
5
6
7
8
Private Shared
Tim
e (
s)
CPU3
CPU2
CPU1
CPU0
0
1
23
4
5
6
78
9
10
Private Shared
Tim
e (
s)
CPU3
CPU2
CPU1
CPU0
0
1
2
3
4
5
6
Private Shared
Tim
e (
s)
CPU3
CPU2
CPU1
CPU0
0
5
10
15
20
25
30
35
Private Shared
Tim
e (
s)
CPU3
CPU2
CPU1
CPU0
Some show marked improvement (CG)…
…others show marginal improvement (FT)…
…still others show asymmetrical loads (BT)…
…and asymmetrical improvement (EP)
University of Maryland31
The Next Step
How to get data and tools for programmers to deal with this?– Hardware– Languages– Analysis tools
Specialized hardware counters– Which CPU forced eviction– Are cores or nodes contending for data– Coherency protocol diagnostics
University of Maryland32
The Next Step
CMP-aware parallel languages– Language-based framework easier to
perform automatic optimizations– OpenMP, UPC likely candidates– Specialized partitioning may be needed to
leverage shared caches• Implicit data partitioning• Current languages distribute data
uniformly– May require extensions (hints) in the form of
language directives
University of Maryland33
The Next Step
Post-execution analysis tools– Identify memory hotspots– Provide hints on restructuring
• Blocking• Execution interleaving
– Convert SMP-optimized code for use in CMP– Dynamic instrumentation opportunities
University of Maryland34
Questions?