thoughts on shared caches jeff odom university of maryland

34
Thoughts on Shared Caches Jeff Odom University of Maryland

Upload: benjamin-hyder

Post on 14-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Thoughts on Shared Caches Jeff Odom University of Maryland

Thoughts on Shared Caches

Jeff OdomUniversity of Maryland

Page 2: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland2

A Brief History of Time

First there was the single CPU– Memory tuning new field– Large improvements possible– Life is good

Then came multiple CPUs– Rethink memory interactions– Life is good (again)

Now there’s multi-core on multi-CPU– Rethink memory interactions (again)– Life will be good (we hope)

Page 3: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland3

SMP vs. CMP

Symmetric Multiprocessing (SMP)– Single CPU core per chip– All caches private to each CPU– Communication via main memory

Chip Multiprocessing (CMP)– Multiple CPU cores on one integrated circuit– Private L1 cache– Shared second-level and higher caches

Page 4: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland4

CMP Features

Thread-level parallelism– One thread per core– Same as SMP

Shared higher-level caches– Reduced latency– Improved memory bandwidth

Non-homogeneous data decomposition– Not all cores are created equal

Page 5: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland5

CMP Challenges

New optimizations– False sharing/private data copies– Delaying reads until shared

Fewer locations to cache data– More chance of data eviction in high-

throughput computations Hybrid SMP/CMP systems

– Connect multiple multi-core nodes– Composite cache sharing scheme– Cray XT4

• 2 cores/chip• 2 chips/node

Page 6: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland6

False Sharing

Occurs when two CPUs access different data structures on the same cache line

Cache Line

struct foo

CPU0 CPU1

struct bar

Page 7: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland7

False Sharing (SMP)

CPU1CPU0

L2

Main memory

L2

Fet

ch fo

o

Page 8: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland8

False Sharing (SMP)

CPU1CPU0

L2

Main memory

L2

Page 9: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland9

False Sharing (SMP)

CPU1CPU0

L2

Main memory

L2

Fet

ch b

ar

Page 10: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland10

False Sharing (SMP)

CPU1CPU0

L2

Main memory

L2

Page 11: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland11

False Sharing (SMP)

CPU1CPU0

L2

Main memory

L2

Writ

e fo

o

inva

lidat

e

invalidate

Page 12: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland12

False Sharing (SMP)

CPU1CPU0

L2

Main memory

L2

Page 13: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland13

False Sharing (SMP)

CPU1CPU0

L2

Main memory

L2

Fet

ch b

ar

Page 14: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland14

False Sharing (SMP)

CPU1CPU0

L2

Main memory

L2

Page 15: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland15

False Sharing (CMP)

CPU1CPU0

L2

Main memory

Fet

ch fo

o

Page 16: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland16

False Sharing (CMP)

CPU1CPU0

L2

Main memory

Page 17: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland17

False Sharing (CMP)

CPU1CPU0

L2

Main memory

Page 18: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland18

False Sharing (CMP)

CPU1CPU0

L2

Main memory

Page 19: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland19

False Sharing (CMP)

CPU1CPU0

L2

Main memory

Page 20: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland20

False Sharing (CMP)

CPU1CPU0

L2

Main memory

Page 21: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland21

False Sharing (CMP)

CPU1CPU0

L2

Main memory

Page 22: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland22

False Sharing (CMP)

CPU1CPU0

L2

Main memory

Page 23: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland23

False Sharing (SMP vs. CMP)

With private L2 (SMP), modification of co-resident data structures results in trips to main memory

In CMP, false sharing impact is limited by the shared L2

Latency from L1 to L2 much less than L2 to main memory

Page 24: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland24

Maintaining Private Copies

Two threads modifying the same cache line will want to move data to their L1

Simultaneous reading/modification causes thrashing between L1’s and L2

Keeping a copy of data in separate cache line keeps data local to the processor

Updates to shared data occur less often

Page 25: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland25

Delaying Reads Until Shared

Often the results from one thread are pipelined to another

Typical signal-based sharing:– Thread 1 accesses data, is pulled into L1T1

– T1 modifies data– T1 signals T2 that data is ready

– T2 requests data, forcing eviction from L1T1 into L2Shared

– Data is now shared

L1 line not filled in, wasting space

Page 26: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland26

Delaying Reads Until Shared

Optimized sharing:– T1 pulls data into L1T1 as before

– T1 modifies data– T1 waits until it has other data to fill the line

with, then uses that to push data into L2Shared

– T1 signals T2 that data is ready

– T1 and T2 now share data in L2Shared

Eviction is side-effect of loading line

Page 27: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland27

Hybrid Models

Most CMP systems will have SMP as well– Large core density not feasible– Want to balance processing with cache sizes

Different access patterns– Co-resident cores act different than cores of

different nodes– Results may differ depending on which

processor pairs you get

Page 28: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland28

Experimental Framework

Simics simulator– Full system simulation– Hot-swappable components– Configurable memory system

• Reconfigurable cache hierarchy• Roll-your-own coherency protocol

Simulated environment– SunFire 6800, Solaris 10– Single CPU board, 4 UltraSPARC IIi– Uniform main memory access– Similar to actual hardware on hand

Page 29: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland29

Experimental Workload

NAS Parallel Benchmarks– Well known, standard applications– Various data access patterns (conjugate

gradient, multi-grid, etc.)

OpenMP-optimized– Already converted from original serial

versions– MPI-based versions also available

Small (W) workloads– Simulation framework slows down execution– Will examine larger (A-C) versions to verify

tool correctness

Page 30: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland30

Workload Results

0

1

2

3

4

5

6

7

8

Private Shared

Tim

e (

s)

CPU3

CPU2

CPU1

CPU0

0

1

23

4

5

6

78

9

10

Private Shared

Tim

e (

s)

CPU3

CPU2

CPU1

CPU0

0

1

2

3

4

5

6

Private Shared

Tim

e (

s)

CPU3

CPU2

CPU1

CPU0

0

5

10

15

20

25

30

35

Private Shared

Tim

e (

s)

CPU3

CPU2

CPU1

CPU0

Some show marked improvement (CG)…

…others show marginal improvement (FT)…

…still others show asymmetrical loads (BT)…

…and asymmetrical improvement (EP)

Page 31: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland31

The Next Step

How to get data and tools for programmers to deal with this?– Hardware– Languages– Analysis tools

Specialized hardware counters– Which CPU forced eviction– Are cores or nodes contending for data– Coherency protocol diagnostics

Page 32: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland32

The Next Step

CMP-aware parallel languages– Language-based framework easier to

perform automatic optimizations– OpenMP, UPC likely candidates– Specialized partitioning may be needed to

leverage shared caches• Implicit data partitioning• Current languages distribute data

uniformly– May require extensions (hints) in the form of

language directives

Page 33: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland33

The Next Step

Post-execution analysis tools– Identify memory hotspots– Provide hints on restructuring

• Blocking• Execution interleaving

– Convert SMP-optimized code for use in CMP– Dynamic instrumentation opportunities

Page 34: Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland34

Questions?