proximity-aware directory-based coherence for multi-core...
TRANSCRIPT
Proximity-Aware Directory-basedCoherence for Multi-core Processor
Architectures
Jeff BrownRakesh KumarDean Tullsen
UC San Diego ● University of Illinois at Urbana-ChampaignSPAA 19 ● June 9, 2007
Introduction● The chip multiprocessor (CMP)
era is upon us!● Caching complicate writes● Cache Coherence ensures
caching is done safely
● Multi-core designs offer new tradeoffs
Introduction● The chip multiprocessor (CMP)
era is upon us!● Caching complicate writes● Cache Coherence ensures
caching is done safely
● Multi-core designs offer new tradeoffs
P
M
P
M
Introduction● The chip multiprocessor (CMP)
era is upon us!● Caching complicate writes● Cache Coherence ensures
caching is done safely
● Multi-core designs offer new tradeoffs
P
M
P
M
P
P M
M
Background: Directory-basedCache Coherence
● Directory-based; explicit per-block accounting– Doesn't rely on broadcasts
● Directory operation: client/server
Background: Directory-basedCache Coherence
● Directory-based; explicit per-block accounting– Doesn't rely on broadcasts
● Directory operation: client/server– Processors request data, permissions
P
Background: Directory-basedCache Coherence
● Directory-based; explicit per-block accounting– Doesn't rely on broadcasts
● Directory operation: client/server– Processors request data, permissions– Directory controllers manage memory access
P Dir
Background: Directory-basedCache Coherence
● Directory-based; explicit per-block accounting– Doesn't rely on broadcasts
● Directory operation: client/server– Processors request data, permissions– Directory controllers manage memory access
P
M
Dir
Background: Directory-basedCache Coherence
● Directory-based; explicit per-block accounting– Doesn't rely on broadcasts
● Directory operation: client/server– Processors request data, permissions– Directory controllers manage memory access
P
M
Dir
Background: Directory-basedCache Coherence
● Directory-based; explicit per-block accounting– Doesn't rely on broadcasts
● Directory operation: client/server– Processors request data, permissions– Directory controllers manage memory access
● Updates, conflicts
P
M
PDir
Background: HistoricalMP Cache Coherence● Distributed directory, memory
P
M
P
M
P
M
P
M
Background: HistoricalMP Cache Coherence● Distributed directory, memory
P
M
P
M
P
M
P
M
Cache Miss
Background: HistoricalMP Cache Coherence● Distributed directory, memory
P
M
P
M
P
M
P
M
Cache Miss "Home Node"
Background: HistoricalMP Cache Coherence● Distributed directory, memory
P
M
P
M
P
M
P
M
Cache Miss "Home Node"
Background: HistoricalMP Cache Coherence● Distributed directory, memory
P
M
P
M
P
M
P
M
Cache Miss "Home Node"
Data Request
Background: HistoricalMP Cache Coherence● Distributed directory, memory
P
M
P
M
P
M
P
M
Cache Miss "Home Node"
Data Request
Reply
Motivation: Multi-coreCache Coherence
M
M
P
M
P
P P
M
Motivation: Multi-coreCache Coherence
M
M
P
M
P
P P
M
Cache Miss
Motivation: Multi-coreCache Coherence
M
M
P
M
P
P P
M
Cache Miss
Motivation: Multi-coreCache Coherence
"HomeNode"
M
M
P
M
P
P P
M
Cache Miss
Motivation: Multi-coreCache Coherence
"HomeNode"
Data Request
M
M
P
M
P
P P
M
Cache Miss
Motivation: Multi-coreCache Coherence
"HomeNode"
Data Request
M
M
P
M
P
P P
M
Cache Miss
Motivation: Multi-coreCache Coherence
"HomeNode"
Reply
M
M
P
M
P
P P
M
Cache Miss
Motivation: Multi-coreCache Coherence
M
M
P
M
P
P P
M
AdditionalSharer
Motivation: Multi-coreCache Coherence
M
M
P
M
P
P P
M
AdditionalSharer
● Multi-core designs present radically differentrelative latency & bandwidth
Outline
● Introduction & Background
● System Architecture
● Proximity-Aware Coherence
● Results
● Conclusion
Directory-based Cache Coherence● Directory structures
Directory-based Cache Coherence● Directory structures
MainMemory
Directory-based Cache Coherence● Directory structures
MainMemory
Directory-based Cache Coherence● Directory structures
– Directory Memory
MainMemory
DirectoryMemory
Directory-based Cache Coherence● Directory structures
– Directory Memory– Directory Entries
MainMemory
DirectoryMemory
Directory-based Cache Coherence● Directory structures
– Directory Memory– Directory Entries– Directory Controller
MainMemory
DirectoryMemory
Controller
A Traditional Multiprocessor
Core
L2 $
Dir
Mem
Interconnect
Core
L2 $
Dir
Mem
…
A Traditional Multiprocessor
Core
L2 $
Dir
Mem
Interconnect
Core
L2 $
Dir
Mem
…
(Chassis, board, etc.)
A Traditional Multiprocessor
Core
L2 $
Dir
Mem
Interconnect
Core
L2 $
Dir
Mem
…
(Chassis, board, etc.)
Our 16-Core Chip Multiprocessor
Core L2 $
BusDircontrol
Net.switch
Dir $
Mem.channel
Tile0
Tile1
Tile15
...
Our 16-Core Chip Multiprocessor
Core L2 $
BusDircontrol
Net.switch
Dir $
Mem.channel
Tile0
Tile1
Tile15
...
Our 16-Core Chip Multiprocessor
Core L2 $
BusDircontrol
Net.switch
Dir $
Mem.channel
Tile0
Tile1
Tile15
...
Our 16-Core Chip Multiprocessor
Core L2 $
BusDircontrol
Net.switch
Dir $
Mem.channel
Tile0
Tile1
Tile15
...
Our 16-Core Chip Multiprocessor
Core L2 $
BusDircontrol
Net.switch
Dir $
Mem.channel
Tile0
Tile1
Tile15
...
Our 16-Core Chip Multiprocessor
Core L2 $
BusDircontrol
Net.switch
Dir $
Mem.channel
Tile0
Tile1
Tile15
...
Outline
● Introduction & Background
● System Architecture
● Proximity-Aware Coherence
● Results
● Conclusion
Proximity-Aware Coherence● Idea: home node asks sharer nearest requester
to forward its cached copy
Proximity-Aware Coherence● Idea: home node asks sharer nearest requester
to forward its cached copy– Stay on-chip when possible
Proximity-Aware Coherence● Idea: home node asks sharer nearest requester
to forward its cached copy– Stay on-chip when possible– Minimize transit of large data-carrying replies
Proximity-Aware Coherence● Idea: home node asks sharer nearest requester
to forward its cached copy– Stay on-chip when possible– Minimize transit of large data-carrying replies
M
M
P
M
P
P P
M
Proximity-Aware Coherence● Idea: home node asks sharer nearest requester
to forward its cached copy– Stay on-chip when possible– Minimize transit of large data-carrying replies
"HomeNode"
Data Request
Cache Miss
M
M
P
M
P
P P
M
Proximity-Aware Coherence● Idea: home node asks sharer nearest requester
to forward its cached copy– Stay on-chip when possible– Minimize transit of large data-carrying replies
"HomeNode"
M
M
P
M
P
P P
M
AdditionalSharer
Proximity-Aware Coherence● Idea: home node asks sharer nearest requester
to forward its cached copy– Stay on-chip when possible– Minimize transit of large data-carrying replies
"HomeNode"
M
M
P
M
P
P P
M
AdditionalSharer
ForwardRequest
Proximity-Aware Coherence● Idea: home node asks sharer nearest requester
to forward its cached copy– Stay on-chip when possible– Minimize transit of large data-carrying replies
Reply
M
M
P
M
P
P P
M
Proximity-Aware Coherence
● To service read misses for shared data,traditional protocols use main memory
● Other nodes may hold copies
● On the CMP landscape, inter-node latency ismuch less than memory latency
Sharer Selection
● When the home node lacks a cached copy, itselects a sharer to ask
Sharer Selection
● When the home node lacks a cached copy, itselects a sharer to ask
Miss
Home
Sharer Selection
● When the home node lacks a cached copy, itselects a sharer to ask– rand
Miss
Home
Sharer Selection
● When the home node lacks a cached copy, itselects a sharer to ask– rand– near1
Miss
Home
Sharer Selection
● When the home node lacks a cached copy, itselects a sharer to ask– rand– near1– via1
Miss
Home
Sharer Selection
● When the home node lacks a cached copy, itselects a sharer to ask– rand– near1– via1
● Retries didn't prove beneficial
Miss
Home
Outline
● Introduction & Background
● System Architecture
● Proximity-Aware Coherence
● Results
● Conclusion
Methodology
● Detailed, execution-driven processor andnetwork simulation
● "RSIM" simulator, adapted to our CMP model● Parallel workloads from several suites● Hardware, benchmark details in paper
Proximity-Aware: Potential Coverage
appbt fft lu mp3d ocean quicksort unstruct0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
6
5
4
3
2
1
Fra
ctio
nofre
ad
mis
ses
tosh
are
dlin
es
Proximity-Aware: Potential Coverage
appbt fft lu mp3d ocean quicksort unstruct0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
6
5
4
3
2
1
Fra
ctio
nofre
ad
mis
ses
tosh
are
dlin
es
Proximity-Aware: Potential Coverage
appbt fft lu mp3d ocean quicksort unstruct0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
6
5
4
3
2
1
Fra
ctio
nofre
ad
mis
ses
tosh
are
dlin
es
Overallx=43%
Proximity-Aware: Potential Coverage
appbt fft lu mp3d ocean quicksort unstruct0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
6
5
4
3
2
1
Fra
ctio
nofre
ad
mis
ses
tosh
are
dlin
es
Overallx=43%
Proximity-Aware: Potential Coverage
appbt fft lu mp3d ocean quicksort unstruct0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
6
5
4
3
2
1
Fra
ctio
nofre
ad
mis
ses
tosh
are
dlin
es
Overallx=43%
dist 1x=75%
Proximity-Aware: Latency Benefit
appbt fft lu mp3d ocean quicksort
un-struct
mean0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
rand
near1
via1
Nor
mal
ized
L2m
iss
late
ncy
Proximity-Aware: Latency Benefit
appbt fft lu mp3d ocean quicksort
un-struct
mean0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
rand
near1
via1
Nor
mal
ized
L2m
iss
late
ncy
Proximity-Aware: Latency Benefit
appbt fft lu mp3d ocean quicksort
un-struct
mean0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
rand
near1
via1
Nor
mal
ized
L2m
iss
late
ncy
Latency-25%
Proximity-Aware: Latency Benefit
appbt fft lu mp3d ocean quicksort
un-struct
mean0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
rand
near1
via1
Nor
mal
ized
L2m
iss
late
ncy
Latency-25%
Proximity-Aware: Latency Benefit
appbt fft lu mp3d ocean quicksort
un-struct
mean0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
rand
near1
via1
Nor
mal
ized
L2m
iss
late
ncy
Latency-25%
Reply traffic-6%
Proximity-Aware: Latency Benefit
appbt fft lu mp3d ocean quicksort
un-struct
mean0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
rand
near1
via1
Nor
mal
ized
L2m
iss
late
ncy
Latency-25%
Reply traffic-6%
Proximity-Aware: Speedup
appbt fft lu mp3d ocean quicksort
un-struct
mean0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
rand
near1
via1Spe
edup
Proximity-Aware: Speedup
appbt fft lu mp3d ocean quicksort
un-struct
mean0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
rand
near1
via1Spe
edup
Proximity-Aware: Speedup
appbt fft lu mp3d ocean quicksort
un-struct
mean0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
rand
near1
via1Spe
edup
Speedup16%
Proximity-Aware: Speedup
● L2 latency sensitivity of workloads
appbt fft lu mp3d ocean quicksort
un-struct
mean0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
rand
near1
via1Spe
edup
Speedup16%
Conclusion
● The latency/bandwidth aspects of CMPsmotivates multicore-aware coherence redesign
● One such change: Proximity-Aware Coherence– Ideas: stay on-chip, decrease "bulk" transit– Mean speedup 16%, mean L2 latency down 25%
● More aggressive techniques are under study
Conclusion
● The latency/bandwidth aspects of CMPsmotivates multicore-aware coherence redesign
● One such change: Proximity-Aware Coherence– Ideas: stay on-chip, decrease "bulk" transit– Mean speedup 16%, mean L2 latency down 25%
● More aggressive techniques are under study
● Questions?