glenn reinman , brad calder , department of computer science and engineering,
DESCRIPTION
Fetch Directed Instruction Prefetching. Glenn Reinman , Brad Calder , Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical Engineering and Computer Science, University of Michigan. Introduction. Instruction Fetch. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Glenn Reinman , Brad Calder , Department of Computer Science and Engineering,](https://reader035.vdocuments.site/reader035/viewer/2022070400/568134ca550346895d9bf3de/html5/thumbnails/1.jpg)
Glenn Reinman, Brad Calder,
Department of Computer Science and Engineering,
University of California San Diego
and Todd AustinDepartment of Electrical Engineering and Computer Science,
University of Michigan
![Page 2: Glenn Reinman , Brad Calder , Department of Computer Science and Engineering,](https://reader035.vdocuments.site/reader035/viewer/2022070400/568134ca550346895d9bf3de/html5/thumbnails/2.jpg)
Fetch Directed Instruction PrefetchingFetch Directed Instruction Prefetching 2
IntroductionIntroduction• Instruction supply critical to processor performance
– Complicated by instruction cache misses– Instruction cache miss solutions:
• Increasing size or associativity of instruction cache
• Instruction cache prefetching– Which cache blocks to prefetch?
– Timeliness of prefetch
– Interference with demand misses
...Instruction Fetch
Execution Core
Issue Buffer
![Page 3: Glenn Reinman , Brad Calder , Department of Computer Science and Engineering,](https://reader035.vdocuments.site/reader035/viewer/2022070400/568134ca550346895d9bf3de/html5/thumbnails/3.jpg)
Fetch Directed Instruction PrefetchingFetch Directed Instruction Prefetching 3
Prior Instruction Prefetching WorkPrior Instruction Prefetching Work• Next line prefetching (NLP) (Smith)
– Each cache block is tagged with an NLP bit
– When block is accessed during a fetch
• NLP bit determines whether next sequential block is prefetched
– Prefetch into fully associative buffer
• Streaming buffers (Jouppi)
– On cache miss, sequential cache blocks, starting with block that
missed, are prefetched into a buffer
• Buffer can use fully associative lookup
• Uniqueness filter can avoid redundant prefetches
• Multiple streaming buffers can be used together
![Page 4: Glenn Reinman , Brad Calder , Department of Computer Science and Engineering,](https://reader035.vdocuments.site/reader035/viewer/2022070400/568134ca550346895d9bf3de/html5/thumbnails/4.jpg)
Fetch Directed Instruction PrefetchingFetch Directed Instruction Prefetching 4
Our Prefetching ApproachOur Prefetching Approach• Desirable characteristics
– Accuracy of prefetch
• Useful prefetches
– Timeliness of prefetch
• Maximize prefetch gain
• Fetch Directed Prefetching
– Branch predictor runs ahead of instruction cache
– Instruction cache prefetch guided by instruction stream
![Page 5: Glenn Reinman , Brad Calder , Department of Computer Science and Engineering,](https://reader035.vdocuments.site/reader035/viewer/2022070400/568134ca550346895d9bf3de/html5/thumbnails/5.jpg)
Fetch Directed Instruction PrefetchingFetch Directed Instruction Prefetching 5
Talk OverviewTalk Overview
• Fetch Target Queue (FTQ)
• Fetch Directed Prefetching (FDP)
• Filtering Techniques
• Enhancements to Streaming Buffers
• Bandwidth Considerations
• Conclusions
![Page 6: Glenn Reinman , Brad Calder , Department of Computer Science and Engineering,](https://reader035.vdocuments.site/reader035/viewer/2022070400/568134ca550346895d9bf3de/html5/thumbnails/6.jpg)
Fetch Directed Instruction PrefetchingFetch Directed Instruction Prefetching 6
Fetch Target QueueFetch Target Queue
• Queue of instruction fetch addresses
• Latency tolerance– Branch predictor can continue in face of icache miss
– Instruction Fetch can continue in face of branch predictor miss
• When combined with high bandwidth branch predictor– Provides stream of instr addresses far in advance of current PC
Branch Predictor
Instruction Fetch
Execution Core
... ...
FTQ Issue Buffer
![Page 7: Glenn Reinman , Brad Calder , Department of Computer Science and Engineering,](https://reader035.vdocuments.site/reader035/viewer/2022070400/568134ca550346895d9bf3de/html5/thumbnails/7.jpg)
Fetch Directed Instruction PrefetchingFetch Directed Instruction Prefetching 7
Fetch Directed PrefetchingFetch Directed Prefetching
Instruction FetchBranch Predictor
FTQ
Fully associative
buffer
Prefetch Enqueue(filtration mechanisms)
Prefetch
current FTQprefetch candidate
•Stream of PCs contained in FTQ guides prefetch–FTQ is searched in-order for entries to prefetch–Prefetched cache blocks stored in fully associative queue–Fully associative queue and instruction cache probed in parallel
PIQ
(32 entry)
(32 entry)
![Page 8: Glenn Reinman , Brad Calder , Department of Computer Science and Engineering,](https://reader035.vdocuments.site/reader035/viewer/2022070400/568134ca550346895d9bf3de/html5/thumbnails/8.jpg)
Fetch Directed Instruction PrefetchingFetch Directed Instruction Prefetching 8
MethodologyMethodology• SimpleScalar Alpha 3.0 tool set (Burger, Austin)
– SPEC95 C Benchmarks• Fast forwarded past initialization portion of benchmarks
– Can issue 8 instructions per cycle
– 128 entry reorder buffer
– 32 entry load/store buffer
– Variety of instruction cache sizes• 16K 2-way and 4-way associative
• 32K 2-way associative
• Tried both single and dual ported configurations
– Instruction cache size for this talk is 16K 2-way
– 32K 4-way associative data cache
– Unified 1MB 4-way associative second level cache
![Page 9: Glenn Reinman , Brad Calder , Department of Computer Science and Engineering,](https://reader035.vdocuments.site/reader035/viewer/2022070400/568134ca550346895d9bf3de/html5/thumbnails/9.jpg)
Fetch Directed Instruction PrefetchingFetch Directed Instruction Prefetching 9
Bandwidth ConcernsBandwidth Concerns• Prefetching can disrupt demand fetching
– Need to model bus utilization
• Modified SimpleScalar’s memory hierarchy
– Accurate modeling of bus usage
– Two configurations of L2 cache bus to main memory
• 32 bytes/cycle
• 8 bytes/cycle
– Single port on L2 cache
• Shared by both data and instruction caches
![Page 10: Glenn Reinman , Brad Calder , Department of Computer Science and Engineering,](https://reader035.vdocuments.site/reader035/viewer/2022070400/568134ca550346895d9bf3de/html5/thumbnails/10.jpg)
Fetch Directed Instruction PrefetchingFetch Directed Instruction Prefetching 10
Performance of Fetch Directed PrefetchPerformance of Fetch Directed Prefetch
0
10
20
30
40
50
60
70
groff gcc go m88ksim perl vortex average
% IP
C S
pee
du
p
32 bytes/cycle
8 bytes/cycle
41% bus utilization
66% bus utilization
89.9 %
![Page 11: Glenn Reinman , Brad Calder , Department of Computer Science and Engineering,](https://reader035.vdocuments.site/reader035/viewer/2022070400/568134ca550346895d9bf3de/html5/thumbnails/11.jpg)
Fetch Directed Instruction PrefetchingFetch Directed Instruction Prefetching 11
Reducing Wasted PrefetchesReducing Wasted Prefetches• Reduce bus utilization while retaining speedup
– How to identify useless or redundant prefetches?
• Variety of filtration techniques
– FTQ Position Filtering
– Cache Probe Filtering
• Use idle instruction cache ports to validate prefetches
– Remove CPF
– Enqueue CPF
– Evict Filtering
![Page 12: Glenn Reinman , Brad Calder , Department of Computer Science and Engineering,](https://reader035.vdocuments.site/reader035/viewer/2022070400/568134ca550346895d9bf3de/html5/thumbnails/12.jpg)
Fetch Directed Instruction PrefetchingFetch Directed Instruction Prefetching 12
Cache Probe FilteringCache Probe Filtering• Use instruction cache to validate FTQ entries for prefetch
– FTQ entries are initially unmarked
– If cache block is in i-cache, invalidate FTQ entry
– If cache block is not in i-cache, validate FTQ entry
• Validation can occur whenever a cache port is idle
– When the instruction window is full
– Instruction cache miss
• Lockup-free instruction cache
![Page 13: Glenn Reinman , Brad Calder , Department of Computer Science and Engineering,](https://reader035.vdocuments.site/reader035/viewer/2022070400/568134ca550346895d9bf3de/html5/thumbnails/13.jpg)
Fetch Directed Instruction PrefetchingFetch Directed Instruction Prefetching 13
Cache Probe Filtering TechniquesCache Probe Filtering Techniques
• Enqueue CPF
– Only enqueue Valid prefetches
– Conservative, low bandwidth approach
• Remove CPF
– By default, prefetch all FTQ entries.
– If idle cache ports are available for validation
• Do not prefetch entries which are found Invalid
![Page 14: Glenn Reinman , Brad Calder , Department of Computer Science and Engineering,](https://reader035.vdocuments.site/reader035/viewer/2022070400/568134ca550346895d9bf3de/html5/thumbnails/14.jpg)
Fetch Directed Instruction PrefetchingFetch Directed Instruction Prefetching 14
Performance of Filtering TechniquesPerformance of Filtering Techniques8 bytes/cycle
0
10
20
30
40
50
60
70
groff gcc go m88ksim perl vortex average
% IP
C S
pee
du
p
No Filt Rem CPF Enq CPF
55% bus utilization
30% bus utilization
![Page 15: Glenn Reinman , Brad Calder , Department of Computer Science and Engineering,](https://reader035.vdocuments.site/reader035/viewer/2022070400/568134ca550346895d9bf3de/html5/thumbnails/15.jpg)
Fetch Directed Instruction PrefetchingFetch Directed Instruction Prefetching 15
Eviction Prefetching ExampleEviction Prefetching Example
FTBIndex
InstructionCache
32715
Evict bit0
FTB
10
0
Cachemiss
Cacheblock evicted
Bit set for next
prediction
Evict bit1
00
1
Evict bit2
00
0
Evict
index
01
2•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• If branch predictor holds more state than instruction cache– Mark evicted cache blocks in branch predictor– Prefetch those blocks when predicted
![Page 16: Glenn Reinman , Brad Calder , Department of Computer Science and Engineering,](https://reader035.vdocuments.site/reader035/viewer/2022070400/568134ca550346895d9bf3de/html5/thumbnails/16.jpg)
Fetch Directed Instruction PrefetchingFetch Directed Instruction Prefetching 16
Performance of Filtering TechniquesPerformance of Filtering Techniques
0
10
20
30
40
50
60
70
groff gcc go m88ksim perl vortex average
% IP
C S
pe
ed
up
No Filt Rem CPF
Enq CPF Evicted
Enq CPF + Evict
8 bytes/cycle
20% bus utilization
31% bus utilization
![Page 17: Glenn Reinman , Brad Calder , Department of Computer Science and Engineering,](https://reader035.vdocuments.site/reader035/viewer/2022070400/568134ca550346895d9bf3de/html5/thumbnails/17.jpg)
Fetch Directed Instruction PrefetchingFetch Directed Instruction Prefetching 17
Enqueue CPF and Eviction Prefetching• Effective combination of two low bandwidth approaches
• Both attempt to prefetch entries not in instruction cache
• Enqueue CPF needs to wait on idle cache port to prefetch
• Eviction Prefetching can prefetch when prediction is made
• Combined
– Eviction Prefetching gives basic coverage
– Enqueue CPF finds additional prefetches that Evict misses
![Page 18: Glenn Reinman , Brad Calder , Department of Computer Science and Engineering,](https://reader035.vdocuments.site/reader035/viewer/2022070400/568134ca550346895d9bf3de/html5/thumbnails/18.jpg)
Fetch Directed Instruction PrefetchingFetch Directed Instruction Prefetching 18
Streaming Buffer EnhancementsStreaming Buffer Enhancements• All configurations used uniqueness filters and fully
associative lookup
• Base configurations– Single streaming buffer (SB1)
– Dual streaming buffers (SB2)
– Eight streaming buffers (SB8)
• Cache Probe Filtering (CPF) enhancements– Filter out streaming buffer prefetches already in icache
– Stop filtering
![Page 19: Glenn Reinman , Brad Calder , Department of Computer Science and Engineering,](https://reader035.vdocuments.site/reader035/viewer/2022070400/568134ca550346895d9bf3de/html5/thumbnails/19.jpg)
Fetch Directed Instruction PrefetchingFetch Directed Instruction Prefetching 19
Streaming Buffer ResultsStreaming Buffer Results
0
10
20
30
40
50
60
70
groff gcc go m88ksim perl vortex average
% IP
C S
pee
du
p
SB1 SB2 SB8
SB1-Stop SB2-Stop SB8-Stop
8 bytes/cycle
58% bus utilization
36% bus utilization
![Page 20: Glenn Reinman , Brad Calder , Department of Computer Science and Engineering,](https://reader035.vdocuments.site/reader035/viewer/2022070400/568134ca550346895d9bf3de/html5/thumbnails/20.jpg)
Fetch Directed Instruction PrefetchingFetch Directed Instruction Prefetching 20
Selected Low Bandwidth ResultsSelected Low Bandwidth Results
0
5
10
15
20
25
30
35
40
45
NLP SB1 FiltStop
SB8 FiltStop
No Filt Rem CPF Enq CPF Evicted Enq CPF+ Evict
% IP
C S
pee
dup
0
10
20
30
40
50
60
70
% B
us
Util
izat
ion
8 bytes/cycle
![Page 21: Glenn Reinman , Brad Calder , Department of Computer Science and Engineering,](https://reader035.vdocuments.site/reader035/viewer/2022070400/568134ca550346895d9bf3de/html5/thumbnails/21.jpg)
Fetch Directed Instruction PrefetchingFetch Directed Instruction Prefetching 21
Selected High Bandwidth ResultsSelected High Bandwidth Results
0
5
10
15
20
25
30
35
40
45
NLP SB1 FiltStop
SB8 FiltStop
No Filt Rem CPF Enq CPF Evicted Enq CPF+ Evict
% IP
C S
pee
dup
0
10
20
30
40
50
60
70
% B
us
Util
izat
ion
32 bytes/cycle
![Page 22: Glenn Reinman , Brad Calder , Department of Computer Science and Engineering,](https://reader035.vdocuments.site/reader035/viewer/2022070400/568134ca550346895d9bf3de/html5/thumbnails/22.jpg)
Fetch Directed Instruction PrefetchingFetch Directed Instruction Prefetching 22
ConclusionConclusion• Fetch Directed Prefetching
– Accurate, just in time prefetching
• Cache Probe Filtering– Reduces bus bandwidth of fetch directed prefetching
– Also useful for Streaming Buffers
• Evict Filter– Provides accurate prefetching by identifying evicted
cache blocks
• Fully associative versus inorder prefetch buffer– Available in upcoming tech report by end of year
![Page 23: Glenn Reinman , Brad Calder , Department of Computer Science and Engineering,](https://reader035.vdocuments.site/reader035/viewer/2022070400/568134ca550346895d9bf3de/html5/thumbnails/23.jpg)
Fetch Directed Instruction PrefetchingFetch Directed Instruction Prefetching 23
Prefetching Tradeoffs• NLP
– Simple, low bandwidth approach
– No notion of prefetch usefulness
– Limited timeliness
• Streaming Buffers– Takes advantage of latency of a cache miss
– Can use low to moderate bandwidth with filtering
– No notion of prefetch usefulness
• Fetch Directed Prefetching– Prefetch based on prediction stream
– Can use low to moderate bandwidth with filtering
– Most useful with accurate branch prediction