using partial tag comparison in low-power snoop-based chip multiprocessors ali shafieenarges shahidi...
TRANSCRIPT
![Page 1: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology](https://reader036.vdocuments.site/reader036/viewer/2022062511/551478e9550346f06e8b45a2/html5/thumbnails/1.jpg)
Using Partial Tag Comparison in Using Partial Tag Comparison in Low-Power Snoop-based Chip Low-Power Snoop-based Chip
MultiprocessorsMultiprocessors
Ali Shafiee Narges Shahidi Amirali Baniasadi
Sharif University of TechnologyUniversity of Victoria
1
![Page 2: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology](https://reader036.vdocuments.site/reader036/viewer/2022062511/551478e9550346f06e8b45a2/html5/thumbnails/2.jpg)
Goal: Improving energy efficiency in snoop-based CMPs.
Motivation: Broadcasting/processing entire tag is inefficient.
Our Solution: Using Partial Tag Comparison (PTC) prior to snoop.
Key Results Performance (2.9%)
Tag array power (52%) Bandwidth utilization (78.5%)
2
This Work: Improving Snoop Coherency This Work: Improving Snoop Coherency
![Page 3: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology](https://reader036.vdocuments.site/reader036/viewer/2022062511/551478e9550346f06e8b45a2/html5/thumbnails/3.jpg)
Our Solution (PTC) vs. Conventional Our Solution (PTC) vs. Conventional
3
D$D$
Interconnect Interconnect
Upper Level CacheUpper Level Cache
….D$D$ D$D$ D$D$
Upper Level Cache
….D$D$ D$D$
InterconnectInterconnect
Conventional Our solution
Fast +Power & Bandwidth −
Fast ++ (early miss detection)
Power & Bandwidth Efficient +
![Page 4: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology](https://reader036.vdocuments.site/reader036/viewer/2022062511/551478e9550346f06e8b45a2/html5/thumbnails/4.jpg)
Conventional Snooping
4
Address BusAddress Bus Snoop Bus Snoop Bus
Command BusCommand Bus
D$CPUCPU
D$
D$D$
CPU CPU
21
3
33
controller54 4
4
Redundant (miss): ~
70%
![Page 5: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology](https://reader036.vdocuments.site/reader036/viewer/2022062511/551478e9550346f06e8b45a2/html5/thumbnails/5.jpg)
Snoop Filters
5
Goal: Eliminate redundant snoop requests.Example: RegionScout (ISCA’05), CGCT(ISCA’05), SSP
(ASPLOS’08)
PTC:(1) Early miss detection using subset of tag bits. (2) Once a miss is detected, snoop is avoided.
How often is that possible?
![Page 6: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology](https://reader036.vdocuments.site/reader036/viewer/2022062511/551478e9550346f06e8b45a2/html5/thumbnails/6.jpg)
6
How often using n bits is enough to detect a miss?
95+% of misses can be detected using 8 bits.
![Page 7: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology](https://reader036.vdocuments.site/reader036/viewer/2022062511/551478e9550346f06e8b45a2/html5/thumbnails/7.jpg)
7
D$
Address BusAddress Bus
LSB
LSB
LSB
misshit
Avoid Snoop Access Upper Level
Snoop Potential Targets
PTC-Filter
PTC-Filter
![Page 8: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology](https://reader036.vdocuments.site/reader036/viewer/2022062511/551478e9550346f06e8b45a2/html5/thumbnails/8.jpg)
PTC-Filter
8
4-way D$
4-way D$
4-way D$
4-way D$
4-way D$
4-way D$
4-way D$
4-way D$
PTC-FilterPTC-Filter FilterFilter FilterFilter FilterFilter
0 1 2 3
…
Core1’s LSB Core2’s LSB Core3’s LSB
VDLSB
8 bits
![Page 9: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology](https://reader036.vdocuments.site/reader036/viewer/2022062511/551478e9550346f06e8b45a2/html5/thumbnails/9.jpg)
PTC: Filter Miss
9
Address BusAddress Bus Snoop Bus Snoop Bus
Command BusCommand Bus
D$CPUCPU
D$
D$D$
CPU CPU
32
controller
1
![Page 10: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology](https://reader036.vdocuments.site/reader036/viewer/2022062511/551478e9550346f06e8b45a2/html5/thumbnails/10.jpg)
PTC: Filter Hit
10
Address BusAddress Bus Snoop Bus Snoop Bus
Command BusCommand Bus
D$CPUCPU
D$
D$D$
CPU CPU
2
4
controller6
5
✗ ✗
✓1 ✗✗
3
✓
![Page 11: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology](https://reader036.vdocuments.site/reader036/viewer/2022062511/551478e9550346f06e8b45a2/html5/thumbnails/11.jpg)
Filter Maintenance
11
PTC- FilterPTC- Filter
CPUCPU
1
B F D E
Request =A
33
Address Bus
Core 0
….. …..
Core i
Addr.
C W D
Snoop Controller
4
Command Bus5
6
6
miss A. place it in position of tag F
22
Pending Request Table
{Address=A, C=0,W=1, D=1}
A 0 1 1
Place A, insert in Way 1 of core 0
![Page 12: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology](https://reader036.vdocuments.site/reader036/viewer/2022062511/551478e9550346f06e8b45a2/html5/thumbnails/12.jpg)
12
Methodology
• SESC simulator 4-way CMP• SPLASH-2 benchmarks• CACTI 6.0
4 MB 4-banked 16-way 10 cycle latency L2
6 cycle arbitration + 2 cycle core to controller latency + Crossbar data network+ MESI protocol
DL1/IL1 4-way/2-way 64KB/32KB 3 cycle latency
64 B cache line+ 500 cycle Memory access
![Page 13: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology](https://reader036.vdocuments.site/reader036/viewer/2022062511/551478e9550346f06e8b45a2/html5/thumbnails/13.jpg)
13
Performance
Average: 2.9%
![Page 14: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology](https://reader036.vdocuments.site/reader036/viewer/2022062511/551478e9550346f06e8b45a2/html5/thumbnails/14.jpg)
14
Bandwidth
Average: 78.5%
![Page 15: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology](https://reader036.vdocuments.site/reader036/viewer/2022062511/551478e9550346f06e8b45a2/html5/thumbnails/15.jpg)
15
Tag Power
Average: 52%
![Page 16: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology](https://reader036.vdocuments.site/reader036/viewer/2022062511/551478e9550346f06e8b45a2/html5/thumbnails/16.jpg)
Why do benchmarks show different performance improvement? Different cache miss frequency Different early miss detection frequency Not all cache misses are on the critical path
Filter overhead: Timing: 1 cycle Power: 78.5% of single tag array access
16
Discussion
![Page 17: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology](https://reader036.vdocuments.site/reader036/viewer/2022062511/551478e9550346f06e8b45a2/html5/thumbnails/17.jpg)
PTC: Using subset of tag bits to improve
bandwidth/power efficiency.
Results: Performance: 2.9% Tag Power: 52% Bandwidth: 78.5%
17
Summary
![Page 18: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology](https://reader036.vdocuments.site/reader036/viewer/2022062511/551478e9550346f06e8b45a2/html5/thumbnails/18.jpg)
18
![Page 19: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology](https://reader036.vdocuments.site/reader036/viewer/2022062511/551478e9550346f06e8b45a2/html5/thumbnails/19.jpg)
19
Global vs. Local Miss
D$D$
Interconnect Interconnect
Upper Level CacheUpper Level Cache
….D$D$ D$D$
Have B? NO NO
D$D$
interconnect interconnect
Upper Level CacheUpper Level Cache
….D$D$ D$D$
Have B? NO YES
D$D$
NO
Global Miss Local Miss
local miss detection better power/bandwidth profile Remote miss detection (source-based approach) vs.
(destination-based filter)
![Page 20: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology](https://reader036.vdocuments.site/reader036/viewer/2022062511/551478e9550346f06e8b45a2/html5/thumbnails/20.jpg)
20
Partial tag lookup: global miss
![Page 21: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology](https://reader036.vdocuments.site/reader036/viewer/2022062511/551478e9550346f06e8b45a2/html5/thumbnails/21.jpg)
21
Partial tag lookup: local miss