counting stream registers: an efficient and effective snoop filter architecture

18
Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan ( ETH Zurich ) , Ali Galip Bayrak (EPFL), Theo Kluter (BFH), Philip Brisk (UC Riverside), Edoardo Charbon (TU Delft), Paolo Ienne (EPFL)

Upload: quant

Post on 25-Feb-2016

49 views

Category:

Documents


5 download

DESCRIPTION

Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture. Aanjhan Ranganathan ( ETH Zurich ) , Ali Galip Bayrak ( EPFL ), Theo Kluter ( BFH ), Philip Brisk ( UC Riverside ), Edoardo Charbon ( TU Delft ), Paolo Ienne ( EPFL ). Multicore Embedded Systems. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Counting Stream Registers: An Efficient and Effective Snoop Filter

Architecture

Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter (BFH), Philip Brisk (UC Riverside), Edoardo

Charbon (TU Delft), Paolo Ienne (EPFL)

Page 2: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

2 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Multicore Embedded Systems

• Increasing number of multiprocessor based embedded systems.

• Low energy requirement with little compromise on performance.

• Significant energy consumption in the memory subsystem (caches, shared bus, main memory).

Page 3: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

3 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Symmetric Multiprocessor System

SharedMemory

D$I$

CPU 1

D$I$

CPU 2

D$I$

CPU n

Page 4: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

4 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Cache Coherency Problem

SharedMemory

D$I$

CPU 1

D$I$

CPU 2

D$I$

CPU n

Page 5: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

5 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Snoopy Hardware Coherence Protocols

SharedMemory

D$I$

CPU 1

D$I$

CPU 2

D$I$

CPU n

Snoop misses consume

excessive energy

Page 6: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

6 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Snoop Filters

SharedMemory

D$I$

CPU 1

D$I$

CPU 2

D$I$

CPU n

SF SF SF

Snoop filter lookup costs lesser energy than a cache

lookup

Page 7: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

7 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Snoop Filters in Prior Art

• Include, Exclude and Hybrid JETTY– Expensive for an embedded system in terms of

area.– Energy consumed by the JETTYs itself is

significant.• Stream Registers

– Present in IBM's BlueGene Supercomputer.– Inclusive filter.– Uses a base and mask register pair to track the

cache lines.

Page 8: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

8 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Stream Registers

1 0 0 1 1 1 1 1 10b1001

1 0 0 1 1 1 0 0 10b1010

--- --- 0

Base Mask Valid

No general mechanism to remove address from SR

without compromising correctness

Addresses with 10XX result in snoop filter hit

Page 9: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

9 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Drawbacks of Stream Register based Snoop Filters

• No efficient way to update the registers when a line is removed from cache– Degraded filtering performance over time– Additional logic units introduced but not

efficient (e.g., cache wrap detection)

Page 10: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

10 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Our Contribution

• Counting Stream Registers– Eliminates cache wrap detection logic– Counter to track cache lines– More robust to workload variability– Better or similar energy savings compared to

SRs

Page 11: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

11 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Counting Stream Registers

1 0 0 1 1 1 1 1 0x010b1001

1 0 0 1 1 1 0 0 0x020b1010

--- --- 0

Base Mask Counter

Removes the need for extra logic such as cache wrap detection, active register

history etc.

Invalidated cache lines can be trackedby decrementing the counter

Page 12: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

12 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Snoop Filter Architecture

Index to direct mapped snoop filter table

Set of cache lines grouped into a page

Used for comparison with base register

Page 13: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

13 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Experimental Analysis

• Virtex 2 FPGA running OpenRISC soft cores– Configurable no. of processors, associativity and

size of data and instruction cache, cache type and coherence protocol

• EEMBC Multibench Benchmarks• CACTI 5.3 energy model

– Total memory subsystem energy accounted for main memory r/w energy, data and instruction cache r/w energy, leakage and snoop energy

Page 14: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

14 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Cache Design Space Exploration

Page 15: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

15 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Results: Filtering Percentage

CSR achieves higher filtering % for smaller number of

registers

Page 16: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

16 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Analysis: RGB2CMYK Benchmark

Page 17: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

17 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Discussion: Energy Consumption

• For most benchmarks, snoop energy was around 8-10% of the total memory subsystem energy without snoop filters

• CSR filters more effective for certain benchmarks (H.264, Image rotation)– Better filtering performance with smaller no. of stream

registers.• Small reduction in overall energy

– Platform limited to 32 MB of off-chip SDRAM– No complex data sharing and limited no. of multiple

producers of same data

Page 18: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

18 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Summary

• Introduced counting stream registers based snoop filter architecture– Lesser hardware complexity and ability to track cache

line invalidations• Experimental evaluation shows better filtering

percentage than stream registers with lesser performance variation for different workloads.