counting stream registers: an efficient and effective snoop filter architecture aanjhan ranganathan...

18
Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan ( ETH Zurich ) , Ali Galip Bayrak (EPFL), Theo Kluter (BFH), Philip Brisk (UC Riverside), Edoardo Charbon (TU Delft), Paolo Ienne (EPFL)

Upload: dayana-strother

Post on 15-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter

Counting Stream Registers: An Efficient and Effective Snoop Filter

Architecture

Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter (BFH), Philip Brisk (UC Riverside), Edoardo

Charbon (TU Delft), Paolo Ienne (EPFL)

Page 2: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter

2 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Multicore Embedded Systems

• Increasing number of multiprocessor based embedded systems.

• Low energy requirement with little compromise on performance.

• Significant energy consumption in the memory subsystem (caches, shared bus, main memory).

Page 3: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter

3 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Symmetric Multiprocessor System

SharedMemory

D$I$

CPU 1

D$I$

CPU 2

D$I$

CPU n

Page 4: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter

4 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Cache Coherency Problem

SharedMemory

D$I$

CPU 1

D$I$

CPU 2

D$I$

CPU n

Page 5: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter

5 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Snoopy Hardware Coherence Protocols

SharedMemory

D$I$

CPU 1

D$I$

CPU 2

D$I$

CPU n

Snoop misses consume

excessive energy

Page 6: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter

6 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Snoop Filters

SharedMemory

D$I$

CPU 1

D$I$

CPU 2

D$I$

CPU n

SF SF SF

Snoop filter lookup costs lesser energy than a cache

lookup

Page 7: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter

7 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Snoop Filters in Prior Art

• Include, Exclude and Hybrid JETTY– Expensive for an embedded system in terms of

area.– Energy consumed by the JETTYs itself is

significant.

• Stream Registers– Present in IBM's BlueGene Supercomputer.– Inclusive filter.– Uses a base and mask register pair to track the

cache lines.

Page 8: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter

8 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Stream Registers

1 0 0 1 1 1 1 1 10b1001

1 0 0 1 1 1 0 0 10b1010

--- --- 0

Base Mask Valid

No general mechanism to remove address from SR

without compromising correctness

Addresses with 10XX result in snoop filter hit

Page 9: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter

9 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Drawbacks of Stream Register based Snoop Filters

• No efficient way to update the registers when a line is removed from cache– Degraded filtering performance over time– Additional logic units introduced but not

efficient (e.g., cache wrap detection)

Page 10: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter

10 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Our Contribution

• Counting Stream Registers– Eliminates cache wrap detection logic– Counter to track cache lines– More robust to workload variability– Better or similar energy savings compared to

SRs

Page 11: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter

11 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Counting Stream Registers

1 0 0 1 1 1 1 1 0x010b1001

1 0 0 1 1 1 0 0 0x020b1010

--- --- 0

Base Mask Counter

Removes the need for extra logic such as cache wrap detection, active register

history etc.

Invalidated cache lines can be trackedby decrementing the counter

Page 12: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter

12 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Snoop Filter Architecture

Index to direct mapped snoop filter table

Set of cache lines grouped into a page

Used for comparison with base register

Page 13: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter

13 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Experimental Analysis

• Virtex 2 FPGA running OpenRISC soft cores– Configurable no. of processors, associativity and

size of data and instruction cache, cache type and coherence protocol

• EEMBC Multibench Benchmarks• CACTI 5.3 energy model

– Total memory subsystem energy accounted for main memory r/w energy, data and instruction cache r/w energy, leakage and snoop energy

Page 14: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter

14 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Cache Design Space Exploration

Page 15: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter

15 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Results: Filtering Percentage

CSR achieves higher filtering % for smaller number of

registers

Page 16: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter

16 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Analysis: RGB2CMYK Benchmark

Page 17: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter

17 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Discussion: Energy Consumption

• For most benchmarks, snoop energy was around 8-10% of the total memory subsystem energy without snoop filters

• CSR filters more effective for certain benchmarks (H.264, Image rotation)– Better filtering performance with smaller no. of stream

registers.

• Small reduction in overall energy– Platform limited to 32 MB of off-chip SDRAM– No complex data sharing and limited no. of multiple

producers of same data

Page 18: Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter

18 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Summary

• Introduced counting stream registers based snoop filter architecture– Lesser hardware complexity and ability to track

cache line invalidations

• Experimental evaluation shows better filtering percentage than stream registers with lesser performance variation for different workloads.